vardb.variant_data_files package¶
There is one variant data file class per file type that is loaded to vardb. They contain all of the information required to load each type of data to the database.
- column names and data types for the files
- information on the headers
- which table the data is loaded to in vardb
- PSQL commands for parsing and loading the data to tables on vardb
- routines to compute required metadata from the file, such as the creation date and md5sum
- the pipelines that the class belongs to
When a new data type is added to vardb, a new variant data file class must be created with this information.
Submodules¶
vardb.variant_data_files.cnv module¶
cnv contains classes for germline (controlfree) and somatic (somatic_cnv) pipelines. The somatic_cnv pipeline actually creates several file types, which are all represented here.
-
class
vardb.variant_data_files.cnv.
ControlFreeC
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
ControlFreeC class gets metadata for cnvs produced by the ControlFreeC pipeline
-
class
vardb.variant_data_files.cnv.
HomozygousDeletion
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
-
class
vardb.variant_data_files.cnv.
HomozygousDeletion_v1
(**kwargs)¶ Bases:
vardb.variant_data_files.cnv.HomozygousDeletion
HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline
-
class
vardb.variant_data_files.cnv.
HomozygousDeletion_v2
(**kwargs)¶ Bases:
vardb.variant_data_files.cnv.HomozygousDeletion
HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline
-
class
vardb.variant_data_files.cnv.
HomozygousDeletion_v3
(**kwargs)¶ Bases:
vardb.variant_data_files.cnv.HomozygousDeletion
HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline
-
class
vardb.variant_data_files.cnv.
SomaticCna
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
SomaticCna class gets metadata for raw cna data produced by the somatic cnv pipeline.
-
class
vardb.variant_data_files.cnv.
SomaticCnv
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
SomaticCna class gets metadata for cnv segment data produced by the somatic cnv pipeline.
-
class
vardb.variant_data_files.cnv.
SomaticLOH
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
SomaticLOH class gets metadata for loss of heterozygosity states (LOH) produced by the APOLLOH
- Zygosity states are:
- DLOH=deletion-LOH (state 1) NLOH=copy-neutral-LOH (states 2,4) ALOH=amplified-LOH (states 5,8,9,13,14,19) HET=heterozygous (states 3,6,7) ASCNA=allele-specific-amplification (states 10,12,15,18) BCNA=balanced-amplification (states 11,16,17)
-
class
vardb.variant_data_files.cnv.
SomaticVAF
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
SomaticVAF class gets metadata and allele frequencies from APOLLOH.
- Tab-delimited output file for position-level results.
- 9-columns:
- chr (‘X’ and ‘Y’ will be output as 23 and 24)
- position
- reference count
- non-reference count
- total depth
- allelic ratio
- copy number (from input)
- APOLLOH genotype state
- Zygosity state.
- N additional columns:
- posterior marginal probabilities (responsibilities) for each APOLLOH genotype state.
- Zygosity states are:
- DLOH=deletion-LOH (state 1) NLOH=copy-neutral-LOH (states 2,4) ALOH=amplified-LOH (states 5,8,9,13,14,19) HET=heterozygous (states 3,6,7) ASCNA=allele-specific-amplification (states 10,12,15,18) BCNA=balanced-amplification (states 11,16,17)
-
class
vardb.variant_data_files.cnv.
TcgaCnv
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
-
class
vardb.variant_data_files.cnv.
TcgaGermlineMaskedCnv
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
vardb.variant_data_files.data_classes module¶
-
vardb.variant_data_files.data_classes.
DataClass
(**kwargs)¶ This is a factory for choosing the correct VariantDataFile subclass based on the pipeline information
Parameters: kwargs – metadata arguments Returns: correct class
vardb.variant_data_files.expression module¶
-
class
vardb.variant_data_files.expression.
RSEM
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
RSEM class gets metadata for .rsem files
-
class
vardb.variant_data_files.expression.
TranscriptNormalized
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
TranscriptNormalized class gets metadata for transcript.normalized files
vardb.variant_data_files.maf module¶
-
class
vardb.variant_data_files.maf.
TCGASimpleSomatic
(**kwargs)¶ Bases:
vardb.variant_data_files.variant_data_file.VariantDataFile
TranscriptNormalized class gets metadata for transcript.normalized files
vardb.variant_data_files.variant_data_file module¶
-
class
vardb.variant_data_files.variant_data_file.
Columns
(cols)¶ Bases:
object
Immutable object containing a list of tuples with column name and type for each column of the data file
-
valid_types
= ('INT', 'FLOAT', 'DATE', 'TIMESTAMP', 'TEXT', 'BIGINT', 'INTEGER', 'BOOLEAN')¶
-
-
class
vardb.variant_data_files.variant_data_file.
VariantDataFile
(**kwargs)¶ Bases:
object
-
close
()¶ Closes the file, resets the file pointer to None
-
get_columns_from_pandas
(filename, **kwargs)¶ Reads a file into a pandas dataframe and extracts the column names and column types of the data file. This is useful in cases where the data file has variable numbers of columns.
Parameters: - filename – path to data
- kwargs – any optional arguments for the pandas read_csv function
Returns: the Columns object corresponding to the columns in the data file
-
get_data
()¶ Sets the member variables for header and data. The header is a list of strings, and the file data is a pandas dataframe. Returns the data.
Returns: the data
-
get_data_ptr
()¶ Sets the file pointer to the first line of data. If the _get_header function has been properly defined in the subclasses, this should always work.
Returns: file pointer at beginning of data
-
get_header
()¶ Returns the file header, and closes the file
Returns: file header
-
get_md5sum
()¶ Gets md5sum and adds it to the metadata
-
line_count
()¶ Just calculates the line count of a file
Returns: line count of file with filename
-
open
()¶ Opens vcf file for reading
Raises: DataFileException if file couldn’t be opened
-
-
exception
vardb.variant_data_files.variant_data_file.
VariantDataFileException
¶ Bases:
exceptions.Exception
vardb.variant_data_files.vcf module¶
-
class
vardb.variant_data_files.vcf.
MutSeq_v1
(**kwargs)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
class for somatic vcf files created by mutation seq version 1.0.2
-
class
vardb.variant_data_files.vcf.
MutSeq_v2
(**kwargs)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
class for somatic vcf files created by mutation seq version 4.3.5
-
class
vardb.variant_data_files.vcf.
StrelkaIndels
(**kwargs)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
Class for strelka indel files
-
class
vardb.variant_data_files.vcf.
StrelkaSnps
(**kwargs)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
Class for strelka snp files
-
class
vardb.variant_data_files.vcf.
VCF
¶ Bases:
object
VCF class has functionality applicable to all vcf data classes
-
class
vardb.variant_data_files.vcf.
VCall
(**kwargs)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
class for vcf files created by vcall pipeline (mpileup)
-
get_md5sum
()¶ Gets md5sum and adds it to the metadata
-
normalize_indels
()¶ normalizes self.unnormalized file to self.path IF self.path does not exist (the file is not already normalized)
-
-
class
vardb.variant_data_files.vcf.
VcfAnnotations
(path)¶ Bases:
vardb.variant_data_files.vcf.VCF
,vardb.variant_data_files.variant_data_file.VariantDataFile
This class is for annotation VCFs. This VCF does not belong to a library and is only used for importing annotations.
vardb.variant_data_files.vcf_tools module¶
-
exception
vardb.variant_data_files.vcf_tools.
VcfToolsException
¶ Bases:
exceptions.Exception
-
vardb.variant_data_files.vcf_tools.
annotate
(log_path, vcf_path)¶ Wrapper function for running bioapps annotator on gphost
Parameters: - log_path – path to place log files
- vcf_path – path to vcf file to annotate
Returns:
-
vardb.variant_data_files.vcf_tools.
check_vt
(normalized_vcf_file, log_file)¶ Checks to make sure that the normalized vcf file was correctly created
Parameters: - normalized_vcf_file –
- log_file –
Returns:
-
vardb.variant_data_files.vcf_tools.
normalize
(unnormalized_file, normalized_file, log_path)¶ Wrapper function for running vt to normalize indels on gphost
Parameters: - unnormalized_file –
- normalized_file –
- log_path –
Returns: