vardb.metadata_wrangling package¶
This package is responsible for retrieving and parsing metadata from internal projects to produce vardb .loader
files, which are used for actually loading the data to vardb.
- Each MetadataCollector is responsible for assembling the metadata for a single pipeline. It makes calls to other databases through their connections class, and/or scrapes the filesystem for the data and metadata for the pipeline.
- The main function in this package is make_loader, which collects the metadata, and optionally compares to a previous version to find only new and changed data for loading.
Subpackages¶
- vardb.metadata_wrangling.oasis package
- Submodules
- vardb.metadata_wrangling.oasis.demographics module
- vardb.metadata_wrangling.oasis.diagnosis module
- vardb.metadata_wrangling.oasis.drug_map module
- vardb.metadata_wrangling.oasis.error_code module
- vardb.metadata_wrangling.oasis.helpers module
- vardb.metadata_wrangling.oasis.oasis module
- vardb.metadata_wrangling.oasis.output module
- vardb.metadata_wrangling.oasis.preprocess module
- vardb.metadata_wrangling.oasis.radiation module
- vardb.metadata_wrangling.oasis.treatment module
Submodules¶
vardb.metadata_wrangling.configuration module¶
-
class
vardb.metadata_wrangling.configuration.
Config
(config=None)¶ Bases:
object
-
evaluate
(key, *args, **kwargs)¶ Evaluates function parameters, and returns the result
Parameters: - key – function key
- args – function positional arguments
- kwargs – function keyword arguments
Returns: the return value of the function
-
get
(key)¶ Gets the parameter associated with key in the configuration
Parameters: key – Returns: value of Config key
-
keys
()¶ Returns keys
Returns: keys
-
set
(key, val)¶ Sets a key in the configuration dictionary
Parameters: - key –
- val –
-
update
(config)¶ Update the Config object with new data
Parameters: config – a dictionary of key value pairs Raises: Value error if a function parameter is not a function recognized in locals
-
validate
(required_keys)¶ Makes sure that all of the required keys are defined
Parameters: required_keys – Raises: ValueError if some keys are not defined
-
vardb.metadata_wrangling.get_bam_cnvs module¶
Locates all bam_CNVs-bam file pairs for the controlfreec pipeline. This is necessary because controlfreec is not currently tracked on a database. The bam_CNVs and bam files will be used to look up metadata on BioApps and LIMS
-
exception
vardb.metadata_wrangling.get_bam_cnvs.
GetBamCNVsException
¶ Bases:
exceptions.Exception
-
vardb.metadata_wrangling.get_bam_cnvs.
get_bam_cnvs
(bam_cnv_pattern)¶ Locates the bam_cnvs and bam file pairs under a particular search pattern.
Returns: A pandas dataframe containing the BioApps lookup path (originating merged bam file path), library name, output data path, pipeline, and pipeline version for a given pair of bam_cnvs and bam files.
vardb.metadata_wrangling.helpers module¶
-
vardb.metadata_wrangling.helpers.
get_patient_identifier
(df)¶
-
vardb.metadata_wrangling.helpers.
get_pog_controlfreec_library_name
(df)¶
-
vardb.metadata_wrangling.helpers.
get_pog_gene_model
(df)¶
-
vardb.metadata_wrangling.helpers.
get_pog_id
(df)¶
vardb.metadata_wrangling.loader_maker module¶
Creates loader
files to be used by vardb.variant_file_loaders.load_files to load data and metadata to vardb.
-
vardb.metadata_wrangling.loader_maker.
make_loader
(output_directory, project, query, previous_metadata_file=None, debug=False)¶ Creates a loader file, which includes all records matching a project and analysis query that need to be loaded to vardb. All metadata associated with the project and query is obtained, and then compared to the same results on a previous day (from previous_metadata_file). The rows that need to be loaded included all new/changed/deleted rows in the new metadata as compared to the previous metadata. If previous_metadata_file is not specified, all rows in the current metadata are added to the loader file.
Parameters: - output_directory – Destination for new metadata and loader files
- project – project
- query – analysis to query for (e.g. vcall)
- previous_metadata_file – path to metadata file created on a previous day
- debug – True if you want to suppress errors for debugging purposes
Returns: path to loader file (None if no modified records were found)
vardb.metadata_wrangling.locate_metadata_changes module¶
Locates changes between two (variant data) metadata DataFrames, including row changes, deletions, and additions.
-
vardb.metadata_wrangling.locate_metadata_changes.
locate_metadata_changes
(old_metadata, new_metadata)¶ Finds new/changed data to be loaded to the database
Parameters: - old_metadata – Includes all metadata found for a project and pipeline at time of last loading to vardb
- new_metadata – All new metadata for the same project and pipeline
Returns: A dataframe with just the new and changed metadata, or None if no changes occured. This is used to make a loader file.
vardb.metadata_wrangling.metadata_collector module¶
MetadataCollector is a base class with common functionality for assembling, cleaning and extracting information from various database sources. A MetadataCollector subclass must be defined for each new data type. Any information that can not be obtained from databases directly can be specified by the Config object.
-
class
vardb.metadata_wrangling.metadata_collector.
ControlFreeCCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
Collects metadata associated with controlfreec pipeline
-
data_type
= 'controlfreec'¶
-
-
class
vardb.metadata_wrangling.metadata_collector.
ExpressionCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
Collects metadata associated with the gene coverage pipeline
-
collect_metadata
()¶ The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.
Returns: a Metadata object with the collected metadata Modifies: self.metadata
-
data_type
= 'expression'¶
-
-
class
vardb.metadata_wrangling.metadata_collector.
Metadata
(df=None, path=None, debug=False)¶ Bases:
object
Metadata is a class for storing metadata information for loading to vardb. It takes either a dataframe or a path. It loads the data, validates it, adds default values.
-
difference
(old_metadata)¶ Finds the difference between metadata and another MetaData object
Parameters: old_metadata – a Metadata object to compare to Returns:
-
k
= 'production'¶
-
output_to_tsv
(output_path)¶ Writes the given DataFrame to a tab-delimited file in the specified load file directory.
Parameters: output_path – full path to destination file
-
-
class
vardb.metadata_wrangling.metadata_collector.
MetadataCollector
(config, debug=False)¶ Bases:
object
Abstract class which defines the common operations needed to collect metadata for loading to vardb.
-
collect_metadata
()¶ The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.
Returns: a Metadata object with the collected metadata Modifies: self.metadata
-
data_type
¶
-
classmethod
factory
(config, data_type, debug=False)¶ Returns the correct MetadataCollector subclass which corresponds to the data_type requested
Parameters: - config – a Config object which contains all required parameters to fully specify the MC class
- data_type – the data type that is to be collected
- debug – True if you want to suppress errors for debugging purposes
Returns: MC subclass corresponding to the data type
Raises: MetadataCollectorException if essential information is missing from config
-
metadata
= None¶
-
-
exception
vardb.metadata_wrangling.metadata_collector.
MetadataCollectorException
¶ Bases:
exceptions.Exception
-
class
vardb.metadata_wrangling.metadata_collector.
ReviewedSomaticCNVCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
,vardb.metadata_wrangling.metadata_collector.TCFilter
Collects metadata associated with the reviewed somatic CNV pipeline
-
collect_metadata
()¶ The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.
Returns: a Metadata object with the collected metadata Modifies: self.metadata
-
data_type
= 'somatic_cnv'¶
-
-
class
vardb.metadata_wrangling.metadata_collector.
ReviewedSomaticLOHCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
,vardb.metadata_wrangling.metadata_collector.TCFilter
Collects metadata associated with the somatic LOH pipeline.
-
collect_metadata
()¶ The main function of the metadata collector. This collects metadata for all analyses matching the specified project and data type as defined in the config. Data is obtained by querying BioApps, LIMS and optionally the file system. Validates metadata and returns a Metadata object.
Returns: a Metadata object with the collected metadata Modifies: self.metadata
-
data_type
= 'somatic_loh'¶
-
-
class
vardb.metadata_wrangling.metadata_collector.
SomaticSmallVariantCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
Collects metadata associated with strelka and mutationseq pipelines
-
data_type
= 'small_somatic'¶
-
-
class
vardb.metadata_wrangling.metadata_collector.
TCFilter
¶ Bases:
object
Collection of routines for filtering somatic cnv pipelines by the reviewed tumour content
-
filter_metadata
(bioapps_df, tumour_df)¶
-
get_tumour_content
(output_data_path)¶ Retrieves tumour content from path for somatic_cnv pipeline
Returns: tumour content
-
-
class
vardb.metadata_wrangling.metadata_collector.
VCallCollector
(config, debug=False)¶ Bases:
vardb.metadata_wrangling.metadata_collector.MetadataCollector
Collects metadata associated with the vcall pipeline
-
data_type
= 'vcall'¶
-
-
vardb.metadata_wrangling.metadata_collector.
throw_exception
(msg, debug)¶ Raises MetadataCollectorException if debug = False, logs the error message
Parameters: - (str) (msg) – error message
- (bool) (debug) – true if you do NOT want to actually raise the exception, false if you just want to log to file