vardb.connections package¶
The Connections package contains classes for interfacing with various databases at the GSC. Currently, there are classes for vardb, BioApps (solexa) and LIMS. These packages provide all details of the connection (host/port/etc) as well as useful apis for querying (all classes) and loading (only vardb) data. The BioApps and LIMS connection classes are used primarily for obtaining the metadata needed for loading data files to vardb. These classes are not general purpose (they have their own web apis for general purpose queries), but are tailored for use with vardb.
Submodules¶
vardb.connections.bioapps module¶
BioApps class provides connection to BioApps database. It also contains useful apis to obtain metadata required for loading data to vardb.
-
class
vardb.connections.bioapps.
BioApps
¶ Bases:
vardb.connections.connection.Connection
-
get_expression_data
(projects, output_path=None)¶ Returns the variant calling data for the input projects.
Parameters: - projects – List of project names.
- output_path – If supplied, results will be written to this path.
Returns: a DataFrame containing the query results.
-
get_merge_data
(df, output_path=None)¶ Returns the variant calling data for the input projects.
Parameters: - df – A dataframe with at least the merged bam file and the library name
- projects – List of project names.
- output_path – If supplied, results will be written to this path.
Returns: a DataFrame containing the query results.
-
get_somatic_data
(projects, pipelines, output_path=None)¶ Returns the variant calling data for the input projects.
Parameters: - projects – List of project names.
- pipelines – A list of pipelines to query. Options are {‘strelka’,’mutationseq’,’cnv’}
- output_path – If supplied, results will be written to this path.
Returns: a DataFrame containing the query results.
-
get_tumor_content
(output_path=None)¶
-
get_variant_calling_data
(projects, output_path=None)¶ Returns the variant calling data for the input projects.
Parameters: - projects – List of project names.
- output_path – If supplied, results will be written to this path.
Returns: a DataFrame containing the query results.
-
vardb.connections.bioapps_api module¶
Obsolete for now. New apis are being written which should facilitate communication with BioApps. For now, we use raw SQL in the bioapps.py module.
-
class
vardb.connections.bioapps_api.
BioApps_API
(account_file='.bioappsaccount', username=None, password=None)¶ Bases:
vardb.connections.connection.Connection
BioApps provides a connection to bioapps api calls
-
aligned_libcore_map
= {'analysis_software': 'aligner', 'bioapps_data_path': 'input_data_path', 'sequence_length': 'read_length'}¶
-
database
= 'solexa'¶
-
get_sample_info
(library)¶
-
library_map
= {'barcode': 'barcode', 'exon_capture_kit_version_number': 'exon_capture_kit_version', 'lower_protocol': 'sequencing_protocol', 'project_name': 'project', 'upper_protocol': 'library_construction_protocol'}¶
-
server_string
= 'http://%s:%s@www.bcgsc.ca/data/sbs/viewer/api'¶
-
source_map
= {'anatomic_site': 'anatomic_site', 'original_source_name': 'patient_id', 'pathology': 'disease_status', 'pathology_alias': 'pathology_alias', 'pathology_type': 'pathology_type', 'sex': 'gender', 'stage': 'developmental_stage'}¶
-
-
exception
vardb.connections.bioapps_api.
BioappsException
¶ Bases:
exceptions.Exception
vardb.connections.bioapps_sql module¶
Raw SQL queries used to obtain metadata from the BioApps database.
vardb.connections.connection module¶
Base class for connecting to a database. It gets credentials from an account file. We may want change how we manage credentials at a later date.
-
class
vardb.connections.connection.
Connection
(account_file=None, username=None, password=None)¶ Bases:
object
Base class provides methods for getting login information from input parameters, or from file
-
exception
vardb.connections.connection.
ConnectionException
¶ Bases:
exceptions.Exception
vardb.connections.karen module¶
Karen class provides connection to Karen’s development database, which currently contains the tumour content and ploidy. This class will have to be modified once this data moves into development.
-
class
vardb.connections.karen.
Karen
¶ Bases:
vardb.connections.connection.Connection
-
get_karen_info
()¶ Returns the tumour content and ploidy for all libraries in the database
Returns: df is a dataframe with the results, keyed on the library name
-
pandas_query
(command, data=None)¶ Performs a query and returns the results as a pandas dataframe
Parameters: - command – SQL command
- data – parameters needed in command
Returns: dataframe with query results
-
vardb.connections.lims module¶
LIMS class provides connection to LIMS web apis. It also contains useful apis to obtain metadata required for loading data to vardb.
-
class
vardb.connections.lims.
LIMS
(account_file='.limsaccount', username=None, password=None)¶ Bases:
vardb.connections.connection.Connection
LIMS provides a connection to LIMS api calls
-
get_LIMS_info
(libraries)¶ Gets ethnicity and library strategy from LIMS
Parameters: libraries – a list of the library names to query Returns: a dataframe with columns for the library_name, ethnicity and library_strategy
-
-
exception
vardb.connections.lims.
LimsException
¶ Bases:
exceptions.Exception
vardb.connections.vardb_connection module¶
VarDB class provides connection to vardb databases (production and test). The class provides apis for querying the
database. The apis in this class should be available to all users with query access to the database. Functions that
modify the database are found in the vardb_loader
package.
-
class
vardb.connections.vardb_connection.
VarDB
(database=None, account_file='.gscaccount', username=None, password=None)¶ Bases:
vardb.connections.connection.Connection
Provides connection to vardb databases
-
analysis_exists
(md5sum, analysis_date)¶ Checks whether the file has already been loaded into gvdb
Parameters: - md5sum – the md5sum for the analysis to be retrieved
- analysis_date – the modified timestamp on the data file
Returns: True or False
-
analysis_record_exists
(metadata)¶ Checks whether the entire row defined by the metadata object if found in analysis. This determines whether the metadata has been updated.
Parameters: metadata – a SampleMetaData object Returns: True if the metadata already exists in a row of analysis, False otherwise
-
count
(table)¶ counts the number of rows in the table
Parameters: table – table name Returns: the number of rows in table
-
data_exists
(table_name, md5sum, library_name)¶ Checks whether the data exists in table table_name
Parameters: - table_name – name of table to check
- md5sum –
- library_name –
Returns: True or False
-
dict_query
(command, data=None)¶ Runs command on database and return results in dictionary format
Parameters: - command – a string psql command
- data – a dictionary of parameters to complete the query
Returns: dictionary result
Raises: VarDBException if there was a problem with command
-
get_analysis
(md5sum, analysis_date)¶ Gets records from analysis table
Parameters: - md5sum – the md5sum for the analysis to be retrieved
- analysis_date –
Returns: the row of analysis matching the selected md5sum if it exists, otherwise None
-
get_latest_analysis
(md5sum, analysis_date)¶ Gets records from the analysis_all table, i.e. these can be non-production
Parameters: - md5sum – the md5sum for the analysis to be retrieved
- library_name –
Returns: the row of analysis_all matching the selected md5sum if it exists, otherwise None
-
get_sample
(library_name)¶ Gets records from sample table
Parameters: library_name – a library name to look up in sample table Returns: the entire row in sample table corresponding to the specified library name if it exists, otherwise None
-
library_exists
(table_name, library_name)¶ Checks whether the sample exists in the vardb_sample_tb based on the library_name
Parameters: - table_name – name of table to check
- library_name –
Returns: True or False
-
object_exists
(object_type, object_id)¶ Checks whether a file exists based on the object type and id
Parameters: - object_type – analysis_object_type
- object_id – analysis_object_id
Returns: True if it exists, otherwise False
-
pandas_query
(command, data=None)¶ Runs command on database and return results in pandas dataframe format
Parameters: - command – a string psql command
- data – a dictionary of parameters to complete the query
Returns: pandas dataframe result
Raises: VarDBException if there was a problem with command
-
sample_exists
(library_name)¶ Checks whether the sample has already been loaded into gvdb
Parameters: library_name – Returns: True or False
-
sample_record_exists
(metadata)¶ Checks if the sample metadata exists in sample
Parameters: metadata – a SampleMetaData object with the records to compare with the database Returns: True if the library exists on the database and exactly the same values as metadata, False otherwise
-
save_to_file
(local_path, command, data=None, null_string='.', headers=False)¶ Saves the output of a query to file
Parameters: - local_path – destination path for the output
- command – PSQL query to format output
- data – a dictionary of parameters with which to complete the query command
- null_string – string to output as null
- headers – True if you want to output the column headers as well as the data
Raises: VarDBException if there is a problem with either the query or the file IO
-
-
exception
vardb.connections.vardb_connection.
VarDBException
¶ Bases:
exceptions.Exception
vardb.connections.vardb_hdfs module¶
HDFS class provides connection to hdfs. The class provides apis for basic file manipulations on hdfs. It also controls the directory structure on hdfs, specifying where variant data and dim data live.
-
class
vardb.connections.vardb_hdfs.
HDFS
(database)¶ HDFS class contains connection information, routines for running HDFS commands
-
delete
(remote_path)¶ Deletes a file from hdfs
Parameters: remote_path – path on hdfs Raises: HdfsException if there is a problem deleting
-
delete_variant_file
(file, temporary=False)¶ Remove a file from hdfs
Parameters: file – The data file object to be loaded Raises: HdfsException if delete is unsuccessful
-
get_remote_path
(file, temporary=False)¶ Returns path to file on hdfs based on file properties
Parameters: - file – The file class
- temporary – A flag indicating whether this data is temporary, or if it should be stored in the main directory structure of hdfs
Returns: The remote path
-
upload_dim_files
(path, remote_file_name)¶ Function for uploading a dim vcf file from the local FS to HDFS :param path: the local path to the vcf file :param remote_file_name: name of the dim table file that is stored in HDFS
-
upload_variant_file
(file, temporary=False)¶ Uploads a file to the correct directory on hdfs
Parameters: file – The data file object to be loaded Raises: HdfsException if the upload is unsuccessful
-
-
exception
vardb.connections.vardb_hdfs.
HdfsException
¶ Bases:
exceptions.Exception
vardb.connections.vardb_loader module¶
Loader class provides connection to vardb databases (production and test) with apis that require pivotal permissions. These include any function which modifies data on the database.
-
class
vardb.connections.vardb_loader.
Loader
(database=None, account_file='.gscaccount', username=None, password=None)¶ Bases:
vardb.connections.vardb_connection.VarDB
Loader is a child class of VarDB. In addition to connecting, it checks that the user has role loader role (needed for loading etc.). This is redundant at the moment since only pivotal has loading privileges.
-
aggregate_effects
()¶ Joins frequently accessed data from the snp_eff, cosmic, clinvar, and dbsnp tables
Returns: The number of rows loaded to the annotations_agg table
-
aggregate_somatic_cnvs
()¶ Aggregates somatic cnvs to calculate the copy number by Ensembl gene id
Returns: the number of rows in gene_copies
-
aggregate_somatic_snps_indel
()¶ Aggregates all somatic snvs and indels in all tables by variant_id to somatic_snps_indels_agg
Returns: the number of rows in somatic_snps_indels_agg
-
aggregate_vcall
()¶ Aggregates the vcall table by variant_id to vcall_agg
Returns: the number of rows in vcall_agg
-
analyze
(table_name=None)¶ Runs analyze on either the whole database (table_name=None) or a specific table. This is important for good database performace.
Parameters: table_name – name of a table to analyze Returns:
-
create_external_table
(file, table_name, header_name, temporary=False)¶ Creates an external table from a data file
Parameters: - file – a variant data file object
- table_name – the name to give to the external table
- header_name – the name to give to the header table (logs errors in file format vv table format)
- temporary – True if the data is temporary, and not to be loaded to a permanent table on the database
Returns: the number of rows loaded to the external table
-
drop_external_table
(table_name, header_name=None)¶ Drops external table
Parameters: - table_name – name of table to be dropped
- header_name – a name of the header table corresponding to the external table to be dropped as well
-
drop_table
(table_name)¶ Drops table
Parameters: table_name – name of table to be dropped
-
dump_table
(table_name, dir)¶ Dumps a database table onto a TSV file in a specified local directory. The name of the file stored will always be of the form ‘table_name’.tsv
Parameters: - table_name – the name of a table in the database to be dumped
- dir – the directory on the local FS to store the dumped tables
-
filter_somatic_snvs_indels
()¶
-
get_unannotated_vcf
(local_path)¶ Copies data in unannotated_snps_indels to file
Parameters: local_path – local path where the data will be saved
-
load
(file, simulate)¶ Chooses the appropriate loading function for the type of the variant data file For each: Loads the data file to the appropriate table Most of the information for data loading is contained within the variant data file class itself, including:
- the table to load to: file._table_name
- the psql command to use for loading: file._insert_cmd
Parameters: - file – a variant data file
- simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
Returns: the number of rows loaded to the data table
Raises: VarDBException
-
load_functions
= {'MutSeq_v1': 'load_vcf_table', 'MutSeq_v2': 'load_vcf_table', 'StrelkaIndels': 'load_vcf_table', 'StrelkaSnps': 'load_vcf_table', 'VCall': 'load_vcall', 'VcfAnnotations': 'load_vcf_annotations'}¶
-
load_table
(file, simulate)¶ loads the data file to the appropriate table
Parameters: - file – a variant data file
- simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
-
load_temp_vcf_table
(file, table_name)¶ loads the vcf variant data file to a temporary table
Parameters: - file – a variant data file
- table_name – name of temporary vcf table
-
load_unannotated_snps_indels
(file, parsed_name)¶ Loads unannotated variants to the table unannotated_snps_indels from a temporary table which contains the info-parsed vcf data
Parameters: - file – the VCF file object to be loaded
- parsed_name – name of the temporary table with data parsed
Returns: the number of snps and indels added to unannotated_snps_indels table
-
load_vcall
(file, simulate)¶ Loads vcall data to vardb
Parameters: - file – VCall object
- simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
Returns: rows loaded
-
load_vcf_annotations
(file, simulate)¶ loads the vcf annotations data file to the effects table
Parameters: - file – a variant data file
- simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
-
load_vcf_table
(file, simulate)¶ loads the vcf variant data file to the appropriate table
Parameters: - file – a variant data file
- simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
-
mark_non_production
(output_data_path)¶
-
restore_table
(table_name, path)¶ Restores contents of table to match the contents of the file table_name.tsv in the local directory path. WARNING: any existing data in the table will be truncated!
Parameters: - table_name – the table we wish to restore back to the database
- path – the path to the TSV file in the Local FS that stores the table data
-
truncate_table
(table_name)¶ Truncates table
Parameters: table_name – name of table to be truncated
-
truncate_unannotated_snps_indels
()¶ Truncates the unannotated_snps_indels table
-
update_analysis
(records)¶ Updates analysis table to contain the information in records
Parameters: records – SampleMetaData object containing metadata to be loaded Raises: VardbException if update fails
-
update_sample
(records)¶ Loads a single record into the sample table
Parameters: records – SampleMetaData object containing metadata to be loaded Raises: VardbException if update fails
-