vardb.connections package

The Connections package contains classes for interfacing with various databases at the GSC. Currently, there are classes for vardb, BioApps (solexa) and LIMS. These packages provide all details of the connection (host/port/etc) as well as useful apis for querying (all classes) and loading (only vardb) data. The BioApps and LIMS connection classes are used primarily for obtaining the metadata needed for loading data files to vardb. These classes are not general purpose (they have their own web apis for general purpose queries), but are tailored for use with vardb.

Submodules

vardb.connections.bioapps module

BioApps class provides connection to BioApps database. It also contains useful apis to obtain metadata required for loading data to vardb.

class vardb.connections.bioapps.BioApps

Bases: vardb.connections.connection.Connection

get_expression_data(projects, output_path=None)

Returns the variant calling data for the input projects.

Parameters:
  • projects – List of project names.
  • output_path – If supplied, results will be written to this path.
Returns:

a DataFrame containing the query results.

get_merge_data(df, output_path=None)

Returns the variant calling data for the input projects.

Parameters:
  • df – A dataframe with at least the merged bam file and the library name
  • projects – List of project names.
  • output_path – If supplied, results will be written to this path.
Returns:

a DataFrame containing the query results.

get_somatic_data(projects, pipelines, output_path=None)

Returns the variant calling data for the input projects.

Parameters:
  • projects – List of project names.
  • pipelines – A list of pipelines to query. Options are {‘strelka’,’mutationseq’,’cnv’}
  • output_path – If supplied, results will be written to this path.
Returns:

a DataFrame containing the query results.

get_tumor_content(output_path=None)
get_variant_calling_data(projects, output_path=None)

Returns the variant calling data for the input projects.

Parameters:
  • projects – List of project names.
  • output_path – If supplied, results will be written to this path.
Returns:

a DataFrame containing the query results.

vardb.connections.bioapps_api module

Obsolete for now. New apis are being written which should facilitate communication with BioApps. For now, we use raw SQL in the bioapps.py module.

class vardb.connections.bioapps_api.BioApps_API(account_file='.bioappsaccount', username=None, password=None)

Bases: vardb.connections.connection.Connection

BioApps provides a connection to bioapps api calls

aligned_libcore_map = {'analysis_software': 'aligner', 'bioapps_data_path': 'input_data_path', 'sequence_length': 'read_length'}
database = 'solexa'
get_sample_info(library)
library_map = {'barcode': 'barcode', 'exon_capture_kit_version_number': 'exon_capture_kit_version', 'lower_protocol': 'sequencing_protocol', 'project_name': 'project', 'upper_protocol': 'library_construction_protocol'}
server_string = 'http://%s:%s@www.bcgsc.ca/data/sbs/viewer/api'
source_map = {'anatomic_site': 'anatomic_site', 'original_source_name': 'patient_id', 'pathology': 'disease_status', 'pathology_alias': 'pathology_alias', 'pathology_type': 'pathology_type', 'sex': 'gender', 'stage': 'developmental_stage'}
exception vardb.connections.bioapps_api.BioappsException

Bases: exceptions.Exception

vardb.connections.bioapps_sql module

Raw SQL queries used to obtain metadata from the BioApps database.

vardb.connections.connection module

Base class for connecting to a database. It gets credentials from an account file. We may want change how we manage credentials at a later date.

class vardb.connections.connection.Connection(account_file=None, username=None, password=None)

Bases: object

Base class provides methods for getting login information from input parameters, or from file

exception vardb.connections.connection.ConnectionException

Bases: exceptions.Exception

vardb.connections.karen module

Karen class provides connection to Karen’s development database, which currently contains the tumour content and ploidy. This class will have to be modified once this data moves into development.

class vardb.connections.karen.Karen

Bases: vardb.connections.connection.Connection

get_karen_info()

Returns the tumour content and ploidy for all libraries in the database

Returns:df is a dataframe with the results, keyed on the library name
pandas_query(command, data=None)

Performs a query and returns the results as a pandas dataframe

Parameters:
  • command – SQL command
  • data – parameters needed in command
Returns:

dataframe with query results

vardb.connections.lims module

LIMS class provides connection to LIMS web apis. It also contains useful apis to obtain metadata required for loading data to vardb.

class vardb.connections.lims.LIMS(account_file='.limsaccount', username=None, password=None)

Bases: vardb.connections.connection.Connection

LIMS provides a connection to LIMS api calls

get_LIMS_info(libraries)

Gets ethnicity and library strategy from LIMS

Parameters:libraries – a list of the library names to query
Returns:a dataframe with columns for the library_name, ethnicity and library_strategy
exception vardb.connections.lims.LimsException

Bases: exceptions.Exception

vardb.connections.vardb_connection module

VarDB class provides connection to vardb databases (production and test). The class provides apis for querying the database. The apis in this class should be available to all users with query access to the database. Functions that modify the database are found in the vardb_loader package.

class vardb.connections.vardb_connection.VarDB(database=None, account_file='.gscaccount', username=None, password=None)

Bases: vardb.connections.connection.Connection

Provides connection to vardb databases

analysis_exists(md5sum, analysis_date)

Checks whether the file has already been loaded into gvdb

Parameters:
  • md5sum – the md5sum for the analysis to be retrieved
  • analysis_date – the modified timestamp on the data file
Returns:

True or False

analysis_record_exists(metadata)

Checks whether the entire row defined by the metadata object if found in analysis. This determines whether the metadata has been updated.

Parameters:metadata – a SampleMetaData object
Returns:True if the metadata already exists in a row of analysis, False otherwise
count(table)

counts the number of rows in the table

Parameters:table – table name
Returns:the number of rows in table
data_exists(table_name, md5sum, library_name)

Checks whether the data exists in table table_name

Parameters:
  • table_name – name of table to check
  • md5sum
  • library_name
Returns:

True or False

dict_query(command, data=None)

Runs command on database and return results in dictionary format

Parameters:
  • command – a string psql command
  • data – a dictionary of parameters to complete the query
Returns:

dictionary result

Raises:

VarDBException if there was a problem with command

get_analysis(md5sum, analysis_date)

Gets records from analysis table

Parameters:
  • md5sum – the md5sum for the analysis to be retrieved
  • analysis_date
Returns:

the row of analysis matching the selected md5sum if it exists, otherwise None

get_latest_analysis(md5sum, analysis_date)

Gets records from the analysis_all table, i.e. these can be non-production

Parameters:
  • md5sum – the md5sum for the analysis to be retrieved
  • library_name
Returns:

the row of analysis_all matching the selected md5sum if it exists, otherwise None

get_sample(library_name)

Gets records from sample table

Parameters:library_name – a library name to look up in sample table
Returns:the entire row in sample table corresponding to the specified library name if it exists, otherwise None
library_exists(table_name, library_name)

Checks whether the sample exists in the vardb_sample_tb based on the library_name

Parameters:
  • table_name – name of table to check
  • library_name
Returns:

True or False

object_exists(object_type, object_id)

Checks whether a file exists based on the object type and id

Parameters:
  • object_type – analysis_object_type
  • object_id – analysis_object_id
Returns:

True if it exists, otherwise False

pandas_query(command, data=None)

Runs command on database and return results in pandas dataframe format

Parameters:
  • command – a string psql command
  • data – a dictionary of parameters to complete the query
Returns:

pandas dataframe result

Raises:

VarDBException if there was a problem with command

sample_exists(library_name)

Checks whether the sample has already been loaded into gvdb

Parameters:library_name
Returns:True or False
sample_record_exists(metadata)

Checks if the sample metadata exists in sample

Parameters:metadata – a SampleMetaData object with the records to compare with the database
Returns:True if the library exists on the database and exactly the same values as metadata, False otherwise
save_to_file(local_path, command, data=None, null_string='.', headers=False)

Saves the output of a query to file

Parameters:
  • local_path – destination path for the output
  • command – PSQL query to format output
  • data – a dictionary of parameters with which to complete the query command
  • null_string – string to output as null
  • headers – True if you want to output the column headers as well as the data
Raises:

VarDBException if there is a problem with either the query or the file IO

exception vardb.connections.vardb_connection.VarDBException

Bases: exceptions.Exception

vardb.connections.vardb_hdfs module

HDFS class provides connection to hdfs. The class provides apis for basic file manipulations on hdfs. It also controls the directory structure on hdfs, specifying where variant data and dim data live.

class vardb.connections.vardb_hdfs.HDFS(database)

HDFS class contains connection information, routines for running HDFS commands

delete(remote_path)

Deletes a file from hdfs

Parameters:remote_path – path on hdfs
Raises:HdfsException if there is a problem deleting
delete_variant_file(file, temporary=False)

Remove a file from hdfs

Parameters:file – The data file object to be loaded
Raises:HdfsException if delete is unsuccessful
get_remote_path(file, temporary=False)

Returns path to file on hdfs based on file properties

Parameters:
  • file – The file class
  • temporary – A flag indicating whether this data is temporary, or if it should be stored in the main directory structure of hdfs
Returns:

The remote path

upload_dim_files(path, remote_file_name)

Function for uploading a dim vcf file from the local FS to HDFS :param path: the local path to the vcf file :param remote_file_name: name of the dim table file that is stored in HDFS

upload_variant_file(file, temporary=False)

Uploads a file to the correct directory on hdfs

Parameters:file – The data file object to be loaded
Raises:HdfsException if the upload is unsuccessful
exception vardb.connections.vardb_hdfs.HdfsException

Bases: exceptions.Exception

vardb.connections.vardb_loader module

Loader class provides connection to vardb databases (production and test) with apis that require pivotal permissions. These include any function which modifies data on the database.

class vardb.connections.vardb_loader.Loader(database=None, account_file='.gscaccount', username=None, password=None)

Bases: vardb.connections.vardb_connection.VarDB

Loader is a child class of VarDB. In addition to connecting, it checks that the user has role loader role (needed for loading etc.). This is redundant at the moment since only pivotal has loading privileges.

aggregate_effects()

Joins frequently accessed data from the snp_eff, cosmic, clinvar, and dbsnp tables

Returns:The number of rows loaded to the annotations_agg table
aggregate_somatic_cnvs()

Aggregates somatic cnvs to calculate the copy number by Ensembl gene id

Returns:the number of rows in gene_copies
aggregate_somatic_snps_indel()

Aggregates all somatic snvs and indels in all tables by variant_id to somatic_snps_indels_agg

Returns:the number of rows in somatic_snps_indels_agg
aggregate_vcall()

Aggregates the vcall table by variant_id to vcall_agg

Returns:the number of rows in vcall_agg
analyze(table_name=None)

Runs analyze on either the whole database (table_name=None) or a specific table. This is important for good database performace.

Parameters:table_name – name of a table to analyze
Returns:
create_external_table(file, table_name, header_name, temporary=False)

Creates an external table from a data file

Parameters:
  • file – a variant data file object
  • table_name – the name to give to the external table
  • header_name – the name to give to the header table (logs errors in file format vv table format)
  • temporary – True if the data is temporary, and not to be loaded to a permanent table on the database
Returns:

the number of rows loaded to the external table

drop_external_table(table_name, header_name=None)

Drops external table

Parameters:
  • table_name – name of table to be dropped
  • header_name – a name of the header table corresponding to the external table to be dropped as well
drop_table(table_name)

Drops table

Parameters:table_name – name of table to be dropped
dump_table(table_name, dir)

Dumps a database table onto a TSV file in a specified local directory. The name of the file stored will always be of the form ‘table_name’.tsv

Parameters:
  • table_name – the name of a table in the database to be dumped
  • dir – the directory on the local FS to store the dumped tables
filter_somatic_snvs_indels()
get_unannotated_vcf(local_path)

Copies data in unannotated_snps_indels to file

Parameters:local_path – local path where the data will be saved
load(file, simulate)

Chooses the appropriate loading function for the type of the variant data file For each: Loads the data file to the appropriate table Most of the information for data loading is contained within the variant data file class itself, including:

  • the table to load to: file._table_name
  • the psql command to use for loading: file._insert_cmd
Parameters:
  • file – a variant data file
  • simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
Returns:

the number of rows loaded to the data table

Raises:

VarDBException

load_functions = {'MutSeq_v1': 'load_vcf_table', 'MutSeq_v2': 'load_vcf_table', 'StrelkaIndels': 'load_vcf_table', 'StrelkaSnps': 'load_vcf_table', 'VCall': 'load_vcall', 'VcfAnnotations': 'load_vcf_annotations'}
load_table(file, simulate)

loads the data file to the appropriate table

Parameters:
  • file – a variant data file
  • simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
load_temp_vcf_table(file, table_name)

loads the vcf variant data file to a temporary table

Parameters:
  • file – a variant data file
  • table_name – name of temporary vcf table
load_unannotated_snps_indels(file, parsed_name)

Loads unannotated variants to the table unannotated_snps_indels from a temporary table which contains the info-parsed vcf data

Parameters:
  • file – the VCF file object to be loaded
  • parsed_name – name of the temporary table with data parsed
Returns:

the number of snps and indels added to unannotated_snps_indels table

load_vcall(file, simulate)

Loads vcall data to vardb

Parameters:
  • file – VCall object
  • simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
Returns:

rows loaded

load_vcf_annotations(file, simulate)

loads the vcf annotations data file to the effects table

Parameters:
  • file – a variant data file
  • simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
load_vcf_table(file, simulate)

loads the vcf variant data file to the appropriate table

Parameters:
  • file – a variant data file
  • simulate – boolean indicating whether to actually load data, or to only simulate loading and to roll back changes
mark_non_production(output_data_path)
restore_table(table_name, path)

Restores contents of table to match the contents of the file table_name.tsv in the local directory path. WARNING: any existing data in the table will be truncated!

Parameters:
  • table_name – the table we wish to restore back to the database
  • path – the path to the TSV file in the Local FS that stores the table data
truncate_table(table_name)

Truncates table

Parameters:table_name – name of table to be truncated
truncate_unannotated_snps_indels()

Truncates the unannotated_snps_indels table

update_analysis(records)

Updates analysis table to contain the information in records

Parameters:records – SampleMetaData object containing metadata to be loaded
Raises:VardbException if update fails
update_sample(records)

Loads a single record into the sample table

Parameters:records – SampleMetaData object containing metadata to be loaded
Raises:VardbException if update fails