Overview¶
SCRIPTS¶
The scripts package includes stand-alone scripts for use in loading data to vardb.
VARDB¶
The vardb package contains utilities for interfacing with the variant databases. Scripts use the contents to create complex functionality.
variant_file_loaders¶
This package is responsible for loading the data files and their metadata to the variant database. To prevent data integrity issues, where the data and metadata are out of sync, we require that the data and metadata be loaded at the same time, in a single transaction.
The main input to the load_files function is the loader file. It is a tab-delimited with columns for all of the required metadata, as well as the full path to the data file. The metadata_wrangling package is responsible for creating this file from metadata obtained on other GSC databases, or by scraping the file system if the pipeline is not tracked in a database.
The metadata is stored and validated using the SampleMetaData class. This class contains all of the metadata requirements for vardb, as well as their expected type, and any defaults.
The information about each type of data file and how it is to be parsed and loaded to vardb is contained in the variant_data_files classes.
connections¶
The Connections package contains classes for interfacing with various databases at the GSC. Currently, there are classes for vardb, BioApps (solexa) and LIMS. These packages provide all details of the connection (host/port/etc) as well as useful apis for querying (all classes) and loading (only vardb) data. The BioApps and LIMS connection classes are used primarily for obtaining the metadata needed for loading data files to vardb. These classes are not general purpose (they have their own web apis for general purpose queries), but are tailored for use with vardb.
variant_data_files¶
There is one variant data file class per file type that is loaded to vardb. They contain all of the information required to load each type of data to the database.
- column names and data types for the files
- information on the headers
- which table the data is loaded to in vardb
- PSQL commands for parsing and loading the data to tables on vardb
- routines to compute required metadata from the file, such as the creation date and md5sum
- the pipelines that the class belongs to
When a new data type is added to vardb, a new variant data file class must be created with this information.
metadata_wrangling¶
This package is responsible for retrieving and parsing metadata from internal projects to produce vardb .loader
files, which are used for actually loading the data to vardb.
- Each MetadataCollector is responsible for assembling the metadata for a single pipeline. It makes calls to other databases through their connections class, and/or scrapes the filesystem for the data and metadata for the pipeline.
- The main function in this package is make_loader, which collects the metadata, and optionally compares to a previous version to find only new and changed data for loading.
queries¶
The queries package contains standalone command-line routines for making production queries to the database. So far, these include:
- Germline CNV query: identifies CNVs that overlap with specified genes
- Experimental records: Queries the database to obtain the effects and annotations for each variant in a test library, as well as all other libraries/patients in vcall matching the variants in the test library have been seen