scripts package¶
The scripts package includes stand-alone scripts for use in loading data to vardb.
REQUIREMENTS
To run any of these scripts, you must have
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- need login files in your home directory:
- for vardb the file is .gscaccount login files contain a single line of text: “username:password”
Submodules¶
scripts.load_weekly_variants_test_env module¶
scripts.vardb_aggregate module¶
Aggregates unpaired and somatic snps and indels to the tables vcall_agg and somatic_snps_indels_agg respectively. Aggregates the annotations to the table annotations_agg.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_aggregate.py [-h] --database DATABASE [--sim]
[--log_level {debug,info,warning,error}]
[--tables TABLES [TABLES ...]]
optional arguments:
-h, --help show this help message and exit
--sim, -s true if you want to simulate loading, but not commit
the transaction [False]
--log_level {debug,info,warning,error}, -l {debug,info,warning,error}
the level of logging
--tables TABLES [TABLES ...], -t TABLES [TABLES ...]
Input the list of table names you want to aggregate
(default is all tables). Options are:
filtered_simple_somatic, somatic_gene_copies,
somatic_snps_indels_agg, vcall_agg, annotations_agg
required named arguments:
--database DATABASE, -db DATABASE
the database to use
OUTPUTS
The output is logged to stout.
MODIFIES
This script updates the aggregate tables below with new data in the data tables. The aggregate tables are:
- vcall_agg: from the vcall table, makes a table with variant id as primary key, where all libraries having that variant are collected in a list
- somatic_snps_indels_agg: from all somatic snv and indels tables, makes a table with variant id as primary key, where all libraries having that variant are collected in a list
- annotations_agg: from snp_eff, makes a table with variant id and primary key, where the annotations for the preferred transcript are selected from snp_eff, and combined with dbsnp, cosmic, clinvar and darned, as well as choosing annotations
- filtered_simple_somatic: finds only high quality somatic snvs and indels by filtering on quality, and looking for only variants called in both strelka and mutationseq pipelines
- somatic_gene_copies: from somatic_cnv, aggregates copy number variants by gene to find the number of whole copies of a gene
-
scripts.vardb_aggregate.
aggregate
(database, tables, simulate=False)¶ Aggregates:
- the vcall table to vcall_agg
- the somatic variants in all strelka and mutationseq tables to somatic_snps_indels_agg
- the annotations from snp_eff, dbsnp, cosmic, … to annotations_agg
Parameters: - database – database to aggregate
- tables – a list of tables to aggregate, None means aggregate all tables
- simulate – make no actual changes to the database, just simulate [False]
Returns: 0 if execution is successful, otherwise -1
Modifies: vcall_agg, somatic_snps_indels_agg, annotations_agg
-
scripts.vardb_aggregate.
main
()¶
scripts.vardb_analyze module¶
This script runs analyze on the database. These are important maintenance functions that are required for optimal database performance, especially after large changes to the data such as after loading.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_analyze.pyc [-h] --database DATABASE
[--log_level {debug,info,warning,error}]
optional arguments:
-h, --help show this help message and exit
--log_level {debug,info,warning,error}, -l {debug,info,warning,error}
the level of logging
required named arguments:
--database DATABASE, -db DATABASE
the database to use
OUTPUTS
The output is logged to stout.
-
scripts.vardb_analyze.
analyze
(database)¶ Analyzes all tables on specified database
Parameters: database – database to aggregate Returns: 0 if execution is successful, otherwise -1
-
scripts.vardb_analyze.
main
()¶
scripts.vardb_load_annotations module¶
Updates the annotation information for loaded snvs and indels. Annotations are loaded automatically when the vcf files have been annotated at time of loading. If not, the new, unique unannotated variants are added to the table unannotated_snps_indels. This script annotates these variants by downloading this table to a vcf file, annotating the file offline, and then uploading the annotation information to snp_eff.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_load_annotations.pyc [-h] --database DATABASE [--sim]
[--no_trunc] [-ol]
[--log_level {debug,info,warning,error}]
[--remove_temp_files]
vcf_path log_path
positional arguments:
vcf_path path to vcf file containing unannotated variants
log_path path to log file
optional arguments:
-h, --help show this help message and exit
--sim, -s true if you want to simulate loading, but not commit
the transaction [False]
--no_trunc, -nt true if you do not want to truncate the
unannotated_snps_indels table after loading
annotations [False]
-ol, --overwrite_log true if you want to overwrite log file
--log_level {debug,info,warning,error}, -l {debug,info,warning,error}
the level of logging
--remove_temp_files, -rm
true if you want to remove the vcf annotations file
after loading [False]
required named arguments:
--database DATABASE, -db DATABASE
the database in which to load annotations
OUTPUTS
If the flag -rm is not set, the script will create several files in the vcf directory. They will have names:
- [vcf_path] (the unannotated snvs and indels from the unannotated_snps_indels table)
- [vcf_basename].eff.vcf
- [vcf_basename].eff.stats.html
- [vcf_basename].eff.stats.genes.txt
If the -rm flag is set, these files are removed after the variant annotations have been loaded.
A log file is also produced:
- [log_path]: the main log file for the annotations loader
MODIFIES
This script adds the annotations obtained for the variants in unannotated_snps_indels to the table snp_eff. If the -no_trunc flag is false, the unannotated_snps_indels is truncated after the annotations are successfully loaded to the effects table.
-
scripts.vardb_load_annotations.
load_annotations
(vcf_path, log_path, database, simulate=False, no_truncate=False, rm=False)¶ Loads SnpEff annotations to the database
Parameters: - vcf_path – location where unannotated variants downloaded from the database will be stored
- log_path – location of log file
- database – database to update
- simulate – true if you want to simulate loading, without actually making changes to the database [False]
- no_truncate – true if you do not want to truncate the table unannotated_snps_indels after effects loading [False]
- rm – true if you want to remove the temporary vcf files after effects loading [True]
Modifies: snp_eff, unannotated_snps_indels (if rm = True)
-
scripts.vardb_load_annotations.
main
()¶
scripts.vardb_load_dim_tables module¶
Updates dim tables with new data from file.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_load_dim_tables.py [-h] --database DATABASE
[-l {debug,info,warning,error}]
[-t TABLES [TABLES ...]]
optional arguments:
-h, --help show this help message and exit
-l {debug,info,warning,error}, --log_level {debug,info,warning,error}
the level of logging
-t TABLES [TABLES ...], --tables TABLES [TABLES ...]
A list of dim tables you want to load, if None, all
dim tables are loaded. Options are: clinvar, dbsnp,
cosmic, oasis, pog_comparator
required named arguments:
--database DATABASE, -db DATABASE
the destination database
OUTPUTS
The output is logged to stout.
MODIFIES
This script updates the dim tables below with new data in the data tables.
Tables: * clinvar, dbsnp, cosmic: These are all external annotations databases. They should be updated whenever a new version is downloaded * oasis: These are the clinical tables that come from the Oasis databases. They should be updated after the oasis metadata wrangling scripts are run on a new oasis data dump. * pog_comparator: This is an Excel sheet managed by the bioinformaticians for POG. It is updated nightly.
-
scripts.vardb_load_dim_tables.
load_dim_tables
(database, tables, log_level='info')¶ (Re)loads dim tables specified by the tables argument
Parameters: - database – database to load to
- tables – a list of “tables” (each key can correspond to a group of tables in the config file)
- log_level –
-
scripts.vardb_load_dim_tables.
main
()¶
scripts.vardb_load_variants module¶
Loads data to vardb databases on HAWQ
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_load_variants.py [-h] -db DATABASE [-d LOG_DIRECTORY] [-s]
[-l {debug,info,warning,error}] [-ol]
[-k KWARGS [KWARGS ...]]
load_file
positional arguments:
load_file file to load to gvdb
optional arguments:
-h, --help show this help message and exit
-d LOG_DIRECTORY, --log_dir LOG_DIRECTORY
destination directory for log files [load file
directory]
-s, --sim true if you want to simulate loading, but not commit
the transaction [False]
-l {debug,info,warning,error}, --log_level {debug,info,warning,error}
the level of logging
-ol, --overwrite_log true if you want to overwrite log file
-k KWARGS [KWARGS ...], --kwargs KWARGS [KWARGS ...]
keyword arguments allow you to enter metadata that is
common to all files to be loaded
required named arguments:
-db DATABASE, --database DATABASE
the database in which to load
OUTPUTS
The only file created by this API is a log file. The log file will be unique for a particular load file, by default it has the same name as the load_file + .log. The file is not overwritten: if you run the command twice, new log entries are appended to the end. Stderr is also automatically logged to file. Note that all of the classes and function called within the api have their own loggers, which are added to the same file so that an error or a bug can be tracked precisely.
If there is an error during processing then an empty [load_file].err file is created in the log directory. If this file exists when the program is started, it is first deleted. If the program fails before the load_file can be obtained from the parameter, the a .failure file is generated from wherever the script is called.
The format of the log file is: time (Y-m-d H:M:S) | name of calling object (i.e __hdfs__) : message
MODIFIES
This script loads variant and expression data from data file specified in the loader file to fact tables in the specified database. It also updates the sample and analysis tables with the metadata associated with the file as obtained from bioapps, lims, and/or the load file. The script automatically detects whether the file, sample metadata or analysis metadata has been loaded previously or has changed and takes the appropriate action to
- load the data for the first time
- ignore data that has been already loaded, or
- update the sample/analysis information
-
scripts.vardb_load_variants.
load_variants
(load_file, database, simulate=False, **kwargs)¶ Loads data and metadata specified in the load_file to vardb databases
Parameters: - load_file – tab separated file data needed for loading to vardb
- database – database to be loaded
- simulate – true if you want to only simulate (but not actually load) data [False]
- kwargs – additional keyword arguments
Returns: 0 - successful execution, -1 - unrecoverable errors, >0 - number of files not loaded
Modifies: Adds variants to the variant table associated with the type of data file, and adds metadata to sample and analysis
-
scripts.vardb_load_variants.
main
()¶
scripts.vardb_make_loader module¶
This is the pipeline responsible for retrieving and parsing metadata from internal projects to produce
vardb .loader
files.
REQUIREMENTS
- gsc_sbs_loader group privileges on bioapps
- permission to use the lims apis
- you must have login files in your home directory; login files have a single line of text with username:password
- for bioapps it is .bioappsaccount
- for lims it is .limsaccount
INPUTS
usage: python vardb_make_loader.pyc [-h] --output_directory OUTPUT_DIRECTORY
--project PROJECT --query QUERY
[--log_level {debug,info,warning,error}]
[--last PREVIOUS_METADATA_FILE] [--debug]
optional arguments:
-h, --help show this help message and exit
--log_level {debug,info,warning,error}, -log {debug,info,warning,error}
the level of logging
--last PREVIOUS_METADATA_FILE, -l PREVIOUS_METADATA_FILE
Path to previous metadata file; for finding only new
or changed data
--debug, -d true if you want to run in debug mode
required named arguments:
--output_directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
directory for saving output
--project PROJECT, -p PROJECT
Project to query. Options are: POG, GPH
--query QUERY, -q QUERY
Query type to perform. Options are: {'POG':
['controlfreec', 'vcall', 'expression',
'small_somatic', 'somatic_cnv', 'somatic_loh'], 'GPH':
['vcall', 'small_somatic', 'somatic_cnv',
'somatic_loh']}
OUTPUTS
Two files are created:
- [project]_[pipeline]_[yyyy-mm-dd].tsv: a file with all data found matching the project and pipeline selected
- [project]_[pipeline]_[yyyy-mm-dd].loader: a file with just the data that is new/changed from PREVIOUS_METADATA_FILE
- PREVIOUS_METADATA_FILE is the previous week’s .tsv file
Output is logged to stdout.
-
scripts.vardb_make_loader.
main
()¶
-
scripts.vardb_make_loader.
make_loader
(output_directory, project, query_type, previous_metadata_file=None, debug=False)¶ Creates loader files by collecting data from several sources
Parameters: - output_directory – Location of loader files to be written
- projects – A list projects to query (if None, all projects will be queried)
- query_types – [optional] A list of query_types to be run (if None, all queries will be run)
- previous_metadata_file – [optional] path to the previous metadata file with which to compare current metadata
- debug – [optional] True if you want to suppress errors for debugging purposes
Returns: 0 = success, -1 = failure
scripts.vardb_rollout_db module¶
Rolls out an “empty” database. The data tables are recreated from ddl.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- you must have login files in your home directory; login files have a single line of text with username:password
- for vardb the file is .gscaccount
INPUTS
usage: python vardb_rollout_db.py [-h] --database DATABASE
[-l {debug,info,warning,error}]
optional arguments:
-h, --help show this help message and exit
-l {debug,info,warning,error}, --log_level {debug,info,warning,error}
the level of logging
required named arguments:
--database DATABASE, -db DATABASE
the database to rollout
OUTPUTS
Logs to stdout
MODIFIES
EXTREME CAUTION should be used with this function, because it truncates all existing data. Before using on the test databases, always check to make sure no one else is using it. To rollout the production database, users must type in “yes” to confirm.
-
scripts.vardb_rollout_db.
main
()¶
-
scripts.vardb_rollout_db.
rollout
(database, log_level='info')¶ Rolls out the database
Parameters: - database – database to roll out
- log_level – how much information to log (default is ‘info’)
scripts.vardb_update_db module¶
Updates the database with new variants. For each project and pipeline that is currently loaded to vardb, it:
- Obtains updated data and metadata from BioApps/LIMS etc
- Finds new/changed data to be loaded, and creates a loader file
- Loads data to vardb
- Loads new annotations
- Aggregates tables
- Updates database indices
- Runs the germline CNV query
Creates a log file in the directory specified by the environment file vardb_load_weekly_variant_env.py
.
REQUIREMENTS
- pivotal group privileges on the vardb databases
- pxf permissions on hdfs
- gsc_sbs_loader group privileges on bioapps
- permission to use the lims apis
- you must have login files in your home directory; login files have a single line of text with username:password
- for bioapps it is .bioappsaccount
- for lims it is .limsaccount
- for vardb the file is .gscaccount
-
scripts.vardb_update_db.
analyze
(loader_dict)¶ Runs analyze on full database.
-
scripts.vardb_update_db.
get_most_recent_file
(path, prefix)¶
-
scripts.vardb_update_db.
load_annotations
(loader_dict)¶ Loads annotations to the target database.
-
scripts.vardb_update_db.
load_variants
(loader_dict)¶ Loads each created loader file to the target database.
-
scripts.vardb_update_db.
main
(*args, **kwargs)¶
-
scripts.vardb_update_db.
make_load_files
()¶ Creates the load files for each type of project/query.
-
scripts.vardb_update_db.
run_aggregation
(loader_dict)¶ Runs snp aggregation.
-
scripts.vardb_update_db.
run_germline_query
(loader_dict)¶ Runs the germline query.
-
scripts.vardb_update_db.
update_dim_tables
()¶ Updates dim tables
scripts.vardb_update_db_env module¶
Environment file with parameters required for updating the vardb database.