scripts package

The scripts package includes stand-alone scripts for use in loading data to vardb.

REQUIREMENTS

To run any of these scripts, you must have

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • need login files in your home directory:
    for vardb the file is .gscaccount login files contain a single line of text: “username:password”

Submodules

scripts.load_weekly_variants_test_env module

scripts.vardb_aggregate module

Aggregates unpaired and somatic snps and indels to the tables vcall_agg and somatic_snps_indels_agg respectively. Aggregates the annotations to the table annotations_agg.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_aggregate.py [-h] --database DATABASE [--sim]
                                 [--log_level {debug,info,warning,error}]
                                 [--tables TABLES [TABLES ...]]

optional arguments:
  -h, --help            show this help message and exit
  --sim, -s             true if you want to simulate loading, but not commit
                        the transaction [False]
  --log_level {debug,info,warning,error}, -l {debug,info,warning,error}
                        the level of logging
  --tables TABLES [TABLES ...], -t TABLES [TABLES ...]
                        Input the list of table names you want to aggregate
                        (default is all tables). Options are:
                        filtered_simple_somatic, somatic_gene_copies,
                        somatic_snps_indels_agg, vcall_agg, annotations_agg

required named arguments:
  --database DATABASE, -db DATABASE
                        the database to use

OUTPUTS

The output is logged to stout.

MODIFIES

This script updates the aggregate tables below with new data in the data tables. The aggregate tables are:

  • vcall_agg: from the vcall table, makes a table with variant id as primary key, where all libraries having that variant are collected in a list
  • somatic_snps_indels_agg: from all somatic snv and indels tables, makes a table with variant id as primary key, where all libraries having that variant are collected in a list
  • annotations_agg: from snp_eff, makes a table with variant id and primary key, where the annotations for the preferred transcript are selected from snp_eff, and combined with dbsnp, cosmic, clinvar and darned, as well as choosing annotations
  • filtered_simple_somatic: finds only high quality somatic snvs and indels by filtering on quality, and looking for only variants called in both strelka and mutationseq pipelines
  • somatic_gene_copies: from somatic_cnv, aggregates copy number variants by gene to find the number of whole copies of a gene
scripts.vardb_aggregate.aggregate(database, tables, simulate=False)

Aggregates:

  1. the vcall table to vcall_agg
  2. the somatic variants in all strelka and mutationseq tables to somatic_snps_indels_agg
  3. the annotations from snp_eff, dbsnp, cosmic, … to annotations_agg
Parameters:
  • database – database to aggregate
  • tables – a list of tables to aggregate, None means aggregate all tables
  • simulate – make no actual changes to the database, just simulate [False]
Returns:

0 if execution is successful, otherwise -1

Modifies:

vcall_agg, somatic_snps_indels_agg, annotations_agg

scripts.vardb_aggregate.main()

scripts.vardb_analyze module

This script runs analyze on the database. These are important maintenance functions that are required for optimal database performance, especially after large changes to the data such as after loading.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_analyze.pyc [-h] --database DATABASE
                                [--log_level {debug,info,warning,error}]

optional arguments:
  -h, --help            show this help message and exit
  --log_level {debug,info,warning,error}, -l {debug,info,warning,error}
                        the level of logging

required named arguments:
  --database DATABASE, -db DATABASE
                        the database to use

OUTPUTS

The output is logged to stout.

scripts.vardb_analyze.analyze(database)

Analyzes all tables on specified database

Parameters:database – database to aggregate
Returns:0 if execution is successful, otherwise -1
scripts.vardb_analyze.main()

scripts.vardb_load_annotations module

Updates the annotation information for loaded snvs and indels. Annotations are loaded automatically when the vcf files have been annotated at time of loading. If not, the new, unique unannotated variants are added to the table unannotated_snps_indels. This script annotates these variants by downloading this table to a vcf file, annotating the file offline, and then uploading the annotation information to snp_eff.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_load_annotations.pyc [-h] --database DATABASE [--sim]
                                         [--no_trunc] [-ol]
                                         [--log_level {debug,info,warning,error}]
                                         [--remove_temp_files]
                                         vcf_path log_path

positional arguments:
  vcf_path              path to vcf file containing unannotated variants
  log_path              path to log file

optional arguments:
  -h, --help            show this help message and exit
  --sim, -s             true if you want to simulate loading, but not commit
                        the transaction [False]
  --no_trunc, -nt       true if you do not want to truncate the
                        unannotated_snps_indels table after loading
                        annotations [False]
  -ol, --overwrite_log  true if you want to overwrite log file
  --log_level {debug,info,warning,error}, -l {debug,info,warning,error}
                        the level of logging
  --remove_temp_files, -rm
                        true if you want to remove the vcf annotations file
                        after loading [False]

required named arguments:
  --database DATABASE, -db DATABASE
                        the database in which to load annotations

OUTPUTS

If the flag -rm is not set, the script will create several files in the vcf directory. They will have names:

  • [vcf_path] (the unannotated snvs and indels from the unannotated_snps_indels table)
  • [vcf_basename].eff.vcf
  • [vcf_basename].eff.stats.html
  • [vcf_basename].eff.stats.genes.txt

If the -rm flag is set, these files are removed after the variant annotations have been loaded.

A log file is also produced:

  • [log_path]: the main log file for the annotations loader

MODIFIES

This script adds the annotations obtained for the variants in unannotated_snps_indels to the table snp_eff. If the -no_trunc flag is false, the unannotated_snps_indels is truncated after the annotations are successfully loaded to the effects table.

scripts.vardb_load_annotations.load_annotations(vcf_path, log_path, database, simulate=False, no_truncate=False, rm=False)

Loads SnpEff annotations to the database

Parameters:
  • vcf_path – location where unannotated variants downloaded from the database will be stored
  • log_path – location of log file
  • database – database to update
  • simulate – true if you want to simulate loading, without actually making changes to the database [False]
  • no_truncate – true if you do not want to truncate the table unannotated_snps_indels after effects loading [False]
  • rm – true if you want to remove the temporary vcf files after effects loading [True]
Modifies:

snp_eff, unannotated_snps_indels (if rm = True)

scripts.vardb_load_annotations.main()

scripts.vardb_load_dim_tables module

Updates dim tables with new data from file.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_load_dim_tables.py [-h] --database DATABASE
                                       [-l {debug,info,warning,error}]
                                       [-t TABLES [TABLES ...]]

optional arguments:
  -h, --help            show this help message and exit
  -l {debug,info,warning,error}, --log_level {debug,info,warning,error}
                        the level of logging
  -t TABLES [TABLES ...], --tables TABLES [TABLES ...]
                        A list of dim tables you want to load, if None, all
                        dim tables are loaded. Options are: clinvar, dbsnp,
                        cosmic, oasis, pog_comparator

required named arguments:
  --database DATABASE, -db DATABASE
                        the destination database

OUTPUTS

The output is logged to stout.

MODIFIES

This script updates the dim tables below with new data in the data tables.

Tables: * clinvar, dbsnp, cosmic: These are all external annotations databases. They should be updated whenever a new version is downloaded * oasis: These are the clinical tables that come from the Oasis databases. They should be updated after the oasis metadata wrangling scripts are run on a new oasis data dump. * pog_comparator: This is an Excel sheet managed by the bioinformaticians for POG. It is updated nightly.

scripts.vardb_load_dim_tables.load_dim_tables(database, tables, log_level='info')

(Re)loads dim tables specified by the tables argument

Parameters:
  • database – database to load to
  • tables – a list of “tables” (each key can correspond to a group of tables in the config file)
  • log_level
scripts.vardb_load_dim_tables.main()

scripts.vardb_load_variants module

Loads data to vardb databases on HAWQ

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_load_variants.py [-h] -db DATABASE [-d LOG_DIRECTORY] [-s]
                                     [-l {debug,info,warning,error}] [-ol]
                                     [-k KWARGS [KWARGS ...]]
                                     load_file

positional arguments:
  load_file             file to load to gvdb

optional arguments:
  -h, --help            show this help message and exit
  -d LOG_DIRECTORY, --log_dir LOG_DIRECTORY
                        destination directory for log files [load file
                        directory]
  -s, --sim             true if you want to simulate loading, but not commit
                        the transaction [False]
  -l {debug,info,warning,error}, --log_level {debug,info,warning,error}
                        the level of logging
  -ol, --overwrite_log  true if you want to overwrite log file
  -k KWARGS [KWARGS ...], --kwargs KWARGS [KWARGS ...]
                        keyword arguments allow you to enter metadata that is
                        common to all files to be loaded

required named arguments:
  -db DATABASE, --database DATABASE
                        the database in which to load

OUTPUTS

The only file created by this API is a log file. The log file will be unique for a particular load file, by default it has the same name as the load_file + .log. The file is not overwritten: if you run the command twice, new log entries are appended to the end. Stderr is also automatically logged to file. Note that all of the classes and function called within the api have their own loggers, which are added to the same file so that an error or a bug can be tracked precisely.

If there is an error during processing then an empty [load_file].err file is created in the log directory. If this file exists when the program is started, it is first deleted. If the program fails before the load_file can be obtained from the parameter, the a .failure file is generated from wherever the script is called.

The format of the log file is: time (Y-m-d H:M:S) | name of calling object (i.e __hdfs__) : message

MODIFIES

This script loads variant and expression data from data file specified in the loader file to fact tables in the specified database. It also updates the sample and analysis tables with the metadata associated with the file as obtained from bioapps, lims, and/or the load file. The script automatically detects whether the file, sample metadata or analysis metadata has been loaded previously or has changed and takes the appropriate action to

  • load the data for the first time
  • ignore data that has been already loaded, or
  • update the sample/analysis information
scripts.vardb_load_variants.load_variants(load_file, database, simulate=False, **kwargs)

Loads data and metadata specified in the load_file to vardb databases

Parameters:
  • load_file – tab separated file data needed for loading to vardb
  • database – database to be loaded
  • simulate – true if you want to only simulate (but not actually load) data [False]
  • kwargs – additional keyword arguments
Returns:

0 - successful execution, -1 - unrecoverable errors, >0 - number of files not loaded

Modifies:

Adds variants to the variant table associated with the type of data file, and adds metadata to sample and analysis

scripts.vardb_load_variants.main()

scripts.vardb_make_loader module

This is the pipeline responsible for retrieving and parsing metadata from internal projects to produce vardb .loader files.

REQUIREMENTS

  • gsc_sbs_loader group privileges on bioapps
  • permission to use the lims apis
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for bioapps it is .bioappsaccount
    • for lims it is .limsaccount

INPUTS

usage: python vardb_make_loader.pyc [-h] --output_directory OUTPUT_DIRECTORY
                                    --project PROJECT --query QUERY
                                    [--log_level {debug,info,warning,error}]
                                    [--last PREVIOUS_METADATA_FILE] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  --log_level {debug,info,warning,error}, -log {debug,info,warning,error}
                        the level of logging
  --last PREVIOUS_METADATA_FILE, -l PREVIOUS_METADATA_FILE
                        Path to previous metadata file; for finding only new
                        or changed data
  --debug, -d           true if you want to run in debug mode

required named arguments:
  --output_directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY
                        directory for saving output
  --project PROJECT, -p PROJECT
                        Project to query. Options are: POG, GPH
  --query QUERY, -q QUERY
                        Query type to perform. Options are: {'POG':
                        ['controlfreec', 'vcall', 'expression',
                        'small_somatic', 'somatic_cnv', 'somatic_loh'], 'GPH':
                        ['vcall', 'small_somatic', 'somatic_cnv',
                        'somatic_loh']}

OUTPUTS

Two files are created:

  • [project]_[pipeline]_[yyyy-mm-dd].tsv: a file with all data found matching the project and pipeline selected
  • [project]_[pipeline]_[yyyy-mm-dd].loader: a file with just the data that is new/changed from PREVIOUS_METADATA_FILE
    PREVIOUS_METADATA_FILE is the previous week’s .tsv file

Output is logged to stdout.

scripts.vardb_make_loader.main()
scripts.vardb_make_loader.make_loader(output_directory, project, query_type, previous_metadata_file=None, debug=False)

Creates loader files by collecting data from several sources

Parameters:
  • output_directory – Location of loader files to be written
  • projects – A list projects to query (if None, all projects will be queried)
  • query_types – [optional] A list of query_types to be run (if None, all queries will be run)
  • previous_metadata_file – [optional] path to the previous metadata file with which to compare current metadata
  • debug – [optional] True if you want to suppress errors for debugging purposes
Returns:

0 = success, -1 = failure

scripts.vardb_rollout_db module

Rolls out an “empty” database. The data tables are recreated from ddl.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for vardb the file is .gscaccount

INPUTS

usage: python vardb_rollout_db.py [-h] --database DATABASE
                                  [-l {debug,info,warning,error}]

optional arguments:
  -h, --help            show this help message and exit
  -l {debug,info,warning,error}, --log_level {debug,info,warning,error}
                        the level of logging

required named arguments:
  --database DATABASE, -db DATABASE
                        the database to rollout

OUTPUTS

Logs to stdout

MODIFIES

EXTREME CAUTION should be used with this function, because it truncates all existing data. Before using on the test databases, always check to make sure no one else is using it. To rollout the production database, users must type in “yes” to confirm.

scripts.vardb_rollout_db.main()
scripts.vardb_rollout_db.rollout(database, log_level='info')

Rolls out the database

Parameters:
  • database – database to roll out
  • log_level – how much information to log (default is ‘info’)

scripts.vardb_update_db module

Updates the database with new variants. For each project and pipeline that is currently loaded to vardb, it:

  • Obtains updated data and metadata from BioApps/LIMS etc
  • Finds new/changed data to be loaded, and creates a loader file
  • Loads data to vardb
  • Loads new annotations
  • Aggregates tables
  • Updates database indices
  • Runs the germline CNV query

Creates a log file in the directory specified by the environment file vardb_load_weekly_variant_env.py.

REQUIREMENTS

  • pivotal group privileges on the vardb databases
  • pxf permissions on hdfs
  • gsc_sbs_loader group privileges on bioapps
  • permission to use the lims apis
  • you must have login files in your home directory; login files have a single line of text with username:password
    • for bioapps it is .bioappsaccount
    • for lims it is .limsaccount
    • for vardb the file is .gscaccount
scripts.vardb_update_db.analyze(loader_dict)

Runs analyze on full database.

scripts.vardb_update_db.get_most_recent_file(path, prefix)
scripts.vardb_update_db.load_annotations(loader_dict)

Loads annotations to the target database.

scripts.vardb_update_db.load_variants(loader_dict)

Loads each created loader file to the target database.

scripts.vardb_update_db.main(*args, **kwargs)
scripts.vardb_update_db.make_load_files()

Creates the load files for each type of project/query.

scripts.vardb_update_db.run_aggregation(loader_dict)

Runs snp aggregation.

scripts.vardb_update_db.run_germline_query(loader_dict)

Runs the germline query.

scripts.vardb_update_db.update_dim_tables()

Updates dim tables

scripts.vardb_update_db_env module

Environment file with parameters required for updating the vardb database.