VCF to TileDB Import Design - GenomicsDB/GenomicsSampleAPIs GitHub Wiki
Functionality
The VCF import process performs:
- 
Registration of MetaDB related items from a list of VCF files. 
- 
Optional sorting, compression, and indexing of input VCF files using bcftools. 
- 
Construction of required configs for GenomicsDB import 
- 
Optional loading of TileDB array 
Config File
See utils/example_configs/vcf_import.config for an example.
| Field | Mandatory | Description | 
|---|---|---|
| workspace | Yes | Full path to TileDB workspace where the array will exist. | 
| array | Yes | Name of array to import VCFs into. | 
| assembly | Yes | Name of assembly. This can be an existing assembly in MetaDB or new assembly to be registered from VCF contig tags. | 
| dburi | Yes | MetaDB Instance to make a connection to, as defined in alembic.ini: driver://user:pass@localhost/dbname | 
| vcf_type | No | Defaults to TN, but overrides if sample size is less than 2. Important to keep track of tumor and normal in metadb. | 
| sort_compress | No | Defaults to True. Will sort and compress files with bcftools. | 
| index | No | Defaults to True. Will index files with bcftools. | 
| source_idx | No | Defaults to 0. Used to specify the ordering of normal sample in the VCF file. | 
| target_idx | No | Defaults to 1. Used to specify the ordering of tumor sample in the VCF file. | 
| sample_name | No | A map of how to read sample information. Defaults to empty {}. | 
| sample_name / derive_sample | No | Defaults to header.headerread the samples from header,tagread samples from the sample tag,fileread samples from file name. Setsplit_byandsplit_indexappropriately fortagandfile, or see defaults below. | 
| sample_name / split_by | No | Defaults to None. Iftagthis says what field to grab the sample name from in the sample tag. Iffilethis will split the file name by this. | 
| sample_name / split_index | No | Defaults to 0, set for splitting by file. | 
GenomicsDB vcf2tiledb Executable
Unlike the MAF importer, the VCF import process does not require an intermediate CSV before passing to the GenomicsDB loading process. The VCF import process for GenomicsDB requires that the VCFs are sorted, blocked compressed, and indexed (addressed by step 2 above). The import process also requires three import configuration files for the GenomicsDB vcf2tiledb import binary (much like the MAF/CSV import process): i) callset_mapping, ii) vid_mapping iii) loader config (addressed by step 3 and 4 above).
Understanding the VCF Format
The main use case for the VCF import process is designed to import VCFs produced from a somatic variant calling pipeline. These VCFs contain two sample columns, one from the NORMAL sample and one from the TUMOR sample. These columns will each be represented as a CallSet in the variant store - meaning each VCF will have two callsets associated with it. The VCF importer is designed to read from a config to understand how to retrieve sample information from the VCF - required because there are two ways a sample name can be labeled in a VCF. These two types are addressed below: i) sample in header and ii) sample tag identifiers. See information about the config section for more on fields related to sample information handling. Note that tumor and normal samples are inferred based on the presense of normal, or target and primary in the sample name.
The secondary use case for the VCF import process is to import single sample, or composite VCF files (no sample). The type of VCF file is assumed based on the number of samples available.
Sample in Header
In the most simple case, the sample information is available in the header line, ie. sample1N and sample1T below.
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    sample1N    sample1T
Sample Tag Identifiers
Often, the sample information in the header is something more generic and cannot be used to uniquely identify the sample, ie. NORMAL and TUMOR below.
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NORMAL    TUMOR
If this is the case, sample tag(s) must be provided in the header for the sample, and the config file sample_map field should specify how to retrieve this information. For example, below would require "split_by": "SampleName".
##SAMPLE=<ID=NORMAL,Description="Wild type",Platform=ILLUMINA,Protocol=WGS,SampleName=sample1N>'
##SAMPLE=<ID=TUMOR,Description="Mutant",Platform=ILLUMINA,Protocol=WGS,SampleName=sample1T>'
ReferenceSet and References
ReferenceSet and Reference registration will be derived from the contig tags of the first VCF file. This section (truncated) is required for proper VCF import:
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
...
##contig=<ID=MT,assembly=b37,length=16569>
##contig=<ID=X,assembly=b37,length=155270560>
##contig=<ID=Y,assembly=b37,length=59373566>
Usage
See the usage section for examples.
python vcf2tile.py --help
usage: vcf2tile.py [-h] -c CONFIG -d OUTPUTDIR -i INPUTS [INPUTS ...]
                   [-a APPEND_CALLSETS] [-l LOADER]
Register VCF with MetaDB and create TileDB JSON.
optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        input configuration file for VCF import
  -d OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory where the outputs need to be stored
  -i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
                        VCF files to be imported.
  -a APPEND_CALLSETS, --append_callsets APPEND_CALLSETS
                        CallSet mapping file to append.
  -l LOADER, --loader LOADER
                        Loader JSON to load data into Tile DB.