VCF to TileDB Import Usage - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

The VCF import process sorts, indexes, and compresses input VCFs, registers the appropriate items with MetaDB, produces the required configs for proper loading into TileDB, and (optionally) uses these configs to load the CSV into a specified TileDB array. A user can choose to either:

  1. Perform a complete create and load process using vcf2tile.py to populate metadb, generate the required configs, and load Genomics DB.
  2. Use vcf2tile.py to populate metadb and generate the required configs, but load Genomics DB using Genomics DB commands - see Loading Genomics DB section for details.

General Requirements for VCF Import

  1. Create a TileDB Workspace if one doesn't already exist. Note that there can be multiple arrays in a given Workspace. Assuming GenomicsDB/bin is in your path (export PATH=/path/to/GenomicsDB/bin:$PATH) run: create_tiledb_workspace /path/to/desired/Workspace. The workspace cannot already exist.
  2. Edit utils/example_configs/load_to_tile.cfg to reflect the correct paths, as well as optional preferred settings.
  3. Edit the tile loader json (as identified in load_to_tile.cfg) to specify the workspace and array name, as well as any desired optional settings. See GenomicsDB documetation for more information on these fields.
  4. Specify a vcf config file as seen in the example, ie. store/utils/example_configs/vcf_import.config. More information here.
  5. Make sure the workspace and array names are consist across all config files.

Option 1: Complete Loading with VCF2Tile Script

  • Inside the virtual environment, cd utils
  • python vcf2tile.py -h to get the usage help.
  • Run the vcf2tile script from inside the utils directory of the store repo, as follows:
python vcf2tile.py \
-c <path to project config file> \
-d <desired location to write output> \
-i <relative path to single or list of VCF files to be imported> \
-l <loader config to load data into tiledb (`example_configs/load_to_tile.cfg`)> 

Option 2: Loading using Genomics DB commands

Follow instructions under option #1 above but run the script without the -l option, and follow instructions under Loading Genomics DB