MAF to Tile CSV Usage - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

The MAF import process produces a CSV from a list of MAF files, registers the appropriate items with MetaDB, produces the required configs for proper loading into GenomicsDB, and (optionally) uses these configs to load the CSV into a specified TileDB array. A user can choose to either:

  1. Perform a complete create and load process using maf2tile.py to produce a csv, populate metadb, generate the required configs, and load Genomics DB.
  2. Use maf2tile.py to generate the csv, populate metadb and generate the required configs, but load Genomics DB using Genomics DB commands - see Loading Genomics DB section for details.

General Requirements for MAF Import

  1. Create a TileDB Workspace if one doesn't already exist. Note that there can be mulitple arrays in a given Workspace. Assuming GenomicsDB/bin is in your path (export PATH=/path/to/GenomicsDB/bin:$PATH) run: `create_tiledb_workspace /path/to/desired/Workspace. The workspace cannot already exist.
  2. Edit utils/example_configs/load_to_tile.cfg to reflect the correct paths, as well as optional preferred settings.
  3. Edit the tile loader json (as identified in load_to_tile.cfg) to specify the workspace and array name, as well as any desired optional settings. See GenomicsDB documetation for more information on these fields.
  4. Specify a project config file as seen in the examples, ie. store/utils/example_configs/icgc_config.json. More information on fields in the project_config.json file here.
  5. Make sure the workspace and array names are consist across all config files.
  6. If you don't want to append to an existing csv, make sure the file specified by -o does not exist/is empty.

Option 1: Complete Loading with MAF2Tile Script

  • Inside the virtual environment, cd utils

  • python vcf2tile.py -h to get the usage help.

  • Run the maf2tile script from inside the utils directory of the store repo, as follows:

    python maf2tile.py \
    -c <path to project config file> \
    -o <desired name of output file> \
    -d <desired location to write output> \
    -i <relative path to single or list of MAF files to be imported> \
    -z <mafs are gzipped> \
    -l <loader config to load data into tiledb> 
    

    If you have preference to using spark, specify -s:

    python maf2tile.py \
    -c <path to project config file> \
    -o <desired name of output file> \
    -d <desired location to write output> \
    -i <relative path to single or list of MAF files to be imported> \
    -l <loader config to load data into tiledb> \
    -s <spark-master-uri>\ 
    

Spark version assumes MAFs are not gzipped.

The utility scripts used in the import process are described in detail here.

Note: If you need to construct a TileDB csv with differing file formats, specify configs for each type and perform separate runs of maf2tile.py see here for more information.

Option 2: Loading using Genomics DB commands

Follow instructions under option #1 above but run the script without the -l option, and follow instructions under Loading Genomics DB