MAF to Tile CSV Usage - GenomicsDB/GenomicsSampleAPIs GitHub Wiki
The MAF import process produces a CSV from a list of MAF files, registers the appropriate items with MetaDB, produces the required configs for proper loading into GenomicsDB, and (optionally) uses these configs to load the CSV into a specified TileDB array. A user can choose to either:
- Perform a complete create and load process using
maf2tile.py
to produce a csv, populate metadb, generate the required configs, and load Genomics DB. - Use
maf2tile.py
to generate the csv, populate metadb and generate the required configs, but load Genomics DB using Genomics DB commands - see Loading Genomics DB section for details.
General Requirements for MAF Import
- Create a TileDB Workspace if one doesn't already exist. Note that there can be mulitple arrays in a given Workspace. Assuming GenomicsDB/bin is in your path (
export PATH=/path/to/GenomicsDB/bin:$PATH
) run: `create_tiledb_workspace /path/to/desired/Workspace. The workspace cannot already exist. - Edit
utils/example_configs/load_to_tile.cfg
to reflect the correct paths, as well as optional preferred settings. - Edit the tile loader json (as identified in
load_to_tile.cfg
) to specify the workspace and array name, as well as any desired optional settings. See GenomicsDB documetation for more information on these fields. - Specify a project config file as seen in the examples, ie.
store/utils/example_configs/icgc_config.json
. More information on fields in the project_config.json file here. - Make sure the workspace and array names are consist across all config files.
- If you don't want to append to an existing csv, make sure the file specified by
-o
does not exist/is empty.
Option 1: Complete Loading with MAF2Tile Script
-
Inside the virtual environment,
cd utils
-
python vcf2tile.py -h
to get the usage help. -
Run the maf2tile script from inside the utils directory of the store repo, as follows:
python maf2tile.py \ -c <path to project config file> \ -o <desired name of output file> \ -d <desired location to write output> \ -i <relative path to single or list of MAF files to be imported> \ -z <mafs are gzipped> \ -l <loader config to load data into tiledb>
If you have preference to using spark, specify -s:
python maf2tile.py \ -c <path to project config file> \ -o <desired name of output file> \ -d <desired location to write output> \ -i <relative path to single or list of MAF files to be imported> \ -l <loader config to load data into tiledb> \ -s <spark-master-uri>\
Spark version assumes MAFs are not gzipped.
The utility scripts used in the import process are described in detail here.
Note: If you need to construct a TileDB csv with differing file formats, specify configs for each type and perform separate runs of maf2tile.py see here for more information.
Option 2: Loading using Genomics DB commands
Follow instructions under option #1 above but run the script without the -l
option, and follow instructions under Loading Genomics DB