MAF to Tile CSV Design - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

Common Translation Layer

Common Translation layer repository of tools can be found at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils. The tools in this repository are described in the follow sub-sections.

loader.py

Loader util for GenomicsDB. Uses a loader config to generate the required loader configuration file for GenomicsDB. When -l is specified, this loader will take the callset and vid maps to automate the GenomicsDB loading process.

example_configs/load_to_tile.cfg

Loader configuration file allowing a user to specify mpi (if desired) as well as pointers to GenomicsDB loader executable, etc.

example_configs/tile_loader.json

An example loader config with some basic settings for loading GenomicsDB. More details can be found on the GenomicsDB wiki.

example_configs/tiledb_config.json

Config defines the valid assemblies. This is a placeholder for future configuration information.

csvline.py

class CSVLine provides the methods to populate the fields that are expected for a Tile DB entry, and generates csv line. It also validates the entries before generating a CSV line. The expected usage is that this class be called by a higher level program that understands the input format, populates the CSVLine structure, and gets a CSV line that is compatible with the GenomicsDB vcf2tiledb loader. Usage Note: Populate the ALT field first since it determines the size of PL and AD fields.

file2tile.py

class File2Tile provides the key data structure and functionality required to build a conversion script. Uses the ConfigReader described below.

configuration.py

class ConfigReader takes the Master configuration file that details the minimum required mapping to build a CSV file. These configuration files can be found here. The fields of the configuration file are described below:

Field Mandatory Description
DB_URI Yes MetaDB Instance to make a connection to, as defined in alembic.ini: driver://user:pass@localhost/dbname
TileDBConfig Yes Used by tools\tiledb\translate to know which assemblies are supported
TileDBSchema Yes Used by tools\tiledb\translate. This is the assembly that will be used for Tile DB
TileDBSchema/workspace Yes Used by import.py for creating and loading into TileDB schema
TileDBSchema/array Yes Used by import.py for creating and loading into TileDB schema
TileDBSchema/fields_list Yes Fields that correspond to the columns produced from the MAF importer
TileDBSchema/ftypes_list Yes Types that correspond to the fields in fields_list
HeaderStartsWith Yes Required to identify which line is the header
VariantSetMap Yes Dictionary that defines the Variant Name
VariantSetMap/Dynamic No Is a Boolean field where true implies that the Variant Set is dynamic and is a field in the input file.
VariantSetMap/VariantSet Yes String that is the Variant Set or the name in the header if Dynamic is true
VariantSetMap/VariantLookup No Is a Boolean field where true implies that the Variant Set is dynamically looked up based on the VariantSet field in the input file.
VariantSetMap/VariantConfig No Input JSON file that provides a map of translation from the input data to Variant Set Names. This is a custom field for ICGC translation.
VariantSetMap/LookupIdx No Index of the field in the Variant Config dictionary's values that should be used as Variant Set Name
IndividualId No Map of the header name that points to the IndividualId. If IndividualId is not specified, the IndividualId will default to Individual_<SourceSample>. The default addresses the fact that some MAF files do not have an Individual identifier column, and it also assumes that the dataset contains only one Source sample per Individual. If this is not the case, the IndividualId must be set.
SourceSampleId Yes Map of the header name that points to a the SourceSampleId, if IndividualId is not specified this will be used at the unique identifier for an Individual.
TargetSampleId Yes Map of the header name that points to the TargetSampleId. The TargetSample diff against the Source Sample is what is considered a CallSet for an Individual.
CallSetId Yes Dictionary that defines the Call Set
CallSetId/Dynamic No Is a Boolean field where true implies that the Call Set is dynamic and is a field in the input file.
CallSetId/CallSetName Yes String that is the CallSetName or the header name if Dynamic is true
Position Yes Dictionary that defines Position
Position/assembly Yes Dictionary that defines the assembly
Position/assembly/Dynamic No Is a Boolean field when true implies that the assembly is dynamic and is a field in the input file.
Position/assembly/assemblyName Yes String that is the assemblyName or the header name if Dynamic is true
Position/chromosome Yes Header name that maps to the chromosome
Position/Location Yes Header name that maps to the Start Position
Position/End Yes Header name that maps to the End Position
TileDBMapping Yes Dictionary of mapping between the Tile DB CSV names and header names. The number of items can be dynamic.
Separators Yes Dictionary of separators that can be used by the script to parse input data. Separators/line is the only mandatory field. Other are script specific.
Separators/line Yes Defines the separator string that the python/split function operates on to split the lines from input file
GTMapping Yes Defines the mapping for translating symbols in the genotype into genotype values. This mapping can be empty
Constants Yes Defines the constants that are used by the script. The mandatory constant required is "pliody"

hg19.json

Defines the HG19 Assembly with the length of each chromosome, the specific order the chromosomes of the given assembly should be placed along the tiledb horizontal dimension, and the offset factor that defines the padding between the chromosomes. This information will be used by translate.py to compute the column #s for Tile DB.

This file defines the Reference and ReferenceSet in MetaDB.

Input Translation layer

Input translation layer is the custom script layer that understands the nuances of the input data set and passes them on to the common translation layer to generate the CSV file. ICGC data set is taken as an input to show how an input translation layer is scripted. You can find the code at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils/.

###maf_importer.py
Conversion Script. See details in the Usage Notes below.
Few things to keep in mind:

  • Connects to a database instance to register new entities and assign TileDB rows.
  • Include all the files that need to be loaded into Tile DB into a single conversion instead of doing piece-wise conversion. The reason being, Tile DB expects consistent set of sample IDs in the input CSV file.
  • Produces the CallSet Map and VID map file required for GenomicsDB loading.

###maf_pyspark.py
Conversion Script which has the same functionality as maf_importer.py, but uses the spark map-reduce hooks. By running it in a distributed spark cluster, the run times for the conversion is reduced by orders of magnitude. The options to run are the same as maf_importer.py through import.py but -t is not supported.

Input Configuration

The input configuration expects at least a master configuration file that is described in Table. NOTE: ICGC data requires a variants config (JSON) file that describes the mapping of ICGC fields to variant names.

Usage

Syntax

usage: maf2tile.py [-h] -c CONFIG -d OUTPUTDIR -i INPUTS [INPUTS ...] [-z]
                   [-s SPARK] [-o OUTPUT] [-a APPEND_CALLSETS] [-l LOADER]

Convert MAF format to Tile DB CSV

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        input configuration file for MAF conversion
  -d OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory where the outputs need to be stored
  -i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
                        List of input MAF files to convert
  -z, --gzipped         True/False indicating if the input file is a gzipped
                        file or not
  -s SPARK, --spark SPARK
                        Run as spark. Where SPARK is the spark-master URI
  -o OUTPUT, --output OUTPUT
                        output Tile DB CSV file (without the path) which will
                        be stored in the output directory. Required for spark.
  -a APPEND_CALLSETS, --append_callsets APPEND_CALLSETS
                        CallSet mapping file to append.
  -l LOADER, --loader LOADER
                        Loader JSON to load data into Tile DB.

See the usage section for examples.