MAF to Tile CSV Design - GenomicsDB/GenomicsSampleAPIs GitHub Wiki

Common Translation Layer

Common Translation layer repository of tools can be found at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils. The tools in this repository are described in the follow sub-sections.

`loader.py`

Loader util for GenomicsDB. Uses a loader config to generate the required loader configuration file for GenomicsDB. When -l is specified, this loader will take the callset and vid maps to automate the GenomicsDB loading process.

`example_configs/load_to_tile.cfg`

Loader configuration file allowing a user to specify mpi (if desired) as well as pointers to GenomicsDB loader executable, etc.

`example_configs/tile_loader.json`

An example loader config with some basic settings for loading GenomicsDB. More details can be found on the GenomicsDB wiki.

`example_configs/tiledb_config.json`

Config defines the valid assemblies. This is a placeholder for future configuration information.

`csvline.py`

class CSVLine provides the methods to populate the fields that are expected for a Tile DB entry, and generates csv line. It also validates the entries before generating a CSV line. The expected usage is that this class be called by a higher level program that understands the input format, populates the CSVLine structure, and gets a CSV line that is compatible with the GenomicsDB vcf2tiledb loader. Usage Note: Populate the ALT field first since it determines the size of PL and AD fields.

`file2tile.py`

class File2Tile provides the key data structure and functionality required to build a conversion script. Uses the ConfigReader described below.

`configuration.py`

class ConfigReader takes the Master configuration file that details the minimum required mapping to build a CSV file. These configuration files can be found here. The fields of the configuration file are described below:

Field	Mandatory	Description
DB_URI	Yes	MetaDB Instance to make a connection to, as defined in alembic.ini: driver://user:pass@localhost/dbname
TileDBConfig	Yes	Used by tools\tiledb\translate to know which assemblies are supported
TileDBSchema	Yes	Used by tools\tiledb\translate. This is the assembly that will be used for Tile DB
TileDBSchema/workspace	Yes	Used by import.py for creating and loading into TileDB schema
TileDBSchema/array	Yes	Used by import.py for creating and loading into TileDB schema
TileDBSchema/fields_list	Yes	Fields that correspond to the columns produced from the MAF importer
TileDBSchema/ftypes_list	Yes	Types that correspond to the fields in fields_list
HeaderStartsWith	Yes	Required to identify which line is the header
VariantSetMap	Yes	Dictionary that defines the Variant Name
VariantSetMap/Dynamic	No	Is a Boolean field where true implies that the Variant Set is dynamic and is a field in the input file.
VariantSetMap/VariantSet	Yes	String that is the Variant Set or the name in the header if Dynamic is true
VariantSetMap/VariantLookup	No	Is a Boolean field where true implies that the Variant Set is dynamically looked up based on the VariantSet field in the input file.
VariantSetMap/VariantConfig	No	Input JSON file that provides a map of translation from the input data to Variant Set Names. This is a custom field for ICGC translation.
VariantSetMap/LookupIdx	No	Index of the field in the Variant Config dictionary's values that should be used as Variant Set Name
IndividualId	No	Map of the header name that points to the IndividualId. If IndividualId is not specified, the IndividualId will default to `Individual_<SourceSample>`. The default addresses the fact that some MAF files do not have an Individual identifier column, and it also assumes that the dataset contains only one Source sample per Individual. If this is not the case, the IndividualId must be set.
SourceSampleId	Yes	Map of the header name that points to a the SourceSampleId, if IndividualId is not specified this will be used at the unique identifier for an Individual.
TargetSampleId	Yes	Map of the header name that points to the TargetSampleId. The TargetSample diff against the Source Sample is what is considered a CallSet for an Individual.
CallSetId	Yes	Dictionary that defines the Call Set
CallSetId/Dynamic	No	Is a Boolean field where true implies that the Call Set is dynamic and is a field in the input file.
CallSetId/CallSetName	Yes	String that is the CallSetName or the header name if Dynamic is true
Position	Yes	Dictionary that defines Position
Position/assembly	Yes	Dictionary that defines the assembly
Position/assembly/Dynamic	No	Is a Boolean field when true implies that the assembly is dynamic and is a field in the input file.
Position/assembly/assemblyName	Yes	String that is the assemblyName or the header name if Dynamic is true
Position/chromosome	Yes	Header name that maps to the chromosome
Position/Location	Yes	Header name that maps to the Start Position
Position/End	Yes	Header name that maps to the End Position
TileDBMapping	Yes	Dictionary of mapping between the Tile DB CSV names and header names. The number of items can be dynamic.
Separators	Yes	Dictionary of separators that can be used by the script to parse input data. Separators/line is the only mandatory field. Other are script specific.
Separators/line	Yes	Defines the separator string that the python/split function operates on to split the lines from input file
GTMapping	Yes	Defines the mapping for translating symbols in the genotype into genotype values. This mapping can be empty
Constants	Yes	Defines the constants that are used by the script. The mandatory constant required is "pliody"

`hg19.json`

Defines the HG19 Assembly with the length of each chromosome, the specific order the chromosomes of the given assembly should be placed along the tiledb horizontal dimension, and the offset factor that defines the padding between the chromosomes. This information will be used by translate.py to compute the column #s for Tile DB.

This file defines the Reference and ReferenceSet in MetaDB.

Input Translation layer

Input translation layer is the custom script layer that understands the nuances of the input data set and passes them on to the common translation layer to generate the CSV file. ICGC data set is taken as an input to show how an input translation layer is scripted. You can find the code at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils/.

###maf_importer.py
Conversion Script. See details in the Usage Notes below.
Few things to keep in mind:

Connects to a database instance to register new entities and assign TileDB rows.
Include all the files that need to be loaded into Tile DB into a single conversion instead of doing piece-wise conversion. The reason being, Tile DB expects consistent set of sample IDs in the input CSV file.
Produces the CallSet Map and VID map file required for GenomicsDB loading.

###maf_pyspark.py
Conversion Script which has the same functionality as maf_importer.py, but uses the spark map-reduce hooks. By running it in a distributed spark cluster, the run times for the conversion is reduced by orders of magnitude. The options to run are the same as maf_importer.py through import.py but -t is not supported.

Input Configuration

The input configuration expects at least a master configuration file that is described in Table. NOTE: ICGC data requires a variants config (JSON) file that describes the mapping of ICGC fields to variant names.

Usage

Syntax

usage: maf2tile.py [-h] -c CONFIG -d OUTPUTDIR -i INPUTS [INPUTS ...] [-z]
                   [-s SPARK] [-o OUTPUT] [-a APPEND_CALLSETS] [-l LOADER]

Convert MAF format to Tile DB CSV

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        input configuration file for MAF conversion
  -d OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory where the outputs need to be stored
  -i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
                        List of input MAF files to convert
  -z, --gzipped         True/False indicating if the input file is a gzipped
                        file or not
  -s SPARK, --spark SPARK
                        Run as spark. Where SPARK is the spark-master URI
  -o OUTPUT, --output OUTPUT
                        output Tile DB CSV file (without the path) which will
                        be stored in the output directory. Required for spark.
  -a APPEND_CALLSETS, --append_callsets APPEND_CALLSETS
                        CallSet mapping file to append.
  -l LOADER, --loader LOADER
                        Loader JSON to load data into Tile DB.

See the usage section for examples.