MAF to Tile CSV Design - GenomicsDB/GenomicsSampleAPIs GitHub Wiki
Common Translation Layer
Common Translation layer repository of tools can be found at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils. The tools in this repository are described in the follow sub-sections.
loader.py
Loader util for GenomicsDB. Uses a loader config to generate the required loader configuration file for GenomicsDB. When -l
is specified, this loader will take the callset and vid maps to automate the GenomicsDB loading process.
example_configs/load_to_tile.cfg
Loader configuration file allowing a user to specify mpi (if desired) as well as pointers to GenomicsDB loader executable, etc.
example_configs/tile_loader.json
An example loader config with some basic settings for loading GenomicsDB. More details can be found on the GenomicsDB wiki.
example_configs/tiledb_config.json
Config defines the valid assemblies. This is a placeholder for future configuration information.
csvline.py
class CSVLine
provides the methods to populate the fields that are expected for a Tile DB entry, and generates csv line. It also validates the entries before generating a CSV line. The expected usage is that this class be called by a higher level program that understands the input format, populates the CSVLine structure, and gets a CSV line that is compatible with the GenomicsDB vcf2tiledb loader.
Usage Note: Populate the ALT field first since it determines the size of PL and AD fields.
file2tile.py
class File2Tile
provides the key data structure and functionality required to build a conversion script. Uses the ConfigReader described below.
configuration.py
class ConfigReader
takes the Master configuration file that details the minimum required mapping to build a CSV file. These configuration files can be found here. The fields of the configuration file are described below:
Field | Mandatory | Description |
---|---|---|
DB_URI | Yes | MetaDB Instance to make a connection to, as defined in alembic.ini: driver://user:pass@localhost/dbname |
TileDBConfig | Yes | Used by tools\tiledb\translate to know which assemblies are supported |
TileDBSchema | Yes | Used by tools\tiledb\translate. This is the assembly that will be used for Tile DB |
TileDBSchema/workspace | Yes | Used by import.py for creating and loading into TileDB schema |
TileDBSchema/array | Yes | Used by import.py for creating and loading into TileDB schema |
TileDBSchema/fields_list | Yes | Fields that correspond to the columns produced from the MAF importer |
TileDBSchema/ftypes_list | Yes | Types that correspond to the fields in fields_list |
HeaderStartsWith | Yes | Required to identify which line is the header |
VariantSetMap | Yes | Dictionary that defines the Variant Name |
VariantSetMap/Dynamic | No | Is a Boolean field where true implies that the Variant Set is dynamic and is a field in the input file. |
VariantSetMap/VariantSet | Yes | String that is the Variant Set or the name in the header if Dynamic is true |
VariantSetMap/VariantLookup | No | Is a Boolean field where true implies that the Variant Set is dynamically looked up based on the VariantSet field in the input file. |
VariantSetMap/VariantConfig | No | Input JSON file that provides a map of translation from the input data to Variant Set Names. This is a custom field for ICGC translation. |
VariantSetMap/LookupIdx | No | Index of the field in the Variant Config dictionary's values that should be used as Variant Set Name |
IndividualId | No | Map of the header name that points to the IndividualId. If IndividualId is not specified, the IndividualId will default to Individual_<SourceSample> . The default addresses the fact that some MAF files do not have an Individual identifier column, and it also assumes that the dataset contains only one Source sample per Individual. If this is not the case, the IndividualId must be set. |
SourceSampleId | Yes | Map of the header name that points to a the SourceSampleId, if IndividualId is not specified this will be used at the unique identifier for an Individual. |
TargetSampleId | Yes | Map of the header name that points to the TargetSampleId. The TargetSample diff against the Source Sample is what is considered a CallSet for an Individual. |
CallSetId | Yes | Dictionary that defines the Call Set |
CallSetId/Dynamic | No | Is a Boolean field where true implies that the Call Set is dynamic and is a field in the input file. |
CallSetId/CallSetName | Yes | String that is the CallSetName or the header name if Dynamic is true |
Position | Yes | Dictionary that defines Position |
Position/assembly | Yes | Dictionary that defines the assembly |
Position/assembly/Dynamic | No | Is a Boolean field when true implies that the assembly is dynamic and is a field in the input file. |
Position/assembly/assemblyName | Yes | String that is the assemblyName or the header name if Dynamic is true |
Position/chromosome | Yes | Header name that maps to the chromosome |
Position/Location | Yes | Header name that maps to the Start Position |
Position/End | Yes | Header name that maps to the End Position |
TileDBMapping | Yes | Dictionary of mapping between the Tile DB CSV names and header names. The number of items can be dynamic. |
Separators | Yes | Dictionary of separators that can be used by the script to parse input data. Separators/line is the only mandatory field. Other are script specific. |
Separators/line | Yes | Defines the separator string that the python/split function operates on to split the lines from input file |
GTMapping | Yes | Defines the mapping for translating symbols in the genotype into genotype values. This mapping can be empty |
Constants | Yes | Defines the constants that are used by the script. The mandatory constant required is "pliody" |
hg19.json
Defines the HG19 Assembly with the length of each chromosome, the specific order the chromosomes of the given assembly should be placed along the tiledb horizontal dimension, and the offset factor that defines the padding between the chromosomes. This information will be used by translate.py
to compute the column #s for Tile DB.
This file defines the Reference and ReferenceSet in MetaDB.
Input Translation layer
Input translation layer is the custom script layer that understands the nuances of the input data set and passes them on to the common translation layer to generate the CSV file. ICGC data set is taken as an input to show how an input translation layer is scripted. You can find the code at https://github.com/Intel-HLS/GenomicsSampleAPIs/tree/master/utils/.
###maf_importer.py
Conversion Script. See details in the Usage Notes below.
Few things to keep in mind:
- Connects to a database instance to register new entities and assign TileDB rows.
- Include all the files that need to be loaded into Tile DB into a single conversion instead of doing piece-wise conversion. The reason being, Tile DB expects consistent set of sample IDs in the input CSV file.
- Produces the CallSet Map and VID map file required for GenomicsDB loading.
###maf_pyspark.py
Conversion Script which has the same functionality as maf_importer.py
, but uses the spark map-reduce hooks. By running it in a distributed spark cluster, the run times for the conversion is reduced by orders of magnitude. The options to run are the same as maf_importer.py
through import.py
but -t
is not supported.
Input Configuration
The input configuration expects at least a master configuration file that is described in Table. NOTE: ICGC data requires a variants config (JSON) file that describes the mapping of ICGC fields to variant names.
Usage
Syntax
usage: maf2tile.py [-h] -c CONFIG -d OUTPUTDIR -i INPUTS [INPUTS ...] [-z]
[-s SPARK] [-o OUTPUT] [-a APPEND_CALLSETS] [-l LOADER]
Convert MAF format to Tile DB CSV
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
input configuration file for MAF conversion
-d OUTPUTDIR, --outputdir OUTPUTDIR
Output directory where the outputs need to be stored
-i INPUTS [INPUTS ...], --inputs INPUTS [INPUTS ...]
List of input MAF files to convert
-z, --gzipped True/False indicating if the input file is a gzipped
file or not
-s SPARK, --spark SPARK
Run as spark. Where SPARK is the spark-master URI
-o OUTPUT, --output OUTPUT
output Tile DB CSV file (without the path) which will
be stored in the output directory. Required for spark.
-a APPEND_CALLSETS, --append_callsets APPEND_CALLSETS
CallSet mapping file to append.
-l LOADER, --loader LOADER
Loader JSON to load data into Tile DB.
See the usage section for examples.