Use case description - GooglingTheCancerGenome/ga-integr8 GitHub Wiki
Goal: Ranking genomic structural variants associated with cancer by integrating function annotations.
Input-Steps-Output
- Description of the annotation process for a first selection of two datasets (genes and TADs) that can be used to determine the best setup (database type, optimization, ...) for annotation of large sets of SVs.
Input
Old dataset (127 SVs):
For initial testing, a smaller test set can be used:
Dataset: TP.txt, TN.txt
Description: list of 127 SVs
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5307971/, SVs (multiple types) classified as 'pathogenic' from patients with ID.
Format: tab-delimited
Example:
2 5 149016758 119485964 "-/+"
Per-column description:
Column 1, 'Chromosome 1': first chromosome on which the SV is located (lower name always comes first)
Column 2, 'Chromosome 2': second chromosome on which the SV is located (Chromosome 1 is first sorted, then Chromosome 2)
Column 3, 'Position 1': breakpoint position on chromosome 1 (this dataset assumes the same start and end position)
Column 4, 'Position 2': breakpoint position on chromosome 2
Column 5, 'Orientation': orientation of the SV. Currently not used.
New dataset (70000 SVs):
Pathogenic SVs:
Dataset: https://github.com/GooglingTheCancerGenome/ga-integr8/blob/master/data/TPTNSet/dataset_Mix_2015-10-27.txt
Description: list of approximately 70000 pathogenic SVs of varying SV types, found in multiple patients with varying cancer types.
Source: In-house
Generated by: https://github.com/GooglingTheCancerGenome/ga-integr8/blob/master/data/TPTNSet/parse_Mix.ipynb
Format: tab-delimited
Example:
1 2380382 2380382 - 19 49682633 49682633 - Baca_Cell_2013 P01-28 inter_chr prostate cancer
Per-column description:
Column 1, 'chr1': first chromosome on which the SV is located
Column 2, 's1': breakpoint start position on the first chromosome
Column 3, 'e1': breakpoint end position on the first chromosome
Column 4, 'o1': orientation of the breakpoint-junction on the first chromosome (like head/tail)
Column 5, 'chr2': second chromosome on which the breakpoint-junction is located
Column 6, 's2': breakpoint start position on the second chromosome
Column 7, 'e2': breakpoint end position on the second chromosome
Column 8, 'source': paper source in which the SV is provided
Column 9, 'sample_name': identifier of the sample in which the SV was detected
Column 10, 'sv_type': type of the SV as described in the source paper
Column 11, 'cancer_type': name of the cancer type the SV was identified in as described in the source paper
Benign SVs:
Dataset: https://github.com/GooglingTheCancerGenome/ga-integr8/blob/master/data/TPTNSet/dataset_1000G_2016-01-26.txt
Description: list of approximately 70000 benign SVs of varying SV types, as obtained from 1000Genomes
Source: In-house
Generated by: https://github.com/GooglingTheCancerGenome/ga-integr8/blob/master/data/TPTNSet/parse_1000G_CNVs_from_sv_map.ipynb
Format: tab-delimited
Per-column description: same as the 70000 pathogenic SVs
Annotation steps required to obtain a matrix with features (on the set of 127 SVs)
-
Process SV input file (either of the files above should work):
- Sorting by chromosome 1, then chromosome 2. Numbers come before X and Y. Coordinates should be ascending and the start position should come before the end position.
-
Annotate the SVs with gene-based features
- Read dataset MP_Genelist_HGNC_v2.txt
- Search for all genes within 2 Mb (before and after) of each SV (using the left-most start and right-most end)
- Obtain the identifiers, pLI and RVIS scores, and HPO terms of these genes
-
Annotate the SVs with TAD-based features
- Read tads.txt (see datasets page for details)
- Compute which TADs each SV overlaps with (defining overlap as matching at least 1 bp)
-
Generating output (see output)
Output
Description: each row describes the genomic position of a structural variant and its associated annotations. Format: tab-separated (debatable)
Example:
1 1736435 1736435 10 1843621 1843623 ID1,ID2,ID3 ENSG00000223972,ENSG00000222623 NA,1.35381157129201e-10 NA,96.68621701 HP:0001744,HP:0002721,HP:0001876
Per-column description (within-column separation character being debatable):
Column 1, 'Chromosome 1': name of the first chromosome on which the SV is located
Column 2, 'Start 1': start position of the SV on the first chromosome
Column 3, 'End 1': end position of the SV on the first chromosome
Column 4, 'Chromosome 2': name of the second chromosome on which the SV is located
Column 5, 'Start 2': start position of the SV on the second chromosome
Column 6, 'End 2': end position of the SV on the second chromosome
Column 7, 'Overlapping TADs': IDs of the TADs that the SV overlaps with
Column 8, 'Nearby gene IDs': IDs of genes that are within 2 Mb of the SV
Column 9, 'pLI': pLI score (pathogenicity indicator) of the genes within 2 Mb
Column 10, 'RVIS': RVIS (pathogenicity indicator) score of the genes within 2 Mb
Column 11, 'HPO': HPO (patient phenotype) terms that are associated with the genes within 2 Mb.
The first 6 columns are related to the SV, which are directly extracted from the input file. The rest of the columns represent annotations.