Datasets - GooglingTheCancerGenome/ga-integr8 GitHub Wiki

Core data sources

our current use case

Genes

Dataset: MP_Genelist_HGNC_v2.txt
Description: List of all protein-coding genes and pre-selected gene annotations
Source: in-house (HPC)
Format: tab-separated

Example:

DDX11L1 100287596 ENSG00000223972 1 11869 14412 1 11869 NA NA NA NA NA 0 0

Per column description:

Column1, 'hgnc_symbol': unique name associated to a gene (source: https://www.genenames.org/)
Column 2, 'entrezgene': Entrez identifier of a gene
Column 3, 'ensembl_gene_id': Ensembl identifier of a gene (source: https://www.ensembl.org/info/genome/stable_ids/index.html)
Column 4, 'refseq_mrna': refseq mRNA identifiers associated with the gene (source: https://www.ncbi.nlm.nih.gov/refseq/)
Column 5, 'chromosome_name': name of the chromosome the gene is located on, formatted as e.g. 1 or X, without the 'chr' prefix.
Column 6, 'start_position': coordinates of the start position of the gene
Column 7, 'end_position': coordinates of the end position of the gene
Column 8, 'strand': strand orientation of the gene, either 1 or -1.
Column 9, 'phenotype_description': string of phenotype information associated with that gene (source: ?)
Column 10, 'Transcription_Start_Site': TSS coordinates of the gene.
Column 11, 'pLI': pathogenic variant burden score of the gene (see: http://blog.gene-talk.de/?p=639)
Column 12, 'RVIS': residual variation intolerance score (see: http://genic-intolerance.org/about.jsp)
Column 13, 'Entrez_gene_name': gene name associated with the Entrez gene identifier.
Column 14, 'HPO_Terms': text string explaining HPO IDs in words.
Column 15 'HPO_Term_IDs': HPO IDs associated with the gene (source: http://human-phenotype-ontology.github.io/about.html)
Column 16, 'Number_HPO_Terms_Gene': Number of HPO terms associated with the gene.
Column 17, 'redin_score': ? (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032982/)

TADs

Dataset: tads/*.txt
Description: List of all TADs across multiple cell types
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5478386/, Table S3
Format: tab-separated. Each file lists the TADs in one cell type.

Example:

chr10 4880000 4920000

Per column description:

Column 1, 'Chromosome': chromosome name that the TAD is located on
Column 2, 'Start': start position of the TAD
Column 3, 'End': end position of the TAD

SVs

[TODO] in-house generated file

Potential data sources

Regulatory elements:

(!) Promoters -> ENCODE
(!) Enhancers -> ENCODE
Repressors

3D conformation:

(!) Hi-C loops -> ENCODE (https://www.encodeproject.org/search/?type=Experiment&assay_title=Hi-C), hdf5 format
(!) TADs -> ENCODE (https://www.encodeproject.org/search/?type=Experiment&assay_title=Hi-C), bed format
CTCF sites
Virtual 4C

Epigenetics:

Methylation marks
Accessibility (DNase I hypersensitivity)

Genes:

Gene locations and annotations
Conservation (GERP)

Expression:

Gene expression / eQTL

Networks:

Pathways
PPI networks

Phenotype:

(!) HPO terms -> http://human-phenotype-ontology.github.io/, ontologies available, OBO/OWL format

Other:

CADD scores?

Tools:

References:

Common file formats used by the ENCODE Consortium

Example for describing data sources: https://github.com/candYgene/pbg-ld/wiki/SGN-tomato-data-description

Datasets - GooglingTheCancerGenome/ga-integr8 GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️