Datasets - GooglingTheCancerGenome/ga-integr8 GitHub Wiki
Core data sources
- our current use case
Genes
Dataset: MP_Genelist_HGNC_v2.txt
Description: List of all protein-coding genes and pre-selected gene annotations
Source: in-house (HPC)
Format: tab-separated
Example:
DDX11L1 100287596 ENSG00000223972 1 11869 14412 1 11869 NA NA NA NA NA 0 0
Per column description:
Column1, 'hgnc_symbol': unique name associated to a gene (source: https://www.genenames.org/)
Column 2, 'entrezgene': Entrez identifier of a gene
Column 3, 'ensembl_gene_id': Ensembl identifier of a gene (source: https://www.ensembl.org/info/genome/stable_ids/index.html)
Column 4, 'refseq_mrna': refseq mRNA identifiers associated with the gene (source: https://www.ncbi.nlm.nih.gov/refseq/)
Column 5, 'chromosome_name': name of the chromosome the gene is located on, formatted as e.g. 1 or X, without the 'chr' prefix.
Column 6, 'start_position': coordinates of the start position of the gene
Column 7, 'end_position': coordinates of the end position of the gene
Column 8, 'strand': strand orientation of the gene, either 1 or -1.
Column 9, 'phenotype_description': string of phenotype information associated with that gene (source: ?)
Column 10, 'Transcription_Start_Site': TSS coordinates of the gene.
Column 11, 'pLI': pathogenic variant burden score of the gene (see: http://blog.gene-talk.de/?p=639)
Column 12, 'RVIS': residual variation intolerance score (see: http://genic-intolerance.org/about.jsp)
Column 13, 'Entrez_gene_name': gene name associated with the Entrez gene identifier.
Column 14, 'HPO_Terms': text string explaining HPO IDs in words.
Column 15 'HPO_Term_IDs': HPO IDs associated with the gene (source: http://human-phenotype-ontology.github.io/about.html)
Column 16, 'Number_HPO_Terms_Gene': Number of HPO terms associated with the gene.
Column 17, 'redin_score': ? (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032982/)
TADs
Dataset: tads/*.txt
Description: List of all TADs across multiple cell types
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5478386/, Table S3
Format: tab-separated. Each file lists the TADs in one cell type.
Example:
chr10 4880000 4920000
Per column description:
Column 1, 'Chromosome': chromosome name that the TAD is located on
Column 2, 'Start': start position of the TAD
Column 3, 'End': end position of the TAD
SVs
[TODO] in-house generated file
Potential data sources
Regulatory elements:
- (!) Promoters -> ENCODE
- (!) Enhancers -> ENCODE
- Repressors
3D conformation:
- (!) Hi-C loops -> ENCODE (https://www.encodeproject.org/search/?type=Experiment&assay_title=Hi-C), hdf5 format
- (!) TADs -> ENCODE (https://www.encodeproject.org/search/?type=Experiment&assay_title=Hi-C), bed format
- CTCF sites
- Virtual 4C
Epigenetics:
- Methylation marks
- Accessibility (DNase I hypersensitivity)
Genes:
- Gene locations and annotations
- Conservation (GERP)
Expression:
- Gene expression / eQTL
Networks:
- Pathways
- PPI networks
Phenotype:
- (!) HPO terms -> http://human-phenotype-ontology.github.io/, ontologies available, OBO/OWL format
Other:
- CADD scores?
Tools:
References:
- Common file formats used by the ENCODE Consortium
Example for describing data sources: https://github.com/candYgene/pbg-ld/wiki/SGN-tomato-data-description