Datasets - GooglingTheCancerGenome/ga-integr8 GitHub Wiki

Core data sources

  • our current use case

Genes

Dataset: MP_Genelist_HGNC_v2.txt
Description: List of all protein-coding genes and pre-selected gene annotations
Source: in-house (HPC)
Format: tab-separated

Example:

DDX11L1 100287596 ENSG00000223972 1 11869 14412 1 11869 NA NA NA NA NA 0 0

Per column description:

Column1, 'hgnc_symbol': unique name associated to a gene (source: https://www.genenames.org/)
Column 2, 'entrezgene': Entrez identifier of a gene
Column 3, 'ensembl_gene_id': Ensembl identifier of a gene (source: https://www.ensembl.org/info/genome/stable_ids/index.html)
Column 4, 'refseq_mrna': refseq mRNA identifiers associated with the gene (source: https://www.ncbi.nlm.nih.gov/refseq/)
Column 5, 'chromosome_name': name of the chromosome the gene is located on, formatted as e.g. 1 or X, without the 'chr' prefix.
Column 6, 'start_position': coordinates of the start position of the gene
Column 7, 'end_position': coordinates of the end position of the gene
Column 8, 'strand': strand orientation of the gene, either 1 or -1.
Column 9, 'phenotype_description': string of phenotype information associated with that gene (source: ?)
Column 10, 'Transcription_Start_Site': TSS coordinates of the gene.
Column 11, 'pLI': pathogenic variant burden score of the gene (see: http://blog.gene-talk.de/?p=639)
Column 12, 'RVIS': residual variation intolerance score (see: http://genic-intolerance.org/about.jsp)
Column 13, 'Entrez_gene_name': gene name associated with the Entrez gene identifier.
Column 14, 'HPO_Terms': text string explaining HPO IDs in words.
Column 15 'HPO_Term_IDs': HPO IDs associated with the gene (source: http://human-phenotype-ontology.github.io/about.html)
Column 16, 'Number_HPO_Terms_Gene': Number of HPO terms associated with the gene.
Column 17, 'redin_score': ? (source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5032982/)

TADs

Dataset: tads/*.txt
Description: List of all TADs across multiple cell types
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5478386/, Table S3
Format: tab-separated. Each file lists the TADs in one cell type.

Example:

chr10 4880000 4920000

Per column description:

Column 1, 'Chromosome': chromosome name that the TAD is located on
Column 2, 'Start': start position of the TAD
Column 3, 'End': end position of the TAD

SVs

[TODO] in-house generated file

Potential data sources

Regulatory elements:

  • (!) Promoters -> ENCODE
  • (!) Enhancers -> ENCODE
  • Repressors

3D conformation:

Epigenetics:

  • Methylation marks
  • Accessibility (DNase I hypersensitivity)

Genes:

  • Gene locations and annotations
  • Conservation (GERP)

Expression:

  • Gene expression / eQTL

Networks:

  • Pathways
  • PPI networks

Phenotype:

Other:

  • CADD scores?

Tools:

References:

Example for describing data sources: https://github.com/candYgene/pbg-ld/wiki/SGN-tomato-data-description

⚠️ **GitHub.com Fallback** ⚠️