interpretation - ToolForVol/doc-synMall GitHub Wiki

📝 Introduction

SynMall is a one-stop synonymous mutation database that stores synonymous mutations across the entire human genome. It contains over 97 million synonymous mutations, corresponding to 25 million unique genome coordinates and reference replacement bases

All potential sSNV

Field Name Description
Variant38 sSNV on GRCh38 format as {chromosome}_{position}_{reference allele}/{alternate allele}
Chromosome Chromosome of sSNV
Position38 Position coordinate of sSNV, build on GRCh38
Reference Allele The refernece allele on genome
Alternate Allele The alternate allele of sSNV
Position19 Position coordinate of sSNV, build on GRCh37, lifted with LIFTOVER(For the unmapped records we use - to represent)
Source Source of this sSNV comes from.
G=Generated with protein coding transcripts;
S=synVep;
F=FavorAnnotator;
C=CADDv1.7
Variant38 sSNV on GRCh37 format as {chromosome}_{position}_{reference allele}/{alternate allele}(For the unmapped records we use - to represent)
ID dbSNP rsID build on b156

✒ Annotation Result interpretion

🖥 in silico Prediction

Common Pathogenic Prediction Score

This section compiles pathogenicity prediction scores for mutations, measured using computational tools that are not limited to a specific type of mutation. The table below lists the names of these tools and the meanings of their fields.

Field Name Description Refernece
CADD_RawScore Raw score from the model , represents a variant is likely to be "observed" vs "simulated".
>0: observed
<0: simulated
Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
CADD_PHRED CADD PHRED Score that scaled on ~8.6 billion SNVs.
Range: [0, 1]
Same as above
DANN_score DANN is a functional prediction score retrained based on the training data of CADD using deep neural network. Scores range from 0 to 1. A larger number indicate a higher probability to be damaging. Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants[J]. Bioinformatics, 2015, 31(5): 761-763.
Eigen_Score A functional prediction score based on conservation, allele frequencies, and deleteriousness prediction Ionita-Laza I, McCallum K, Xu B, et al. A spectral approach integrating functional genomic annotations for coding and noncoding variants[J]. Nature genetics, 2016, 48(2): 214-220.
FATHMM-MKL_Score Discriminate between pathogenic variants and benign variants.
>0.5: deleterious
<=0.5: neutral or benign
Shihab H A, Rogers M F, Gough J, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation[J]. Bioinformatics, 2015, 31(10): 1536-1543.
FATHMM-XF_Score Discriminate between pathogenic variants and benign variants.
>0.5: deleterious
<=0.5: neutral or benign
Rogers M F, Shihab H A, Mort M, al. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features[J]. Bioinformatics, 2018, 34(3): 511-513.
CAPICE_Score The higher the score, the more likely that the variant is pathogenic. Li S, van der Velde K J, De Ridder D, et al. CAPICE: a computational method for consequence-agnostic pathogenicity interpretation of clinical exome variations[J]. Genome Medicine, 2020, 12: 1-11.
TraP_Score The chance of a variant being pathogenic, the higher the score the higher the damage the variant is predicted to have.
0.459<0.93: possibly damaging
>=0.93: probably damaging
Gelfman S, Wang Q, McSweeney K M, al. Annotating pathogenic non-coding variants in genic regions[J]. Nat Commun, 2017, 8(1): 236.
PhD-SNPg_Score A binary classifier for predicting pathogenic variants.
->1: Pathogenic
->0: Benign
Capriotti E, Fariselli P. PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants[J]. Nucleic acids research, 2017, 45(W1): W247-W252.
GPN-MSA_Score Refers to the deleteriousness of one position.
cutoff: -7
Benegas G, Albors C, Aw A J, et al. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction[J]. bioRxiv, 2023.
CScape-somatic_Score Discriminate between pathogenic variants and benign variants.
>0.5: deleterious
<=0.5: neutral or benign
Rogers M F, Gaunt T R, Campbell C. CScape-somatic: distinguishing driver and passenger point mutations in the cancer genome[J]. Bioinformatics, 2020, 36(12): 3637-3644.
CScape_Score Discriminate between pathogenic variants and benign variants.
>0.5: deleterious
<=0.5: neutral or benign
Rogers M F, Shihab H A, Gaunt T R, al. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome[J]. Sci Rep, 2017, 7(1): 11597.

sSNV-specific Pathogenic Prediction Score

This section compiles pathogenicity prediction scores specifically designed for synonymous mutations, measured using computational tools. The table below lists the field names of these tools, their meanings, and the corresponding references.

Field Name Description Refernece
EnDSM_Score Detect deleterious sSNV based on a ensemble learning framework. Cheng N, Wang H, Tang X, al. An Ensemble Framework for Improving the Prediction of Deleterious Synonymous Mutation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(5): 2603-2611.
frDSM_Score Deleterious synonymous mutation prediction using logistic regression. Wang H, Sun J, Liu M, al. frDSM: An Ensemble Predictor With Effective Feature Representation for Deleterious Synonymous Mutation in Human Genome[J]. IEEE/ACM Trans Comput Biol Bioinform, 2023, 20(1): 371-377.
PrDSM_Score Predictive score of PrDSM for each synonymous mutation.
Range: [0, 1]
>0.308: deleterious
<=0.308: benign
Cheng N, Li M, Zhao L, al. Comparison and integration of computational methods for deleterious synonymous mutation prediction[J]. Brief Bioinform, 2020, 21(3): 970-981.
usDSM_Score A prediction score for deleterious synonymous mutations. The larger the score is, the more likely the mutation is deleterious. Tang X, Zhang T, Cheng N, 等. usDSM: a novel method for deleterious synonymous mutation prediction using undersampling scheme[J]. Brief Bioinform, 2021, 22(5).
usDSM_Class The prediction of usDSM model, deleterious or benign. Same as above.
Syntool_Score Intolerance score to sn variation Zhang T, Wu Y, Lan Z, al. Syntool: A Novel Region-Based Intolerance Score to Single Nucleotide Substitution for Synonymous Mutations Predictions Based on 123,136 Individuals[J]. Biomed Res Int, 2017, 2017: 5096208.
Syntool_Score_P Intolerance score percentile to sn variation Same as above.
SliVA_Score A tool for the automated harmfulness prediction of synonymous (silent) mutations within the human genome.
Range: [0, 1]
->1: Harmful
Buske O J, Manickaraj A, Mital S, al. Identification of deleterious synonymous variants in human genomes[J]. Bioinformatics, 2013, 29(15): 1843-1850.
synVep_Score Evaluating the effects of human synonymous variants based on different transcription. Zeng Z, Aptekmann A A, Bromberg Y. Decoding the effects of synonymous variants[J]. Nucleic acids research, 2021, 49(22): 12673-12691.

Regulatory/Functional Prediction Score

This section compiles information on whether mutations have regulatory or functional effects based on computational tools, rather than necessarily being pathogenic. Many tools are designed for non-coding mutations/regions, but they also provide precomputed scores for the whole genome or regions including some synonymous mutations, making them applicable for annotating synonymous mutations. The table below lists the field names of these tools, their meanings, and the corresponding references.

Field Name Description Refernece
MACIE01 The estimated joint posterior probabilities of not evolutionarily conserved and regulatory functional Li X, Yung G, Zhou H, et al. A multi-dimensional integrative scoring framework for predicting functional variants in the human genome[J]. The American Journal of Human Genetics, 2022, 109(3): 446-456.
MACIE10 The estimated joint posterior probabilities of evolutionarily conserved and not regulatory functional Same as above
MACIE00 The estimated joint posterior probabilities of not evolutionarily conserved and not regulatory functional Same as above
MACIE11 The estimated joint posterior probabilities of both evolutionarily conserved and regulatory functional Same as above
MACIE_conserved The estimated posterior probability of evolutionarily conserved Same as above
MACIE_regulatory The estimated posterior probability of regulatory functional Same as above
MACIE_anyclass The estimated posterior probability of evolutionarily conserved or regulatory functional Same as above
FunSeq_Score A flexible framework to prioritize regulatory mutations from cancer genome sequencing (integrative score). Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013)
GenoCanyon_Score Predict the functional potential at each nucleotide. Lu, Q., Hu, Y., Sun, J. et al. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data. Sci Rep 5, 10576 (2015).
FIRE_Score A score refers to the variant's potential to regulate the expression levels of nearby genes. Ioannidis N M, Davis J R, DeGorter M K, et al. FIRE: functional inference of genetic variants that regulate gene expression[J]. Bioinformatics, 2017, 33(24): 3895-3901.
CDTS_Score CDTS context-dependent tolerance scorescore. The lower the score is, the more intolerant to variation. di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333– 337 (2018)
CDTS_percentile genome-wide percentile of the CDTS_score. The lower the percentile,the more constrained the region is. Same as above
ReMM_Score Scores the positions in the human genome in terms of their regulatory probability.
->0: non-deleterious;
->1: deleterious
Smedley D, Schubach M, Jacobsen J O B, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease[J]. The American Journal of Human Genetics, 2016, 99(3): 595-606.
ALoFT_Score ALoFT provides extensive annotations to putative loss-of-function variants (LoF) in protein-coding genes including functional, evolutionary and network features (integrative score). Balasubramanian S, Fu Y, Pawashe M, et al. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes[J]. Nature communications, 2017, 8(1): 1-11.
ALoFT_Description ALoFT annotation can predict the impact of premature stop variants and classify them as dominant disease-causing, recessive disease-causing and benign variants (integrative score). Same as above
LINSIGHT_Score The LINSIGHT score (integrative score). A higher LINSIGHT score indicates more functionality. Range: [0.215, 0.995]. Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017)
RegSeq0 Regulatory sequence model HEK293T Schubach M, Maass T, Nazaretyan L, et al. CADD v1. 7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions[J]. Nucleic Acids Research, 2024, 52(D1): D1143-D1154.
RegSeq1 Regulatory sequence model K562 Same as above
RegSeq2 Regulatory sequence model HepG2 Same as above
RegSeq3 Regulatory sequence model HeLa-S3 Same as above
RegSeq4 Regulatory sequence model MC-7 Same as above
RegSeq5 Regulatory sequence model iPS DF 19.11 Same as above
RegSeq6 Regulatory sequence model GM23338 Same as above
RegSeq7 Regulatory sequence model GC-matched background Same as above
SpliceAI-acc-gain Masked SpliceAI acceptor gain score (default: 0*) Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535- 548.e24 (2019).
SpliceAI-acc-loss Masked SpliceAI acceptor loss score (default: 0) Same as above
SpliceAI-don-gain Masked SpliceAI donor gain score (default: 0) Same as above
SpliceAI-don-loss Masked SpliceAI donor loss score (default: 0) Same as above
MMSp_acceptor MMSplice acceptor score (default: 0) Cheng J, Nguyen T Y D, Cygan K J, et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing[J]. Genome biology, 2019, 20: 1-15.
MMSp_exon MMSplice exon score (default: 0) Same as above
MMSp_donor MMSplice donor score (default: 0) Same as above
dbscSNV-ada_Score Adaboost classifier score from dbscSNV (default: 0*) Jian X, Boerwinkle E, Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome[J]. Nucleic acids research, 2014, 42(22): 13534-13544.
dbscSNV-rf_Score Random forest classifier score from dbscSNV (default: 0*) Same as above
TargetScan_Score Targetscan (default: 0*) Friedman, R. C., Farh, K. K.-H., Burge, C. B. & Bartel, D. P. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105 (2009).
mirSVR-Score mirSVR-Score (default: 0*) Betel D, Koppal A, Agius P, et al. Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites[J]. Genome biology, 2010, 11: 1-14.
mirSVR-E mirSVR-E (default: 0) Same as above
mirSVR-Aln mirSVR-Aln (default: 0) Same as above

🩺 Disease Information

HGMD

  • Reference: Stenson P D, Mort M, Ball E V, et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting[J]. Human genetics, 2020, 139: 1197-1207.
  • Retrieve Source: Professional, 2023.3
  • Brief Introduction: The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease.
Field Name Description
acc_num The HGMD accession number for this mutation. Typically these are strings consisting of CM, CP, CX or HM etc, followed by a six digit integer, such as CM995289. Foreign key for the GENOMIC_COORDS and MUTNOMEN tables.
chrom_hg19
pos_hg19
ref_hg19
alt_hg19
class_hg19
mut_hg19
chrom_hg38
pos_hg38
ref_hg38
alt_hg38
class_hg38
mut_hg38
diseasegene
chrom If known, the number of the chromosome (including X and Y). DEPRECIATED
genename A human readable, fully spelled out name for the gene.
gdbid Identifier for the GDB Genome Database. When a matching record has not been identified, the field contains NULL. Present for historical reasons, as GDB no longer exists.
omimid Identifier for the OMIM database, http://www.ncbi.nlm.nih.gov/omim. When a matching record has not been identified, the field contains NULL.
amino The amino acid change caused by the mutation, in triple-letter code.
deletion Deletions are presented in terms of the deleted bases in lower case plus, in upper case, 10 bp DNA sequence flanking both sides of the lesion. Intron/exon boundary information may be provided where identified (e.g. I12E13). The codon number in the CODON field represents the last whole codon preceding the deletion and is marked in the given sequence by the caret character (^).
insertion Insertions are presented in terms of the inserted bases in lower case plus, in upper case, 10 bp DNA sequence flanking both sides of the lesion. The numbered codon from the AMINO field is preceded in the given sequence by the caret character (^).
codon The number of the altered codon mapped to the HGMD cDNA sequence provided.
codonAff The codon affected by the mutation in question.
descr A textual description of the mutation.
refseq The NCBI mRNA reference sequence utilised by HGMD.
hgvs Composite HGVS cDNA based nomenclature for the mutation.
hgvsAll Composite HGVS nomenclature for fulltext indexing and searching purposes.
dbsnp Links the variants in HGMD to a corresponding dbSNP entry.
chromosome Strictly a number from 1-22, X or Y.
startCoord Number of the first nucleotide of the mutation (chromosomal coordinate). For deletions, the first deleted nucleotide, for insertions, the last nucleotide before the inserted sequence, for single nucleotide mutations, the number of the mutated nucleotide.
endCoord Number of the last nucleotide of the mutation (chromosomal coordinate). For deletions, the last deleted nucleotide; for insertions, the first nucleotide after the inserted sequence; for single nucleotide mutations, the number of the mutated nucleotide (should be identical to CoordSTART).
expected_inheritance Inheritance data curated from multiple literature sources (only where such data may be unequivocally assigned).
gnomad_AC Allele counts for HGMD variants exactly matching variants found in the Genome Aggregation Database
gnomad_AF Allele frequency from gnomAD.
gnomad_AN Total number of alleles sequenced by gnomAD at the matching locus.
tag This field categorizes mutations and polymorphisms. There are seven possible values, DM, DM?, DP, DFP, FP, FTV and R.
dmsupport Positive or negative score depending on the support (or lack of support) of the extra references for pathogenicity or functional alteration. Experimental.
rankscore Ranking score is a single score between 0-1, with 1 been most likely diseasecausing. The score is computed using machine learning, and is based upon multiple lines of evidence, including HGMD literature support for pathogenicity, evolutionary conservation (100- way vertebrate alignment), variant allele frequency and in-silico prediction. This feature is under ongoing development.
mutype Primary type of mutation logged in HGMD. (i.e. missense, initiation, nonsense, synonymous, noncoding, frameshift, inframe, gross, canonical-splice, exonic-splice, splice, nonstop, regulatory).
author Reference field. All the reference fields refer to the literature report that the corresponding mutation was obtained from. Last name of the first author
title
fullname Reference field. The approved Medline abbreviation for the journal. Foreign key for the base table JOURNAL.FULLNAME field
allname ALLNAME contains the name spelled out in its entirety.
vol Reference field. There are 6 possible values for this field.
page Reference field. Number of the first page of the article.
year Reference field. Year the article was published, in four digits.
pmid Reference field. There are 5 possible values, numeric, HGOL, LSDB, NO ID and ABST.
pmidAll This field contains all of the PubMed Ids from primary and additional references that are associated with that variant.
reftag The REFTAG field contains five values APR for additional phenotype report, FCR for functional characterisation report, MCR for molecular characterisation report, ACR for additional case report (detailing an additional case of the mutation) and SAR for simple additional report.
comments Free text comments by the curator.
new_date The date when the mutation was added to the database.
base This field is specific to single base pair substitutions and contains the description of the nucleotide change. This is presented in terms of a triplet change. For example, TAC-TAT represents a change of the last nucleotide C in the triplet to a T. TGT-TAT represents a change of the middle nucleotide G to an A.
clinvarID
clinvar_clnsig

ClinVar

  • Reference: Landrum M J, Chitipiralla S, Brown G R, et al. ClinVar: improvements to accessing data[J]. Nucleic acids research, 2020, 48(D1): D835-D844.
  • Retrieve Source: https://www.ncbi.nlm.nih.gov/clinvar/ , 2024-06-11
  • Brief Introduction: ClinVar is a freely accessible public archive maintained by the NIH, aggregates and provides interpretations of human genetic variants' relationships to diseases.
Field Name Description
AF_ESP allele frequencies from GO-ESP
AF_EXAC allele frequencies from ExAC
AF_TGP allele frequencies from TGP
ALLELEID the ClinVar Allele ID
CLNDN ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
CLNDNINCL For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
CLNDISDB Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN
CLNDISDBINCL For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN
CLNHGVS Top-level (primary assembly, alt, or patch) HGVS expression
CLNREVSTAT ClinVar review status of germline classification for the Variation ID
CLNSIG Aggregate germline classification for this single variant; multiple values are separated by a vertical bar
CLNSIGCONF Conflicting germline classification for this single variant; multiple values are separated by a vertical bar
CLNSIGINCL Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar
CLNVC Variant type
CLNVCSO Sequence Ontology id for variant type
CLNVI the variant's clinical sources reported as tag-value pairs of database and variant identifier
DBVARID nsv accessions from dbVar for the variant
GENEINFO Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)
MC comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence
ONCDN ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB
ONCDNINCL For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL
ONCDISDB Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN
ONCDISDBINCL For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN
ONC Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar
ONCINCL Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar
ONCREVSTAT ClinVar review status of oncogenicity classification for the Variation ID
ONCCONF Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar
ORIGIN Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other
RS dbSNP ID (i.e. rs number)
SCIDN ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB
SCIDNINCL For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL
SCIDISDB Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN
SCIDISDBINCL For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN
SCIREVSTAT ClinVar review status of somatic clinical impact for the Variation ID
SCI Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar
SCIINCL Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar

COSMIC

  • Reference: Tate J G, Bamford S, Jubb H C, et al. COSMIC: the catalogue of somatic mutations in cancer[J]. Nucleic acids research, 2019, 47(D1): D941-D947.
  • Retrieve Source: https://cancer.sanger.ac.uk/cosmic/ , v100
  • Brief Introduction: COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer.
Field Name Description
COSMIC_MUTATION_ID Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome.
GENE Gene name
TRANSCRIPT Transcript accession
STRAND Gene strand
LEGACY_ID Legacy Mutation ID
CDS CDS annotation
AA Peptide annotation
HGVSC HGVS cds syntax
HGVSP HGVS peptide syntax
HGVSG HGVS genomic syntax
SAMPLE_COUNT How many genome screens samples have this mutation
IS_CANONICAL The Ensembl Canonical transcript is a single, representative transcript identified at every locus
TIER Indicates to which tier of the Cancer Gene Census the gene belongs (1/2)
SO_TERM SO term for this mutation
COMISC_SOURCE This record comes from TARGETED_SCREEN or GENOME_SCREEN. GENOME_SCREEN: Coding point mutations from genome wide screens (including whole exome sequencing) from the current release; TARGETED_SCREEN: Complete curated COSMIC dataset (targeted screens) from the current release.

GWAS Catalog

  • Reference: Sollis E, Mosaku A, Abid A, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource[J]. Nucleic acids research, 2023, 51(D1): D977-D985.
  • Retrieve Source: https://www.ebi.ac.uk/gwas/home , v1.0.2
  • Brief Introduction: The NHGRI-EBI GWAS Catalog is a FAIR knowledgebase providing standardized GWAS data, containing variant-trait associations and metadata for over 45,000 published GWAS, with expanded data types and improved interoperability, curated from publications or prepublication author submissions.
FieldName Description
DATE ADDED TO CATALOG Date a study is published in the catalog
PUBMEDID PubMed identification number
FIRST AUTHOR Last name and initials of first author
DATE Publication date (online (epub) date if available)
JOURNAL Abbreviated journal name
LINK PubMed URL
STUDY Title of paper
DISEASE/TRAIT Disease or trait examined in study
INITIAL SAMPLE SIZE Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)
REPLICATION SAMPLE SIZE Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)
REGION Cytogenetic region associated with rs number
CHR_ID Chromosome number associated with rs number
CHR_POS Chromosomal position associated with rs number
REPORTED GENE(S) Gene(s) reported by author
MAPPED_GENE Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is located within multiple genes, these genes are listed separated by commas. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.
UPSTREAM_GENE_ID Entrez Gene ID for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_ID Entrez Gene ID for nearest downstream gene to rs number, if not within gene
SNP_GENE_IDS Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts
UPSTREAM_GENE_DISTANCE Distance in kb for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_DISTANCE Distance in kb for nearest downstream gene to rs number, if not within gene
STRONGEST SNP-RISK ALLELE SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype
SNPS Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)
MERGED Denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)
SNP_ID_CURRENT Current rs number (will differ from strongest SNP when merged = 1)
CONTEXT Provides information on a variant’s predicted most severe functional effect from Ensembl
INTERGENIC Denotes whether SNP is in intergenic region (0 = no; 1 = yes)
RISK ALLELE FREQUENCY Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.
P-VALUE Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).
PVALUE_MLOG -log(p-value)
P-VALUE (TEXT) Information describing context of p-value (e.g. females, smokers).
OR or BETA Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that prior to 2021, any OR <1 was inverted, along with the reported allele, so that all ORs included in the Catalog were >1. This is no longer done, meaning that associations added after 2021 may have OR <1. Appropriate unit and increase/decrease are included for beta coefficients.
95% CI (TEXT) Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.
PLATFORM [SNPS PASSING QC] Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable
CNV Study of copy number variation (yes/no)
MAPPED_TRAIT Mapped Experimental Factor Ontology trait for this study
MAPPED_TRAIT_URI URI of the EFO trait
STUDY ACCESSION Accession ID allocated to a GWAS Catalog study
GENOTYPING TECHNOLOGY Genotyping technology/ies used in this study, with additional array information (ex. Immunochip or Exome array) in brackets.

GRASP

  • Reference: Leslie R, O' Donnell C J, Johnson A D. GRASP: analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database[J]. Bioinformatics, 2014, 30(12): i185-i194.
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa , v2
  • Brief Introduction: GRASP contains over 6.2 million SNP-phenotype associations from 1390 GWAS studies, re-annotated with 16 diverse sources including RNA editing sites, lincRNAs, and PTMs.
Field Name Description
rs Latest snp ID from dbSNP, it can be different from the original SNP entry in the database due to SNPmerges (merged = 1)
PMID PubMed identifier for paper from which the SNP association originates
p-value P-value for SNP-phenotype association
phenotype Phenotype description of SNP-phenotype entry
ancestry Ethnodemographic description of the paper population(s) (e.g., European, Mixed)
platform Description of genotyping and/or imputation platform(s) and number of SNP markers (specified or approximated) included in post-QC analyses

DisGenet

  • Reference: Piñero J, Ramírez-Anguita J M, Saüch-Pitarch J, et al. The DisGeNET knowledge platform for disease genomics: 2019 update[J]. Nucleic acids research, 2020, 48(D1): D845-D855.
  • Retrieve Source: https://www.disgenet.org/home/ , 2020.3
  • Brief Introduction: DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases.
Field Name Description
snpId dbSNP variant Identifier
class type of variant
chromosome Chromosome of the variant
position Position in chromosome
DSI The Disease Specificity Index for the variant
DPI The Disease Pleiotropy Index for the variant
NofDiseases Number of diseases associated to the variant
NofPmids Total number of publications reporting the Variant-Disease association

❗ ClinGen

ClinGen网页上没有对字段的详细描述

  • Reference: Rehm H L, Berg J S, Brooks L D, et al. ClinGen—the clinical genome resource[J]. New England Journal of Medicine, 2015, 372(23): 2235-2242.
  • Retrieve Source: https://clinicalgenome.org/ , 2024-03-27
  • Brief Introduction: ClinGen is a National Institutes of Health (NIH)-funded resource dedicated to building a central resource that defines the clinical relevance of genes and variants for use in precision medicine and research.
Field Name Description
#Variation
ClinVar Variation Id
Allele Registry Id GlinGen canonical allele identifier, example: CA200893
HGVS Expressions
HGNC Gene Symbol
Disease
Mondo Id MonDO IDs are required for describing the disease entity in the ClinGen Gene and Variant Curation Interfaces
Mode of Inheritance a gene may also be associated with multiple inheritance patterns
Assertion Clinical Validity Classification
Applied Evidence Codes (Met)
Applied Evidence Codes (Not Met)
Summary of interpretation
PubMed Articles
Expert Panel
Guideline
Approval Date
Published Date
Retracted
Evidence Repo Link
Uuid
HGVSg

VariSNP

  • Reference: Schaafsma G C P, Vihinen M. V ari SNP, a benchmark database for variations from db SNP[J]. Human mutation, 2015, 36(2): 161-166.
  • Retrieve Source: https://lap676.srv.lu.se/VariSNP/index.php , 2017-02-16
  • Brief Introduction: VariSNP is a benchmark database suite comprising variation datasets that can be used for developing and testing the performance of variant effect prediction tools.  VariSNP contains datasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode, so all variations are considered neutral or non-pathogenic.
Field Name Description
dbSNP_id dbSNP RefSNP cluster ID number (rs#)
heterozygosity Estimated average heterozygosity from allele frequencies of this RefSNP. Values between 0 and 1. You can find a document describing the computation of average heterozygosity and standard error for dbSNP RefSNP clusters at NCBI
heterozygosity_standard_error Standard error of heterozygosity estimate.
creation_date Date when the RefSNP cluster was instantiated
creation_build Date when the RefSNP cluster was instantiated
update_date Most recent date the RefSNP cluster was updated (member added or deleted)
update_build Build number (NCBI release) when the RefSNP cluster was updated
observed_alleles Observed variation alleles. All allele(s) observed at this position in the reference. Can be something like A/C or A/C/G/T or -/ACC
asn_from Start position of snp on contig, counting from 0. This position is always from the beginning of the contig regardless of the snp orientation to contig and regardless of the contig orienation to chromosome
asn_to End position of snp on contig
reference_allele Reference allele(s), this can be a '-' in the case of an insertion
orientation Orientation of RefSNP sequence to contig sequence. Values are 'forward' or 'reverse'
minor_allele_frequency Global minor allele frequency. dbSNP is reporting the minor allele frequency for each rs included in a default global population. Since this is being provided to distinguish common polymorphism from rare variants, the MAF is actually the second most frequent allele value. In other words, if there are 3 alleles, with frequencies of 0.50, 0.49, and 0.01, the MAF will be reported as 0.49. The current default global population is 1000Genome phase 1 genotype data from 1094 worldwide individuals, released in the May 2011 dataset. Values from 0 to 0.50
minor_allele Minor allele
sample_size Sample size, which is the number of chromosomes in the sample population
validation Validation method, type of evidence used to confirm the variation. Present values can be byHapMap; byOtherPop; byFrequency; by1000G; by2Hit2Allele; byCluster
hgvs_names Description(s) of the variation according to HGVS recommendations
allele_origin Genetic origin of the allele, e.g. germline, somatic, inherited, maternal
clinical_significance Clinical significance. Assertions of clinical significance for alleles of human sequence variations are reported as provided by the submitter and not interpreted by NCBI. Submissions based on processing data from OMIM® were assigned the value of probable-pathogenic. If there is a published authoritative guideline about the pathogenicity of any allele, that is included in the report. The supported values are: unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other
functional_class Variation functional class. Variations are assigned functional classes, which report if a variation is located in a locus region, in a transcript, or in a coding region. This column contains one or more functional classes (fxnClass), values can be cds-indel, downstream-variant-500B, frameshift-variant, intron-variant, missense, nc-transcript-variant, reference, splice-acceptor-variant, splice-donor-variant, stop-gained, stop-lost, synonymous-codon, upstream-variant-2KB, utr-variant-3-prime. In this column you can also find values for a to the functional class corresponding Sequence Ontology term (soTerm), the mRNA accession (mrnaAcc) and version (mrnaVer), gene symbol (symbol) and the Entrez gene id (geneid)
ncbi_gi NCBI gi number.
ncbi_accession NCBI accession and version number of reference sequence, e.g. NG_01234.5
gene_symbol Gene symbol (provided by HGNC).
refseq_start_description Description relative to transcription start on reference sequence
coding_dna_description Coding DNA variant description according to HGVS recommendations
protein_description Protein variant description according to HGVS recommendations
coding_reference NCBI RefSeq accession and version number (mRNA), e.g. NM_01234.5
protein_reference NCBI RefSeq accession and version number (protein), e.g. NP_01234.5

dbDSM

  • Reference: Wen P, Xiao P, Xia J. dbDSM: a manually curated database for deleterious synonymous mutations[J]. Bioinformatics, 2016, 32(12): 1914-1916.
  • Retrieve Source: http://www.xialab.info:8080/dbDSM/index.jsp , v2
  • Brief Introduction: dbDSM (Database of Deleterious Synonymous Mutation) is an integrated database that collect multiple sources relate to deleterious synonymous mutations.
Field Name Description
dbDSM Number The access number of a variant in dbDSM
Disease The main phenotype of the patient
DOID The identifier of a disease linked to OMIM database
Gene Gene name
GeneID The unique identifier for a gene
MIM The identifier of a gene linked to OMIM database
Map Location The map location for this gene
Protein A protein reference level representation of the variant
cDNA A coding reference level representation of the variant
SNPID dbSNP identifier of the variant. If there is no rs id this field is “n/a“
Refseq Transcript Refseq Transcript that the variant resides on
P-value P-value in GWAS
Strand A variant occurred in forword chain(+) or reverse chain(-)
GRCh38 Position The position of variant on GRCh38
GRCh37 Position The position of variant on GRCh37
Ref Reference allele
Alt Alternate allele
Year Published time of an article
PMID Pubmed ID for an article
Classification Deleterious mechanism of a variant
Strength of Evidence Clinical classification of a variant
Key Sentence Deleterious evidence of a variant extracted from the article
Source The source of a variant
Score dbDSM score of a variant Which are including SilVA,DDIG-SN,FATHMM-MKL, TraP, CADD score.We use voting methods to evaluate the variant, dbDSM score plus one if the score above the threshold value for each tool.

PharmGKB

  • Reference: Gong L, Whirl‐Carrillo M, Klein T E. PharmGKB, an integrated resource of pharmacogenomic knowledge[J]. Current protocols, 2021, 1(8): e226.
  • Retrieve Source: https://www.pharmgkb.org/ , 2024-03-06
  • Brief Introduction: The Pharmacogenomics Knowledgebase (PharmGKB) is an integrated online knowledge resource for the understanding of how genetic variation contributes to variation in drug response.
  1. var_pheno_ann.tsv: Contains associations in which the variant affects a phenotype, with or without drug information.
  2. var_drug_ann.tsv: Contains associations in which the variant affects a drug dose, response, metabolism, etc.
  3. var_fa_ann.tsv: Contains in vitro and functional analysis-type associations.
Field Name Description
Variant Annotation ID Unique ID number for each variant/drug annotation.
Variant/Haplotypes dbSNP rsID or haplotype(s) involved in the association. In some cases, an association is based on a gene phenotype group such as "poor metabolizers" or "intermediate activity". In these cases, the gene phenotype is found in this field.
Gene HGNC symbol for the gene involved in the association. Typically the variants will be within the gene boundaries, but occasionally this will not be true. E.g. the variant in the annotation may be upstream of the gene but is reported to affect the gene's expression or otherwise associated with the gene.
Drug(s) The drug(s) involved in the association. If there is more than one drug listed, the association may apply to each drug individually or the combination of the drugs together. The field "Multiple drugs And/or" will designate "or" - meaning that it applies to each drug - or "and" - meaning that the association is for the combination.
PMID PubMed identifier for the article supporting the annotation.
Phenotype Category Options are "efficacy", "toxicity", "dosage", "metabolism/PK", "PD", "other".
Significance The significance of the association as stated by the author; options are [yes, no, not stated].
Notes Free text field for notes added by the curator.
Sentence The structured annotation sentence generated by the variant annotation tool based on the information entered by the curator.
Alleles The basis for comparison in the annotation. In this field, there may be a variant, one or more haplotypes grouped together, one or more genotypes grouped together or one or more diplotypes grouped together. If there is a gene phenotype in the "Variant/Haplotypes" field (described above), this field will be blank
Specialty Population Any special populations this annotation is relevant to (e.g. pediatric).
Assay Type Information about the type of assay performed.
  1. Relationship
Field Name Description
Entity1_id Diseases, genes and drugs are designated by their PharmGKB IDs.
Entity1_type Disease, Drug, Gene, VariantLocation1 or Haplotype2.
Entity2_id Diseases, genes and drugs are designated by their PharmGKB IDs.
Entity2_type Disease, Drug, Gene, VariantLocation1 or Haplotype2.
Evidence VIP, VariantAnnotation, ClinicalAnnotation, DosingGuideline, DrugLabel or Pathway. Comma separated list because the evidence for a relationship could come from multiple sources in PharmGKB.
Association Possible values: “associated”, “not associated” or “ambiguous”.
PK PK stands for “Pharmacokinetic”. Relationships are marked as PK if the pair of entities was found in a pharmacokinetic pathway on PharmGKB, or if the Variant Annotation or VIP was annotated with PK in some manner
PD PD stands for “Pharmacodynamic”. Relationships are marked as PD if the pair of entities was found in a pharmacodynamic pathway on PharmGKB, or if the Variant Annotation or VIP was annotated with PD in some manner.
PMIDs PubMed IDs that were used to support the listed relationship. Semi-colon delimited list.
  1. Clinical
Field Name Description
variant name or symbol of the variant
gene HGNC ID of the gene
type category or categories that the annotation falls in
level of evidence strength of evidence for the annotation
chemicals drug(s) associated with the variant in the annotation; from the PharmGKB drug vocabulary
phenotypes associated disease phenotype(s), where applicable
  1. Variant
Field Name Description
Variant ID The PharmGKB identifier for this variant
Variant Name The PharmGKB name for this variant
Gene IDs The PharmGKB identifiers for genes associated with this variant
Gene Symbols The HGNC symbols for genes associated with this variant
Location The location of this variation on a reference sequence (either RefSeq or GenBank), if available. HGVS format when applicable
Variant Annotation count The count of Variant Annotations done on this variant
Clinical Annotation count The count of all Clinical Annotations done on this variant
Level 1/2 Clinical Annotation count The count of Level 1 or Level 2 ("top") Clinical Annotations done on this variant
Guideline Annotation count The count of Dosing Guideline Annotations of which this variant is a part
Label Annotation count The count of Drug Label Annotations in which this variant is mentioned
Synonyms A comma-separated list of synonyms for this variant. Includes HGVS names, retired RSIDs, and other names

👁 Epigenetic Information

ENCODE

  • Reference: Davis C A, Hitz B C, Sloan C A, et al. The Encyclopedia of DNA elements (ENCODE): data portal update[J]. Nucleic acids research, 2018, 46(D1): D794-D801.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: Chemical modifications (e.g., methylation and acetylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription.
Field Name Description
EncodeH3K4me1-sum Sum of Encode H3K4me1 levels (from 13 cell lines) (default: 0.76)
EncodeH3K4me1-max Maximum Encode H3K4me1 level (from 13 cell lines) (default: 0.37)
EncodeH3K4me2-sum Sum of Encode H3K4me2 levels (from 14 cell lines) (default: 0.73)
EncodeH3K4me2-max Maximum Encode H3K4me2 level (from 14 cell lines) (default: 0.37)
EncodeH3K4me3-sum Sum of Encode H3K4me3 levels (from 14 cell lines) (default: 0.81)
EncodeH3K4me3-max Maximum Encode H3K4me3 level (from 14 cell lines) (default: 0.38)
EncodeH3K9ac-sum Sum of Encode H3K9ac levels (from 13 cell lines) (default: 0.82)
EncodeH3K9ac-max Maximum Encode H3K9ac level (from 13 cell lines) (default: 0.41)
EncodeH3K9me3-sum Sum of Encode H3K9me3 levels (from 14 cell lines) (default: 0.81)
EncodeH3K9me3-max Maximum Encode H3K9me3 level (from 14 cell lines) (default: 0.38)
EncodeH3K27ac-sum Sum of Encode H3K27ac levels (from 14 cell lines) (default: 0.74)
EncodeH3K27ac-max Maximum Encode H3K27ac level (from 14 cell lines) (default: 0.36)
EncodeH3K27me3-sum Sum of Encode H3K27me3 levels (from 14 cell lines) (default: 0.93)
EncodeH3K27me3-max Maximum Encode H3K27me3 level (from 14 cell lines) (default: 0.47)
EncodeH3K36me3-sum Sum of Encode H3K36me3 levels (from 10 cell lines) (default: 0.71)
EncodeH3K36me3-max Maximum Encode H3K36me3 level (from 10 cell lines) (default: 0.39)
EncodeH3K79me2-sum Sum of Encode H3K79me2 levels (from 13 cell lines) (default: 0.64)
EncodeH3K79me2-max Maximum Encode H3K79me2 level (from 13 cell lines) (default: 0.34)
EncodeH4K20me1-sum Sum of Encode H4K20me1 levels (from 11 cell lines) (default: 0.88)
EncodeH4K20me1-max Maximum Encode H4K20me1 level (from 11 cell lines) (default: 0.47)
EncodeH2AFZ-sum Sum of Encode H2AFZ levels (from 13 cell lines) (default: 0.9)
EncodeH2AFZ-max Maximum Encode H2AFZ level (from 13 cell lines) (default: 0.42)
EncodeDNase-sum Sum of Encode DNase-seq levels (from 12 cell lines) (default: 0.0)
EncodeDNase-max Maximum Encode DNase-seq level (from 12 cell lines) (default: 0.0)
EncodetotalRNA-sum Sum of Encode totalRNA-seq levels (from 10 cell lines always minus and plus strand) (default: 0.0)
EncodetotalRNA-max Maximum Encode totalRNA-seq level (from 10 cell lines, minus and plus strand separately) (default: 0.0)

chromHMM

  • Reference: Ernst J, Kellis M. Chromatin-state discovery and genome annotation with ChromHMM[J]. Nature protocols, 2017, 12(12): 2478-2492.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: ChromHMM annotates the noncoding genome using epigenomic data across multiple cell types by employing a multivariate hidden Markov model to infer chromatin-state signatures, generating genome-wide annotations and facilitating functional interpretations through automated enrichment analysis.
Field Name Description
cHmm_E1 Number of 48 cell types in chromHMM state E1_poised (default: 1.92*)
cHmm_E2 Number of 48 cell types in chromHMM state E2_repressed (default: 1.92)
cHmm_E3 Number of 48 cell types in chromHMM state E3_dead (default: 1.92)
cHmm_E4 Number of 48 cell types in chromHMM state E4_dead (default: 1.92)
cHmm_E5 Number of 48 cell types in chromHMM state E5_repressed (default: 1.92)
cHmm_E6 Number of 48 cell types in chromHMM state E6_repressed (default: 1.92)
cHmm_E7 Number of 48 cell types in chromHMM state E7_weak (default: 1.92)
cHmm_E8 Number of 48 cell types in chromHMM state E8_gene (default: 1.92)
cHmm_E9 Number of 48 cell types in chromHMM state E9_gene (default: 1.92)
cHmm_E10 Number of 48 cell types in chromHMM state E10_gene (default: 1.92)
cHmm_E11 Number of 48 cell types in chromHMM state E11_gene (default: 1.92)
cHmm_E12 Number of 48 cell types in chromHMM state E12_distal (default: 1.92)
cHmm_E13 Number of 48 cell types in chromHMM state E13_distal (default: 1.92)
cHmm_E14 Number of 48 cell types in chromHMM state E14_distal (default: 1.92)
cHmm_E15 Number of 48 cell types in chromHMM state E15_weak (default: 1.92)
cHmm_E16 Number of 48 cell types in chromHMM state E16_tss (default: 1.92)
cHmm_E17 Number of 48 cell types in chromHMM state E17_proximal (default: 1.92)
cHmm_E18 Number of 48 cell types in chromHMM state E18_proximal (default: 1.92)
cHmm_E19 Number of 48 cell types in chromHMM state E19_tss (default: 1.92)
cHmm_E20 Number of 48 cell types in chromHMM state E20_poised (default: 1.92)
cHmm_E21 Number of 48 cell types in chromHMM state E21_dead (default: 1.92)
cHmm_E22 Number of 48 cell types in chromHMM state E22_repressed (default: 1.92)
cHmm_E23 Number of 48 cell types in chromHMM state E23_weak (default: 1.92)
cHmm_E24 Number of 48 cell types in chromHMM state E24_distal (default: 1.92)
cHmm_E25 Number of 48 cell types in chromHMM state E25_distal (default: 1.92)

❗ ORegAnno

ORegAnno的网页失效了,提供该数据的WGSA也只给出了如下两个字段的描述。

  • Reference: Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126-132 (2016).
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction: The Open Regulatory Annotation database (ORegAnno) is a resource for curated regulatory annotation. It contains information about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements.
Field Name Description
#Chrom
Start
End
ORegAnno_ID
Species
Outcome
Type The type of regulatory region by ORegAnno
Gene_Symbol
Gene_ID
Gene_Source
Regulatory_Element_Symbol
Regulatory_Element_ID
Regulatory_Element_Source
dbSNP_ID
PMID The PMID of the paper describing the regulation
Dataset
Build
Strand

DICE

  • Reference:Schmiedel B J, Singh D, Madrigal A, et al. Impact of genetic polymorphisms on human immune cell gene expression[J]. Cell, 2018, 175(6): 1701-1715. e16.
  • Retrieve Source: https://dice-database.org/downloads , 2.23.2022
  • Brief Introduction: The DICE project aims to elucidate the role of common genetic variations in human disease by creating reference transcriptomic and epigenomic maps of immune cells, identifying functional SNPs affecting gene expression, and investigating regulatory mechanisms and cell-type specific effects, including those influenced by sex, to reveal how disease risk-associated polymorphisms impact pathogenesis.
Field Name Description
DICE_rs_ID dbSNP rsID
DICE_FILTER Filter status
DICE_Cell_Type Different cell type reported in DICE
DICE_Gene Ensembl ID
DICE_GeneSymbol Gene symbol
DICE_Pvalue Pvalue
DICE_Beta The beta value indicates if expression for the alt allele is higher (if beta is positive) or lower (if beta is negative)

Geuvadis

  • Reference: Lappalainen T, Sammeth M, Friedländer M R, et al. Transcriptome and genome sequencing uncovers functional variation in humans[J]. Nature, 2013, 501(7468): 506-511.
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction: Geuvadis is the first uniformly processed RNA-seq data from 462 individuals across multiple populations, revealing extensive genetic variation in gene regulation and providing insights into causal regulatory mechanisms and disease-associated loci.
Field Name Description
Geuvadis_eQTL_target_gene Ensembl gene ID of the eQTL associated with, from the Geuvadis project

GTEx

Field Name Description
variant_id variant ID in the format {chr}_{pos}_\ref_base}_{ref_seq}/{alt_seq}
gene_id GENCODE/Ensembl gene ID
tss_distance distance between variant and transcription start site. Positive when variant is downstream of the TSS, negative otherwise
ma_samples number of samples carrying the minor allele
ma_count total number of minor alleles across individuals
maf minor allele frequency observed in the set of donors for a given tissue
pval_nominal nominal p-value threshold for calling a variant-gene pair significant for the gene
slope regression slope
slope_se standard error of the regression slope
pval_nominal_threshold nominal p-value threshold for calling a variant-gene pair significant for the gene
min_pval_nominal smallest nominal p-value for the gene
pval_beta beta-approximated permutation p-value for the gene
tissue_type Different human tissuses in GTEx

Transcript Factor

  • Reference: Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: Transcription-factor-related information retrieved from CADD v1.7.
Field Name Description
RemapOverlapTF Remap number of different transcription factors binding (default: -0.5)
RemapOverlapCL Remap number of different transcription factor - cell line combinations binding (default: -0.5)

GeneHancer

  • Reference: Fishilevich S, Nudel R, Rappaport N, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards[J]. Database, 2017, 2017: bax028.
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction: GeneHancer predictions are fully integrated in the widely used GeneCards Suite, whereby candidate enhancers and their annotations are displayed on every relevant GeneCard.
Field Name Description
GeneHancer Predicted human enhancer sites from the GeneHancer database.

Super Enhancer

  • Reference: Hnisz D, Abraham B J, Lee T I, et al. Super-enhancers in the control of cell identity and disease[J]. Cell, 2013, 155(4): 934-947.
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction: Super-enhancers produce a catalog of super-enhancers in a broad range of human cell types and find that super-enhancers associate with genes that control and define the biology of these cells.
Field Name Description
Super Enhancer Predicted super-enhancer sites and targets in a range of human cell types.

Enhancer Finder

  • Reference: Erwin G D, Oksenberg N, Truty R M, et al. Integrating diverse datasets improves developmental enhancer prediction[J]. PLoS computational biology, 2014, 10(6): e1003677.
  • Retrieve Source: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003677#references
  • Brief Introduction: EnhancerFinder integrates DNA sequence motifs, evolutionary patterns, and functional genomics data to predict developmental enhancers and their tissue specificity, which outperforms single-data approaches, identifies 84,301 enhancers genome-wide, and provides functional annotations enriched near relevant genes and GWAS lead SNPs, with predictions validated in vivo and available as a UCSC Genome Browser track.
Field Name Description
Enhancer_Finder_General_Prediction_MKL_Scores Whether the site is within a predicted general developmental enhancers, along with MKL scores.
Enhancer_Finder_General_Prediction_H3K27ac_H3K4me1_Contexts The H3K27ac and H3K4me1 marks from the feature data overlapping each predicted enhancer.
Enhancer_Finder_Limb_MKL_Scores Whether the site is within a predicted limb tissuse-specificity enhancers, along with MKL scores.
Enhancer_Finder_Brain_MKL_Scores Whether the site is within a predicted brain tissuse-specificity enhancers, along with MKL scores.
Enhancer_Finder_Heart_MKL_Scores Whether the site is within a predicted heart tissuse-specificity enhancers, along with MKL scores.

CAGE Promoter

  • Reference: The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas[J]. Nature, 2014, 507(7493): 462-470.
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction: Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) in human and mouse cells, revealing few 'housekeeping' genes, many composite promoters with cell-type-specific TSSs, and differing evolutionary rates for TSSs, linking key transcription factors to cell states, with the FANTOM5 project providing comprehensive mammalian cell-type-specific transcriptome profiles for biomedical research.
Field Name Description
cage_promoter CAGE defined promoter sites from Fantom 5
cage_tc CAGE tag cluster

CAGE Enhancer

  • Reference: Andersson R, Gebhard C, Miguel-Escalada I, et al. An atlas of active enhancers across human cell types and tissues[J]. Nature, 2014, 507(7493): 455-461.
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction: CAGE Enhancer utilizes the FANTOM5 panel of samples, covering the majority of human tissues and cell types, to produce an atlas of active, in vivo-transcribed enhancers.
Field Name Description
cage_enhancer CAGE defined permissive Enhancer sites from Fantom 5

snoRNABase/miRBase

  • Reference:
    1. miRBase: Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).
    2. snoRNABase: Lestrade, L. & Weber, M. J. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 34, D158-162 (2006).
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction:
    • miRBase: The miRBase database contains 24,521 microRNA loci from 206 species and includes a high-confidence subset based on deep sequencing data.
    • snoRNABase: The snoRNA-LBME-db is an online database containing experimentally verified and predicted human C/D box and H/ACA box snoRNAs, and scaRNAs, which guide RNA modifications and maturation, providing detailed annotations, predicted base pairings.
Field Name Description
sno_miRNA_name The name of snoRNA or miRNA if the site is located within (from miRBase/snoRNABase)
sno_miRNA_type the type of snoRNA or miRNA (from miRBase/snoRNABase)

👥 Allele Frequency

gnomAD

  • Reference: Chen S, Francioli L C, Goodrich J K, et al. A genomic mutational constraint map using variation in 76,156 human genomes[J]. Nature, 2024, 625(7993): 92-100.
  • Retrieve Source: https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/
  • Brief Introduction: The gnomAD database is composed of exome and genome sequences from around the world. We have removed cohorts that were recruited for pediatric disease, except for a small number of diverse cohorts where we have included unaffected relatives.
Field Name Description
exomes_AF Exomes Alternate allele frequency
exomes_AF_XX Exomes Alternate allele frequency in XX samples
exomes_AF_XY Exomes Alternate allele frequency in XY samples
exomes_AF_afr_XX Exomes Alternate allele count for XX samples of African/African-American ancestry
exomes_AF_afr_XY Exomes Alternate allele count for XYsamples of African/African-American ancestry
exomes_AF_afr Exomes Alternate allele frequency in samples of African/African-American ancestry
exomes_AF_amr_XX Exomes Alternate allele frequency in XX samples of Latino ancestry
exomes_AF_amr_XY Exomes Alternate allele frequency in XY samples of Latino ancestry
exomes_AF_amr Exomes Alternate allele frequency in samples of Latino ancestry
exomes_AF_asj_XX Exomes Alternate allele frequency in XX samples of Ashkenazi Jewish ancestry
exomes_AF_asj_XY Exomes Alternate allele frequency in XY samples of Ashkenazi Jewish ancestry
exomes_AF_asj Exomes Alternate allele frequency in samples of Ashkenazi Jewish ancestry
exomes_AF_eas_XX Exomes Alternate allele frequency in XX samples of East Asian ancestry
exomes_AF_eas_XY Exomes Alternate allele frequency in XY samples of East Asian ancestry
exomes_AF_eas Exomes Alternate allele frequency in samples of East Asian ancestry
exomes_AF_fin_XX Exomes Alternate allele frequency in XX samples of Finnish ancestry
exomes_AF_fin_XY Exomes Alternate allele frequency in XY samples of Finnish ancestry
exomes_AF_fin Exomes Alternate allele frequency in samples of Finnish ancestry
exomes_AF_mid_XX Exomes Alternate allele frequency in XX samples of Middle Eastern ancestry
exomes_AF_mid_XY Exomes Alternate allele frequency in XY samples of Middle Eastern ancestry
exomes_AF_mid Exomes Alternate allele frequency in samples of Middle Eastern ancestry
exomes_AF_nfe_XX Exomes Alternate allele frequency in XX samples of Non-Finnish European ancestry
exomes_AF_nfe_XY Exomes Alternate allele frequency in XY samples of Non-Finnish European ancestry
exomes_AF_nfe Exomes Alternate allele frequency in samples of Non-Finnish European ancestry
genomes_AF Genomes Alternate allele frequency
genomes_AF_XX Genomes Alternate allele frequency in XX samples
genomes_AF_XY Genomes Alternate allele frequency in XY samples
genomes_AF_afr_XX Genomes Alternate allele count for XX samples of African/African-American ancestry
genomes_AF_afr_XY Genomes Alternate allele count for XYsamples of African/African-American ancestry
genomes_AF_afr Genomes Alternate allele frequency in samples of African/African-American ancestry
genomes_AF_amr_XX Genomes Alternate allele frequency in XX samples of Latino ancestry
genomes_AF_amr_XY Genomes Alternate allele frequency in XY samples of Latino ancestry
genomes_AF_amr Genomes Alternate allele frequency in samples of Latino ancestry
genomes_AF_asj_XX Genomes Alternate allele frequency in XX samples of Ashkenazi Jewish ancestry
genomes_AF_asj_XY Genomes Alternate allele frequency in XY samples of Ashkenazi Jewish ancestry
genomes_AF_asj Genomes Alternate allele frequency in samples of Ashkenazi Jewish ancestry
genomes_AF_eas_XX Genomes Alternate allele frequency in XX samples of East Asian ancestry
genomes_AF_eas_XY Genomes Alternate allele frequency in XY samples of East Asian ancestry
genomes_AF_eas Genomes Alternate allele frequency in samples of East Asian ancestry
genomes_AF_fin_XX Genomes Alternate allele frequency in XX samples of Finnish ancestry
genomes_AF_fin_XY Genomes Alternate allele frequency in XY samples of Finnish ancestry
genomes_AF_fin Genomes Alternate allele frequency in samples of Finnish ancestry
genomes_AF_mid_XX Genomes Alternate allele frequency in XX samples of Middle Eastern ancestry
genomes_AF_mid_XY Genomes Alternate allele frequency in XY samples of Middle Eastern ancestry
genomes_AF_mid Genomes Alternate allele frequency in samples of Middle Eastern ancestry
genomes_AF_nfe_XX Genomes Alternate allele frequency in XX samples of Non-Finnish European ancestry
genomes_AF_nfe_XY Genomes Alternate allele frequency in XY samples of Non-Finnish European ancestry
genomes_AF_nfe Genomes Alternate allele frequency in samples of Non-Finnish European ancestry

UK10K

  • Reference: Statistics group Ciampi Antonio 8 Greenwood Celia MT (co-chair) 7 8 14 19 Hendricks Audrey E. 1 12 Li Rui 7 13 14 Metrustry Sarah 5 Oualkacha Karim 80 Tachmazidou Ioanna 1 Xu ChangJiang 7 8 Zeggini Eleftheria (co-chair) 1. The UK10K project identifies rare variants in health and disease[J]. Nature, 2015, 526(7571): 82-90.
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction: The UK10K project will enable researchers in the UK and beyond to better understand the link between low-frequency and rare genetic changes, and human disease caused by harmful changes to the proteins the body makes.
Field Name Description
RS_ID dbSNP ID.
DP -
VQSLOD -
AC Alternative allele count in called genotypes in UK10K cohorts.
AN Total allele count in called genotypes in UK10K cohorts.
AF Alternative allele frequency in called genotypes in UK10K cohorts.
AC_TWINSUK Alternative allele count in called genotypes in UK10K TWINSUK cohort.
AN_TWINSUK Total allele count in called genotypes in UK10K TWINSUK cohort.
AF_TWINSUK Alternative allele frequency in called genotypes in UK10K TWINSUK cohort.
AC_ALSPAC Alternative allele count in called genotypes in UK10K TWINSUK cohort.
AN_ALSPAC Total allele count in called genotypes in UK10K TWINSUK cohort.
AF_ALSPAC Alternative allele frequency in called genotypes in UK10K TWINSUK cohort.
AF_AFR -
AF_AMR -
AF_ASN -
AF_EUR -
AF_MAX -
ESP_MAF -
CSQ Conseqence of given variant. e.g. ENST00000342066:SAMD11:synonymous_variant:21:7:Q>Q

ExAC

Field Name Description
ExAC_ALL Allele frequency in total ExAC samples
ExAC_AFR Allele frequency in African & African American ExAC samples
ExAC_AMR Allele frequency in American ExAC samples
ExAC_EAS Allele frequency in East Asian ExAC samples
ExAC_FIN Allele frequency in Finnish ExAC samples
ExAC_NFE Allele frequency in Non-Finnish European ExAC samples
ExAC_OTH Allele frequency in other ExAC samples
ExAC_SAS Allele frequency in South Asian ExAC samples
ExAC_nonpsych_ALL Allele frequency in total ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_AFR Allele frequency in African & African American ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_AMR Allele frequency in American ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_EAS Allele frequency in East Asian ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_FIN Allele frequency in Finnish ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_NFE Allele frequency in Non-Finnish European ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_OTH Allele frequency in other ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_SAS Allele frequency in South Asian ExAC samples excluding psychiatric cohorts
ExAC_nonTCGA_QUAL Phred-scaled quality score for the assertion made in ALT
ExAC_nonTCGA_FILTER PASS if this position has passed all filters
ExAC_nonTCGA_ALL Allele frequency in total ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_AFR Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in African & African American ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_AMR Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in American ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_EAS Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in East Asian ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_FIN Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in Finnish ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_NFE Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in Non-Finnish European ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_Adj Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in total ExAC samples excluding TCGA cohorts

Kaviar

Field Name Description
Kaviar_AF
Kaviar_AC
Kaviar_AN

GME

  • Reference: Scott E M, Halees A, Itan Y, et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery[J]. Nature genetics, 2016, 48(9): 1071-1076.
  • Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
  • Brief Introduction:
Field Name Description
GME_AF
GME_NWA
GME_NEA
GME_AP
GME_Israel
GME_SD
GME_TP
GME_CA

NCI-60

Field Name Description
NCI60_AF

AbraOM

  • Reference: Naslavsky M S, Yamamoto G L, de Almeida T F, et al. Exomic variants of an elderly cohort of Brazilians in the ABraOM database[J]. Human mutation, 2017, 38(7): 751-763.
  • Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
  • Brief Introduction:
Field Name Description
ABRAOM_AF
ABRAOM_Filter
ABRAOM_Cegh_Filter

ESP6500

  • Reference: Fu W, O' connor T D, Jun G, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants[J]. Nature, 2013, 493(7431): 216-220.
  • Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
  • Brief Introduction:
Field Name Description
esp6500siv2_all
esp6500siv2_aa
esp6500siv2_ea

TOPMed BRAVO

  • Reference: Taliun D, Harris D N, Kessler M D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program[J]. Nature, 2021, 590(7845): 290-299.
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction:
Field Name Description
bravo_an TOPMed Bravo Genome Allele Number.
bravo_af TOPMed Bravo Genome Allele Frequency.
filter_status TOPMed QC status of the given variant.

Primate

5 Primates Allele Frequency utilized in AlphaMissense, we have mapped them on GRCh38.

Field Name Reference Retrieve Source
Bonobos_AF Genetic variation in Pan species is shaped by demographic history and harbors lineage-specific functions[J]. Genome biology and evolution, 2019, 11(4): 1178-1191. https://figshare.com/articles/dataset/Han_etal_Data_tsv_gz/7855850
Gorilla_AF Great ape genetic diversity and population history[J]. Nature, 2013, 499(7459): 471-475. https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Gorilla.vcf.gz
Pan_troglodytes_AF Same as above https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pan_troglodytes.vcf.gz
Pongo_pygmaeus_AF Same as above https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pongo_abelii.vcf.gz
Pongo_abelii_AF Same as above https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pongo_pygmaeus.vcf.gz

🧬 Conservation Score

siPhy

  • Reference: Garber M, Guttman M, Clamp M, et al. Identifying novel constrained elements by exploiting biased substitution patterns[J]. Bioinformatics, 2009, 25(12): i54-i62.
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction: siPhy leverages deeply sequenced clades to identify evolutionary selection by detecting both rate-based conservation and substitution patterns indicative of natural selection, employing a statistical method for biased nucleotide substitutions, a learning algorithm to infer site-specific biases from sequence alignments, and a hidden Markov model to detect constrained elements.
Field Name Description
siPhy_rankscore The rank of the SiPhy_29way_logOdds score among all SiPhy_29way_logOdds scores in genome

bStatistic

  • Reference: McVicker G, Gordon D, Davis C, et al.Widespread Genomic Signatures of Natural Selection in Hominid Evolution [J]. PLoS genetics, 2009, 5(5): e1000471.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: Selection on genomic functional elements can be detected by its effects on population diversity at linked neutral sites, as shown by our analysis of human polymorphisms and sequence differences among five primate species relative to conserved sequence features.
Field Name Description
bStatistic Background selection (B) value estimatation. Ranges from 0 to 1000. It estimates the expected fraction (1000) of neutral diversity present at a site. Values close to 0 represent near complete removal of diversity as a result of background selection and values near 1000 indicating absent of background selection.

FitCons

  • Reference: Gulko B, Melissa J. Hubisz, Gronau I, Siepel A (2015). Probabilities of fitness consequences for point mutations across the human genome. Nature Genetics, 47, 276-283.
  • Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
  • Brief Introduction: FitCons, a novel computational method, estimates the probability that a point mutation at each genome position will influence fitness, using high-throughput functional genomic data to cluster genomic positions and assess fitness consequences.
Field Name Description
integrated_fitCons_score FitCons scores (i6) based on function evidence from multiple cell types, the higher the score the more potential for interesting genomic function
integrated_confidence_value Confidence value for the integrated_fitCons_score:
0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05),
2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
GM12878_fitCons_score FitCons scores (gm) based on function evidence from the GM12878 cell type, the higher the score the more potential for interesting genomic function
GM12878_confidence_value Confidence value for the GM12878_fitCons_score:
0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05),
2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
H1-hESC_fitCons_score FitCons scores (h1) based on function evidence from the H1-hESC cell type, the higher the score the more potential for interesting genomic function
H1-hESC_confidence_value Rank of the H1-hESC_fitCons_score among all H1-hESC_fitCons_scores in genome
HUVEC_fitCons_score FitCons scores (hu) based on function evidence from the HUVEC cell type, the higher the score the more potential for interesting genomic function
HUVEC_confidence_value confidence value for the HUVEC_fitCons_score:
0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05),
2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
integrated_fitCons_score_rankscore Rank of the integrated_fitCons_score among all integrated_fitCons_scores in genome
GM12878_fitCons_score_rankscore Rank of the GM12878_fitCons_score among all GM12878_fitCons_scores in genome
H1-hESC_fitCons_score_rankscore Confidence value for the H1-hESC_fitCons_score:
0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05),
2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
HUVEC_fitCons_score_rankscore Rank of the HUVEC_fitCons_score among all HUVEC_fitCons_scores in genome

PhastCons

  • Reference: Siepel A, Bejerano G, Pedersen J S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes[J]. Genome research, 2005, 15(8): 1034-1050.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: PhastCons, a program based on a two-state phylogenetic hidden Markov model, was used to conduct a comprehensive search for conserved elements across vertebrate genomes, utilizing genome-wide alignments of five vertebrate species, four insect species, two Caenorhabditis species, and seven Saccharomyces species.
Field Name Description
priPhCons Primate PhastCons conservation score (excl. human) (default: 0.0)
mamPhCons Mammalian PhastCons conservation score (excl. human) (default: 0.0)
verPhCons Vertebrate PhastCons conservation score (excl. human) (default: 0.0)

PhyloP

  • Reference: Pollard K S, Hubisz M J, Rosenbloom K R, et al. Detection of nonneutral substitution rates on mammalian phylogenies[J]. Genome research, 2010, 20(1): 110-121.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: PhyloP addresses the broader problem of detecting departures from neutral nucleotide substitution rates in either direction, potentially in a clade-specific manner, using four statistical tests (likelihood ratio, score, exact distributions, GERP).
Field Name Description
priPhyloP Primate PhyloP score (excl. human) (default: -0.029)
mamPhyloP Mammalian PhyloP score (excl. human) (default: - 0.005)
verPhyloP Vertebrate PhyloP score (excl. human) (default: 0.042)

GERP++

  • Reference: Davydov E V, Goode D L, Sirota M, et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++[J]. PLoS computational biology, 2010, 6(12): e1001025.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: GERP++ uses maximum likelihood evolutionary rate estimation for position-specific scoring. In contrast to previous bottom-up methods, it employs a novel dynamic programming approach to subsequently define constrained elements.
Field Name Description
GerpRS Gerp element score (default: 0)
GerpRSpval Gerp element p-Value (default: 0)
GerpN Neutral evolution score defined by GERP++ (default: 3.0)
GerpS Rejected Substitution score defined by GERP++ (default: -0.2)

Zoonomia

  • Reference: Christmas M J, Kaplow I M, Genereux D P, et al. Evolutionary constraint and innovation across hundreds of placental mammals[J]. Science, 2023, 380(6643): eabn3943.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction: Zoonomia, the largest comparative genomics resource for mammals, aligns genomes of 240 species to identify bases likely affecting fitness and disease risk, revealing 332 million evolutionarily constrained bases in the human genome, with many outside protein-coding exons, and associating changes in genes and regulatory elements with unique mammalian traits that could inform therapeutic development.
Field Name Description
ZooPriPhyloP Zoonomia Primate PhyloP conservation score (43 genomes) (default: 0.005)
ZooVerPhyloP Zoonomia Vertebrate PhyloP conservation score (241 vertebrate genome) (default: -0.1460)
ZooRoCC Zoonomia Runs of Contiguous Constraint (default: 0)
ZooUCE Zoonomia UltraConserved Elements (default: 0)

👶🏻 De novo Variants

De novo mutations (DNMs) are defined as variants observed in individuals that are not seen in either parent and these types of variants have been reported to play prominent roles in several genetic diseases.

❗ Gene4Denovo

Gene4Denovo的网页、文献中没有对如下字段名的描述信息,该注释来自ANNOVAR

  • Reference: Zhao G, Li K, Li B, et al. Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans[J]. Nucleic acids research, 2020, 48(D1): D913-D926.
  • Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
  • Brief Introduction: Gene4Denovo integrated 580 799 DNMs, including 30 060 coding DNMs detected by WES/WGS from 23 951 individuals across 24 phenotypes and prioritized a list of candidate genes with different degrees of statistical evidence, including 346 genes with false discovery rates <0.05.
Field Name Description
DN ID The variants identifer of Gene4Denovo, such as  dn65354.
Patient ID
Phenotype Annotated information about gene function according to OMIM, ClinVar, denovo-db, MGI, HPO.
Platform
Study
Pubmed ID

denovo-db

  • Reference: Turner T N, Yi Q, Krumm N, et al. denovo-db: A compendium of human de novo variants[J]. Nucleic acids research, 2017, 45(D1): D804-D811.
  • Retrieve Source: https://denovo-db.gs.washington.edu/denovo-db/, we only retrieved non-SSC Samples due to terms of use of denovo-db.
  • Brief Introduction: denovo-db contained 40 different studies and 32,991 de novo variants from 23,098 trios.
Field Name Description
SAMPLE_CT Observed Sample Count
NumProbands The total number of probands involved in the study.
SampleIDs If some type of sample identifier is given in the study we use that exactly. If there is no sample identifier we use the name of the study and start numbering such that every variant has a unique sample identifier.
SequenceType The sequence type used in the study.
Validation The validation status describes the result of some orthogonal validation method (for example Sanger sequencing). The values are either yes or unknown meaning either valid or not known, respectively. Any variants that are not valid are removed early in the pipeline and are not represented in denovo-db.
PrimaryPhenotype he primary phenotype is the main phenotype of the patient for inclusion in the study.
StudyName This is the name of the study.
PubmedId Pubmed ID for the study publication.
NumControls The total number of controls involved in the study.

📦 Other

Local Nuclear Diversity

  • Reference: Gazal, S., Finucane, H., Furlotte, N. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat Genet 49, 1421–1427 (2017).
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction:
Field Name Description
nucdiv Nuclear diversity measures the probability of how likely the region diversify.
Range: [0.05, 60.25] (default: 0).
recombination_rate Recombination rate measures the probability of how likely the region tends to undergo recombination.
Range: [0, 54.96]

Mappability

  • Reference: Mehran Karimzadeh, Carl Ernst, Anshul Kundaje, Michael M Hoffman, Umap and Bismap: quantifying genome and methylome mappability, Nucleic Acids Research, Volume 46, Issue 20, 16 November 2018, Page e120
  • Retrieve Source: https://favor.genohub.org/
  • Brief Introduction:
Field Name Description
k*_bismap Mappability of the bisulfite-converted genome. Bisulfite sequencing approaches used to identify DNA methylation introduce large numbers of reads that map to multiple regions. This annotation identifies mappability of the bisulfite-converted genome.
Range: [0, 1] (default: 0).
k*_umap Mappability of unconverted genome. It measures the extent to which a position can be uniquely mapped by sequence reads. Lower mappability means the estimates of genomic and epigenomic characteristics from sequencing assays are less reliable, and the region has increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation.
Range: [0, 1] (default: 0).

Mutation Density

  • Reference: Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
  • Retrieve Source: https://cadd.gs.washington.edu/download
  • Brief Introduction:
Field Name Description
Dist2Mutation Distance between the closest BRAVO SNV up and downstream (position itself excluded) (default: 0*)
Freq100bp Number of frequent (MAF > 0.05) BRAVO SNV in 100 bp window nearby (default: 0)
Rare100bp Number of rare (MAF < 0.05) BRAVO SNV in 100 bp window nearby (default: 0)
Sngl100bp Number of single occurrence BRAVO SNV in 100 bp window nearby (default: 0)
Freq1000bp Number of frequent (MAF > 0.05) BRAVO SNV in 1000 bp window nearby (default: 0)
Rare1000bp Number of rare (MAF < 0.05) BRAVO SNV in 1000 bp window nearby (default: 0)
Sngl1000bp Number of single occurrence BRAVO SNV in 1000 bp window nearby (default: 0)
Freq10000bp Number of frequent (MAF > 0.05) BRAVO SNV in 10000 bp window nearby (default: 0)
Rare10000bp Number of rare (MAF < 0.05) BRAVO SNV in 10000 bp window nearby (default: 0)
Sngl10000bp Number of single occurrence BRAVO SNV in 10000 bp window nearby (default: 0)

🦍 Different Species sSNV

We further collected sSNV from Ensembl Variation 112 and mapped them on GRCh38.

Field Name Description
species_chromosome Chromosome of this variant
species_position Position of this variant
rs_id dbSNP rsID
reference_allele Reference allele of this variant
alternate_allele Alternate allele of this variant
evidence_status Support evidence of this variant, see details in
https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status
original_source The original source this variant comes from.
RefPep Amino acid translated with reference allele.
VarPep Variant peptide that is translated as a result of a missense variant. Format=Index|Amino_acid|Feature_id. The index identifies the missense variant. The amino acid translated with the missense variant. The feature id for the feature overlapping the variant.
VE Variant effect of a variant overlapping a sequence feature as computed by the ensembl variant effect pipeline. Format=Consequence|Index|Feature_type|Feature_id. Index indentifies for which variant sequence the effect is described for.
CSQ Consequence annotations from Ensembl's Variant Effect Pipeline. Format=Allele|Consequence|Feature_type|Feature|Amino_acids|SIFT
ensembl_transcript_id Transcript information of this variant.
species_variant Variant id, format as {chrom}_{position}_{ref}/{alt}
hg38_variant The variant of this species maps to the GRCh38 coordinate of human synonymous mutations.
reference_genome Reference genome of this variant.
⚠️ **GitHub.com Fallback** ⚠️