📝 Introduction

SynMall is a one-stop synonymous mutation database that stores synonymous mutations across the entire human genome. It contains over 97 million synonymous mutations, corresponding to 25 million unique genome coordinates and reference replacement bases

All potential sSNV

Field Name	Description
Variant38	sSNV on GRCh38 format as {chromosome}_{position}_{reference allele}/{alternate allele}
Chromosome	Chromosome of sSNV
Position38	Position coordinate of sSNV, build on GRCh38
Reference Allele	The refernece allele on genome
Alternate Allele	The alternate allele of sSNV
Position19	Position coordinate of sSNV, build on GRCh37, lifted with LIFTOVER(For the unmapped records we use `-` to represent)
Source	Source of this sSNV comes from. G=Generated with protein coding transcripts; S=synVep; F=FavorAnnotator; C=CADDv1.7
Variant38	sSNV on GRCh37 format as {chromosome}_{position}_{reference allele}/{alternate allele}(For the unmapped records we use `-` to represent)
ID	dbSNP rsID build on b156

✒ Annotation Result interpretion

🖥 in silico Prediction

Common Pathogenic Prediction Score

This section compiles pathogenicity prediction scores for mutations, measured using computational tools that are not limited to a specific type of mutation. The table below lists the names of these tools and the meanings of their fields.

Field Name	Description	Refernece
CADD_RawScore	Raw score from the model , represents a variant is likely to be "observed" vs "simulated". >0: observed <0: simulated	Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
CADD_PHRED	CADD PHRED Score that scaled on ~8.6 billion SNVs. Range: [0, 1]	Same as above
DANN_score	DANN is a functional prediction score retrained based on the training data of CADD using deep neural network. Scores range from 0 to 1. A larger number indicate a higher probability to be damaging.	Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants[J]. Bioinformatics, 2015, 31(5): 761-763.
Eigen_Score	A functional prediction score based on conservation, allele frequencies, and deleteriousness prediction	Ionita-Laza I, McCallum K, Xu B, et al. A spectral approach integrating functional genomic annotations for coding and noncoding variants[J]. Nature genetics, 2016, 48(2): 214-220.
FATHMM-MKL_Score	Discriminate between pathogenic variants and benign variants. >0.5: deleterious <=0.5: neutral or benign	Shihab H A, Rogers M F, Gough J, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation[J]. Bioinformatics, 2015, 31(10): 1536-1543.
FATHMM-XF_Score	Discriminate between pathogenic variants and benign variants. >0.5: deleterious <=0.5: neutral or benign	Rogers M F, Shihab H A, Mort M, al. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features[J]. Bioinformatics, 2018, 34(3): 511-513.
CAPICE_Score	The higher the score, the more likely that the variant is pathogenic.	Li S, van der Velde K J, De Ridder D, et al. CAPICE: a computational method for consequence-agnostic pathogenicity interpretation of clinical exome variations[J]. Genome Medicine, 2020, 12: 1-11.
TraP_Score	The chance of a variant being pathogenic, the higher the score the higher the damage the variant is predicted to have. 0.459<0.93: possibly damaging >=0.93: probably damaging	Gelfman S, Wang Q, McSweeney K M, al. Annotating pathogenic non-coding variants in genic regions[J]. Nat Commun, 2017, 8(1): 236.
PhD-SNPg_Score	A binary classifier for predicting pathogenic variants. ->1: Pathogenic ->0: Benign	Capriotti E, Fariselli P. PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants[J]. Nucleic acids research, 2017, 45(W1): W247-W252.
GPN-MSA_Score	Refers to the deleteriousness of one position. cutoff: -7	Benegas G, Albors C, Aw A J, et al. GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction[J]. bioRxiv, 2023.
CScape-somatic_Score	Discriminate between pathogenic variants and benign variants. >0.5: deleterious <=0.5: neutral or benign	Rogers M F, Gaunt T R, Campbell C. CScape-somatic: distinguishing driver and passenger point mutations in the cancer genome[J]. Bioinformatics, 2020, 36(12): 3637-3644.
CScape_Score	Discriminate between pathogenic variants and benign variants. >0.5: deleterious <=0.5: neutral or benign	Rogers M F, Shihab H A, Gaunt T R, al. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome[J]. Sci Rep, 2017, 7(1): 11597.

sSNV-specific Pathogenic Prediction Score

This section compiles pathogenicity prediction scores specifically designed for synonymous mutations, measured using computational tools. The table below lists the field names of these tools, their meanings, and the corresponding references.

Field Name	Description	Refernece
EnDSM_Score	Detect deleterious sSNV based on a ensemble learning framework.	Cheng N, Wang H, Tang X, al. An Ensemble Framework for Improving the Prediction of Deleterious Synonymous Mutation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(5): 2603-2611.
frDSM_Score	Deleterious synonymous mutation prediction using logistic regression.	Wang H, Sun J, Liu M, al. frDSM: An Ensemble Predictor With Effective Feature Representation for Deleterious Synonymous Mutation in Human Genome[J]. IEEE/ACM Trans Comput Biol Bioinform, 2023, 20(1): 371-377.
PrDSM_Score	Predictive score of PrDSM for each synonymous mutation. Range: [0, 1] >0.308: deleterious <=0.308: benign	Cheng N, Li M, Zhao L, al. Comparison and integration of computational methods for deleterious synonymous mutation prediction[J]. Brief Bioinform, 2020, 21(3): 970-981.
usDSM_Score	A prediction score for deleterious synonymous mutations. The larger the score is, the more likely the mutation is deleterious.	Tang X, Zhang T, Cheng N, 等. usDSM: a novel method for deleterious synonymous mutation prediction using undersampling scheme[J]. Brief Bioinform, 2021, 22(5).
usDSM_Class	The prediction of usDSM model, deleterious or benign.	Same as above.
Syntool_Score	Intolerance score to sn variation	Zhang T, Wu Y, Lan Z, al. Syntool: A Novel Region-Based Intolerance Score to Single Nucleotide Substitution for Synonymous Mutations Predictions Based on 123,136 Individuals[J]. Biomed Res Int, 2017, 2017: 5096208.
Syntool_Score_P	Intolerance score percentile to sn variation	Same as above.
SliVA_Score	A tool for the automated harmfulness prediction of synonymous (silent) mutations within the human genome. Range: [0, 1] ->1: Harmful	Buske O J, Manickaraj A, Mital S, al. Identification of deleterious synonymous variants in human genomes[J]. Bioinformatics, 2013, 29(15): 1843-1850.
synVep_Score	Evaluating the effects of human synonymous variants based on different transcription.	Zeng Z, Aptekmann A A, Bromberg Y. Decoding the effects of synonymous variants[J]. Nucleic acids research, 2021, 49(22): 12673-12691.

Regulatory/Functional Prediction Score

This section compiles information on whether mutations have regulatory or functional effects based on computational tools, rather than necessarily being pathogenic. Many tools are designed for non-coding mutations/regions, but they also provide precomputed scores for the whole genome or regions including some synonymous mutations, making them applicable for annotating synonymous mutations. The table below lists the field names of these tools, their meanings, and the corresponding references.

Field Name	Description	Refernece
MACIE01	The estimated joint posterior probabilities of not evolutionarily conserved and regulatory functional	Li X, Yung G, Zhou H, et al. A multi-dimensional integrative scoring framework for predicting functional variants in the human genome[J]. The American Journal of Human Genetics, 2022, 109(3): 446-456.
MACIE10	The estimated joint posterior probabilities of evolutionarily conserved and not regulatory functional	Same as above
MACIE00	The estimated joint posterior probabilities of not evolutionarily conserved and not regulatory functional	Same as above
MACIE11	The estimated joint posterior probabilities of both evolutionarily conserved and regulatory functional	Same as above
MACIE_conserved	The estimated posterior probability of evolutionarily conserved	Same as above
MACIE_regulatory	The estimated posterior probability of regulatory functional	Same as above
MACIE_anyclass	The estimated posterior probability of evolutionarily conserved or regulatory functional	Same as above
FunSeq_Score	A flexible framework to prioritize regulatory mutations from cancer genome sequencing (integrative score).	Khurana, E. et al. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 (2013)
GenoCanyon_Score	Predict the functional potential at each nucleotide.	Lu, Q., Hu, Y., Sun, J. et al. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data. Sci Rep 5, 10576 (2015).
FIRE_Score	A score refers to the variant's potential to regulate the expression levels of nearby genes.	Ioannidis N M, Davis J R, DeGorter M K, et al. FIRE: functional inference of genetic variants that regulate gene expression[J]. Bioinformatics, 2017, 33(24): 3895-3901.
CDTS_Score	CDTS context-dependent tolerance scorescore. The lower the score is, the more intolerant to variation.	di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333– 337 (2018)
CDTS_percentile	genome-wide percentile of the CDTS_score. The lower the percentile,the more constrained the region is.	Same as above
ReMM_Score	Scores the positions in the human genome in terms of their regulatory probability. ->0: non-deleterious; ->1: deleterious	Smedley D, Schubach M, Jacobsen J O B, et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease[J]. The American Journal of Human Genetics, 2016, 99(3): 595-606.
ALoFT_Score	ALoFT provides extensive annotations to putative loss-of-function variants (LoF) in protein-coding genes including functional, evolutionary and network features (integrative score).	Balasubramanian S, Fu Y, Pawashe M, et al. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes[J]. Nature communications, 2017, 8(1): 1-11.
ALoFT_Description	ALoFT annotation can predict the impact of premature stop variants and classify them as dominant disease-causing, recessive disease-causing and benign variants (integrative score).	Same as above
LINSIGHT_Score	The LINSIGHT score (integrative score). A higher LINSIGHT score indicates more functionality. Range: [0.215, 0.995].	Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017)
RegSeq0	Regulatory sequence model HEK293T	Schubach M, Maass T, Nazaretyan L, et al. CADD v1. 7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions[J]. Nucleic Acids Research, 2024, 52(D1): D1143-D1154.
RegSeq1	Regulatory sequence model K562	Same as above
RegSeq2	Regulatory sequence model HepG2	Same as above
RegSeq3	Regulatory sequence model HeLa-S3	Same as above
RegSeq4	Regulatory sequence model MC-7	Same as above
RegSeq5	Regulatory sequence model iPS DF 19.11	Same as above
RegSeq6	Regulatory sequence model GM23338	Same as above
RegSeq7	Regulatory sequence model GC-matched background	Same as above
SpliceAI-acc-gain	Masked SpliceAI acceptor gain score (default: 0*)	Jaganathan, K. et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell 176, 535- 548.e24 (2019).
SpliceAI-acc-loss	Masked SpliceAI acceptor loss score (default: 0)	Same as above
SpliceAI-don-gain	Masked SpliceAI donor gain score (default: 0)	Same as above
SpliceAI-don-loss	Masked SpliceAI donor loss score (default: 0)	Same as above
MMSp_acceptor	MMSplice acceptor score (default: 0)	Cheng J, Nguyen T Y D, Cygan K J, et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing[J]. Genome biology, 2019, 20: 1-15.
MMSp_exon	MMSplice exon score (default: 0)	Same as above
MMSp_donor	MMSplice donor score (default: 0)	Same as above
dbscSNV-ada_Score	Adaboost classifier score from dbscSNV (default: 0*)	Jian X, Boerwinkle E, Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome[J]. Nucleic acids research, 2014, 42(22): 13534-13544.
dbscSNV-rf_Score	Random forest classifier score from dbscSNV (default: 0*)	Same as above
TargetScan_Score	Targetscan (default: 0*)	Friedman, R. C., Farh, K. K.-H., Burge, C. B. & Bartel, D. P. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res. 19, 92–105 (2009).
mirSVR-Score	mirSVR-Score (default: 0*)	Betel D, Koppal A, Agius P, et al. Comprehensive modeling of microRNA targets predicts functional non-conserved and non-canonical sites[J]. Genome biology, 2010, 11: 1-14.
mirSVR-E	mirSVR-E (default: 0)	Same as above
mirSVR-Aln	mirSVR-Aln (default: 0)	Same as above

🩺 Disease Information

HGMD

Reference: Stenson P D, Mort M, Ball E V, et al. The Human Gene Mutation Database (HGMD®): optimizing its use in a clinical diagnostic or research setting[J]. Human genetics, 2020, 139: 1197-1207.
Retrieve Source: Professional, 2023.3
Brief Introduction: The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease.

Field Name	Description
acc_num	The HGMD accession number for this mutation. Typically these are strings consisting of CM, CP, CX or HM etc, followed by a six digit integer, such as CM995289. Foreign key for the GENOMIC_COORDS and MUTNOMEN tables.
chrom_hg19
pos_hg19
ref_hg19
alt_hg19
class_hg19
mut_hg19
chrom_hg38
pos_hg38
ref_hg38
alt_hg38
class_hg38
mut_hg38
diseasegene
chrom	If known, the number of the chromosome (including X and Y). DEPRECIATED
genename	A human readable, fully spelled out name for the gene.
gdbid	Identifier for the GDB Genome Database. When a matching record has not been identified, the field contains NULL. Present for historical reasons, as GDB no longer exists.
omimid	Identifier for the OMIM database, http://www.ncbi.nlm.nih.gov/omim. When a matching record has not been identified, the field contains NULL.
amino	The amino acid change caused by the mutation, in triple-letter code.
deletion	Deletions are presented in terms of the deleted bases in lower case plus, in upper case, 10 bp DNA sequence flanking both sides of the lesion. Intron/exon boundary information may be provided where identified (e.g. I12E13). The codon number in the CODON field represents the last whole codon preceding the deletion and is marked in the given sequence by the caret character (^).
insertion	Insertions are presented in terms of the inserted bases in lower case plus, in upper case, 10 bp DNA sequence flanking both sides of the lesion. The numbered codon from the AMINO field is preceded in the given sequence by the caret character (^).
codon	The number of the altered codon mapped to the HGMD cDNA sequence provided.
codonAff	The codon affected by the mutation in question.
descr	A textual description of the mutation.
refseq	The NCBI mRNA reference sequence utilised by HGMD.
hgvs	Composite HGVS cDNA based nomenclature for the mutation.
hgvsAll	Composite HGVS nomenclature for fulltext indexing and searching purposes.
dbsnp	Links the variants in HGMD to a corresponding dbSNP entry.
chromosome	Strictly a number from 1-22, X or Y.
startCoord	Number of the first nucleotide of the mutation (chromosomal coordinate). For deletions, the first deleted nucleotide, for insertions, the last nucleotide before the inserted sequence, for single nucleotide mutations, the number of the mutated nucleotide.
endCoord	Number of the last nucleotide of the mutation (chromosomal coordinate). For deletions, the last deleted nucleotide; for insertions, the first nucleotide after the inserted sequence; for single nucleotide mutations, the number of the mutated nucleotide (should be identical to CoordSTART).
expected_inheritance	Inheritance data curated from multiple literature sources (only where such data may be unequivocally assigned).
gnomad_AC	Allele counts for HGMD variants exactly matching variants found in the Genome Aggregation Database
gnomad_AF	Allele frequency from gnomAD.
gnomad_AN	Total number of alleles sequenced by gnomAD at the matching locus.
tag	This field categorizes mutations and polymorphisms. There are seven possible values, DM, DM?, DP, DFP, FP, FTV and R.
dmsupport	Positive or negative score depending on the support (or lack of support) of the extra references for pathogenicity or functional alteration. Experimental.
rankscore	Ranking score is a single score between 0-1, with 1 been most likely diseasecausing. The score is computed using machine learning, and is based upon multiple lines of evidence, including HGMD literature support for pathogenicity, evolutionary conservation (100- way vertebrate alignment), variant allele frequency and in-silico prediction. This feature is under ongoing development.
mutype	Primary type of mutation logged in HGMD. (i.e. missense, initiation, nonsense, synonymous, noncoding, frameshift, inframe, gross, canonical-splice, exonic-splice, splice, nonstop, regulatory).
author	Reference field. All the reference fields refer to the literature report that the corresponding mutation was obtained from. Last name of the first author
title
fullname	Reference field. The approved Medline abbreviation for the journal. Foreign key for the base table JOURNAL.FULLNAME field
allname	ALLNAME contains the name spelled out in its entirety.
vol	Reference field. There are 6 possible values for this field.
page	Reference field. Number of the first page of the article.
year	Reference field. Year the article was published, in four digits.
pmid	Reference field. There are 5 possible values, numeric, HGOL, LSDB, NO ID and ABST.
pmidAll	This field contains all of the PubMed Ids from primary and additional references that are associated with that variant.
reftag	The REFTAG field contains five values APR for additional phenotype report, FCR for functional characterisation report, MCR for molecular characterisation report, ACR for additional case report (detailing an additional case of the mutation) and SAR for simple additional report.
comments	Free text comments by the curator.
new_date	The date when the mutation was added to the database.
base	This field is specific to single base pair substitutions and contains the description of the nucleotide change. This is presented in terms of a triplet change. For example, TAC-TAT represents a change of the last nucleotide C in the triplet to a T. TGT-TAT represents a change of the middle nucleotide G to an A.
clinvarID
clinvar_clnsig

ClinVar

Reference: Landrum M J, Chitipiralla S, Brown G R, et al. ClinVar: improvements to accessing data[J]. Nucleic acids research, 2020, 48(D1): D835-D844.
Retrieve Source: https://www.ncbi.nlm.nih.gov/clinvar/ , 2024-06-11
Brief Introduction: ClinVar is a freely accessible public archive maintained by the NIH, aggregates and provides interpretations of human genetic variants' relationships to diseases.

Field Name	Description
AF_ESP	allele frequencies from GO-ESP
AF_EXAC	allele frequencies from ExAC
AF_TGP	allele frequencies from TGP
ALLELEID	the ClinVar Allele ID
CLNDN	ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
CLNDNINCL	For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB
CLNDISDB	Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN
CLNDISDBINCL	For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN
CLNHGVS	Top-level (primary assembly, alt, or patch) HGVS expression
CLNREVSTAT	ClinVar review status of germline classification for the Variation ID
CLNSIG	Aggregate germline classification for this single variant; multiple values are separated by a vertical bar
CLNSIGCONF	Conflicting germline classification for this single variant; multiple values are separated by a vertical bar
CLNSIGINCL	Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar
CLNVC	Variant type
CLNVCSO	Sequence Ontology id for variant type
CLNVI	the variant's clinical sources reported as tag-value pairs of database and variant identifier
DBVARID	nsv accessions from dbVar for the variant
GENEINFO	Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (\|)
MC	comma separated list of molecular consequence in the form of Sequence Ontology ID\|molecular_consequence
ONCDN	ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB
ONCDNINCL	For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL
ONCDISDB	Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN
ONCDISDBINCL	For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN
ONC	Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar
ONCINCL	Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar
ONCREVSTAT	ClinVar review status of oncogenicity classification for the Variation ID
ONCCONF	Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar
ORIGIN	Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other RS dbSNP ID (i.e. rs number)
SCIDN	ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB
SCIDNINCL	For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL
SCIDISDB	Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN
SCIDISDBINCL	For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN
SCIREVSTAT	ClinVar review status of somatic clinical impact for the Variation ID
SCI	Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar
SCIINCL	Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar

COSMIC

Reference: Tate J G, Bamford S, Jubb H C, et al. COSMIC: the catalogue of somatic mutations in cancer[J]. Nucleic acids research, 2019, 47(D1): D941-D947.
Retrieve Source: https://cancer.sanger.ac.uk/cosmic/ , v100
Brief Introduction: COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer.

Field Name	Description
COSMIC_MUTATION_ID	Genomic mutation identifier (COSV) to indicate the definitive position of the variant on the genome.
GENE	Gene name
TRANSCRIPT	Transcript accession
STRAND	Gene strand
LEGACY_ID	Legacy Mutation ID
CDS	CDS annotation
AA	Peptide annotation
HGVSC	HGVS cds syntax
HGVSP	HGVS peptide syntax
HGVSG	HGVS genomic syntax
SAMPLE_COUNT	How many genome screens samples have this mutation
IS_CANONICAL	The Ensembl Canonical transcript is a single, representative transcript identified at every locus
TIER	Indicates to which tier of the Cancer Gene Census the gene belongs (1/2)
SO_TERM	SO term for this mutation
COMISC_SOURCE	This record comes from TARGETED_SCREEN or GENOME_SCREEN. GENOME_SCREEN: Coding point mutations from genome wide screens (including whole exome sequencing) from the current release; TARGETED_SCREEN: Complete curated COSMIC dataset (targeted screens) from the current release.

GWAS Catalog

Reference: Sollis E, Mosaku A, Abid A, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource[J]. Nucleic acids research, 2023, 51(D1): D977-D985.
Retrieve Source: https://www.ebi.ac.uk/gwas/home , v1.0.2
Brief Introduction: The NHGRI-EBI GWAS Catalog is a FAIR knowledgebase providing standardized GWAS data, containing variant-trait associations and metadata for over 45,000 published GWAS, with expanded data types and improved interoperability, curated from publications or prepublication author submissions.

FieldName	Description
DATE ADDED TO CATALOG	Date a study is published in the catalog
PUBMEDID	PubMed identification number
FIRST AUTHOR	Last name and initials of first author
DATE	Publication date (online (epub) date if available)
JOURNAL	Abbreviated journal name
LINK	PubMed URL
STUDY	Title of paper
DISEASE/TRAIT	Disease or trait examined in study
INITIAL SAMPLE SIZE	Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)
REPLICATION SAMPLE SIZE	Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)
REGION	Cytogenetic region associated with rs number
CHR_ID	Chromosome number associated with rs number
CHR_POS	Chromosomal position associated with rs number
REPORTED GENE(S)	Gene(s) reported by author
MAPPED_GENE	Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is located within multiple genes, these genes are listed separated by commas. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.
UPSTREAM_GENE_ID	Entrez Gene ID for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_ID	Entrez Gene ID for nearest downstream gene to rs number, if not within gene
SNP_GENE_IDS	Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts
UPSTREAM_GENE_DISTANCE	Distance in kb for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_DISTANCE	Distance in kb for nearest downstream gene to rs number, if not within gene
STRONGEST SNP-RISK ALLELE	SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype
SNPS	Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)
MERGED	Denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)
SNP_ID_CURRENT	Current rs number (will differ from strongest SNP when merged = 1)
CONTEXT	Provides information on a variant’s predicted most severe functional effect from Ensembl
INTERGENIC	Denotes whether SNP is in intergenic region (0 = no; 1 = yes)
RISK ALLELE FREQUENCY	Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.
P-VALUE	Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).
PVALUE_MLOG	-log(p-value)
P-VALUE (TEXT)	Information describing context of p-value (e.g. females, smokers).
OR or BETA	Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that prior to 2021, any OR <1 was inverted, along with the reported allele, so that all ORs included in the Catalog were >1. This is no longer done, meaning that associations added after 2021 may have OR <1. Appropriate unit and increase/decrease are included for beta coefficients.
95% CI (TEXT)	Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.
PLATFORM [SNPS PASSING QC]	Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable
CNV	Study of copy number variation (yes/no)
MAPPED_TRAIT	Mapped Experimental Factor Ontology trait for this study
MAPPED_TRAIT_URI	URI of the EFO trait
STUDY ACCESSION	Accession ID allocated to a GWAS Catalog study
GENOTYPING TECHNOLOGY	Genotyping technology/ies used in this study, with additional array information (ex. Immunochip or Exome array) in brackets.

GRASP

Reference: Leslie R, O' Donnell C J, Johnson A D. GRASP: analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database[J]. Bioinformatics, 2014, 30(12): i185-i194.
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa , v2
Brief Introduction: GRASP contains over 6.2 million SNP-phenotype associations from 1390 GWAS studies, re-annotated with 16 diverse sources including RNA editing sites, lincRNAs, and PTMs.

Field Name	Description
rs	Latest snp ID from dbSNP, it can be different from the original SNP entry in the database due to SNPmerges (merged = 1)
PMID	PubMed identifier for paper from which the SNP association originates
p-value	P-value for SNP-phenotype association
phenotype	Phenotype description of SNP-phenotype entry
ancestry	Ethnodemographic description of the paper population(s) (e.g., European, Mixed)
platform	Description of genotyping and/or imputation platform(s) and number of SNP markers (specified or approximated) included in post-QC analyses

DisGenet

Reference: Piñero J, Ramírez-Anguita J M, Saüch-Pitarch J, et al. The DisGeNET knowledge platform for disease genomics: 2019 update[J]. Nucleic acids research, 2020, 48(D1): D845-D855.
Retrieve Source: https://www.disgenet.org/home/ , 2020.3
Brief Introduction: DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases.

Field Name	Description
snpId	dbSNP variant Identifier
class	type of variant
chromosome	Chromosome of the variant
position	Position in chromosome
DSI	The Disease Specificity Index for the variant
DPI	The Disease Pleiotropy Index for the variant
NofDiseases	Number of diseases associated to the variant
NofPmids	Total number of publications reporting the Variant-Disease association

❗ ClinGen

ClinGen网页上没有对字段的详细描述

Reference: Rehm H L, Berg J S, Brooks L D, et al. ClinGen—the clinical genome resource[J]. New England Journal of Medicine, 2015, 372(23): 2235-2242.
Retrieve Source: https://clinicalgenome.org/ , 2024-03-27
Brief Introduction: ClinGen is a National Institutes of Health (NIH)-funded resource dedicated to building a central resource that defines the clinical relevance of genes and variants for use in precision medicine and research.

Field Name	Description
#Variation
ClinVar Variation Id
Allele Registry Id	GlinGen canonical allele identifier, example: CA200893
HGVS Expressions
HGNC Gene Symbol
Disease
Mondo Id	MonDO IDs are required for describing the disease entity in the ClinGen Gene and Variant Curation Interfaces
Mode of Inheritance	a gene may also be associated with multiple inheritance patterns
Assertion	Clinical Validity Classification
Applied Evidence Codes (Met)
Applied Evidence Codes (Not Met)
Summary of interpretation
PubMed Articles
Expert Panel
Guideline
Approval Date
Published Date
Retracted
Evidence Repo Link
Uuid
HGVSg

VariSNP

Reference: Schaafsma G C P, Vihinen M. V ari SNP, a benchmark database for variations from db SNP[J]. Human mutation, 2015, 36(2): 161-166.
Retrieve Source: https://lap676.srv.lu.se/VariSNP/index.php , 2017-02-16
Brief Introduction: VariSNP is a benchmark database suite comprising variation datasets that can be used for developing and testing the performance of variant effect prediction tools. VariSNP contains datasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode, so all variations are considered neutral or non-pathogenic.

Field Name	Description
dbSNP_id	dbSNP RefSNP cluster ID number (rs#)
heterozygosity	Estimated average heterozygosity from allele frequencies of this RefSNP. Values between 0 and 1. You can find a document describing the computation of average heterozygosity and standard error for dbSNP RefSNP clusters at NCBI
heterozygosity_standard_error	Standard error of heterozygosity estimate.
creation_date	Date when the RefSNP cluster was instantiated
creation_build	Date when the RefSNP cluster was instantiated
update_date	Most recent date the RefSNP cluster was updated (member added or deleted)
update_build	Build number (NCBI release) when the RefSNP cluster was updated
observed_alleles	Observed variation alleles. All allele(s) observed at this position in the reference. Can be something like A/C or A/C/G/T or -/ACC
asn_from	Start position of snp on contig, counting from 0. This position is always from the beginning of the contig regardless of the snp orientation to contig and regardless of the contig orienation to chromosome
asn_to	End position of snp on contig
reference_allele	Reference allele(s), this can be a '-' in the case of an insertion
orientation	Orientation of RefSNP sequence to contig sequence. Values are 'forward' or 'reverse'
minor_allele_frequency	Global minor allele frequency. dbSNP is reporting the minor allele frequency for each rs included in a default global population. Since this is being provided to distinguish common polymorphism from rare variants, the MAF is actually the second most frequent allele value. In other words, if there are 3 alleles, with frequencies of 0.50, 0.49, and 0.01, the MAF will be reported as 0.49. The current default global population is 1000Genome phase 1 genotype data from 1094 worldwide individuals, released in the May 2011 dataset. Values from 0 to 0.50
minor_allele	Minor allele
sample_size	Sample size, which is the number of chromosomes in the sample population
validation	Validation method, type of evidence used to confirm the variation. Present values can be byHapMap; byOtherPop; byFrequency; by1000G; by2Hit2Allele; byCluster
hgvs_names	Description(s) of the variation according to HGVS recommendations
allele_origin	Genetic origin of the allele, e.g. germline, somatic, inherited, maternal
clinical_significance	Clinical significance. Assertions of clinical significance for alleles of human sequence variations are reported as provided by the submitter and not interpreted by NCBI. Submissions based on processing data from OMIM® were assigned the value of `probable-pathogenic`. If there is a published authoritative guideline about the pathogenicity of any allele, that is included in the report. The supported values are: unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other
functional_class	Variation functional class. Variations are assigned functional classes, which report if a variation is located in a locus region, in a transcript, or in a coding region. This column contains one or more functional classes (fxnClass), values can be cds-indel, downstream-variant-500B, frameshift-variant, intron-variant, missense, nc-transcript-variant, reference, splice-acceptor-variant, splice-donor-variant, stop-gained, stop-lost, synonymous-codon, upstream-variant-2KB, utr-variant-3-prime. In this column you can also find values for a to the functional class corresponding Sequence Ontology term (soTerm), the mRNA accession (mrnaAcc) and version (mrnaVer), gene symbol (symbol) and the Entrez gene id (geneid)
ncbi_gi	NCBI gi number.
ncbi_accession	NCBI accession and version number of reference sequence, e.g. NG_01234.5
gene_symbol	Gene symbol (provided by HGNC).
refseq_start_description	Description relative to transcription start on reference sequence
coding_dna_description	Coding DNA variant description according to HGVS recommendations
protein_description	Protein variant description according to HGVS recommendations
coding_reference	NCBI RefSeq accession and version number (mRNA), e.g. NM_01234.5
protein_reference	NCBI RefSeq accession and version number (protein), e.g. NP_01234.5

dbDSM

Reference: Wen P, Xiao P, Xia J. dbDSM: a manually curated database for deleterious synonymous mutations[J]. Bioinformatics, 2016, 32(12): 1914-1916.
Retrieve Source: http://www.xialab.info:8080/dbDSM/index.jsp , v2
Brief Introduction: dbDSM (Database of Deleterious Synonymous Mutation) is an integrated database that collect multiple sources relate to deleterious synonymous mutations.

Field Name	Description
dbDSM Number	The access number of a variant in dbDSM
Disease	The main phenotype of the patient
DOID	The identifier of a disease linked to OMIM database
Gene	Gene name
GeneID	The unique identifier for a gene
MIM	The identifier of a gene linked to OMIM database
Map Location	The map location for this gene
Protein	A protein reference level representation of the variant
cDNA	A coding reference level representation of the variant
SNPID	dbSNP identifier of the variant. If there is no rs id this field is “n/a“
Refseq Transcript	Refseq Transcript that the variant resides on
P-value	P-value in GWAS
Strand	A variant occurred in forword chain(+) or reverse chain(-)
GRCh38 Position	The position of variant on GRCh38
GRCh37 Position	The position of variant on GRCh37
Ref	Reference allele
Alt	Alternate allele
Year	Published time of an article
PMID	Pubmed ID for an article
Classification	Deleterious mechanism of a variant
Strength of Evidence	Clinical classification of a variant
Key Sentence	Deleterious evidence of a variant extracted from the article
Source	The source of a variant
Score	dbDSM score of a variant Which are including SilVA,DDIG-SN,FATHMM-MKL, TraP, CADD score.We use voting methods to evaluate the variant, dbDSM score plus one if the score above the threshold value for each tool.

PharmGKB

Reference: Gong L, Whirl‐Carrillo M, Klein T E. PharmGKB, an integrated resource of pharmacogenomic knowledge[J]. Current protocols, 2021, 1(8): e226.
Retrieve Source: https://www.pharmgkb.org/ , 2024-03-06
Brief Introduction: The Pharmacogenomics Knowledgebase (PharmGKB) is an integrated online knowledge resource for the understanding of how genetic variation contributes to variation in drug response.

var_pheno_ann.tsv: Contains associations in which the variant affects a phenotype, with or without drug information.
var_drug_ann.tsv: Contains associations in which the variant affects a drug dose, response, metabolism, etc.
var_fa_ann.tsv: Contains in vitro and functional analysis-type associations.

Field Name	Description
Variant Annotation ID	Unique ID number for each variant/drug annotation.
Variant/Haplotypes	dbSNP rsID or haplotype(s) involved in the association. In some cases, an association is based on a gene phenotype group such as "poor metabolizers" or "intermediate activity". In these cases, the gene phenotype is found in this field.
Gene	HGNC symbol for the gene involved in the association. Typically the variants will be within the gene boundaries, but occasionally this will not be true. E.g. the variant in the annotation may be upstream of the gene but is reported to affect the gene's expression or otherwise associated with the gene.
Drug(s)	The drug(s) involved in the association. If there is more than one drug listed, the association may apply to each drug individually or the combination of the drugs together. The field "Multiple drugs And/or" will designate "or" - meaning that it applies to each drug - or "and" - meaning that the association is for the combination.
PMID	PubMed identifier for the article supporting the annotation.
Phenotype Category	Options are "efficacy", "toxicity", "dosage", "metabolism/PK", "PD", "other".
Significance	The significance of the association as stated by the author; options are [yes, no, not stated].
Notes	Free text field for notes added by the curator.
Sentence	The structured annotation sentence generated by the variant annotation tool based on the information entered by the curator.
Alleles	The basis for comparison in the annotation. In this field, there may be a variant, one or more haplotypes grouped together, one or more genotypes grouped together or one or more diplotypes grouped together. If there is a gene phenotype in the "Variant/Haplotypes" field (described above), this field will be blank
Specialty Population	Any special populations this annotation is relevant to (e.g. pediatric).
Assay Type	Information about the type of assay performed.

Relationship

Field Name	Description
Entity1_id	Diseases, genes and drugs are designated by their PharmGKB IDs.
Entity1_type	Disease, Drug, Gene, VariantLocation1 or Haplotype2.
Entity2_id	Diseases, genes and drugs are designated by their PharmGKB IDs.
Entity2_type	Disease, Drug, Gene, VariantLocation1 or Haplotype2.
Evidence	VIP, VariantAnnotation, ClinicalAnnotation, DosingGuideline, DrugLabel or Pathway. Comma separated list because the evidence for a relationship could come from multiple sources in PharmGKB.
Association	Possible values: “associated”, “not associated” or “ambiguous”.
PK	PK stands for “Pharmacokinetic”. Relationships are marked as PK if the pair of entities was found in a pharmacokinetic pathway on PharmGKB, or if the Variant Annotation or VIP was annotated with PK in some manner
PD	PD stands for “Pharmacodynamic”. Relationships are marked as PD if the pair of entities was found in a pharmacodynamic pathway on PharmGKB, or if the Variant Annotation or VIP was annotated with PD in some manner.
PMIDs	PubMed IDs that were used to support the listed relationship. Semi-colon delimited list.

Clinical

Field Name	Description
variant	name or symbol of the variant
gene	HGNC ID of the gene
type	category or categories that the annotation falls in
level of evidence	strength of evidence for the annotation
chemicals	drug(s) associated with the variant in the annotation; from the PharmGKB drug vocabulary
phenotypes	associated disease phenotype(s), where applicable

Variant

Field Name	Description
Variant ID	The PharmGKB identifier for this variant
Variant Name	The PharmGKB name for this variant
Gene IDs	The PharmGKB identifiers for genes associated with this variant
Gene Symbols	The HGNC symbols for genes associated with this variant
Location	The location of this variation on a reference sequence (either RefSeq or GenBank), if available. HGVS format when applicable
Variant Annotation count	The count of Variant Annotations done on this variant
Clinical Annotation count	The count of all Clinical Annotations done on this variant
Level 1/2 Clinical Annotation count	The count of Level 1 or Level 2 ("top") Clinical Annotations done on this variant
Guideline Annotation count	The count of Dosing Guideline Annotations of which this variant is a part
Label Annotation count	The count of Drug Label Annotations in which this variant is mentioned
Synonyms	A comma-separated list of synonyms for this variant. Includes HGVS names, retired RSIDs, and other names

👁 Epigenetic Information

ENCODE

Reference: Davis C A, Hitz B C, Sloan C A, et al. The Encyclopedia of DNA elements (ENCODE): data portal update[J]. Nucleic acids research, 2018, 46(D1): D794-D801.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: Chemical modifications (e.g., methylation and acetylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription.

Field Name	Description
EncodeH3K4me1-sum	Sum of Encode H3K4me1 levels (from 13 cell lines) (default: 0.76)
EncodeH3K4me1-max	Maximum Encode H3K4me1 level (from 13 cell lines) (default: 0.37)
EncodeH3K4me2-sum	Sum of Encode H3K4me2 levels (from 14 cell lines) (default: 0.73)
EncodeH3K4me2-max	Maximum Encode H3K4me2 level (from 14 cell lines) (default: 0.37)
EncodeH3K4me3-sum	Sum of Encode H3K4me3 levels (from 14 cell lines) (default: 0.81)
EncodeH3K4me3-max	Maximum Encode H3K4me3 level (from 14 cell lines) (default: 0.38)
EncodeH3K9ac-sum	Sum of Encode H3K9ac levels (from 13 cell lines) (default: 0.82)
EncodeH3K9ac-max	Maximum Encode H3K9ac level (from 13 cell lines) (default: 0.41)
EncodeH3K9me3-sum	Sum of Encode H3K9me3 levels (from 14 cell lines) (default: 0.81)
EncodeH3K9me3-max	Maximum Encode H3K9me3 level (from 14 cell lines) (default: 0.38)
EncodeH3K27ac-sum	Sum of Encode H3K27ac levels (from 14 cell lines) (default: 0.74)
EncodeH3K27ac-max	Maximum Encode H3K27ac level (from 14 cell lines) (default: 0.36)
EncodeH3K27me3-sum	Sum of Encode H3K27me3 levels (from 14 cell lines) (default: 0.93)
EncodeH3K27me3-max	Maximum Encode H3K27me3 level (from 14 cell lines) (default: 0.47)
EncodeH3K36me3-sum	Sum of Encode H3K36me3 levels (from 10 cell lines) (default: 0.71)
EncodeH3K36me3-max	Maximum Encode H3K36me3 level (from 10 cell lines) (default: 0.39)
EncodeH3K79me2-sum	Sum of Encode H3K79me2 levels (from 13 cell lines) (default: 0.64)
EncodeH3K79me2-max	Maximum Encode H3K79me2 level (from 13 cell lines) (default: 0.34)
EncodeH4K20me1-sum	Sum of Encode H4K20me1 levels (from 11 cell lines) (default: 0.88)
EncodeH4K20me1-max	Maximum Encode H4K20me1 level (from 11 cell lines) (default: 0.47)
EncodeH2AFZ-sum	Sum of Encode H2AFZ levels (from 13 cell lines) (default: 0.9)
EncodeH2AFZ-max	Maximum Encode H2AFZ level (from 13 cell lines) (default: 0.42)
EncodeDNase-sum	Sum of Encode DNase-seq levels (from 12 cell lines) (default: 0.0)
EncodeDNase-max	Maximum Encode DNase-seq level (from 12 cell lines) (default: 0.0)
EncodetotalRNA-sum	Sum of Encode totalRNA-seq levels (from 10 cell lines always minus and plus strand) (default: 0.0)
EncodetotalRNA-max	Maximum Encode totalRNA-seq level (from 10 cell lines, minus and plus strand separately) (default: 0.0)

chromHMM

Reference: Ernst J, Kellis M. Chromatin-state discovery and genome annotation with ChromHMM[J]. Nature protocols, 2017, 12(12): 2478-2492.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: ChromHMM annotates the noncoding genome using epigenomic data across multiple cell types by employing a multivariate hidden Markov model to infer chromatin-state signatures, generating genome-wide annotations and facilitating functional interpretations through automated enrichment analysis.

Field Name	Description
cHmm_E1	Number of 48 cell types in chromHMM state E1_poised (default: 1.92*)
cHmm_E2	Number of 48 cell types in chromHMM state E2_repressed (default: 1.92)
cHmm_E3	Number of 48 cell types in chromHMM state E3_dead (default: 1.92)
cHmm_E4	Number of 48 cell types in chromHMM state E4_dead (default: 1.92)
cHmm_E5	Number of 48 cell types in chromHMM state E5_repressed (default: 1.92)
cHmm_E6	Number of 48 cell types in chromHMM state E6_repressed (default: 1.92)
cHmm_E7	Number of 48 cell types in chromHMM state E7_weak (default: 1.92)
cHmm_E8	Number of 48 cell types in chromHMM state E8_gene (default: 1.92)
cHmm_E9	Number of 48 cell types in chromHMM state E9_gene (default: 1.92)
cHmm_E10	Number of 48 cell types in chromHMM state E10_gene (default: 1.92)
cHmm_E11	Number of 48 cell types in chromHMM state E11_gene (default: 1.92)
cHmm_E12	Number of 48 cell types in chromHMM state E12_distal (default: 1.92)
cHmm_E13	Number of 48 cell types in chromHMM state E13_distal (default: 1.92)
cHmm_E14	Number of 48 cell types in chromHMM state E14_distal (default: 1.92)
cHmm_E15	Number of 48 cell types in chromHMM state E15_weak (default: 1.92)
cHmm_E16	Number of 48 cell types in chromHMM state E16_tss (default: 1.92)
cHmm_E17	Number of 48 cell types in chromHMM state E17_proximal (default: 1.92)
cHmm_E18	Number of 48 cell types in chromHMM state E18_proximal (default: 1.92)
cHmm_E19	Number of 48 cell types in chromHMM state E19_tss (default: 1.92)
cHmm_E20	Number of 48 cell types in chromHMM state E20_poised (default: 1.92)
cHmm_E21	Number of 48 cell types in chromHMM state E21_dead (default: 1.92)
cHmm_E22	Number of 48 cell types in chromHMM state E22_repressed (default: 1.92)
cHmm_E23	Number of 48 cell types in chromHMM state E23_weak (default: 1.92)
cHmm_E24	Number of 48 cell types in chromHMM state E24_distal (default: 1.92)
cHmm_E25	Number of 48 cell types in chromHMM state E25_distal (default: 1.92)

❗ ORegAnno

ORegAnno的网页失效了，提供该数据的WGSA也只给出了如下两个字段的描述。

Reference: Lesurf, R. et al. ORegAnno 3.0: a community-driven resource for curated regulatory annotation. Nucleic Acids Res. 44, D126-132 (2016).
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction: The Open Regulatory Annotation database (ORegAnno) is a resource for curated regulatory annotation. It contains information about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements.

Field Name	Description
#Chrom
Start
End
ORegAnno_ID
Species
Outcome
Type	The type of regulatory region by ORegAnno
Gene_Symbol
Gene_ID
Gene_Source
Regulatory_Element_Symbol
Regulatory_Element_ID
Regulatory_Element_Source
dbSNP_ID
PMID	The PMID of the paper describing the regulation
Dataset
Build
Strand

DICE

Reference:Schmiedel B J, Singh D, Madrigal A, et al. Impact of genetic polymorphisms on human immune cell gene expression[J]. Cell, 2018, 175(6): 1701-1715. e16.
Retrieve Source: https://dice-database.org/downloads , 2.23.2022
Brief Introduction: The DICE project aims to elucidate the role of common genetic variations in human disease by creating reference transcriptomic and epigenomic maps of immune cells, identifying functional SNPs affecting gene expression, and investigating regulatory mechanisms and cell-type specific effects, including those influenced by sex, to reveal how disease risk-associated polymorphisms impact pathogenesis.

Field Name	Description
DICE_rs_ID	dbSNP rsID
DICE_FILTER	Filter status
DICE_Cell_Type	Different cell type reported in DICE
DICE_Gene	Ensembl ID
DICE_GeneSymbol	Gene symbol
DICE_Pvalue	Pvalue
DICE_Beta	The beta value indicates if expression for the alt allele is higher (if beta is positive) or lower (if beta is negative)

Geuvadis

Reference: Lappalainen T, Sammeth M, Friedländer M R, et al. Transcriptome and genome sequencing uncovers functional variation in humans[J]. Nature, 2013, 501(7468): 506-511.
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction: Geuvadis is the first uniformly processed RNA-seq data from 462 individuals across multiple populations, revealing extensive genetic variation in gene regulation and providing insights into causal regulatory mechanisms and disease-associated loci.

Field Name	Description
Geuvadis_eQTL_target_gene	Ensembl gene ID of the eQTL associated with, from the Geuvadis project

GTEx

Reference: Lonsdale, J., Thomas, J., Salvatore, M. et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585 (2013).
Retrieve Source: https://storage.googleapis.com/adult-gtex/bulk-qtl/v8/single-tissue-cis-qtl/GTEx_Analysis_v8_eQTL.tar
Brief Introduction: The Genotype-Tissue Expression (GTEx) project aims to create a resource database and tissue bank to study the relationship between genetic variation and gene expression in human tissues.

Field Name	Description
variant_id	variant ID in the format {chr}_{pos}_\ref_base}_{ref_seq}/{alt_seq}
gene_id	GENCODE/Ensembl gene ID
tss_distance	distance between variant and transcription start site. Positive when variant is downstream of the TSS, negative otherwise
ma_samples	number of samples carrying the minor allele
ma_count	total number of minor alleles across individuals
maf	minor allele frequency observed in the set of donors for a given tissue
pval_nominal	nominal p-value threshold for calling a variant-gene pair significant for the gene
slope	regression slope
slope_se	standard error of the regression slope
pval_nominal_threshold	nominal p-value threshold for calling a variant-gene pair significant for the gene
min_pval_nominal	smallest nominal p-value for the gene
pval_beta	beta-approximated permutation p-value for the gene
tissue_type	Different human tissuses in GTEx

Transcript Factor

Reference: Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: Transcription-factor-related information retrieved from CADD v1.7.

Field Name	Description
RemapOverlapTF	Remap number of different transcription factors binding (default: -0.5)
RemapOverlapCL	Remap number of different transcription factor - cell line combinations binding (default: -0.5)

GeneHancer

Reference: Fishilevich S, Nudel R, Rappaport N, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards[J]. Database, 2017, 2017: bax028.
Retrieve Source: https://favor.genohub.org/
Brief Introduction: GeneHancer predictions are fully integrated in the widely used GeneCards Suite, whereby candidate enhancers and their annotations are displayed on every relevant GeneCard.

Field Name	Description
GeneHancer	Predicted human enhancer sites from the GeneHancer database.

Super Enhancer

Reference: Hnisz D, Abraham B J, Lee T I, et al. Super-enhancers in the control of cell identity and disease[J]. Cell, 2013, 155(4): 934-947.
Retrieve Source: https://favor.genohub.org/
Brief Introduction: Super-enhancers produce a catalog of super-enhancers in a broad range of human cell types and find that super-enhancers associate with genes that control and define the biology of these cells.

Field Name	Description
Super Enhancer	Predicted super-enhancer sites and targets in a range of human cell types.

Enhancer Finder

Reference: Erwin G D, Oksenberg N, Truty R M, et al. Integrating diverse datasets improves developmental enhancer prediction[J]. PLoS computational biology, 2014, 10(6): e1003677.
Retrieve Source: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003677#references
Brief Introduction: EnhancerFinder integrates DNA sequence motifs, evolutionary patterns, and functional genomics data to predict developmental enhancers and their tissue specificity, which outperforms single-data approaches, identifies 84,301 enhancers genome-wide, and provides functional annotations enriched near relevant genes and GWAS lead SNPs, with predictions validated in vivo and available as a UCSC Genome Browser track.

Field Name	Description
Enhancer_Finder_General_Prediction_MKL_Scores	Whether the site is within a predicted general developmental enhancers, along with MKL scores.
Enhancer_Finder_General_Prediction_H3K27ac_H3K4me1_Contexts	The H3K27ac and H3K4me1 marks from the feature data overlapping each predicted enhancer.
Enhancer_Finder_Limb_MKL_Scores	Whether the site is within a predicted limb tissuse-specificity enhancers, along with MKL scores.
Enhancer_Finder_Brain_MKL_Scores	Whether the site is within a predicted brain tissuse-specificity enhancers, along with MKL scores.
Enhancer_Finder_Heart_MKL_Scores	Whether the site is within a predicted heart tissuse-specificity enhancers, along with MKL scores.

CAGE Promoter

Reference: The FANTOM Consortium and the RIKEN PMI and CLST (DGT). A promoter-level mammalian expression atlas[J]. Nature, 2014, 507(7493): 462-470.
Retrieve Source: https://favor.genohub.org/
Brief Introduction: Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) in human and mouse cells, revealing few 'housekeeping' genes, many composite promoters with cell-type-specific TSSs, and differing evolutionary rates for TSSs, linking key transcription factors to cell states, with the FANTOM5 project providing comprehensive mammalian cell-type-specific transcriptome profiles for biomedical research.

Field Name	Description
cage_promoter	CAGE defined promoter sites from Fantom 5
cage_tc	CAGE tag cluster

CAGE Enhancer

Reference: Andersson R, Gebhard C, Miguel-Escalada I, et al. An atlas of active enhancers across human cell types and tissues[J]. Nature, 2014, 507(7493): 455-461.
Retrieve Source: https://favor.genohub.org/
Brief Introduction: CAGE Enhancer utilizes the FANTOM5 panel of samples, covering the majority of human tissues and cell types, to produce an atlas of active, in vivo-transcribed enhancers.

Field Name	Description
cage_enhancer	CAGE defined permissive Enhancer sites from Fantom 5

snoRNABase/miRBase

Reference:
1. miRBase: Kozomara, A. & Griffiths-Jones, S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 42, D68–D73 (2014).
2. snoRNABase: Lestrade, L. & Weber, M. J. snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 34, D158-162 (2006).
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction:
- miRBase: The miRBase database contains 24,521 microRNA loci from 206 species and includes a high-confidence subset based on deep sequencing data.
- snoRNABase: The snoRNA-LBME-db is an online database containing experimentally verified and predicted human C/D box and H/ACA box snoRNAs, and scaRNAs, which guide RNA modifications and maturation, providing detailed annotations, predicted base pairings.

Field Name	Description
sno_miRNA_name	The name of snoRNA or miRNA if the site is located within (from miRBase/snoRNABase)
sno_miRNA_type	the type of snoRNA or miRNA (from miRBase/snoRNABase)

👥 Allele Frequency

gnomAD

Reference: Chen S, Francioli L C, Goodrich J K, et al. A genomic mutational constraint map using variation in 76,156 human genomes[J]. Nature, 2024, 625(7993): 92-100.
Retrieve Source: https://gnomad.broadinstitute.org/news/2023-11-gnomad-v4-0/
Brief Introduction: The gnomAD database is composed of exome and genome sequences from around the world. We have removed cohorts that were recruited for pediatric disease, except for a small number of diverse cohorts where we have included unaffected relatives.

Field Name	Description
exomes_AF	Exomes Alternate allele frequency
exomes_AF_XX	Exomes Alternate allele frequency in XX samples
exomes_AF_XY	Exomes Alternate allele frequency in XY samples
exomes_AF_afr_XX	Exomes Alternate allele count for XX samples of African/African-American ancestry
exomes_AF_afr_XY	Exomes Alternate allele count for XYsamples of African/African-American ancestry
exomes_AF_afr	Exomes Alternate allele frequency in samples of African/African-American ancestry
exomes_AF_amr_XX	Exomes Alternate allele frequency in XX samples of Latino ancestry
exomes_AF_amr_XY	Exomes Alternate allele frequency in XY samples of Latino ancestry
exomes_AF_amr	Exomes Alternate allele frequency in samples of Latino ancestry
exomes_AF_asj_XX	Exomes Alternate allele frequency in XX samples of Ashkenazi Jewish ancestry
exomes_AF_asj_XY	Exomes Alternate allele frequency in XY samples of Ashkenazi Jewish ancestry
exomes_AF_asj	Exomes Alternate allele frequency in samples of Ashkenazi Jewish ancestry
exomes_AF_eas_XX	Exomes Alternate allele frequency in XX samples of East Asian ancestry
exomes_AF_eas_XY	Exomes Alternate allele frequency in XY samples of East Asian ancestry
exomes_AF_eas	Exomes Alternate allele frequency in samples of East Asian ancestry
exomes_AF_fin_XX	Exomes Alternate allele frequency in XX samples of Finnish ancestry
exomes_AF_fin_XY	Exomes Alternate allele frequency in XY samples of Finnish ancestry
exomes_AF_fin	Exomes Alternate allele frequency in samples of Finnish ancestry
exomes_AF_mid_XX	Exomes Alternate allele frequency in XX samples of Middle Eastern ancestry
exomes_AF_mid_XY	Exomes Alternate allele frequency in XY samples of Middle Eastern ancestry
exomes_AF_mid	Exomes Alternate allele frequency in samples of Middle Eastern ancestry
exomes_AF_nfe_XX	Exomes Alternate allele frequency in XX samples of Non-Finnish European ancestry
exomes_AF_nfe_XY	Exomes Alternate allele frequency in XY samples of Non-Finnish European ancestry
exomes_AF_nfe	Exomes Alternate allele frequency in samples of Non-Finnish European ancestry
genomes_AF	Genomes Alternate allele frequency
genomes_AF_XX	Genomes Alternate allele frequency in XX samples
genomes_AF_XY	Genomes Alternate allele frequency in XY samples
genomes_AF_afr_XX	Genomes Alternate allele count for XX samples of African/African-American ancestry
genomes_AF_afr_XY	Genomes Alternate allele count for XYsamples of African/African-American ancestry
genomes_AF_afr	Genomes Alternate allele frequency in samples of African/African-American ancestry
genomes_AF_amr_XX	Genomes Alternate allele frequency in XX samples of Latino ancestry
genomes_AF_amr_XY	Genomes Alternate allele frequency in XY samples of Latino ancestry
genomes_AF_amr	Genomes Alternate allele frequency in samples of Latino ancestry
genomes_AF_asj_XX	Genomes Alternate allele frequency in XX samples of Ashkenazi Jewish ancestry
genomes_AF_asj_XY	Genomes Alternate allele frequency in XY samples of Ashkenazi Jewish ancestry
genomes_AF_asj	Genomes Alternate allele frequency in samples of Ashkenazi Jewish ancestry
genomes_AF_eas_XX	Genomes Alternate allele frequency in XX samples of East Asian ancestry
genomes_AF_eas_XY	Genomes Alternate allele frequency in XY samples of East Asian ancestry
genomes_AF_eas	Genomes Alternate allele frequency in samples of East Asian ancestry
genomes_AF_fin_XX	Genomes Alternate allele frequency in XX samples of Finnish ancestry
genomes_AF_fin_XY	Genomes Alternate allele frequency in XY samples of Finnish ancestry
genomes_AF_fin	Genomes Alternate allele frequency in samples of Finnish ancestry
genomes_AF_mid_XX	Genomes Alternate allele frequency in XX samples of Middle Eastern ancestry
genomes_AF_mid_XY	Genomes Alternate allele frequency in XY samples of Middle Eastern ancestry
genomes_AF_mid	Genomes Alternate allele frequency in samples of Middle Eastern ancestry
genomes_AF_nfe_XX	Genomes Alternate allele frequency in XX samples of Non-Finnish European ancestry
genomes_AF_nfe_XY	Genomes Alternate allele frequency in XY samples of Non-Finnish European ancestry
genomes_AF_nfe	Genomes Alternate allele frequency in samples of Non-Finnish European ancestry

UK10K

Reference: Statistics group Ciampi Antonio 8 Greenwood Celia MT (co-chair) 7 8 14 19 Hendricks Audrey E. 1 12 Li Rui 7 13 14 Metrustry Sarah 5 Oualkacha Karim 80 Tachmazidou Ioanna 1 Xu ChangJiang 7 8 Zeggini Eleftheria (co-chair) 1. The UK10K project identifies rare variants in health and disease[J]. Nature, 2015, 526(7571): 82-90.
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction: The UK10K project will enable researchers in the UK and beyond to better understand the link between low-frequency and rare genetic changes, and human disease caused by harmful changes to the proteins the body makes.

Field Name	Description
RS_ID	dbSNP ID.
DP	-
VQSLOD	-
AC	Alternative allele count in called genotypes in UK10K cohorts.
AN	Total allele count in called genotypes in UK10K cohorts.
AF	Alternative allele frequency in called genotypes in UK10K cohorts.
AC_TWINSUK	Alternative allele count in called genotypes in UK10K TWINSUK cohort.
AN_TWINSUK	Total allele count in called genotypes in UK10K TWINSUK cohort.
AF_TWINSUK	Alternative allele frequency in called genotypes in UK10K TWINSUK cohort.
AC_ALSPAC	Alternative allele count in called genotypes in UK10K TWINSUK cohort.
AN_ALSPAC	Total allele count in called genotypes in UK10K TWINSUK cohort.
AF_ALSPAC	Alternative allele frequency in called genotypes in UK10K TWINSUK cohort.
AF_AFR	-
AF_AMR	-
AF_ASN	-
AF_EUR	-
AF_MAX	-
ESP_MAF	-
CSQ	Conseqence of given variant. e.g. ENST00000342066:SAMD11:synonymous_variant:21:7:Q>Q

ExAC

Reference: Karczewski K J, Weisburd B, Thomas B, et al. The ExAC browser: displaying reference data information from over 60 000 exomes[J]. Nucleic acids research, 2017, 45(D1): D840-D845.
Retrieve Source:
- ExAC, ExAC_nonpsych are retrieved from https://annovar.openbioinformatics.org/en/latest/
- ExAC_nonTCGA is retrieved from https://sites.google.com/site/jpopgen/wgsa
Brief Introduction:

Field Name	Description
ExAC_ALL	Allele frequency in total ExAC samples
ExAC_AFR	Allele frequency in African & African American ExAC samples
ExAC_AMR	Allele frequency in American ExAC samples
ExAC_EAS	Allele frequency in East Asian ExAC samples
ExAC_FIN	Allele frequency in Finnish ExAC samples
ExAC_NFE	Allele frequency in Non-Finnish European ExAC samples
ExAC_OTH	Allele frequency in other ExAC samples
ExAC_SAS	Allele frequency in South Asian ExAC samples
ExAC_nonpsych_ALL	Allele frequency in total ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_AFR	Allele frequency in African & African American ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_AMR	Allele frequency in American ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_EAS	Allele frequency in East Asian ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_FIN	Allele frequency in Finnish ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_NFE	Allele frequency in Non-Finnish European ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_OTH	Allele frequency in other ExAC samples excluding psychiatric cohorts
ExAC_nonpsych_SAS	Allele frequency in South Asian ExAC samples excluding psychiatric cohorts
ExAC_nonTCGA_QUAL	Phred-scaled quality score for the assertion made in ALT
ExAC_nonTCGA_FILTER	PASS if this position has passed all filters
ExAC_nonTCGA_ALL	Allele frequency in total ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_AFR	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in African & African American ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_AMR	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in American ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_EAS	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in East Asian ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_FIN	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in Finnish ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_NFE	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in Non-Finnish European ExAC samples excluding TCGA cohorts
ExAC_nonTCGA_Adj	Adjusted Alt allele frequency (DP >= 10 & GQ >= 20) in total ExAC samples excluding TCGA cohorts

Kaviar

Reference: Glusman G, Caballero J, Mauldin D E, et al. Kaviar: an accessible system for testing SNV novelty[J]. Bioinformatics, 2011, 27(22): 3216-3217.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction:

Field Name	Description
Kaviar_AF
Kaviar_AC
Kaviar_AN

GME

Reference: Scott E M, Halees A, Itan Y, et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery[J]. Nature genetics, 2016, 48(9): 1071-1076.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction:

Field Name	Description
GME_AF
GME_NWA
GME_NEA
GME_AP
GME_Israel
GME_SD
GME_TP
GME_CA

NCI-60

Reference: Reinhold W C, Varma S, Sousa F, et al. NCI-60 whole exome sequencing and pharmacological CellMiner analyses[J]. PloS one, 2014, 9(7): e101670.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction:

Field Name	Description
NCI60_AF

AbraOM

Reference: Naslavsky M S, Yamamoto G L, de Almeida T F, et al. Exomic variants of an elderly cohort of Brazilians in the ABraOM database[J]. Human mutation, 2017, 38(7): 751-763.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction:

Field Name	Description
ABRAOM_AF
ABRAOM_Filter
ABRAOM_Cegh_Filter

ESP6500

Reference: Fu W, O' connor T D, Jun G, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants[J]. Nature, 2013, 493(7431): 216-220.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction:

Field Name	Description
esp6500siv2_all
esp6500siv2_aa
esp6500siv2_ea

TOPMed BRAVO

Reference: Taliun D, Harris D N, Kessler M D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program[J]. Nature, 2021, 590(7845): 290-299.
Retrieve Source: https://favor.genohub.org/
Brief Introduction:

Field Name	Description
bravo_an	TOPMed Bravo Genome Allele Number.
bravo_af	TOPMed Bravo Genome Allele Frequency.
filter_status	TOPMed QC status of the given variant.

Primate

5 Primates Allele Frequency utilized in AlphaMissense, we have mapped them on GRCh38.

Field Name	Reference	Retrieve Source
Bonobos_AF	Genetic variation in Pan species is shaped by demographic history and harbors lineage-specific functions[J]. Genome biology and evolution, 2019, 11(4): 1178-1191.	https://figshare.com/articles/dataset/Han_etal_Data_tsv_gz/7855850
Gorilla_AF	Great ape genetic diversity and population history[J]. Nature, 2013, 499(7459): 471-475.	https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Gorilla.vcf.gz
Pan_troglodytes_AF	Same as above	https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pan_troglodytes.vcf.gz
Pongo_pygmaeus_AF	Same as above	https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pongo_abelii.vcf.gz
Pongo_abelii_AF	Same as above	https://eichlerlab.gs.washington.edu/greatape/data/VCFs/SNPs/Pongo_pygmaeus.vcf.gz

🧬 Conservation Score

siPhy

Reference: Garber M, Guttman M, Clamp M, et al. Identifying novel constrained elements by exploiting biased substitution patterns[J]. Bioinformatics, 2009, 25(12): i54-i62.
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction: siPhy leverages deeply sequenced clades to identify evolutionary selection by detecting both rate-based conservation and substitution patterns indicative of natural selection, employing a statistical method for biased nucleotide substitutions, a learning algorithm to infer site-specific biases from sequence alignments, and a hidden Markov model to detect constrained elements.

Field Name	Description
siPhy_rankscore	The rank of the SiPhy_29way_logOdds score among all SiPhy_29way_logOdds scores in genome

bStatistic

Reference: McVicker G, Gordon D, Davis C, et al.Widespread Genomic Signatures of Natural Selection in Hominid Evolution [J]. PLoS genetics, 2009, 5(5): e1000471.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: Selection on genomic functional elements can be detected by its effects on population diversity at linked neutral sites, as shown by our analysis of human polymorphisms and sequence differences among five primate species relative to conserved sequence features.

Field Name	Description
bStatistic	Background selection (B) value estimatation. Ranges from 0 to 1000. It estimates the expected fraction (1000) of neutral diversity present at a site. Values close to 0 represent near complete removal of diversity as a result of background selection and values near 1000 indicating absent of background selection.

FitCons

Reference: Gulko B, Melissa J. Hubisz, Gronau I, Siepel A (2015). Probabilities of fitness consequences for point mutations across the human genome. Nature Genetics, 47, 276-283.
Retrieve Source: https://sites.google.com/site/jpopgen/wgsa
Brief Introduction: FitCons, a novel computational method, estimates the probability that a point mutation at each genome position will influence fitness, using high-throughput functional genomic data to cluster genomic positions and assess fitness consequences.

Field Name	Description
integrated_fitCons_score	FitCons scores (i6) based on function evidence from multiple cell types, the higher the score the more potential for interesting genomic function
integrated_confidence_value	Confidence value for the integrated_fitCons_score: 0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05), 2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
GM12878_fitCons_score	FitCons scores (gm) based on function evidence from the GM12878 cell type, the higher the score the more potential for interesting genomic function
GM12878_confidence_value	Confidence value for the GM12878_fitCons_score: 0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05), 2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
H1-hESC_fitCons_score	FitCons scores (h1) based on function evidence from the H1-hESC cell type, the higher the score the more potential for interesting genomic function
H1-hESC_confidence_value	Rank of the H1-hESC_fitCons_score among all H1-hESC_fitCons_scores in genome
HUVEC_fitCons_score	FitCons scores (hu) based on function evidence from the HUVEC cell type, the higher the score the more potential for interesting genomic function
HUVEC_confidence_value	confidence value for the HUVEC_fitCons_score: 0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05), 2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
integrated_fitCons_score_rankscore	Rank of the integrated_fitCons_score among all integrated_fitCons_scores in genome
GM12878_fitCons_score_rankscore	Rank of the GM12878_fitCons_score among all GM12878_fitCons_scores in genome
H1-hESC_fitCons_score_rankscore	Confidence value for the H1-hESC_fitCons_score: 0 - High confidence values (p<~.003), 1 - Likely Significant (p<.05), 2 - Likely Informative (p<.25), 3 - Best estimate (p>=.25)
HUVEC_fitCons_score_rankscore	Rank of the HUVEC_fitCons_score among all HUVEC_fitCons_scores in genome

PhastCons

Reference: Siepel A, Bejerano G, Pedersen J S, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes[J]. Genome research, 2005, 15(8): 1034-1050.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: PhastCons, a program based on a two-state phylogenetic hidden Markov model, was used to conduct a comprehensive search for conserved elements across vertebrate genomes, utilizing genome-wide alignments of five vertebrate species, four insect species, two Caenorhabditis species, and seven Saccharomyces species.

Field Name	Description
priPhCons	Primate PhastCons conservation score (excl. human) (default: 0.0)
mamPhCons	Mammalian PhastCons conservation score (excl. human) (default: 0.0)
verPhCons	Vertebrate PhastCons conservation score (excl. human) (default: 0.0)

PhyloP

Reference: Pollard K S, Hubisz M J, Rosenbloom K R, et al. Detection of nonneutral substitution rates on mammalian phylogenies[J]. Genome research, 2010, 20(1): 110-121.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: PhyloP addresses the broader problem of detecting departures from neutral nucleotide substitution rates in either direction, potentially in a clade-specific manner, using four statistical tests (likelihood ratio, score, exact distributions, GERP).

Field Name	Description
priPhyloP	Primate PhyloP score (excl. human) (default: -0.029)
mamPhyloP	Mammalian PhyloP score (excl. human) (default: - 0.005)
verPhyloP	Vertebrate PhyloP score (excl. human) (default: 0.042)

GERP++

Reference: Davydov E V, Goode D L, Sirota M, et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++[J]. PLoS computational biology, 2010, 6(12): e1001025.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: GERP++ uses maximum likelihood evolutionary rate estimation for position-specific scoring. In contrast to previous bottom-up methods, it employs a novel dynamic programming approach to subsequently define constrained elements.

Field Name	Description
GerpRS	Gerp element score (default: 0)
GerpRSpval	Gerp element p-Value (default: 0)
GerpN	Neutral evolution score defined by GERP++ (default: 3.0)
GerpS	Rejected Substitution score defined by GERP++ (default: -0.2)

Zoonomia

Reference: Christmas M J, Kaplow I M, Genereux D P, et al. Evolutionary constraint and innovation across hundreds of placental mammals[J]. Science, 2023, 380(6643): eabn3943.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction: Zoonomia, the largest comparative genomics resource for mammals, aligns genomes of 240 species to identify bases likely affecting fitness and disease risk, revealing 332 million evolutionarily constrained bases in the human genome, with many outside protein-coding exons, and associating changes in genes and regulatory elements with unique mammalian traits that could inform therapeutic development.

Field Name	Description
ZooPriPhyloP	Zoonomia Primate PhyloP conservation score (43 genomes) (default: 0.005)
ZooVerPhyloP	Zoonomia Vertebrate PhyloP conservation score (241 vertebrate genome) (default: -0.1460)
ZooRoCC	Zoonomia Runs of Contiguous Constraint (default: 0)
ZooUCE	Zoonomia UltraConserved Elements (default: 0)

👶🏻 De novo Variants

De novo mutations (DNMs) are defined as variants observed in individuals that are not seen in either parent and these types of variants have been reported to play prominent roles in several genetic diseases.

❗ Gene4Denovo

Gene4Denovo的网页、文献中没有对如下字段名的描述信息，该注释来自ANNOVAR

Reference: Zhao G, Li K, Li B, et al. Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans[J]. Nucleic acids research, 2020, 48(D1): D913-D926.
Retrieve Source: https://annovar.openbioinformatics.org/en/latest/
Brief Introduction: Gene4Denovo integrated 580 799 DNMs, including 30 060 coding DNMs detected by WES/WGS from 23 951 individuals across 24 phenotypes and prioritized a list of candidate genes with different degrees of statistical evidence, including 346 genes with false discovery rates <0.05.

Field Name	Description
DN ID	The variants identifer of Gene4Denovo, such as dn65354.
Patient ID
Phenotype	Annotated information about gene function according to OMIM, ClinVar, denovo-db, MGI, HPO.
Platform
Study
Pubmed ID

denovo-db

Reference: Turner T N, Yi Q, Krumm N, et al. denovo-db: A compendium of human de novo variants[J]. Nucleic acids research, 2017, 45(D1): D804-D811.
Retrieve Source: https://denovo-db.gs.washington.edu/denovo-db/, we only retrieved non-SSC Samples due to terms of use of denovo-db.
Brief Introduction: denovo-db contained 40 different studies and 32,991 de novo variants from 23,098 trios.

Field Name	Description
SAMPLE_CT	Observed Sample Count
NumProbands	The total number of probands involved in the study.
SampleIDs	If some type of sample identifier is given in the study we use that exactly. If there is no sample identifier we use the name of the study and start numbering such that every variant has a unique sample identifier.
SequenceType	The sequence type used in the study.
Validation	The validation status describes the result of some orthogonal validation method (for example Sanger sequencing). The values are either yes or unknown meaning either valid or not known, respectively. Any variants that are not valid are removed early in the pipeline and are not represented in denovo-db.
PrimaryPhenotype	he primary phenotype is the main phenotype of the patient for inclusion in the study.
StudyName	This is the name of the study.
PubmedId	Pubmed ID for the study publication.
NumControls	The total number of controls involved in the study.

📦 Other

Local Nuclear Diversity

Reference: Gazal, S., Finucane, H., Furlotte, N. et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nat Genet 49, 1421–1427 (2017).
Retrieve Source: https://favor.genohub.org/
Brief Introduction:

Field Name	Description
nucdiv	Nuclear diversity measures the probability of how likely the region diversify. Range: [0.05, 60.25] (default: 0).
recombination_rate	Recombination rate measures the probability of how likely the region tends to undergo recombination. Range: [0, 54.96]

Mappability

Reference: Mehran Karimzadeh, Carl Ernst, Anshul Kundaje, Michael M Hoffman, Umap and Bismap: quantifying genome and methylome mappability, Nucleic Acids Research, Volume 46, Issue 20, 16 November 2018, Page e120
Retrieve Source: https://favor.genohub.org/
Brief Introduction:

Field Name	Description
k*_bismap	Mappability of the bisulfite-converted genome. Bisulfite sequencing approaches used to identify DNA methylation introduce large numbers of reads that map to multiple regions. This annotation identifies mappability of the bisulfite-converted genome. Range: [0, 1] (default: 0).
k*_umap	Mappability of unconverted genome. It measures the extent to which a position can be uniquely mapped by sequence reads. Lower mappability means the estimates of genomic and epigenomic characteristics from sequencing assays are less reliable, and the region has increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Range: [0, 1] (default: 0).

Mutation Density

Reference: Rentzsch P, Witten D, Cooper G M, et al. CADD: predicting the deleteriousness of variants throughout the human genome[J]. Nucleic Acids Research, 2019, 47(D1): D886-D894.
Retrieve Source: https://cadd.gs.washington.edu/download
Brief Introduction:

Field Name	Description
Dist2Mutation	Distance between the closest BRAVO SNV up and downstream (position itself excluded) (default: 0*)
Freq100bp	Number of frequent (MAF > 0.05) BRAVO SNV in 100 bp window nearby (default: 0)
Rare100bp	Number of rare (MAF < 0.05) BRAVO SNV in 100 bp window nearby (default: 0)
Sngl100bp	Number of single occurrence BRAVO SNV in 100 bp window nearby (default: 0)
Freq1000bp	Number of frequent (MAF > 0.05) BRAVO SNV in 1000 bp window nearby (default: 0)
Rare1000bp	Number of rare (MAF < 0.05) BRAVO SNV in 1000 bp window nearby (default: 0)
Sngl1000bp	Number of single occurrence BRAVO SNV in 1000 bp window nearby (default: 0)
Freq10000bp	Number of frequent (MAF > 0.05) BRAVO SNV in 10000 bp window nearby (default: 0)
Rare10000bp	Number of rare (MAF < 0.05) BRAVO SNV in 10000 bp window nearby (default: 0)
Sngl10000bp	Number of single occurrence BRAVO SNV in 10000 bp window nearby (default: 0)

🦍 Different Species sSNV

We further collected sSNV from Ensembl Variation 112 and mapped them on GRCh38.

Field Name	Description
species_chromosome	Chromosome of this variant
species_position	Position of this variant
rs_id	dbSNP rsID
reference_allele	Reference allele of this variant
alternate_allele	Alternate allele of this variant
evidence_status	Support evidence of this variant, see details in https://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#evidence_status
original_source	The original source this variant comes from.
RefPep	Amino acid translated with reference allele.
VarPep	Variant peptide that is translated as a result of a missense variant. Format=Index\|Amino_acid\|Feature_id. The index identifies the missense variant. The amino acid translated with the missense variant. The feature id for the feature overlapping the variant.
VE	Variant effect of a variant overlapping a sequence feature as computed by the ensembl variant effect pipeline. Format=Consequence\|Index\|Feature_type\|Feature_id. Index indentifies for which variant sequence the effect is described for.
CSQ	Consequence annotations from Ensembl's Variant Effect Pipeline. Format=Allele\|Consequence\|Feature_type\|Feature\|Amino_acids\|SIFT
ensembl_transcript_id	Transcript information of this variant.
species_variant	Variant id, format as {chrom}_{position}_{ref}/{alt}
hg38_variant	The variant of this species maps to the GRCh38 coordinate of human synonymous mutations.
reference_genome	Reference genome of this variant.

interpretation - ToolForVol/doc-synMall GitHub Wiki

📝 Introduction

All potential sSNV

✒ Annotation Result interpretion

🖥 in silico Prediction

Common Pathogenic Prediction Score

sSNV-specific Pathogenic Prediction Score

Regulatory/Functional Prediction Score

🩺 Disease Information

HGMD

ClinVar

COSMIC

GWAS Catalog

GRASP

DisGenet

❗ ClinGen

VariSNP

dbDSM

PharmGKB

👁 Epigenetic Information

ENCODE

chromHMM

❗ ORegAnno

DICE

Geuvadis

GTEx

Transcript Factor

GeneHancer

Super Enhancer

Enhancer Finder

CAGE Promoter

CAGE Enhancer

snoRNABase/miRBase

👥 Allele Frequency

gnomAD

UK10K

ExAC

Kaviar

GME

NCI-60

AbraOM

ESP6500

TOPMed BRAVO

Primate

🧬 Conservation Score

siPhy

bStatistic

FitCons

PhastCons

PhyloP

GERP++

Zoonomia

👶🏻 De novo Variants

❗ Gene4Denovo

denovo-db

📦 Other

Local Nuclear Diversity

Mappability

Mutation Density

🦍 Different Species sSNV

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️