4. Annotation Resource: DisGeNET - bcb420-2022/RuoXuan_Wang GitHub Wiki

Objective

Learn about an annotation data set

Duration

Time estimated: 1.5 h; taken 2 h;
date started: 2022-02-28; date completed: 2022-03-01

Progress

  • Find an annotation data set for human genes - any data set that adds functional, process, location, disease status ... to a set of genes.
  • Questions to Answer:

1. What sort of data is it? What sort of information does it offer us?

  • DisGeNET collects genes and variants associated to human diseases from publicly available databases.
  • It is divided into Gene-Disease Associations (GDAs), Variant-Disease Associations (VDAs), and Disease-Disease Associations. There is also an IntAct Coronavirus dataset.
  • The current version of DisGeNET (v7.0) contains 1,134,942 GDAs, between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 VDAs, between 194,515 variants and 14,155 diseases, traits, and phenotypes.

2. When and where was it published? Was it published?

3. Is this annotation set updated regularly or is it a static source?

  • Yes. It is updated relatively regularly, as there are yearly updates since the publication of v2 in 2015. However, 2018 is an exception. There seems to be no updates in 2018, as v5 and v6 were published in 2017 and 2019 respectively.

4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

5. How is the data formatted and released? Does it exist in some sort of standard file format?

  • Gene-Disease Associations, Variant-Disease Associations, and Disease-Disease Associations are formatted as tab separated files using the gmt format (Gene Matrix Transposed file format). Each row represents a gene set, and each line contains: ID (tab) Description (tab) Gene (tab) Gene (tab).
  • There are also SQLite files: DisGeNET SQLite 2020 - v7.0.
  • The DisGeNET association type ontology is an OWL ontology that has been integrated into the Sematicscience Integrated Ontology (SIO).

6. What identifiers are associated with these annotations?

  • In the gmt files, ID = Disease Concept Unique Identifier; Description = Disease Name; Gene = identified by Entrez gene id or HGNC gene symbol; the gene name, along with its Uniprot accession, may also be stored.
  • For diseases, entries are mapped to the UMLS® CUIs. The source databases use MeSH, or MIM identifiers, or disease names for disease terms.

  • Record this annotation source in your journal and add it to the list of annotations
    • this is the journal entry, will add to Student_Wiki now

Conclusion and outlook

  • Exploring disease and gene connections on DisGeNET was very interesting.
  • There are many databases and annotation sources available, which may be a problem when you are trying to look for comprehensive sources for analysis. That is why regulated use of identifiers and having up-to-date sources are so important.
  • Will have to start Assignment 2 soon

References

  • Piñero, J., Queralt-Rosinach, N., Bravo, À., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Sanz, F., & Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database : the journal of biological databases and curation, 2015, bav028. https://doi.org/10.1093/database/bav028
  • Piñero, J., Saüch, J., Sanz, F., & Furlong, L. I. (2021). The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Computational and Structural Biotechnology Journal, 19, 2960–2967. https://doi.org/10.1016/j.csbj.2021.05.015