4. Annotation Resource: DisGeNET - bcb420-2022/RuoXuan_Wang GitHub Wiki

Objective

Learn about an annotation data set

Time estimated: 1.5 h; taken 2 h;
date started: 2022-02-28; date completed: 2022-03-01

Find an annotation data set for human genes - any data set that adds functional, process, location, disease status ... to a set of genes.
- looked for databases, finally decided to find a disease annotation dataset
- DisGeNET at https://www.disgenet.org/home/
Questions to Answer:

DisGeNET collects genes and variants associated to human diseases from publicly available databases.
It is divided into Gene-Disease Associations (GDAs), Variant-Disease Associations (VDAs), and Disease-Disease Associations. There is also an IntAct Coronavirus dataset.
The current version of DisGeNET (v7.0) contains 1,134,942 GDAs, between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 VDAs, between 194,515 variants and 14,155 diseases, traits, and phenotypes.

Yes. It is updated relatively regularly, as there are yearly updates since the publication of v2 in 2015. However, 2018 is an exception. There seems to be no updates in 2018, as v5 and v6 were published in 2017 and 2019 respectively.

Data can be downloaded in bulk at https://www.disgenet.org/downloads
Data can also be obtained through the Cytoscape app or the REST API(disgenet.org/api/)
The website homepage is https://www.disgenet.org/home/ and an online browser is also be found there

Gene-Disease Associations, Variant-Disease Associations, and Disease-Disease Associations are formatted as tab separated files using the gmt format (Gene Matrix Transposed file format). Each row represents a gene set, and each line contains: ID (tab) Description (tab) Gene (tab) Gene (tab).
There are also SQLite files: DisGeNET SQLite 2020 - v7.0.
The DisGeNET association type ontology is an OWL ontology that has been integrated into the Sematicscience Integrated Ontology (SIO).

In the gmt files, ID = Disease Concept Unique Identifier; Description = Disease Name; Gene = identified by Entrez gene id or HGNC gene symbol; the gene name, along with its Uniprot accession, may also be stored.
For diseases, entries are mapped to the UMLS® CUIs. The source databases use MeSH, or MIM identifiers, or disease names for disease terms.

Record this annotation source in your journal and add it to the list of annotations
- this is the journal entry, will add to Student_Wiki now

Exploring disease and gene connections on DisGeNET was very interesting.
There are many databases and annotation sources available, which may be a problem when you are trying to look for comprehensive sources for analysis. That is why regulated use of identifiers and having up-to-date sources are so important.
Will have to start Assignment 2 soon

Piñero, J., Queralt-Rosinach, N., Bravo, À., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Sanz, F., & Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database : the journal of biological databases and curation, 2015, bav028. https://doi.org/10.1093/database/bav028
Piñero, J., Saüch, J., Sanz, F., & Furlong, L. I. (2021). The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Computational and Structural Biotechnology Journal, 19, 2960–2967. https://doi.org/10.1016/j.csbj.2021.05.015