4. Annotation Resource: DisGeNET - bcb420-2022/RuoXuan_Wang GitHub Wiki
Objective
Learn about an annotation data set
Duration
Time estimated: 1.5 h; taken 2 h;
date started: 2022-02-28; date completed: 2022-03-01
Progress
- Find an annotation data set for human genes - any data set that adds functional, process, location, disease status ... to a set of genes.
- looked for databases, finally decided to find a disease annotation dataset
- DisGeNET at https://www.disgenet.org/home/
- Questions to Answer:
1. What sort of data is it? What sort of information does it offer us?
- DisGeNET collects genes and variants associated to human diseases from publicly available databases.
- It is divided into Gene-Disease Associations (GDAs), Variant-Disease Associations (VDAs), and Disease-Disease Associations. There is also an IntAct Coronavirus dataset.
- The current version of DisGeNET (v7.0) contains 1,134,942 GDAs, between 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes, and 369,554 VDAs, between 194,515 variants and 14,155 diseases, traits, and phenotypes.
2. When and where was it published? Was it published?
- Originally published in 2015: DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes
- Latest publication in 2021: The DisGeNET cytoscape app: Exploring and visualizing disease genomics data
3. Is this annotation set updated regularly or is it a static source?
- Yes. It is updated relatively regularly, as there are yearly updates since the publication of v2 in 2015. However, 2018 is an exception. There seems to be no updates in 2018, as v5 and v6 were published in 2017 and 2019 respectively.
4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)
- Data can be downloaded in bulk at https://www.disgenet.org/downloads
- Data can also be obtained through the Cytoscape app or the REST API(disgenet.org/api/)
- The website homepage is https://www.disgenet.org/home/ and an online browser is also be found there
5. How is the data formatted and released? Does it exist in some sort of standard file format?
- Gene-Disease Associations, Variant-Disease Associations, and Disease-Disease Associations are formatted as tab separated files using the gmt format (Gene Matrix Transposed file format). Each row represents a gene set, and each line contains: ID (tab) Description (tab) Gene (tab) Gene (tab).
- There are also SQLite files: DisGeNET SQLite 2020 - v7.0.
- The DisGeNET association type ontology is an OWL ontology that has been integrated into the Sematicscience Integrated Ontology (SIO).
6. What identifiers are associated with these annotations?
- In the gmt files, ID = Disease Concept Unique Identifier; Description = Disease Name; Gene = identified by Entrez gene id or HGNC gene symbol; the gene name, along with its Uniprot accession, may also be stored.
- For diseases, entries are mapped to the UMLS® CUIs. The source databases use MeSH, or MIM identifiers, or disease names for disease terms.
- Record this annotation source in your journal and add it to the list of annotations
- this is the journal entry, will add to Student_Wiki now
Conclusion and outlook
- Exploring disease and gene connections on DisGeNET was very interesting.
- There are many databases and annotation sources available, which may be a problem when you are trying to look for comprehensive sources for analysis. That is why regulated use of identifiers and having up-to-date sources are so important.
- Will have to start Assignment 2 soon
References
- Piñero, J., Queralt-Rosinach, N., Bravo, À., Deu-Pons, J., Bauer-Mehren, A., Baron, M., Sanz, F., & Furlong, L. I. (2015). DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database : the journal of biological databases and curation, 2015, bav028. https://doi.org/10.1093/database/bav028
- Piñero, J., Saüch, J., Sanz, F., & Furlong, L. I. (2021). The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Computational and Structural Biotechnology Journal, 19, 2960–2967. https://doi.org/10.1016/j.csbj.2021.05.015