Annotation Sources - bcb420-2023/Jielin_Yang GitHub Wiki

Objectives

Identify an annotation dataset for human genes that adds functional information to the genes.
Retrieve the details of the dataset
Identify how the dataset contributes to the functional annotation of genes

Time management

Time estimated: 2h, taken: 1.5h.

Start date: 2023-02-21, End date: 2023-02-21.

Procedure and Results

Annotation source

The source of annotation chosen is the Database for Annotation, Visualization and Integrated Discovery (DAVID)

Dataset details

What sort of data is it? What sort of information does it offer us?
- DAVID is a web-based tool that allows users to perform functional annotation of gene lists. It is curated upon several other databases such as Gene Ontology, KEGG, etc., and it associated a unique identifier with these curated databases. DAVID allows functional annotation, functional clustering, as well as identifying enriched functional terms associated with the set of genes. It also supports disease and pathway analysis, which allows identifying enriched pathways and the known diseases associated with the set of genes. Additionally, DAVID supports identifying protein-protein interactions. These information are offerred though the web-based tool via the direct presentation of results, statistical analysis of the enriched annotations, as well as graphical visualizations depending on the exact question searched.
When and where was it published? Was it published?
- DAVID was first published in 2003 by Dennis et al. (2003) which introduced the web-based interface that integrates gene annotation and visualization. This was published in Genome Biology. Several publications have been published since then, including the newly developed tools for DAVID (Huang et al. 2007). In 2009 a Nature Protocols paper was published that details the use of DAVID for genome-scale annotation. The latest publication made describes the new update of DAVID which has incorporated new species and adopted a new gene system (Sherman et al. 2021).
Is this annotation set updated regularly or is it a static source?
- The annotation set is updated regularly. DAVID is composed of the set of tools that are integrated into the web interface and the DAVID knowledgebase which curates the annotation sources. The knowledgebase is updated quarterly, with the schedule found here. The web interface is updated once in several years, each of which is composed of a set of new tools and major restructuring of the existing tools. The history of releases can be found here.
Where can I find this data? (link to the download web address or ftp site or publication where it can be found)
- The data can be found using the web interface functional annotation tool, which can be found here. In addition, the data can be downloaded upon request using the request form, where the species, primary ID type (e.g. official gene symbol), and the annotation type (e.g. GO terms) should be specified.
How is the data formatted and released? Does it exist in some sort of standard file format?
- The downloaded data is in a tab-delimited format in a flat file, which provides a direct association of the identifier (the id used for query) and the different datasets retrieved from the DAVID knowledgebase such as GO terms, pathways, diseases, etc.
- Data querying through the web interface provides differently formated data. These data are also tab-delimited, but are not in a standard format as it contains statistical calculation results for enriched functions for certain genes, and depending on the tool used, the data format is different. These data can all be downloaded as a txt file from the web interface following analysis.
What identifiers are associated with these annotations?
- The annotations are associated with a set of gene identifiers, which can be chosen to use by the user at the time of retrieving annotations. One of the identifiersis the official gene symbol, which is defined by the official nomenclature of the selected organism. For human, it is the HGNC symbols. Additionaly, the dataset also supports Ensembl gene ID and GenBank accession numbers, which allows annotation of genes that cannot be mapped to an HGNC symbol.

Conclusion

DAVID is a easy-to-use functional annotation tools that has a regularly updated, highly comprehensive knowledgebase that is curated upon several other credited databases. Its web-based interface and dataset download options make it a useful tool for functional annotation of genes.

References

B.T. Sherman, M. Hao, J. Qiu, X. Jiao, M.W. Baseler, H.C. Lane, T. Imamichi and W. Chang. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 23 March 2022. doi:10.1093/nar/gkac194.

Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.

Dennis G Jr, Sherman BT, Hosack DA, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4(5):P3.

Huang DW, Sherman BT, Tan Q, et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 2007;35(Web Server issue):W169-W175. doi:10.1093/nar/gkm415