Journal Entry 3: Homework Assignment : Annotation sources - bcb420-2022/Sabbir_Hossain GitHub Wiki

Table of Contents

Objective

Give the gist and answer questions about The Consensus CDS protein set database CCDS, an annotation database/source provider.

Time est.: 30 mins Time used: 0.5 h Date started: 2022/04/21
Date completed: 2022/04/21

Progress & Notes

Activates & Tasks

Find an annotation data set (excluding GO and Reactome which I have outlined below as an example) for human genes - any data set that adds functional, process, location, disease status ... to a set of genes.

Find out the following information:

What sort of data is it? What sort of information does it offer us?

The Consensus CDS (CCDS) project is a collaborative effort to find a core group of consistently annotated and high-quality human and mouse protein coding regions. The long-term objective is to encourage the adoption of a common set of gene annotations.

When and where was it published? Was it published?

2019 for mice. 2018 for humans. All releases

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D. Genome Res. 2009 Jul;19(7):1316-23. PubMed: PMID: 19498102

Tracking and coordinating an international curation effort for the CCDS Project. Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD. Database 2012 Mar 20;2012:bas008. doi: 10.1093/database/bas008. PubMed: PMID: 22434842

Current status and new features of the Consensus Coding Sequence database. Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD. Nucleic Acids Res. 2014 Jan 1;42(1):D865-72. doi: 10.1093/nar/gkt1059. PubMed: PMID: 24217909

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, Girón CG, Diekhans M, Barnes I, Bennett R, Berry AE, Cox E, Davidson C, Goldfarb T, Gonzalez JM, Hunt T, Jackson J, Joardar V, Kay MP, Kodali VK, Martin FJ, McAndrews M, McGarvey KM, Murphy M, Rajput B, Rangwala SH, Riddick LD, Seal RL, Suner MM, Webb D, Zhu S, Aken BL, Bruford EA, Bult CJ, Frankish A, Murphy T, Pruitt KD. Nucleic Acids Res. 2018 Jan 4;46(D1):D221-D228. doi: 10.1093/nar/gkx1031. PubMed: PMID: 29126148 PubMed Central: PMCID: PMC5753299 CcdsB

Is this annotation set updated regularly or is it a static source?

Static it seems.

Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

Direct link to all releases ftp for both mice and human.

How is the data formatted and released? Does it exist in some sort of standard file format?

As ftp archive releases. The CCDS collection contains full-length (with a starting ATG and valid stop-codon) coding sequences that can be translated from the genome without frameshifts. The Havana team at EMBL-EBI and the RefSeq annotation group at NCBI are the two primary curation groups.

What identifiers are associated with these annotations?

The following is the general process flow for defining the CCDS gene set:

  1. Compare the outcomes of genomic annotation.
  2. On the genome quality evaluation, look for annotated coding sections with the same geographical coordinates.
  3. Lower-quality CDSs should be removed from the core set pending further assessment by the collaborating groups.

References

The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D. Genome Res. 2009 Jul;19(7):1316-23. PubMed: PMID: 19498102

Tracking and coordinating an international curation effort for the CCDS Project. Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L, Aken B, Barrell D, Frankish A, Wallin C, Searle S, Diekhans M, Harrow J, Pruitt KD. Database 2012 Mar 20;2012:bas008. doi: 10.1093/database/bas008. PubMed: PMID: 22434842

Current status and new features of the Consensus Coding Sequence database. Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, Hiatt SM, Frankish A, Suner MM, Rajput B, Steward CA, Brown GR, Bennett R, Murphy M, Wu W, Kay MP, Hart J, Rajan J, Weber J, Snow C, Riddick LD, Hunt T, Webb D, Thomas M, Tamez P, Rangwala SH, McGarvey KM, Pujar S, Shkeda A, Mudge JM, Gonzalez JM, Gilbert JG, Trevanion SJ, Baertsch R, Harrow JL, Hubbard T, Ostell JM, Haussler D, Pruitt KD. Nucleic Acids Res. 2014 Jan 1;42(1):D865-72. doi: 10.1093/nar/gkt1059. PubMed: PMID: 24217909

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, Girón CG, Diekhans M, Barnes I, Bennett R, Berry AE, Cox E, Davidson C, Goldfarb T, Gonzalez JM, Hunt T, Jackson J, Joardar V, Kay MP, Kodali VK, Martin FJ, McAndrews M, McGarvey KM, Murphy M, Rajput B, Rangwala SH, Riddick LD, Seal RL, Suner MM, Webb D, Zhu S, Aken BL, Bruford EA, Bult CJ, Frankish A, Murphy T, Pruitt KD. Nucleic Acids Res. 2018 Jan 4;46(D1):D221-D228. doi: 10.1093/nar/gkx1031. PubMed: PMID: 29126148 PubMed Central: PMCID: PMC5753299 CcdsB

⚠️ **GitHub.com Fallback** ⚠️