Entry 8.1: CCDS (Annotation Sources Assignment) - bcb420-2025/Chloe_Calica GitHub Wiki

Objective: Find an annotation dataset that is not already in the Student Wiki list and find out some information about it as outlined in the assignment.

Estimated Time: 1 hr

Actual Time: 2 hr (because I cannot focus for the life of me 𖦹ᯅ𖦹)

Selecting an Annotation Source

At the time I did this homework, the following resources were already in the Student Wiki list:

Gene Ontology (GO) and Reactome provided by Prof Isserlin
From classmates: Human Protein Atlas, InterPro, gnomAD, ClinVar, GENCODE, TCGA's Pan-Cancer Atlas, and Refseq

I'm somewhat familiar with the resource KEGG and I initially wanted to do this resource, but then I realized the dataset wasn't available for download without subscription, so I decided to look for other reseources. I stumbled upon the CCDS, confirmed it was downloadable, and so I selected it for this assignment.

CCDS Resource Information

1. What sort of data is it? What sort of information does it offer us?

CCDS or the Consensus Coding Sequence Project is collaborative effort by EBI, HGNC, MGI, and NCBI to identify a core set of protein coding regions that are consistently annotated and are of high quality. the CCDS gene set includes coding regions that are annotated as full-length (valid start and stop codon), can be translated without frameshifts, and use consensus splice sites. Other information that are included in the datasets are the chromosome number/positions, nucleotide/protein sequence of the CDS, exon composition of the CDS, and associated IDs of the CDS as NCBI accessions, Entrez Gene ID, and Refseq/UniProtKB/SwissProt sequence IDs.

2. When and where was it published? Was it published?

CCDS was created in 2005 with the first publication appearing on Genome Research in July 2009. Three more articles were published about CCDS in 2012, 2014, and the most recent one in 2018 released on Nucleic Acids Research.

3. Is this annotation set updated regularly or is it a static source?

CCDS datasets are updated whenever there are major updates to the human reference genome. The last release was on October 26, 2022 with the most recent release of GRCh38.p14 (Genome Reference Consortium Human Build 38 patch release 14). Previous release notes and associated reference genomes can be accessed here. Additionally, the datasets are maintained with weekly updates to represent latest data for the current release.

4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

The main FTP download page contains links to both the human and mouse datasets, as well as links to previously available versions and a README file explaining the dataset formatting.
This link to the human datasets contains multiple annotation files which are listed in the next section.

5. How is the data formatted and released? Does it exist in some sort of standard file format?

Annotation datasets are all .txt files with the exception of the nucleotide and protein sequence datasets which are formatted as compressed fasta files.

BuildInfo.[YearMonthDay].txt
CCDS.[YearMonthDay].txt
CCDS2Sequence.[YearMonthDay].txt
CCDS2UniProtKB.[YearMonthDay].txt
CCDS_attributes.[YearMonthDay].txt
CCDS_exons.[YearMonthDay].txt
CCDS_nucleotide.[YearMonthDay].fna.gz
CCDS_protein.[YearMonthDay].faa.gz
CCDS_protein_exons.[YearMonthDay].faa.gz

Note that each file is named with a datestamp in [YearMonthDay] format, but the current files for a specific release are labelled with "current" in their names instead of the timestamps.

The README file for the above CCDS datasets contains descriptions of each file as well as the information they contain.

6. What identifiers are associated with these annotations?

Annotated genes in CCDS have a unique identifier number and version number e.g. CCDS1.1. or CCDS234.1.