Annotation sources‐Ensembl - bcb420-2024/Krutika_Joshi GitHub Wiki

Annotation source: Ensembl

1. What sort of data is it? What sort of information does it offer us?

  • Ensemble is an open source gene annotation tool that focuses more on the protein-coding and non-coding genes, splice variants, cDNA, protein sequences, and non-coding RNAs. In addition to gene function and expression, Ensembl annotations contain data about gene regulation, variant effects, and phylogenetic relationships. Ensembl annotation works similarly to the GO categorization where the databases provide annotations on biological processes, molecular functions, and cellular components of genes.

2. When and where was it published? Was it published?

  • Yes, Ensembl was published. Ensembl was published in 1999 in the European Molecular Biology Laboratory's European Bioinformatics Institute which is located in United Kingdom.

3. Is this annotation set updated regularly or is it a static source?

  • Yes, the overall annotation is updated after about every 3 months. Sometimes the releases take longer depending on the update taking place. The updates can include new genomic data, updated annotations, bug fixes, and new features. However, updates to individual species are irregular, as the information is added based on on the availability of new evidence and assemblies. The most recent updated was published on January 5, 2024.

4. Where can I find this data? (link to the download web address or ftp site or publication where it can be found)

  • The data can be found directly on their website. Depending on which research is being completed, there are different websites under Ensembl that can be used. For example, BioMart can be used for custom dataset exports or BLAST/BLAT for sequence research.

5. How is the data formatted and released? Does it exist in some sort of standard file format?

  • There are many formats supported by Ensembl but the most common ones are FASTA, GFF/GTF, VCF and BED formats.

6. What identifiers are associated with these annotations?

  • Ensembl has its own identifiers called Ensembl IDs. Regardless of the study of interest(i.e gene, transcript or protein), each have a unique Ensembl ID. On another note, Ensembl gives the users the option to cross-references identifiers from other databases.