04 Functional annotation - saltpinna/Genome_analysis_project GitHub Wiki

The functional annotation was performed using Prokka. The script can be found under code/scripts and is called Prokka_script.sh. The annotation resulted in 3127 coding sequences, 18 rRNAs, 70 tRNAs and 1 tmRNA.

Questions

What types of features are detected by the software? Which ones are more reliable a priori?

The software deteted coding seqeunces (which can either have an acutal function or be annotated as a hypothetical protein), rRNAs, tRNAs and tmRNAs. The ones that are most reliable a priori is the annotated seqeunces with known functions. This means the tRNAs, rRNAs, tmRNAs and the coding sequences corresponding to proteins with known functions.

How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?

The annotation resulted in 3127 coding sequences, 18 rRNAs, 70 tRNAs and 1 tmRNA. The article predicts that the genome has 3095 coding sequences, but it is not stated how many rRNAs, tRNAs and tmRNAs are found. So, my annotation resulted in 32 more coding seqeunces than in the article. This is probably because my assembly is longer than the one in the article.

Why is it more difficult to do the functional annotation in eukaryotic genomes?

It is more difficult because eukaryotic genomes have introns and exons which leads to alternative splicing. This means that one coding region can result in multiple different proteins. Often, the exon seqeunces are short that the intron seqeunces which makes them even more difficult to identify when annotating the genome.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

1359 genes were annotated as 'hypothetical protein'. This was found by running the command [grep 'hypothetical protein' annotation.faa | wc -l]. This is because the software was not able to find any related proteins in the database. To tackle this problem, it would be best to improve the assembly inserted to Prokka in the first place. This would close any gaps in the seqeunce and give a better assembly, which would probably result in better annotation.

How can you evaluate the quality of the obtained functional annotation?

The annotation obtained could be evaluated by comparing it to an annotation on a curated genome in a database.

How comparable are the results obtained from two different structural annotation softwares?

Depending on the algorithms and databases used by the structural annotation software, different results may be obtained. Some algorithms may focus on different features when doing the annotation, which would result in different annotations. The database used also plays a huge part of course.