Genome Annotation - MaryamDost/GenomeAnalysis GitHub Wiki

To identify genes that are responsible for L. ferriphilum’s characteristic metabolism one must identify the coding regions in the genome and fine out their functions. This is done in a process called structural and functional annotation which was performed here by using Prokka and eggNOG.

Prokka is a software tool that annotated genes and identifies coding regions, RNA (i.e., tRNA, rRNA) sequences and features of the annotated sequences in prokaryotic genomes. The code that was used to run for prokka is found in Code directory. More information about prokka and its out puts is found on its manual.

eggNOG-mapper is a tool that is used for fast functional annotation. Another functional annotation was also conducted by this tool. It was conducted in its online server using the structural annotation from Prokka (.faa file) as input.

Result

	Article	prokka	eggNOG
Total no. of genes	2541	2656	2594
No. of RNA genes (rRNA/tRNA/mRNA)	6/48/1	6/48/1
No. of CDS with functional prediction	1846	1356	1960
CDS	2486	2594	2019
CRISPR	1	1

Discussion

The result shows that values obtained with eggNOG for 'Total number of genes' and the 'number of CDS' are closer with the value obtained in article. EggNOG uses orthology predictions for functional annotation which is considered to be more precise than traditional homology searches. It avoids transferring annotations from paralogs which are duplicate genes with a higher chance of being involved in functional divergence. The difference in CDS could be due to the same reason.

Overall, the values are quite reasonable.

Lab manual question

What types of features are detected by the software? Which ones are more reliable a priori?

Prokka detects coding regions (CDS), RNAs (rRNA, tRNA, tmRNA) and CRISPR arrays. I think the most reliable are the coding regains due to the lack of introns and rRNA and tRNA sequences because they are widely conserved among bacterial species.

How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?

The number of features detected by prokka and the result from the article can be seen in table 1. The difference is in the detected coding regions (article 2486 vs my Pokka result 2594) which might due to the difference in the used database, as a custom database specific to the Leptospirillum genus was constructed in the article.

Why is it more difficult to do the functional annotation in eukaryotic genomes?

Eukaryotic genome sequences contain repetitive sequences that complicate the annotation process. Repeat identification and masking simply change the bases in repetitive regions to an “N” or “X” nucleotide, allowing downstream tools to ignore the repeat. It is also difficult because a small percentage of the genome is coding and the genes contain introns and exons. These features make it very hard to annotate without evidence.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

Running this command grep -o -i hypothetical Lferriphilum.gff | wc -l in terminal gave me the number of hypothetical proteins to 1330.

Prokka predicts a coding sequence there using prodigal but fails to perform a functional annota-tion due to not finding a homologous gene with a know function in the searched database. This happens with the gene in not present in the database or produced protein has not been yet characterized for its function. One way they tackled it in the article was to create a custom database. However, they also annotated many hypothetical proteins. L.ferriphilum is not very well studied so it is more likely we will find many proteins that plays an important role in understanding biochemical and physiological pathways such as finding new and unknown structures and functions. I will try to find an efficient approach and technique to predict the function of these genes. An example of such technique is protein-protein interaction analyses.

How can you evaluate the quality of the obtained functional annotation?

When you do not know what result to get, it becomes difficult to evaluate the quality of it. One way to control the quality is to control it experimentally but it may not be an option as it is expensive and time consuming. One way to strengthen the result though is by checking for oth-er information, for example if there are promoter sites.

How comparable are the results obtained from two different structural annotation softwares?

In this project the structural annotation was conducted by prokka only. However, I guess it will be similar in our case because we are dealing with bacterial genomes. the software will only find likely open reading frames and do not have to predict possible splice variants (no introns in bacterial genomes)