7. Annotation - lovisalittbrand/Genome_Analysis GitHub Wiki

Introduction

Genome annotation is the process of identifying genetic elements in the genome that are of particular interest. We might want to distinguish what parts in the genome is of biological meaning for the organisms survival, or distinguish other characteristics of interest. We want to label each feature with information such as the function and structure of the elements, as well as the process it is involved in. In my study it is of interest to determine the biochemical activity of the different bins.

Method

Prokka is used as software for making a structural and functional annotation of the bins. Prokka will identify features such as protein coding regions as well as RNA genes (tRNA, rRNA). It first determines the protein coding regions (CDS) and thereafter predicts the function of the encoded protein by similarity to others located in databases. The following parameters were specified when running the analyses:

--kingdom Archaea: some bins were provided with a flag telling the software that this bin most likely corresponds to archaea genome(s). This information was obtained in the binning-step. The default setting of kingdom was bacteria, which was used for all other bins.
--prefix: specifying the filename output prefix corresponding to "bin_X"
The input consisting of the bins as well as specifying an output directory for the annotation. The script for the binning can be found here.

Result

The annotation with Prokka provides us with valuable information such as the number of coding sequences (CDS) as well as genes coding for tRNA, rRNA and tmRNA. Additionally, it predicts the amount of non-coding RNA. It uses Prodigal when predicting the protein coding sequences. However, Prodigal is not able to functionally annotate the predicted genes. For that does Prokka include various databases in its analysis which assigns a function to the CDS features. The three main databases are UniProtKB, ISFinder and NCBI Bacterial Antimicrobial Resistance Reference Gene Database. The functional annotation can be used to further analyze particular genes of interest, for example those having a remarkable level of expression. Many of the proteins in the annotations were referred to as hypothetical proteins. That corresponds to regions that have been structurally annotated as protein coding regions by Prodigal, but where there is no similarity match in protein databases. The number of hypothetical proteins was found with the following bash-command:

grep -c "hypothetical" bin_X.gff

An overview of the result from Prokka is presented in table 1 below. Only the 9 bins passing the quality threshold are showed.

Table 1: Prokka annotation

Bin	Contigs	CDS	tRNA	rRNA	tmRNA	Hypothetical protein
Bin_1	54	2416	36	0	1	812
Bin_2	241	2872	52	2	1	1317
Bin_6	223	1713	31	6	1	513
Bin_7	334	2018	38	1	0	896
Bin_12	236	1365	36	1	1	552
Bin_16	338	1817	29	0	0	830
Bin_19	42	1516	36	1	1	588
Bin_24	297	2439	32	0	1	963
Bin_26	67	1183	40	1	0	631

As one can see in the table, most bins contain approximately 1000-2000 CDS. Genes for tRNA are quite common and ~30 potential sites are found in each bin. Regarding rRNA and tmRNA, they are much more rare. Finally, it is possible to see that a large amount of the CDS found are regarded as hypothetical proteins. That are not too surprising as we are taking samples containing many non-cultivated species that may have never been studied before. It is thereby likely that they contain proteins that are not present in the common databases.

Questions

What types of features are detected by the software? Which ones are more reliable a priori?

The features detected are mainly CDS, tmRNA, tRNA and rRNA. The number of each feature can be found in table 1 above. Beforehand, I would say that the features corresponding to the RNA's are more reliable. This is because these features are often encoded by highly conserved sequences which do not vary much between organisms. When it comes to the protein coding sequences, these are probably much more uncertain as not all possible regions with f.ex. a stop and start codon do encode a protein.

Why is more difficult to do the functional annotation in eukaryotic genomes?

One thing that differentiates eukaryotes and prokaryotes is the biosynthesis of proteins, and the process of alternative splicing of messenger-RNA after transcription that is found in eukaryotes. This means that regions coded by a specific exon can be spliced out and different combinations of exons will thereby encode the final proteins. Due to the possible splice patterns, a protein can't simply be predicted from the genome sequence as we may not know which exons are being used. Usually does structural annotation of protein coding sequences involve the identification of ORFs, which should include both a start and stop codon. As eukaryotic introns may contain stop codons, it makes the prediction of the position of protein coding sequences harder. Also, not all exons contain stop codons in the case where alternative splicing is used.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

As mentioned, hypothetical proteins lack functional annotation since there is no match to proteins found in the databases that Prokka uses in a hierarchical manner [1]. To reduce the number of hypothetical proteins, you could either try to search for that nucleotide sequence in BLAST or other databases containing functionally annotated protein sequence. Also, you can use optional annotation softwares on the data which includes other databases compared to Prokka which may contain the desired functional information. On the other hand does Prokka allow for the possibility of making custom databases [2].

How can you evaluate the quality of the obtained functional annotation?

If I didn't have the same time constraint it would be valuable to try different annotations softwares and compare their results. If they annotate the same features similarly, this will increase the reliability of the result. Similar predictions will strengthen the reliability and thereby the quality. If one software declares a region as a hypothetical protein and another as a non-coding region, it decreases the possibility of it actually being a protein coding sequence. On the other hand, hypothetical proteins might be assigned a function by other annotation softwares, increasing the likelihood of it being a protein. Another measure that can be used to assess the quality is the E-value. When searching for proteins in a database, the E-value will tell us how many hits we expect to obtain by random chance when searching in the database with that sequence. A shorter sequence means that it is more likely to find a close enough match by chance. The quality of the annotated protein is higher for a low E-value, as there is less likely to have obtained that match by chance and the function assigned is thereby more likely.

How comparable are the results obtained from two different structural annotation softwares?

It may be hard to conclude the performance of different softwares and compare their results if we don't have a reference genome. If we have a reference genome, we can tell how correct the predictions were and if the softwares made similar errors or not. It will be possible to compare the sequence position of CDS, as well as the amount of tRNA/rRNA/tmRNA. However, as our metagenome lacks a reference genome it would be difficult to compare different annotations. It will then be difficult to evaluate their performance. But if many softwares obtain similar predictions they may, as discussed in the previous questions, strengthen each other even though we can't say for certain that the predictions are 100% correct.

References

Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30: 2068–2069. https://academic.oup.com/bioinformatics/article/30/14/2068/2390517
Seemann T. 2020. tseemann/prokka. https://github.com/tseemann/prokka#databases