Analysis 6: Genome Annotation with Prokka - cecilia-andersson/Genome_Analysis_Project GitHub Wiki

Methods

As an input, I used the fasta file from each bin (1.fa, 2.fa, etc) to run in prokka using default parameters.

Results

Prokka outputs several files, including a .gff file of the sequences and annotations, a FASTA file (.faa) of protein coding genes, and a FASTA file of all genomic features (.ffn). I was able to look through the annotations of some bins using UGENE, an example is shown below of a sequence (ANIMGAOC 00021) that matched to a previously identified enzyme. Screen Shot 2023-05-17 at 9 03 34 AM

Discussion

Prokka uses databases with information about previously identified bacterial protein sequences, as well as small RNA sequences, to identify genes within unidentified samples. It also predicts genes which encode proteins that have not yet been described. In this analysis I annotated each bin individually, to get an idea of what individual species may contribute to the overall environment in each location. All bins had hundreds of identified coding sequences, and when I looked at the output .gff files further in UniPro UGENE, I was able to see that among mostly hypothetical proteins, most bins had several previously identified proteins as well. For example, pictured above in 'Results' is an annotated protein sequence identified in Bin 3. Many of the identified proteins have described enzymatic functions like this one. Unfortunately, I do not think I will have enough time to thoroughly search each bin to find all metabolic-associated proteins to do an analysis as described in the reference paper, but it's very cool to see that the environmental functions of even unidentified species can be described in this way. For each bin, there are also at least one rRNA, tRNA, or tmRNA identified.

To think about:

  • What types of features are detected by the software? Which ones are more reliable a priori?

Coding sequences rRNA tRNA tmRNA

⚠️ **GitHub.com Fallback** ⚠️