Gene Annotation - sellwe/Genome-Analysis GitHub Wiki

For gene annotation i used Prokka v.1.45-5b58020 on the Canu assembled PacBio contigs. Prokka will perform structural and functional annotations and identify coding regions (CDS), which usually indicate protein coding genes, and if possible provide their function. But it will also find rRNA, tRNA and tcRNA. My hope is that it will identify genes that are relevant for the difference between cells grown in human serum and BH in downstream transcriptomic analyses.

Prokka results

In the .gbk genebank file we can see the start and stop positions, as well as the predicted function:

We can see a summary from the .txt output file:

This is very close to the study, which identified 3095 CDS. This indicates that the Canu assembly have high gene coverage.

From the .gff output file we can derive how many CDS are found on each of the contigs:

Most of the CDS are found on the largest contig (suspected chromosome)

Prokka did also annotate 1361 genes as "hypothetical proteins", which is 44% of the 3093 identified CDS. These found no good BLAST matches in the default databases (UniProtKB/Swiss-Prot) but it is also to be expected from prokaryotes, especially one like E.faecium which is not as well studied as say E.coli. If we wanted deeper understanding of the hypotheticals we could also Blast against larger non-curated databases or do manual curation. But for the scope of this experiment I moved forward with this annotation for the downstream differential expression analysis.