02b_Annotation with Prokka - esogin/seagrassOmics GitHub Wiki

Annotation of Gene Catalog with Prokka Trouble shooting

Maggie Sogin

Created: July 8, 2019

Updated: July 8, 2019

My initial attempts to do the Prokka annotation pipeline failed. Unclear at the moment why this failed, but the pipeline simply did not predict annotations for the coding sequences (and took a very long time to run).

Trouble shooting:

Subset initial clustered sequences to 10,000 sequences only.

head -n 10000 clusters_rep_seq.fasta > troubleshooting/subset.fasta

Fix fasta headers Currently they are in the format of group000000 00000000 remove the 000000

sed 's/ [0-9]*//g' subset.fasta > subset.fixed.fasta

run Prokka prokka subset.fixed.fasta --outdir troubleshooting_1 --metagenome --cpus 24 --mincontiglen 500

That worked pretty well - annotated something like 600 sequences out of 5000 (most were skipped because below min Contigs length)

Try now without fixing the Fasta headers:

prokka subset.fasta --outdir troubleshooting_1 --metagenome --cpus 24 --mincontiglen 500

That also didn't make a difference in terms of the output generated.

One potential issue based on comments from previous users is that Parallel is out of date so when it goes to annotate it freezes up. here is the log info:

Could not run command: cat result/sprot.faa | parallel --gnu --plain -j 48 --block 24570138 --recstart '>' --pipe blastp -query - -db /opt/share/software/packages/prokka-1.11/bin/../db/kingdom/Bacteria/sprot -evalue 1e-06 -num_threads 1 -num_descriptions 1 -num_alignments 1 -seg no > result/sprot.blast 2> /dev/null

Parallel in Cologne is from 2011. However, this doesn't explain why I can make it work with smaller datasets (subsets, 100s instead of 1000s of sequences).

Try to break Prokka with a 1M read file prokka subset_1M.fasta --outdir troubleshooting_3 --metagenome --cpus 24 --mincontiglen 500

tbl2asn command failed, apparently NCBI requires that you keep this up to-date every year (see https://github.com/tseemann/prokka/issues/139) - I did as suggested and places the new tbl2asn program in my tools folder.

See if Prokka now works with 1 M sequences prokka subset_1M.fasta --outdir troubleshooting_4 --metagenome --cpus 24 --mincontiglen 500

Using the tbl2asn file solved the issue in attempt 5 (where I managed to break it).

For the complete dataset/gene library, I have 73,357,971 coding sequences.

Try Prokka with 10 M sequences next (this will likely take a while but the last scale down I can think of before I scale it to full dataset) prokka subset_10M.fasta --outdir troubleshooting_5 --metagenome --cpus 24 --mincontiglen 500