Plasmid Identification - sellwe/Genome-Analysis GitHub Wiki

In my assembly i got 10 contigs, where 1 is the chromosome. I ran each of the remaining 9 contigs in BLASTn in order to find homologs for these suspected plasmids. The study identified 6 plasmids so my hope here is that my contigs will cover these 6 plasmids. I blasted against the nr/nt database, with the organism setting of "Enterococcus faecium plasmid (taxid:1352)". I am looking for the top result that is from "Enterococcus faecium strain E745" in order to match my strain. I have included screen shots of the top hits for each of my contigs below:

tig 2:

image

tig 3: image

tig 4: image

tig 5: image

tig 6: image

tig 7: image

tig 8: image

tig 9: image

tig 12: image

My contig Length Reads Top BLAST E475 hit name GenBank Accession nr
tig00000002 199383 580 plasmid pl1 CP014530.1
tig00000003 22773 36 plasmid pl5 CP014534.1
tig00000004 14735 7 plasmid pl1 CP014530.1
tig00000005 40014 124 plasmid pl2 CP014531.1
tig00000006 15014 14 plasmid pl5 CP014534.1
tig00000007 16045 9 plasmid pl4 CP014533.1
tig00000008 15429 21 plasmid pl3 CP014532.1
tig00000009 24949 28 plasmid pl6 CP014535.1
tig00000012 4226 6 plasmid pl6 CP014535.1

As expected i had some duplicate hits for the plasmids, but all 6 plasmids where detected. I only picked the top hit from E475, and interestingly all of the entries has the same source, example for tig00000002:

image

Which is an entry form 2016 by the same authors that published my paper a year later. This further indicates to me that the assembly managed to identify the correct plasmids, since i basically have the same dataset as they had.

By sorting the .txt file in the prokka output contig i get the following:

image

Here we also see that the first contig is the chromosome, with most of the CDS (Coding DNA Sequence) are located, and all of the genes coding for rRNA, tRNA and tmRNA. The rest are should represent the plasmids. Now the amount of CDS identified by Prokka can be compared to the corresponding GenBank reference plasmids CDS.

Plasmid Contig Summary

pl1 (CP014530.1) — 221 CDS

My Contig Length Reads CDS
tig00000002 199383 580 199
tig00000004 14735 7 27

Total CDS: 199 + 27 = 226
Percent of CDS identified: 226/221 ≈ 102.3%

I got two fragments that together captured the entire plasmid 1

pl2 (CP014531.1) — 41 CDS

My Contig Length Reads CDS
tig00000005 40014 124 49

Percent of CDS identified: 49/41 ≈ 119.5%

I think it captured the entire plasmid. Slight overannotation

pl3 (CP014532.1) — 12 CDS

My Contig Length Reads CDS
tig00000008 15429 21 20

Percent of CDS identified: 20/12 ≈ 166.7%

Large overannotation

pl4 (CP014533.1) — 31 CDS

My Contig Length Reads CDS
tig00000007 16045 9 28

Percent of CDS identified: 28/31 ≈ 90.3%

Almost full coverage.

pl5 (CP014534.1) — 68 CDS

My Contig Length Reads CDS
tig00000003 22773 36 34
tig00000006 15014 14 18

Total CDS: 34 + 18 = 52
Percent of CDS identified: 52/68 ≈ 76.5%

Partial coverage. Might be fragmented or underannotation

pl6 (CP014535.1) — 74 CDS

My Contig Length Reads CDS
tig00000009 24949 28 32
tig00000012 4226 6 5

Total CDS: 32 + 5 = 37
Percent of CDS identified: 37/74 ≈ 50.0%

Incomplete. Might be fragmented or underannotation.

Some of plasmids was satisfactory captured (pl1, pl2 and pl4). The remaining plasmids were correctly identified, but their coverage was lacking or overannotated compared to the GenBank entries. There might have been some miss-assembly where Canu didnt put the fragments together into contigs properly. There are also a few contigs that were removed (contig 10 and 11 are "missing"). These might have information that would have complete the assortment. Prokka might also not have annotated these regions correctly. If we look back to the Canu .tigInfo file:

image

We see that tig 10 and 11 are unassembled. Perhaps some of the missing genes are located here but the software was unable to assemble these into contigs. What would have been an interesting follow up here is to mimick the study and use the Illumina short reads + Nanopore long reads with the SPAdes assembler to target these contigs specifically in order to get full plasmid coverage and then redo the plasmid identification step.