Plasmid Identification - sellwe/Genome-Analysis GitHub Wiki
In my assembly i got 10 contigs, where 1 is the chromosome. I ran each of the remaining 9 contigs in BLASTn in order to find homologs for these suspected plasmids. The study identified 6 plasmids so my hope here is that my contigs will cover these 6 plasmids. I blasted against the nr/nt database, with the organism setting of "Enterococcus faecium plasmid (taxid:1352)". I am looking for the top result that is from "Enterococcus faecium strain E745" in order to match my strain. I have included screen shots of the top hits for each of my contigs below:
tig 2:
tig 3:
tig 4:
tig 5:
tig 6:
tig 7:
tig 8:
tig 9:
tig 12:
My contig | Length | Reads | Top BLAST E475 hit name | GenBank Accession nr |
---|---|---|---|---|
tig00000002 | 199383 | 580 | plasmid pl1 | CP014530.1 |
tig00000003 | 22773 | 36 | plasmid pl5 | CP014534.1 |
tig00000004 | 14735 | 7 | plasmid pl1 | CP014530.1 |
tig00000005 | 40014 | 124 | plasmid pl2 | CP014531.1 |
tig00000006 | 15014 | 14 | plasmid pl5 | CP014534.1 |
tig00000007 | 16045 | 9 | plasmid pl4 | CP014533.1 |
tig00000008 | 15429 | 21 | plasmid pl3 | CP014532.1 |
tig00000009 | 24949 | 28 | plasmid pl6 | CP014535.1 |
tig00000012 | 4226 | 6 | plasmid pl6 | CP014535.1 |
As expected i had some duplicate hits for the plasmids, but all 6 plasmids where detected. I only picked the top hit from E475, and interestingly all of the entries has the same source, example for tig00000002:
Which is an entry form 2016 by the same authors that published my paper a year later. This further indicates to me that the assembly managed to identify the correct plasmids, since i basically have the same dataset as they had.
By sorting the .txt file in the prokka output contig i get the following:
Here we also see that the first contig is the chromosome, with most of the CDS (Coding DNA Sequence) are located, and all of the genes coding for rRNA, tRNA and tmRNA. The rest are should represent the plasmids. Now the amount of CDS identified by Prokka can be compared to the corresponding GenBank reference plasmids CDS.
Plasmid Contig Summary
pl1 (CP014530.1) — 221 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000002 | 199383 | 580 | 199 |
tig00000004 | 14735 | 7 | 27 |
Total CDS: 199 + 27 = 226
Percent of CDS identified: 226/221 ≈ 102.3%
I got two fragments that together captured the entire plasmid 1
pl2 (CP014531.1) — 41 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000005 | 40014 | 124 | 49 |
Percent of CDS identified: 49/41 ≈ 119.5%
I think it captured the entire plasmid. Slight overannotation
pl3 (CP014532.1) — 12 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000008 | 15429 | 21 | 20 |
Percent of CDS identified: 20/12 ≈ 166.7%
Large overannotation
pl4 (CP014533.1) — 31 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000007 | 16045 | 9 | 28 |
Percent of CDS identified: 28/31 ≈ 90.3%
Almost full coverage.
pl5 (CP014534.1) — 68 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000003 | 22773 | 36 | 34 |
tig00000006 | 15014 | 14 | 18 |
Total CDS: 34 + 18 = 52
Percent of CDS identified: 52/68 ≈ 76.5%
Partial coverage. Might be fragmented or underannotation
pl6 (CP014535.1) — 74 CDS
My Contig | Length | Reads | CDS |
---|---|---|---|
tig00000009 | 24949 | 28 | 32 |
tig00000012 | 4226 | 6 | 5 |
Total CDS: 32 + 5 = 37
Percent of CDS identified: 37/74 ≈ 50.0%
Incomplete. Might be fragmented or underannotation.
Some of plasmids was satisfactory captured (pl1, pl2 and pl4). The remaining plasmids were correctly identified, but their coverage was lacking or overannotated compared to the GenBank entries. There might have been some miss-assembly where Canu didnt put the fragments together into contigs properly. There are also a few contigs that were removed (contig 10 and 11 are "missing"). These might have information that would have complete the assortment. Prokka might also not have annotated these regions correctly. If we look back to the Canu .tigInfo file:
We see that tig 10 and 11 are unassembled. Perhaps some of the missing genes are located here but the software was unable to assemble these into contigs. What would have been an interesting follow up here is to mimick the study and use the Illumina short reads + Nanopore long reads with the SPAdes assembler to target these contigs specifically in order to get full plasmid coverage and then redo the plasmid identification step.