Plasmid Identification - sellwe/Genome-Analysis GitHub Wiki

In my assembly i got 10 contigs, where 1 is the chromosome. I ran each of the remaining 9 contigs in BLASTn in order to find homologs for these suspected plasmids. The study identified 6 plasmids so my hope here is that my contigs will cover these 6 plasmids. I blasted against the nr/nt database, with the organism setting of "Enterococcus faecium plasmid (taxid:1352)". I am looking for the top result that is from "Enterococcus faecium strain E745" in order to match my strain. I have included screen shots of the top hits for each of my contigs below:

tig 2:

tig 3:

tig 4:

tig 5:

tig 6:

tig 7:

tig 8:

tig 9:

tig 12:

My contig	Length	Reads	Top BLAST E475 hit name	GenBank Accession nr
tig00000002	199383	580	plasmid pl1	CP014530.1
tig00000003	22773	36	plasmid pl5	CP014534.1
tig00000004	14735	7	plasmid pl1	CP014530.1
tig00000005	40014	124	plasmid pl2	CP014531.1
tig00000006	15014	14	plasmid pl5	CP014534.1
tig00000007	16045	9	plasmid pl4	CP014533.1
tig00000008	15429	21	plasmid pl3	CP014532.1
tig00000009	24949	28	plasmid pl6	CP014535.1
tig00000012	4226	6	plasmid pl6	CP014535.1

As expected i had some duplicate hits for the plasmids, but all 6 plasmids where detected. I only picked the top hit from E475, and interestingly all of the entries has the same source, example for tig00000002:

Which is an entry form 2016 by the same authors that published my paper a year later. This further indicates to me that the assembly managed to identify the correct plasmids, since i basically have the same dataset as they had.

By sorting the .txt file in the prokka output contig i get the following:

Here we also see that the first contig is the chromosome, with most of the CDS (Coding DNA Sequence) are located, and all of the genes coding for rRNA, tRNA and tmRNA. The rest are should represent the plasmids. Now the amount of CDS identified by Prokka can be compared to the corresponding GenBank reference plasmids CDS.

Plasmid Contig Summary

pl1 (CP014530.1) — 221 CDS

My Contig	Length	Reads	CDS
tig00000002	199383	580	199
tig00000004	14735	7	27

Total CDS: 199 + 27 = 226
Percent of CDS identified: 226/221 ≈ 102.3%

I got two fragments that together captured the entire plasmid 1

pl2 (CP014531.1) — 41 CDS

My Contig	Length	Reads	CDS
tig00000005	40014	124	49

Percent of CDS identified: 49/41 ≈ 119.5%

I think it captured the entire plasmid. Slight overannotation

pl3 (CP014532.1) — 12 CDS

My Contig	Length	Reads	CDS
tig00000008	15429	21	20

Percent of CDS identified: 20/12 ≈ 166.7%

Large overannotation

pl4 (CP014533.1) — 31 CDS

My Contig	Length	Reads	CDS
tig00000007	16045	9	28

Percent of CDS identified: 28/31 ≈ 90.3%

Almost full coverage.

pl5 (CP014534.1) — 68 CDS

My Contig	Length	Reads	CDS
tig00000003	22773	36	34
tig00000006	15014	14	18

Total CDS: 34 + 18 = 52
Percent of CDS identified: 52/68 ≈ 76.5%

Partial coverage. Might be fragmented or underannotation

pl6 (CP014535.1) — 74 CDS

My Contig	Length	Reads	CDS
tig00000009	24949	28	32
tig00000012	4226	6	5

Total CDS: 32 + 5 = 37
Percent of CDS identified: 37/74 ≈ 50.0%

Incomplete. Might be fragmented or underannotation.

Some of plasmids was satisfactory captured (pl1, pl2 and pl4). The remaining plasmids were correctly identified, but their coverage was lacking or overannotated compared to the GenBank entries. There might have been some miss-assembly where Canu didnt put the fragments together into contigs properly. There are also a few contigs that were removed (contig 10 and 11 are "missing"). These might have information that would have complete the assortment. Prokka might also not have annotated these regions correctly. If we look back to the Canu .tigInfo file:

We see that tig 10 and 11 are unassembled. Perhaps some of the missing genes are located here but the software was unable to assemble these into contigs. What would have been an interesting follow up here is to mimick the study and use the Illumina short reads + Nanopore long reads with the SPAdes assembler to target these contigs specifically in order to get full plasmid coverage and then redo the plasmid identification step.