Genome Assembly - sellwe/Genome-Analysis GitHub Wiki
For sequencing the sequencing of the E.faecium 745 bacteria, the authors of the study used:
-
Pacific Biosciences RS II SMRT technology for long reads
-
MinION system with R7 flowcell chemistry (Oxford Nanopore Technologies) for long reads
-
Illumina HiSeq 100 bp paired-end sequencing for short reads
To assemble the genome of E.faecium i used two different assembly methods, Canu using the PacBio long reads, and SPAdes using the Nanopore long reads in combination with the Illumina short reads. I did not perform any preprocessing on the long reads, they are either raw or already processed for us. The preprocessing of the short reads are found in the previous chapter. After the assembly i used the Quast software in order to perform a quality check. The goal was to compare the two methods and choose which assembly to use in downstream analyses.
PacBio long reads with Canu v2.2 assembly evaluation
Quality from Quast:
The quality looks good. I only got 10 contigs and the longest contig is 2762476 bp long, which is very close to the length of the chromosome in the study (2765010 bp), indicating a near complete assembly of the core genome. For this part of the assembly, the authors of the study also used an additional step using the Illumina short reads for polishing the Pacbio reads, using the BWA software which could explain our slight difference in assembly.
The N50 is the same as the longest contig (2762476), which means that the longest contig is also the only ("shortest") contig required to sum up to 50% of the assembly. And L50 = 1 indicate similarly that only 1 contig is needed to cover 50% of the genome.
We can also look at the .tigInfo file from the assembly to see the lengths of the contigs:
I got 10 contigs in total, 1 is suspected to be the chromosome, the other 9 contigs indicate the plasmids or fragmented contigs. The study identified 6 plasmids in comparison. My additional 4 could be small repeats. Or, my reads might have broken the plasmids apart due to assembly settings. Maybe not enough scaffolding or polishing? I will later do plasmid identification to see if the assembly identified the same plasmids as the study. The authors of the study had 9 contigs they suspected as plasmids, and they used PCR-amplification, the Illumina short reads and NanoPore long reads in order to combine these contigs and close the gaps.
As an additional evaluation step, this assembly was compared to a reference genome seen in the "Comparative Genomics" section.
Extra analysis: Genome assembly using Illumina/Nanopore
Illumina short reads + Nanopore long reads with SPAdes v.3.15.5 assembly evaluation
In the previous step i trimmed the Illumina short reads. I used these trimmed short reads together with the Nanopore long reads for assembly
Quality from Quast:
This assembly immediately looks very bad compared to the Pacbio reads. I got 83 contigs (some of which as 0bp long it seems), indicating a very fragmented assembly. The longest contig is only 559726bp, indicating that it hasnt captured the entire chromosome. The L50 tells us that it needed 4 contigs to cover 50% of the genome. As i previously did my own triming on the Illumina reads, i wanted to see if running the assembly on the raw reads would yield better results.
Untrimmed SPAdes quality:
But it still looks similar. Slightly fewer contigs, but still an L50 of 4. The longest contig is even shorter than before. These results indicate to me that its the Nanopore sequencing was not as successful as the PacBio sequencing for total genome assembly. Based on these results it appears as if Canu using Pacbio reads was superior at assembling the genome, and I will move forward with the Canu assembled genome in downstream analyses.