3. Genome Assembly & Evaluation - majanettelbladt/Genome-Analysis GitHub Wiki

PacBio Genome Assembly

De novo assembly with Canu

A de novo assembly of PacBio reads was performed using Canu, a software specially adapted for long-read sequencing data from PacBio or Oxford Nanopore. The analysis was executed as a batch job on the UPPMAX Snowy cluster. Canu performs three steps by default:

  1. Correction of errors in the raw long reads based on read overlaps.
  2. Trimming of low-quality sequences.
  3. Assembly of the reads into contigs.

As seen in the script 01_canu_pacbio_assembly.sh, Canu was run using the following parameters:

  • genomeSize=3m: Estimated genome size of E. Faecium (Xiang Qin et al. 2012)
  • pacbio-raw $INPUT_DIR/*.fastq.gz: Specifying the input as raw PacBio data.

QUAST Evaluation

The quality of the PacBio assembly was evaluated using QUAST. QUAST generates statistics such as:

  • Number of contigs: The total number of assembled contigs.
  • Total length: The total length of all contigs in base pairs.
  • N50: The contig length at which 50% of the total assembly length is in contigs or longer.
  • L50: The number of contigs required to reach the N50 value.
  • GC-content: The percentage of guanine and cytosine bases in the genome.

QUAST was run on a working node in the UPPMAX cluster Snowy by loading Quast version 5.0.2 and then using the following command:

python /sw/bioinfo/quast/5.0.2/snowy/bin/quast.py pacbio_assembly.contigs.fasta -o quast_output

The results from the evaluation are found in the output file pacbio_assembly_report.pdf. The assembly consist of 9 contigs, each longer than 10,000 bp. The total length is 3.1 Mb, the expected size of the genome of E. Faecium. The largest contig is 2,763,567 bp, which implies a well-performed assembly since one contig covers not only 50% but even 75% of the genome. Overall, these statistics suggest a good genome assembly.

Hybrid Assembly (Extra Analysis)

De novo assembly with SPAdes

A hybrid de novo assembly was performed using SPAdes with Illumina and Nanopore reads. The hybrid assembly combines the strengths of Illumina reads with the strengths of Nanopore reads. The Illumina reads are short but provide high accuracy, while the Nanopore reads have lower accuracy but help resolve repeats.

SPAdes was run using a SLURM script on UPPMAX, as can be seen here: 02_spades_hybrid_assembly.sh

QUAST Evaluation

A QUAST Evaluation of the SPAdes hybrid assembly was performed in the same way as for the PacBio assembly. The output file hybrid_assembly_report.pdf lists the result of the quality evaluation. The hybrid assembly consists of 108 contigs, with 16 contigs larger than 1000 bp and 7 over 50,000 bp. The largest contig is 1,184,553 bp, and the total assembly length is 3,130,588 bp. The N50 value is 842,521 bp, and the L50 value is 2, indicating a relatively high level of continuity.

Overall, the hybrid assembly has an acceptable quality. However, compared to the PacBio assembly, the hybrid version is more fragmented. As a result, the PacBio assembly will be used in further analyses.

Assembly Evaluation (Extra Analysis)

Extra evaluations of the PacBio assembly include comparisons to a reference genome. The reference genome in this project is NZ_CP014529.1, which was downloaded from NCBI using the following commands: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/750/885/GCF_001750885.1_ASM175088v1/GCF_001750885.1_ASM175088v1_genomic.fna.gz gunzip GCF_001750885.1_ASM175088v1_genomic.fna.gz mv GCF_001750885.1_ASM175088v1_genomic.fna reference_E745.fasta

MUMmerplot evaluation

MUMmerplot provides an alignment dot-plot comparing two assemblies and was used for visually inspecting how well the de novo assembly aligns with the reference genome.

MUMmerplot was executed using the following script: 03_mummerplot_evaluation.sh, and the output can be seen in Figure 2.

Figure 2: MUMmerplot output showing the alignment between the PacBio de novo assembly and the reference genome.

In Figure 2, two clear red diagonals imply that two large contigs from the PacBio assembly align well with the reference genome. The fact that both diagonals run in the same direction suggests that these sequences are in the same orientation as the reference genome. The blue dots represent small alignments, such as short fragments, small contigs, or repetitive elements.

In the upper right corner, a cluster of red lines and blue dots is present, implying that many small contigs from the assembly align to a specific region of the reference. This cluster could be due to repetitive sequences in the reference genome, or it may reflect a fragmented or poorly resolved region in the assembly.

MUMmer evaluation

The MUMmer tool dnadiff was used for comparing the PacBio assembly to the reference genome. The dnadiff tool analyzes differences and similarities between the sequences and generates a report of the conclusions.

dnadiff was executed on the working node snowy on UPPMAX using the following command in the terminal: dnadiff -p pacbio_vs_E745 /home/mane9823/Genome-Analysis/analyses/DNA_analyses/02_genome_assembly/01_pacbio/assembly_results/pacbio_assembly.contigs.fasta /home/mane9823/Genome-Analysis/data/reference_data/reference_E745.fasta

The report pacbio_vs_E745.report can be found here:

As seen in the MUMmer dnadiff report, the Canu assembly has a high coverage of the reference genome. Almost all assembled bases align with the reference, and the average identity is high (over 99.9%). A few structural differences were detected, including 2 relocations, 2 translocations, 484 indels, as well as 68 SNPs. These differences could be due to genetic variation between the samples, assembly errors, or biological factors such as mutations or plasmid variation.