De novo Genome Assembly - mel-jo/probable-memory GitHub Wiki

De novo Genome Assembly with DNA long reads

In this step, the genome (and the plasmid) of all the three S. rimosus strains are assembled, using the Canu corrected DNA long reads.

Tools used in this step

  • Flye : Flye is an efficient de novo genome assembler, optimized to work with noisier DNA long reads.
  • Canu : Canu is a genome assembler designed and optimized for high noise long read data from Nanopore and PacBio instruments.
  • Bandage : It’s a GUI-based tool which can be used to visualize assembly graphs.

Input for this step

.fasta.gz files of the corrected DNA long reads.

Strain File
R7 SRR24413072.correctedReads.fasta.gz
HP126 SRR24413066.correctedReads.fasta.gz
DV3 SRR24413081.correctedReads.fasta.gz

De novo assembly with Flye

Instead of building contigs directly, Flye starts with generating rough, overlapping sequences from the raw DNA long reads which are called disjointigs. These disjointigs are then concatenated to construct a long synthetic draft genome. This is then aligned to itself to find repeat regions. Using this, a repeat graph is built which represents all genomic repeats. The original long reads are then mapped back to the graph to untangle and resolve two kinds of repeats:

  • Bridged repeats - Repeats which are spanned by atleast one long read
  • Unbridged repeats - Repeats not covered by any read; resolved using small differences between repeat copies (SNP/indels)

After graph simplification and transversal, Flye gives out higly contiguous and accurate contigs. The one advantage, Flye holds over other assemblers is that there it skips inital read correction, making it substantially faster whilst still giving highly accurate assembly outputs.

Figure reused with permission from Kolmogorov et. al.,

Reference: Kolmogorov, M., Yuan, J., Lin, Y. et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37, 540–546 (2019). https://doi.org/10.1038/s41587-019-0072-8

The Process

Using Flye, the DNA long reads are de nove assembled. Depending on the type of DNA long reads used, the flag --nano-corr or --nano-raw will be used.

Code
# load the necessary modules

module load bioinfo-tools
module load Flye

#genome size is picked by roughly estimating it from a collection of S. rimosus genome assemblies at NCBI

flye --nano-corr sample_correctedReads.fasta.gz \
      --out-dir sample_assembly/ \
      --genome-size 9.6m \
      --threads 2

Each assembly will produce a assembly_graph.gfa file, which is a graphical representation of the assembly. It can be viewed after converting the file to something viewable, like a .png file using Bandage.

Code
module load bioinfo-tools
module load Bandage
Bandage image assembly_graph.gfa assembly_graph.png
Strain Assembly from Canu corrected DNA long reads
R7
Contigs:
β€’ contig_3 – 9,356,146 bp (cov: 40, mult: 1)
β€’ contig_2 – 292,392 bp (cov: 73, mult: 2)
HP126
Contigs:
β€’ contig_1 – 6,969,382 bp (cov: 39, mult: 1)
β€’ contig_4 – 1,032,043 bp (cov: 40, mult: 1)
β€’ contig_3 – 550,818 bp (cov: 88, mult: 2)
β€’ contig_2 – 170,280 bp (cov: 62, mult: 1)
DV3
Contigs:
β€’ contig_1 – 6,973,577 bp (cov: 35, mult: 1)
β€’ contig_4 – 1,058,229 bp (cov: 73, mult: 2)
β€’ contig_3 – 550,795 bp (cov: 109, mult: 3, repeat)
β€’ contig_2 – 170,282 bp (cov: 45, mult: 1)

Output files from this step

After assembly is done, there are a lot of files generated. The main files of interest are :

File Description
assembly.fasta Final raw assembled contigs
assembly_info.txt Contains the data about all the assembled contigs like length, coverage, circularity etc.,
assembly_graph.gfa Graph-based representation of the assembly

De novo assembly with Canu

Canu, works in three stages with each stage independent of each other and can be done individually on the reads at any time. The stages are correction, trimming and assembly. For information about the correction step, refer here. In trimming, Canu checks if parts of a read are well-supported by overlaps. Some parts of the reads may not be supported as the regions could be chimeric, adapters or low-quality tails. The reads are basically trimmed down to their more confident regions.

When it comes to assembly, Canu makes an overlap graph where each read is a node and the overlaps are the edges. Then a "best overlap" approach is used where for each read's end, it finds the longest and most reliable overlap to another read. This builds paths through the graph to make contigs, whilst carefully avoiding false connections especially from repeats or very similar regions by checking for overlap error rates.

Finally, Canu outputs both contigs and an assembly graph (in GFA format), which can help visualize unresolved parts or complex regions.

Reference: Koren, S., Walenz, B.P., Berlin, K. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722-736 (2017). https://doi.org/10.1101/gr.215087.116

The Process

Since the DNA long reads are already corrected using Canu, that step is skipped and the reads are subjected to trimming and assembly.

Code
#loading the modules
module load bioinfo-tools
module load canu

#since canu cannot work on compressed files directly when the input is going to be corrected long reads,
#the files are going to be copied to a scratch space, uncompressed and then passed into canu.
#the output can be copied back to a necessary directory

cd $SNIC_TMP
cp corrected_read/correctedReads.fasta.gz ./corrected.fasta.gz
gunzip corrected.fasta.gz

canu \
 -p canu_assembly \
 -d canu_out/ \
 genomeSize=9.6m \
 -nanopore-corrected corrected.fasta \
 useGrid=false \
 maxThreads=4

cp -r canu_out/ assemblies/canu/

-nanopore-corrected flag needs to be used here to tell Canu that the input DNA long reads have already been corrected. After assembly, Canu produces a .report file which gives insight into the assemblies it has produced.

Metric R7 HP126 DV3
Number of contigs 5 3 3
Total assembly size 9,735,284 bp 8,813,208 bp 8,773,496 bp
Largest contig 4,907,691 bp 8,573,036 bp 8,030,267 bp
N50 / NG50 4,907,691 bp 8,573,036 bp 8,030,267 bp
L50 / LG50 1 contig 1 contig 1 contig
N90 292,278 bp 194,840 bp 192,109 bp
Unassembled sequences 41 (660,453 bp) 17 (317,900 bp) 25 (440,341 bp)
Coverage (after trimming) ~37.6Γ— ~35.25Γ— ~37.35Γ—
Corrected bases (post-trimming) 360,802,329 bp 338,480,564 bp 358,640,427 bp
Mean error rate (estimated) ~0.88% ~1.89% ~1.01%

Output files from this step

After assembly is done, there are a lot of files generated. The main files of interest are :

File Description
assembly.contigs.fasta Final raw assembled contigs
assembly.report Contains the data about all the metrics of the assembly
⚠️ **GitHub.com Fallback** ⚠️