2C Nanopore Assembly - NU-CPGME/sl_workshop

July / August, 2024

Developed by:
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

1. Prepare the nanopore demo data

To overcome the 100 Mb file size limit on GitHub we had split the full nanopore read file, so now we have to rejoin the two downloaded halves. This is as simple as just using cat to sequentially write the two parts into a single file.

cd ~/sl_workshop_2024/demo_data/nanopore_reads

cat PA_nanopore_pt1.fastq.gz PA_nanopore_pt2.fastq.gz > PA_nanopore.fastq.gz

2. Create the nanopore micromamba environment

micromamba deactivate

micromamba create -y \
  -n nanopore \
  -c conda-forge \
  -c bioconda \
  -c defaults \
  filtlong=0.2.1 raven-assembler=1.5.3 medaka=1.11.3 quast

3. Read filtering

Light filtering of the sequence read files can sometimes improve assemblies by removing very short and/or very low quality reads.

A tool that can be used to do this is Filtlong.

We'll filter based on two criteria:

A minimum read length of 1000 bp (--min_length 1000)
Removing the lowest quality 5% of the reads, i.e. keeping the 95% of the sequence with the highest base qualities (--keep_percent 95)

To see other filtering options, follow the link above.

Filtlong outputs the filtered reads to STDOUT, so we'll capture that output with a pipe | and gzip it:

micromamba activate nanopore

filtlong \
  --min_length 1000 \
  --keep_percent 95 \
  PA_nanopore.fastq.gz | \
  gzip > PA_nanopore_filt.fastq.gz

More filtering is not necessariy better. Assemblies can sometimes get worse the more you try to filter your input reads, so a light touch is preferable.

4. Assemblers

There are many choices for long read assemblers and there can be a fair amount of variation in their outputs. Here are few assembler options to be aware of:

4.1 Flye

Includes polishing step that aligns reads to assembly and corrects likely errors
Good balance between speed and accuracy

4.2 Raven

OLC assembler with accelerated overlap step
Includes polish step and circularizes contigs if possible

4.3 Minimap2 / Miniasm / Minipolish

Three separate programs that are run sequentially to generate a consensus assembly:
- Minimap2: Generates all-vs-all mapping of the nanopore reads to each other to find overlaps
- Miniasm: OLC-based assembler to combine reads based on overlaps. This does not include a consensus-generating step so the sequence quality is the same as the individual reads used to generate the sequences
- Minipolish: Aligns reads to miniasm-out sequences to produce a polished and higher-quality consensus sequence
The minipolish package includes scripts to run all the steps sequentially

4.4 Canu

Revision of the Celera assembler designed for Sanger sequence assemblies, but optimized for Nanopore or PacBio reads
Performs read correction, read trimming, and assembly
Many options, highly customizable
Very slow

4.5 NextDenovo / NextPolish

Performs read correction prior to assembly (like Canu), but faster and requiring fewer computer resources
Requires separate configuration files which decreases efficiency of multiple assemblies

4.6 NECAT

Read correction, assembly, and optional polishing
Also needs separate configuration files for each assembly

5. Assembly

For this exercise we'll pick one assembler, Raven, because in testing it was one of the fastest-performing and produced the most contiguous sequence for our demo sequence reads. Raven is not necessarily the best assembler in all situations.

For this example we will use the -t 1 option to use our single VM CPU and will also add --graphical-fragment-assembly raven.gfa to output a graph file we can view in Bandage.

raven \
  -t 1 \
  --graphical-fragment-assembly raven.gfa \
  PA_nanopore_filt.fastq.gz > raven.fasta

We can view the results in Quast. Note the difference relative to the Quast results from the Illumina assembly we performed yesterday.

quast raven.fasta

And we can view the contig graph in Bandage. If more than one contig was produced it may show ambiguity in repetitive regions.

Bandage load raven.gfa

OPTIONAL: Click here for Flye assembly instructions

To assemble the reads using Flye, you will first need to create a micromamba environment for Flye:

micromamba deactivate
micromamba create -y \
  -n flye \
  -c conda-forge \
  -c bioconda \
  -c defaults \
  flye=2.9.3 quast

To run the assembly, we'll call Flye using one thread and output the results to a directory "flye_assembly". Since our reads were generated on an Oxford Nanopore R10.4.1 flowcell using the super high accuracy basecalling model, we will use the --nano-hq for the reads. We'll also activate the scaffolding feature for the graph output. Note that scaffolding is off by default as it may add gaps to the assembly contigs. We'll use it for demonstration purposes here, but you will usually leave scaffolding off.

micromamba activate flye
flye \
  --nano-hq PA_nanopore_filt.fastq.gz \
  --threads 1 \
  --scaffold \
  --out-dir flye_assembly

When this completes you can view characteristics of the contig(s) produced such as their lengths and whether they were circularized or not by viewing the flye_assembly/assembly_info.txt file.

For example (your results may not look like this):

#seq_name	length	cov.	circ.	repeat	mult.	alt_group	graph_path
contig_1	6724283	25	Y	N	1	*	1

View assembly characteristics in Quast and and graph in Bandage:

quast flye_assembly/assembly.fasta

Bandage load flye_assembly/assembly_graph.gfa

OPTIONAL: Click here for Miniasm assembly instructions

To assemble the reads using the Minimap2 / Miniasm / Minipolish pipeline, you will first need to create a micromamba environment:

micromamba deactivate
micromamba create -y \
  -n miniasm \
  -c conda-forge \
  -c bioconda \
  -c defaults \
  minimap miniasm minipolish quast

First we'll use minimap2 to align all of our sequence reads to each other (all-vs-all) using built-in parameters optimized for Nanopore reads (-x ava-ont) and identify all the overlap coordinates. The output of minimap2 is a .paf formatted file. We'll gzip this to save storage space.
```
micromamba activate miniasm
minimap2 \
  -x ava-ont \
  -t 1 \
  PA_nanopore_filt.fastq.gz \
  PA_nanopore_filt.fastq.gz | \
  gzip > PA_nanopore.paf.gz
```
Next, we'll generate a non-consensus assembly of the reads with miniasm based on the overlap coordinates. This is output to a .gfa graph file.
```
miniasm \
  -f PA_nanopore_filt.fastq.gz \
  PA_nanopore.paf.gz > PA_nanopore.gfa
```
Now we'll use minipolish to correct any errors in and generate the consensus sequences. This is also output to a .gfa graph file.
```
minipolish \
  --threads 1 \
  PA_nanopore_filt.fastq.gz \
  PA_nanopore.gfa > PA_nanopore_minipolish.gfa
```
Finally, we'll convert the .gfa formatted contig file to a fasta-formatted sequence file using a simple awk one-liner:
```
awk '/^S/{print ">"$2"\n"$3}' PA_nanopore_polished.gfa > PA_nanopore_minipolish.fasta
```

Now you can view the assembly in Quast:

quast PA_nanopore_minipolish.fasta

And the graph with Bandage:

Bandage load PA_nanopore_minipolish.gfa

6. Polishing

Although the assemblers and pipelines listed above include a polishing step where sequence reads are used to validate and correct the assembly, you'll likely also want to perform a secondary polishing step using Oxford Nanopore's Medaka.

This program has multiple functions, however we'll be using it here for polishing the assembly and generating a new consensus sequence. Though they all use the same fastq read files as input, the difference between Medaka and the other polisher algorithms is that Medaka uses neural networks based specifically on the basecalling model used to generate the sequence reads. It tends to produce more accurate results than other alignment-based methods.

In this step, we will use the predefined medaka_consensus pipeline in Medaka that performs the following functions:

mini_align: aligns reads to the assembly with minimap2
medaka consensus: Generates the consensus
medaka stitch: Outputs the fasta sequence

The -i option points to our sequence reads and -d is the assembly produced by Raven. Normally we would not set the -b option for the batch size, but we have to reduce the value from the default 100 to keep from running out of memory in the VM.

As newer versions of the Dorado basecaller output the name of the basecalling model in the header of each read, Medaka will try to automatically detect the model from the input reads. If you are using reads generated from older versions of Guppy, for instance, you may have to manually provide Medaka with the model.

micromamba deactivate
micromamba activate nanopore
medaka_consensus \
  -i PA_nanopore_filt.fastq.gz \
  -d raven.fasta \
  -b 20

Results will be output as consensus.fasta a directory named medaka.

OPTIONAL: Annotate the new assembly

Now that you've generated a new high quality genome assembly, you can annotate it using Prokka.

Download the reference genome sequence PAO1 from NCBI:

wget -O PAO1.gbk.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/006/765/GCA_000006765.1_ASM676v1/GCA_000006765.1_ASM676v1_genomic.gbff.gz
gzip -d PAO1.gbk.gz

Use the annotation micromamba environment and generate annotations from your Medaka consensus sequence.

7. Combined Long- and Short Read Assembly

There are advantages and disadvantages to the short- and long read assembly approaches:

Short Read Assembly	Long Read Assembly
Low sequence error rate	Higher sequence error (variable)
Poor repeat resolution	Better repeat resolution
Fragmentary assembly	High contiguity / completeness

Assembly approaches that combine the strengths of both data types may improve assembly accuracy and completeness.

Combined assembly approaches:

Hybrid assembly

De-bruijn graph assembly with long and short reads or use long reads to order contigs generated with short reads and fill gaps (Example: SPAdes)

Sequential assembly

Assemble using long reads alone using one (or more) of the long-read assemblers
Correct any potential assembly errors with high accuracy short reads using one (or more) of the software packages below:

Optional: Join and circularize chromosomes and plasmids. This used to be a more useful step with assemblies using older, less accurate R9.4.1 Nanopore flow cells as short read correction was sometime so extensive as to allow circularization to be performed.

Circlator

Sequential Assembly Pipeline Program: Unicycler

This program uses SPAdes, miniasm, and Racon with short and long reads to generate assemblies Steps:

Generates assembly from short reads (SPAdes)
Assembly with combined long reads and contigs from step 1 (miniasm/Racon)
Bridge ambiguous connections using long reads
Circularization of contigs

Trycycler

Back to table of contents

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

2C Nanopore Assembly - NU-CPGME/sl_workshop_2024 GitHub Wiki

1. Prepare the nanopore demo data

2. Create the nanopore micromamba environment

3. Read filtering

4. Assemblers

4.1 Flye

4.2 Raven

4.3 Minimap2 / Miniasm / Minipolish

4.4 Canu

4.5 NextDenovo / NextPolish

4.6 NECAT

5. Assembly

6. Polishing

7. Combined Long- and Short Read Assembly

Combined assembly approaches:

Hybrid assembly

Sequential assembly

Back to table of contents

⚠️ GitHub.com Fallback ⚠️

2C Nanopore Assembly - NU-CPGME/sl_workshop_2024 GitHub Wiki

1. Prepare the nanopore demo data

2. Create the nanopore micromamba environment

3. Read filtering

4. Assemblers

4.1 Flye

4.2 Raven

4.3 Minimap2 / Miniasm / Minipolish

4.4 Canu

4.5 NextDenovo / NextPolish

4.6 NECAT

5. Assembly

6. Polishing

7. Combined Long- and Short Read Assembly

Combined assembly approaches:

Hybrid assembly

Sequential assembly

Back to table of contents

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️