2C Nanopore Assembly - NU-CPGME/sl_workshop_2024 GitHub Wiki
July / August, 2024
Developed by:
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])
To overcome the 100 Mb file size limit on GitHub we had split the full nanopore read file, so now we have to rejoin the two downloaded halves. This is as simple as just using cat
to sequentially write the two parts into a single file.
cd ~/sl_workshop_2024/demo_data/nanopore_reads
cat PA_nanopore_pt1.fastq.gz PA_nanopore_pt2.fastq.gz > PA_nanopore.fastq.gz
micromamba deactivate
micromamba create -y \
-n nanopore \
-c conda-forge \
-c bioconda \
-c defaults \
filtlong=0.2.1 raven-assembler=1.5.3 medaka=1.11.3 quast
Light filtering of the sequence read files can sometimes improve assemblies by removing very short and/or very low quality reads.
A tool that can be used to do this is Filtlong.
We'll filter based on two criteria:
- A minimum read length of 1000 bp (
--min_length 1000
) - Removing the lowest quality 5% of the reads, i.e. keeping the 95% of the sequence with the highest base qualities (
--keep_percent 95
)
To see other filtering options, follow the link above.
Filtlong outputs the filtered reads to STDOUT, so we'll capture that output with a pipe |
and gzip it:
micromamba activate nanopore
filtlong \
--min_length 1000 \
--keep_percent 95 \
PA_nanopore.fastq.gz | \
gzip > PA_nanopore_filt.fastq.gz
More filtering is not necessariy better. Assemblies can sometimes get worse the more you try to filter your input reads, so a light touch is preferable.
There are many choices for long read assemblers and there can be a fair amount of variation in their outputs. Here are few assembler options to be aware of:
4.1 Flye
- Includes polishing step that aligns reads to assembly and corrects likely errors
- Good balance between speed and accuracy
4.2 Raven
- OLC assembler with accelerated overlap step
- Includes polish step and circularizes contigs if possible
4.3 Minimap2 / Miniasm / Minipolish
- Three separate programs that are run sequentially to generate a consensus assembly:
- Minimap2: Generates all-vs-all mapping of the nanopore reads to each other to find overlaps
- Miniasm: OLC-based assembler to combine reads based on overlaps. This does not include a consensus-generating step so the sequence quality is the same as the individual reads used to generate the sequences
- Minipolish: Aligns reads to miniasm-out sequences to produce a polished and higher-quality consensus sequence
- The minipolish package includes scripts to run all the steps sequentially
4.4 Canu
- Revision of the Celera assembler designed for Sanger sequence assemblies, but optimized for Nanopore or PacBio reads
- Performs read correction, read trimming, and assembly
- Many options, highly customizable
- Very slow
4.5 NextDenovo / NextPolish
- Performs read correction prior to assembly (like Canu), but faster and requiring fewer computer resources
- Requires separate configuration files which decreases efficiency of multiple assemblies
4.6 NECAT
- Read correction, assembly, and optional polishing
- Also needs separate configuration files for each assembly
For this exercise we'll pick one assembler, Raven, because in testing it was one of the fastest-performing and produced the most contiguous sequence for our demo sequence reads. Raven is not necessarily the best assembler in all situations.
For this example we will use the -t 1
option to use our single VM CPU and will also add --graphical-fragment-assembly raven.gfa
to output a graph file we can view in Bandage.
raven \
-t 1 \
--graphical-fragment-assembly raven.gfa \
PA_nanopore_filt.fastq.gz > raven.fasta
We can view the results in Quast. Note the difference relative to the Quast results from the Illumina assembly we performed yesterday.
quast raven.fasta
And we can view the contig graph in Bandage. If more than one contig was produced it may show ambiguity in repetitive regions.
Bandage load raven.gfa
OPTIONAL: Click here for Flye assembly instructions
To assemble the reads using Flye, you will first need to create a micromamba environment for Flye:
micromamba deactivate
micromamba create -y \
-n flye \
-c conda-forge \
-c bioconda \
-c defaults \
flye=2.9.3 quast
To run the assembly, we'll call Flye using one thread and output the results to a directory "flye_assembly". Since our reads were generated on an Oxford Nanopore R10.4.1 flowcell using the super high accuracy basecalling model, we will use the --nano-hq
for the reads. We'll also activate the scaffolding feature for the graph output. Note that scaffolding is off by default as it may add gaps to the assembly contigs. We'll use it for demonstration purposes here, but you will usually leave scaffolding off.
micromamba activate flye
flye \
--nano-hq PA_nanopore_filt.fastq.gz \
--threads 1 \
--scaffold \
--out-dir flye_assembly
When this completes you can view characteristics of the contig(s) produced such as their lengths and whether they were circularized or not by viewing the flye_assembly/assembly_info.txt
file.
For example (your results may not look like this):
#seq_name length cov. circ. repeat mult. alt_group graph_path
contig_1 6724283 25 Y N 1 * 1
View assembly characteristics in Quast and and graph in Bandage:
quast flye_assembly/assembly.fasta
Bandage load flye_assembly/assembly_graph.gfa
OPTIONAL: Click here for Miniasm assembly instructions
To assemble the reads using the Minimap2 / Miniasm / Minipolish pipeline, you will first need to create a micromamba environment:micromamba deactivate
micromamba create -y \
-n miniasm \
-c conda-forge \
-c bioconda \
-c defaults \
minimap miniasm minipolish quast
-
First we'll use minimap2 to align all of our sequence reads to each other (all-vs-all) using built-in parameters optimized for Nanopore reads (
-x ava-ont
) and identify all the overlap coordinates. The output of minimap2 is a .paf formatted file. We'll gzip this to save storage space.micromamba activate miniasm minimap2 \ -x ava-ont \ -t 1 \ PA_nanopore_filt.fastq.gz \ PA_nanopore_filt.fastq.gz | \ gzip > PA_nanopore.paf.gz
-
Next, we'll generate a non-consensus assembly of the reads with miniasm based on the overlap coordinates. This is output to a .gfa graph file.
miniasm \ -f PA_nanopore_filt.fastq.gz \ PA_nanopore.paf.gz > PA_nanopore.gfa
-
Now we'll use minipolish to correct any errors in and generate the consensus sequences. This is also output to a .gfa graph file.
minipolish \ --threads 1 \ PA_nanopore_filt.fastq.gz \ PA_nanopore.gfa > PA_nanopore_minipolish.gfa
-
Finally, we'll convert the .gfa formatted contig file to a fasta-formatted sequence file using a simple awk one-liner:
awk '/^S/{print ">"$2"\n"$3}' PA_nanopore_polished.gfa > PA_nanopore_minipolish.fasta
Now you can view the assembly in Quast:
quast PA_nanopore_minipolish.fasta
And the graph with Bandage:
Bandage load PA_nanopore_minipolish.gfa
Although the assemblers and pipelines listed above include a polishing step where sequence reads are used to validate and correct the assembly, you'll likely also want to perform a secondary polishing step using Oxford Nanopore's Medaka.
This program has multiple functions, however we'll be using it here for polishing the assembly and generating a new consensus sequence. Though they all use the same fastq read files as input, the difference between Medaka and the other polisher algorithms is that Medaka uses neural networks based specifically on the basecalling model used to generate the sequence reads. It tends to produce more accurate results than other alignment-based methods.
In this step, we will use the predefined medaka_consensus
pipeline in Medaka that performs the following functions:
-
mini_align
: aligns reads to the assembly with minimap2 -
medaka consensus
: Generates the consensus -
medaka stitch
: Outputs the fasta sequence
The -i
option points to our sequence reads and -d
is the assembly produced by Raven. Normally we would not set the -b
option for the batch size, but we have to reduce the value from the default 100 to keep from running out of memory in the VM.
As newer versions of the Dorado basecaller output the name of the basecalling model in the header of each read, Medaka will try to automatically detect the model from the input reads. If you are using reads generated from older versions of Guppy, for instance, you may have to manually provide Medaka with the model.
micromamba deactivate
micromamba activate nanopore
medaka_consensus \
-i PA_nanopore_filt.fastq.gz \
-d raven.fasta \
-b 20
Results will be output as consensus.fasta
a directory named medaka
.
OPTIONAL: Annotate the new assembly
Now that you've generated a new high quality genome assembly, you can annotate it using Prokka.
Download the reference genome sequence PAO1 from NCBI:
wget -O PAO1.gbk.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/006/765/GCA_000006765.1_ASM676v1/GCA_000006765.1_ASM676v1_genomic.gbff.gz
gzip -d PAO1.gbk.gz
Use the annotation
micromamba environment and generate annotations from your Medaka consensus sequence.
There are advantages and disadvantages to the short- and long read assembly approaches:
Short Read Assembly | Long Read Assembly |
---|---|
Low sequence error rate | Higher sequence error (variable) |
Poor repeat resolution | Better repeat resolution |
Fragmentary assembly | High contiguity / completeness |
Assembly approaches that combine the strengths of both data types may improve assembly accuracy and completeness.
De-bruijn graph assembly with long and short reads or use long reads to order contigs generated with short reads and fill gaps (Example: SPAdes)
- Assemble using long reads alone using one (or more) of the long-read assemblers
- Correct any potential assembly errors with high accuracy short reads using one (or more) of the software packages below:
- Optional: Join and circularize chromosomes and plasmids. This used to be a more useful step with assemblies using older, less accurate R9.4.1 Nanopore flow cells as short read correction was sometime so extensive as to allow circularization to be performed.
Sequential Assembly Pipeline Program: Unicycler
This program uses SPAdes, miniasm, and Racon with short and long reads to generate assemblies Steps:
- Generates assembly from short reads (SPAdes)
- Assembly with combined long reads and contigs from step 1 (miniasm/Racon)
- Bridge ambiguous connections using long reads
- Circularization of contigs
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.