12 Long Read Assembly - NU-CPGME/quest_genomics_2025 GitHub Wiki
February 2025
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])
Long read sequencing technologies have provided the promise for better resolution and more complete assembly of microbial genomes. Long reads have the potential to bridge repetitive regions in the genome and resolve ambiguities in genome organization that short reads might miss.
There are several programs available for creating assemblies from long reads. These include, but are not limited to:
These each have their pros and cons which are outside of the scope of this workshop to discuss here. Suffice it to say Flye and Raven are most commonly used for microbial genome assemblies.
Here we'll use Raven for our initial assembly as it is fast, simple, and usually does a pretty good first pass job. We're going to assemble a Klebsiella pneumoniae genome from reads sequenced on an Oxford Nanopore R10.4.1 flowcell and basecalled using the super accuracy model.
When given sufficient resources Raven runs pretty fast, so we're going to submit the assembly as a Slurm batch job to run 12 parallel threads.
raven.sh
#!/bin/bash
#SBATCH --job-name="raven"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o raven.out
#SBATCH -e raven.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/autocycler_env
raven \
-t 12 \
-F KP_nanopore.raven.gfa \
/projects/b1042/OzerLab/workshop/nanopore_assembly/workshop_reads/KP_nanopore.fastq.gz \
> KP_nanopore.raven.fasta
-t
: Number of threads
-F
: Name of the GFA file to output in addition to the assembly. This is an optional step that will produce a graphical fragment assembly (GFA) file for assembly visualization using Bandage.
Output
Raven adds some information about the assembly to the header lines of the output fasta file. To output just the header lines to the terminal, you can use the grep command to output lines that have the >
character:
grep ">" KP_nanopore.raven.fasta
Output of above command:
Note
Raven's algorithm is non-deterministic so your output may look a bit different than this.
>Utg1568 LN:i:46600 RC:i:9 XO:i:1
>Utg1574 LN:i:9286 RC:i:19 XO:i:1
>Utg1576 LN:i:74878 RC:i:10 XO:i:1
>Utg1578 LN:i:5589041 RC:i:738 XO:i:1
Header field | Interpretation |
---|---|
LN:i |
Length of the assembled sequence |
RC:i |
Number of reads that were used in the polishing step |
XO:i |
Was the sequence circularized? 0 = No, 1 = Yes |
We can download the KP_nanopore.raven.gfa
file and view it in Bandage. To do so, open Bandage, select 'Load graph' in the menu to load the file, then click the 'Draw graph' button.
In my case, this show four circularized sequences. You can select the 'Length' and 'Depth' buttons to show the lengths of each of the sequences as well as the coverage depths.

It is good practice to perform at least one round of polishing on an assembly regardless of which program was used to generate the assembly. This compares the assembly to the sequencing reads and corrects any assembly errors found. Oxford Nanopore provides a polishing program that is tuned to their read data called Medaka. This program compares reads to the assembly and generates a consensus sequence.
medaka.sh
#!/bin/bash
#SBATCH --job-name="medaka"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o medaka.out
#SBATCH -e medaka.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/medaka_env
unset PYTHONPATH
medaka_consensus \
-i /projects/b1042/OzerLab/workshop/nanopore_assembly/workshop_reads/KP_nanopore.fastq.gz \
-d KP_nanopore.raven.fasta \
-o KP_nanopore_medaka \
-t 12 \
-f --bacteria 1>&2
Setting | Description |
---|---|
-i |
Read sequence file |
-d |
Assembly file |
-o |
Directory where the output will be generated |
-t |
Number of threads |
-f |
Force overwrite of outputs |
--bacteria |
Optimize models for bacterial seqeunces |
Output
The output will be found in KP_nanopore_medaka/consensus.fasta
. Medaka strips out and does not update the fasta header information that Raven produced, so we'll use a perl script to get information about the post-polishing consensus sequences.
perl /projects/p30002/Scripts/fasta_list.pl KP_nanopore_medaka/consensus.fasta
Output of above command:
Note
Again, your output may look a bit different than this.
id length pct_gc num_gap
Utg1578 5589050 57.2 0
Utg1576 74885 54.0 0
Utg1574 9294 55.2 0
Utg1568 46614 52.8 0
As you can see, the lengths of the sequences changed a bit after polishing.
If you're lucky enough to have assembled circularaized sequences, you're probably going to want to eventually annotate and deposit these sequences. To prepare for this, you'll probably want to rotate the sequences such that the linearized represntations start with a canonical gene. Usually for bacterial chromosomes this is the dnaA gene and for plasmids it's the repA gene.
To your great fortune, a program to reorient your sequences already exists: dnaapler.
dnaapler.sh
#!/bin/bash
#SBATCH --job-name="dnaapler"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:05:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o dnaapler.out
#SBATCH -e dnaapler.err
eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/medaka_env
dnaapler all \
-i KP_nanopore_medaka/consensus.fasta \
-o dnaapler \
-t 12 \
-a nearest \
-f
Setting | Description |
---|---|
all |
Reorients 1 or more contigs to begin with any of dnaA, terL, repA or COG1474 |
-i |
Input sequence |
-o |
Output directory name |
-t |
Number of threads |
-a |
Autocomplete method. nearest means if none of the target genes is found, the contig will be reoriented to start with the coding sequence closest to the start of the contig |
-f |
Force ovewriting of the output directory |
Outputs
In the dnaapler
folder you'll find several files, but the most important one is the dnaapler_reoriented.fasta
file, of course. The program adds information about rotation to the fasta header, so you can use grep again to see what the program did:
grep ">" dnaapler/dnaapler_reoriented.fasta
Output of above command:
Note
Again, your output may look a bit different what is above.
>Utg1578 rotated=True rotated_gene=dnaA
>Utg1576 rotated=True rotated_gene=repA
>Utg1574 rotated=True
>Utg1568 rotated=True rotated_gene=repA
More detail about what dnaapler did can be found in the dnaapler/dnaapler_all_reorientation_summary.tsv
file.
Because there can be so much variation betweeen the outputs of the differnt assembler programs, combining the outputs from more than one program can often yield more accurate assembly results. Autocycler is a newer pipeline that performs several steps to reconcile results from multiple assemblers. It performs the following steps:
- Separates the reads into multiple overlapping subsamples
- Generate assemblies from each of the subsampled read sets using multiple assemblers
- Combine the assembly results
- Cluster results from the assemblers to identify separate sequnces shared by the assemblies
- Trim excess sequence from the clusters
- Resolve and combine the sequences in each cluster to produce consensus sequences
This can be a very hands-on, step-by-step process requiring careful monitoring and curation at each step to make sure the results are accurate and make sense.
I've written a perl script to generate separate Slurm scripts for each of the Autocycler pipeline steps. This can be found at /projects/p30002/batch_scripts/autocycler_batch.pl
.
Caution
Each step of the process requires manual validation, so before you use autocycler you should be very familiar with all of the steps performed by the program, potential pitfallls, and how to resolve them. The Autocycler guide is the best place to learn more about these steps.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.