February 2025

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

Long read sequencing technologies have provided the promise for better resolution and more complete assembly of microbial genomes. Long reads have the potential to bridge repetitive regions in the genome and resolve ambiguities in genome organization that short reads might miss.

There are several programs available for creating assemblies from long reads. These include, but are not limited to:

These each have their pros and cons which are outside of the scope of this workshop to discuss here. Suffice it to say Flye and Raven are most commonly used for microbial genome assemblies.

Step 1 - Generate a de novo assembly

Here we'll use Raven for our initial assembly as it is fast, simple, and usually does a pretty good first pass job. We're going to assemble a Klebsiella pneumoniae genome from reads sequenced on an Oxford Nanopore R10.4.1 flowcell and basecalled using the super accuracy model.

When given sufficient resources Raven runs pretty fast, so we're going to submit the assembly as a Slurm batch job to run 12 parallel threads.

raven.sh

#!/bin/bash
#SBATCH --job-name="raven"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o raven.out
#SBATCH -e raven.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/autocycler_env

raven \ 
    -t 12 \
    -F KP_nanopore.raven.gfa \
    /projects/b1042/OzerLab/workshop/nanopore_assembly/workshop_reads/KP_nanopore.fastq.gz \
    > KP_nanopore.raven.fasta

-t: Number of threads

-F: Name of the GFA file to output in addition to the assembly. This is an optional step that will produce a graphical fragment assembly (GFA) file for assembly visualization using Bandage.

Output

Raven adds some information about the assembly to the header lines of the output fasta file. To output just the header lines to the terminal, you can use the grep command to output lines that have the > character:

grep ">" KP_nanopore.raven.fasta

Output of above command:

Note

Raven's algorithm is non-deterministic so your output may look a bit different than this.

>Utg1568 LN:i:46600 RC:i:9 XO:i:1
>Utg1574 LN:i:9286 RC:i:19 XO:i:1
>Utg1576 LN:i:74878 RC:i:10 XO:i:1
>Utg1578 LN:i:5589041 RC:i:738 XO:i:1

Header field	Interpretation
`LN:i`	Length of the assembled sequence
`RC:i`	Number of reads that were used in the polishing step
`XO:i`	Was the sequence circularized? 0 = No, 1 = Yes

We can download the KP_nanopore.raven.gfa file and view it in Bandage. To do so, open Bandage, select 'Load graph' in the menu to load the file, then click the 'Draw graph' button.

In my case, this show four circularized sequences. You can select the 'Length' and 'Depth' buttons to show the lengths of each of the sequences as well as the coverage depths.

Step 2 - Assembly polishing

It is good practice to perform at least one round of polishing on an assembly regardless of which program was used to generate the assembly. This compares the assembly to the sequencing reads and corrects any assembly errors found. Oxford Nanopore provides a polishing program that is tuned to their read data called Medaka. This program compares reads to the assembly and generates a consensus sequence.

medaka.sh

#!/bin/bash
#SBATCH --job-name="medaka"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o medaka.out
#SBATCH -e medaka.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/medaka_env
unset PYTHONPATH

medaka_consensus \
    -i /projects/b1042/OzerLab/workshop/nanopore_assembly/workshop_reads/KP_nanopore.fastq.gz \
    -d KP_nanopore.raven.fasta \
    -o KP_nanopore_medaka \
    -t 12 \
    -f --bacteria 1>&2

Setting	Description
`-i`	Read sequence file
`-d`	Assembly file
`-o`	Directory where the output will be generated
`-t`	Number of threads
`-f`	Force overwrite of outputs
`--bacteria`	Optimize models for bacterial seqeunces

Output

The output will be found in KP_nanopore_medaka/consensus.fasta. Medaka strips out and does not update the fasta header information that Raven produced, so we'll use a perl script to get information about the post-polishing consensus sequences.

perl /projects/p30002/Scripts/fasta_list.pl KP_nanopore_medaka/consensus.fasta

Output of above command:

Note

Again, your output may look a bit different than this.

id	length	pct_gc	num_gap
Utg1578	5589050	57.2	0
Utg1576	74885	54.0	0
Utg1574	9294	55.2	0
Utg1568	46614	52.8	0

As you can see, the lengths of the sequences changed a bit after polishing.

Step 3 - Sequence Rotation

If you're lucky enough to have assembled circularaized sequences, you're probably going to want to eventually annotate and deposit these sequences. To prepare for this, you'll probably want to rotate the sequences such that the linearized represntations start with a canonical gene. Usually for bacterial chromosomes this is the dnaA gene and for plasmids it's the repA gene.

To your great fortune, a program to reorient your sequences already exists: dnaapler.

dnaapler.sh

#!/bin/bash
#SBATCH --job-name="dnaapler"
#SBATCH -A b1042
#SBATCH -p genomics
#SBATCH -t 00:05:00
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH -o dnaapler.out
#SBATCH -e dnaapler.err

eval "$(conda shell.bash hook)"
conda activate /projects/p30002/condaenvs/medaka_env

dnaapler all \
  -i KP_nanopore_medaka/consensus.fasta \
  -o dnaapler \
  -t 12 \
  -a nearest \
  -f

Setting	Description
`all`	Reorients 1 or more contigs to begin with any of dnaA, terL, repA or COG1474
`-i`	Input sequence
`-o`	Output directory name
`-t`	Number of threads
`-a`	Autocomplete method. `nearest` means if none of the target genes is found, the contig will be reoriented to start with the coding sequence closest to the start of the contig
`-f`	Force ovewriting of the output directory

Outputs

In the dnaapler folder you'll find several files, but the most important one is the dnaapler_reoriented.fasta file, of course. The program adds information about rotation to the fasta header, so you can use grep again to see what the program did:

grep ">" dnaapler/dnaapler_reoriented.fasta

Output of above command:

Note

Again, your output may look a bit different what is above.

>Utg1578 rotated=True rotated_gene=dnaA
>Utg1576 rotated=True rotated_gene=repA
>Utg1574 rotated=True
>Utg1568 rotated=True rotated_gene=repA

More detail about what dnaapler did can be found in the dnaapler/dnaapler_all_reorientation_summary.tsv file.

Step 4 - Autocycler

Because there can be so much variation betweeen the outputs of the differnt assembler programs, combining the outputs from more than one program can often yield more accurate assembly results. Autocycler is a newer pipeline that performs several steps to reconcile results from multiple assemblers. It performs the following steps:

Separates the reads into multiple overlapping subsamples
Generate assemblies from each of the subsampled read sets using multiple assemblers
Combine the assembly results
Cluster results from the assemblers to identify separate sequnces shared by the assemblies
Trim excess sequence from the clusters
Resolve and combine the sequences in each cluster to produce consensus sequences

This can be a very hands-on, step-by-step process requiring careful monitoring and curation at each step to make sure the results are accurate and make sense.

I've written a perl script to generate separate Slurm scripts for each of the Autocycler pipeline steps. This can be found at /projects/p30002/batch_scripts/autocycler_batch.pl.

Caution

Each step of the process requires manual validation, so before you use autocycler you should be very familiar with all of the steps performed by the program, potential pitfallls, and how to resolve them. The Autocycler guide is the best place to learn more about these steps.

Back to table of contents

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

12 Long Read Assembly - NU-CPGME/quest_genomics_2025 GitHub Wiki

Step 1 - Generate a de novo assembly

Step 2 - Assembly polishing

Step 3 - Sequence Rotation

Step 4 - Autocycler

Back to table of contents

⚠️ GitHub.com Fallback ⚠️

12 Long Read Assembly - NU-CPGME/quest_genomics_2025 GitHub Wiki

Step 1 - Generate a de novo assembly

Step 2 - Assembly polishing

Step 3 - Sequence Rotation

Step 4 - Autocycler

Back to table of contents

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️