Genome Assembly Concepts - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

3.1 Genome Assembly Concepts

In this section we will cover:

  • Assembly Strategies – reference-guided vs. de novo approaches
  • Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)
  • k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly
  • Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection
  • Assembly Metrics – N50/L50, total length, completeness (e.g. BUSCO scores)
  • Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)
  • Hands-On Demo – assemble a small bacterial genome and evaluate output

3.1.1 Assembly Strategies

When reconstructing a genome from sequencing reads, there are two broad approaches:

A) Reference-Guided (a.k.a. “Mapping” or “Resequencing”)

You align your reads to a known reference genome and then call consensus.

Workflow

  1. Index the reference
bwa index ref.fa

  1. Map reads
bwa mem ref.fa reads_R1.fastq reads_R2.fastq \
  | samtools sort -o aln.bam
samtools index aln.bam

  1. Call variants & consensus
samtools mpileup -uf ref.fa aln.bam \
  | bcftools call -c - \
  | bcftools consensus -o consensus.fa

Pros

  • Very fast & low-memory
  • Highly accurate in conserved regions
  • Straightforward variant calling downstream

Cons

  • Cannot discover large structural variants easily
  • Biased by the reference (reference “gaps” remain unassembled)
  • Fails when there’s no close reference

Typical Use Cases

  • Human resequencing (whole-genome, exome)
  • Pathogen outbreak tracking against a known strain

B) De Novo (assembly “from scratch”)

You assemble reads into contigs without any reference.

Workflow (de Bruijn graph example with SPAdes)

  1. Run SPAdes
spades.py \
  -1 reads_R1.fastq -2 reads_R2.fastq \
  -o spades_out/
  1. Inspect contigs
less spades_out/contigs.fasta

  1. Evaluate assembly
quast spades_out/contigs.fasta -o quast_report/

Pros

  • Unbiased; can recover novel sequences
  • Captures large insertions / structural variants
  • No requirement for a “close” reference

Cons

  • Higher compute (CPU / RAM)
  • Assemblies can be fragmented (especially with short reads)
  • Requires careful k-mer choice & parameter tuning

Typical Use Cases

  • Novel microbial genome assembly
  • Metagenome assembly
  • Polyploid or highly divergent species
Feature Reference-Guided De Novo
Input requirement High-quality reference Only reads
Compute cost Low Moderate → High
Bias Reference bias Reference-free
Complexity detection Limited Good (SVs, novel sequences)
Downstream processing Variant calling, consensus Contig scaffolding, annotation

Section 3.1.1 complete: “Assembly Strategies – reference-guided vs de novo”


3.1.2 Sequencing Technologies – Short Reads vs. Long Reads

Illumina (Short Reads)

Pros

  • Very high per-base accuracy (>99.9%)
  • Extremely high throughput (hundreds of Gb per run)
  • Low cost per Gb
  • Mature ecosystem of bioinformatics tools

Cons

  • Read length limited to ~50–300 bp
  • Difficult to resolve long repeats or structural variants
  • Library preparation can introduce GC-bias

Typical Use Cases

  • SNP & small indel calling
  • RNA-Seq / transcript quantification
  • Targeted resequencing (exomes, panels)

PacBio & Oxford Nanopore (Long Reads)

Pros

  • Read lengths from 10 kb up to >100 kb
  • Excellent for spanning repeats & structural variants
  • Direct detection of base modifications (ONT)
  • Phasing of haplotypes in diploid/polyploid genomes

Cons

  • Higher raw error rate (∼90–99% depending on chemistry)
  • Lower overall throughput per run
  • Higher cost per Gb (though improving)
  • Bioinformatics pipelines still maturing

Typical Use Cases

  • De novo genome assembly
  • Structural variant discovery
  • Full-length isoform sequencing
  • Metagenome assembly & phasing

Technology Comparison

Feature Illumina (Short Reads) PacBio / ONT (Long Reads)
Read length 50 – 300 bp 10 kb – 100 kb+
Per-base accuracy > 99.9 % ∼ 90 – 99 %
Throughput Very high Moderate
Cost per Gb Low Higher
Bias GC-bias possible Systematic indel/errors
Ideal applications SNP/indel calling, RNA-Seq De novo assembly, SV detection

Section 3.1.2 complete: “Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)”


3.1.3 k-mer & de Bruijn Graphs – Constructing and Traversing the Graph for Assembly

What Is a k-mer?

  • A k-mer is any subsequence of length k extracted from your reads.
  • Example: from the read AATGC, and k = 3, we get the k-mers:
AAT, ATG, TGC

The de Bruijn Graph Model

  • Nodes represent all possible (k – 1)-mers.
  • Directed edges represent each k-mer, connecting its prefix (first k – 1 bases) to its suffix (last k – 1 bases).
  • Assembly becomes an Eulerian path problem: find a path that traverses every edge exactly once.

Example (k = 3)

  1. Reads
AATGC
  1. Extract k-mers
AAT, ATG, TGC
  1. Build nodes (k – 1 = 2)
AA, AT, TG, GC
  1. Add edges
  • AAT: AA → AT
  • ATG: AT → TG
  • TGC: TG → GC
  1. Traverse
    An Eulerian path:
AA → AT → TG → GC

Reconstruct the genome:

AATGC

Practical Considerations

  • Choosing k
  • Small k → graph highly connected, risk of repeats collapsing
  • Large k → graph fragmented, less overlap between reads
  • Error handling
  • Sequencing errors create “tips” (dead‐end branches) and “bubbles” (parallel paths)
  • Assemblers trim tips and pop bubbles to clean the graph
  • Graph simplification
  • Tip trimming: remove short dead‐end paths
  • Bubble popping: merge parallel paths that differ by few bases

Popular de Bruijn Graph Assemblers

  • SPAdes (short reads, multi-k)
  • Velvet (short reads)
  • ABySS (distributed, short reads)
  • MEGAHIT (metagenomes)

Each implements its own heuristics for error correction, k-mer counting, graph simplification, and contig extraction.


Section 3.1.3 complete: “k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly”


3.1.4 Overlap–Layout–Consensus (OLC) – Classic Assembly via Read Overlap Detection

What Is OLC?

The Overlap–Layout–Consensus paradigm reconstructs a genome by:

  1. Overlap: finding all significant pairwise overlaps between reads
  2. Layout: arranging reads into a layout graph based on those overlaps
  3. Consensus: deriving the final sequence by resolving disagreements among overlapping reads

1. Overlap

  • Perform an all-vs-all comparison of reads (e.g., with suffix arrays, minimizers, or seed-and-extend)
  • Identify high-confidence overlaps above a length/identity threshold
  • Record overlaps as edges between read nodes

2. Layout

  • Build a directed overlap graph where nodes = reads and edges = overlaps
  • Simplify the graph by removing transitive overlaps and spurious edges
  • Extract a layout path (or set of contig paths) that traverses reads in order

3. Consensus

  • For each contig path, align overlapping reads to a work sequence
  • At each position, call the consensus base (e.g. majority vote, quality-weighted)
  • Produce polished contigs representing the assembled genome

Pros & Cons

Feature OLC
Best for Long reads (PacBio, ONT), moderate dataset sizes
Computational cost High (all-vs-all overlaps ∼ O(N²) reads)
Repeat resolution Natural via overlap context
Error handling Can incorporate quality scores & polishing
Memory footprint High for short-read datasets

Pros

  • Handles variable read lengths and high error rates (long reads)
  • Captures complex repeat structures via read overlaps
  • Consensus phase can incorporate multiple rounds of polishing

Cons

  • All-vs-all overlap detection is computationally expensive
  • Not suitable for extremely high-coverage short-read projects without downsampling
  • Requires careful tuning of overlap length and identity thresholds

Popular OLC Assemblers

  • Celera Assembler – one of the first whole-genome OLC pipelines
  • Canu (formerly PBcR) – optimized for noisy long reads, includes adaptive k-mer filtering
  • FALCON / FALCON-Unzip – diploid assembly with haplotype phasing
  • Miniasm + Racon – ultrafast long-read layout and polishing

Section 3.1.4 complete: “Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection”

3.1.5 Assembly Metrics – N50/L50, Total Length, Completeness (BUSCO)

Key Definitions

  • Total Assembly Length
    Sum of the lengths of all contigs or scaffolds. Reflects how much of the genome you’ve recovered.

  • N50
    The length N such that 50% of the total assembly length is contained in contigs of length ≥ N.

    1. Sort contigs by length (long → short).
    2. Sum their lengths until you reach half of the total assembly length.
    3. The length of the contig at that point is the N50.
  • L50
    The minimum number of contigs whose combined length makes up at least 50% of the total assembly.
    Equivalently, the rank of the contig at N50 in the sorted list.

  • Completeness (BUSCO)
    BUSCO (Benchmarking Universal Single-Copy Orthologs) estimates the proportion of expected single-copy orthologous genes that are:

    • C: Complete (single-copy [S] or duplicated [D])
    • F: Fragmented
    • M: Missing
      from your assembly, based on a lineage-specific database.

Tools & Commands

  1. QUAST (Quality Assessment)
    quast.py contigs.fasta -o quast_report/
    

Generates a report with contig counts, N50/L50, total length, GC%, and more.

2. BUSCO (Completeness)
```bash
busco \
  -i contigs.fasta \
  -o busco_out \
  -l bacteria_odb10 \
  -m genome

Reports percentage of complete, fragmented, and missing orthologs.

Example QUAST Summary

Metric Value
Number of contigs 45
Total length (bp) 4,736,284
Largest contig (bp) 892,341
N50 (bp) 327,814
L50 5
GC content (%) 50.3

Example BUSCO Results (bacteria_odb10)

# BUSCO version: 5.0.0
# Running in mode: genome
# Lineage dataset: bacteria_odb10 (n=124)
# 
# C:98.4%[S:97.6%,D:0.8%],F:0.8%,M:0.8%,n:124

  • C (Complete): 98.4% of the 124 expected genes are fully present.
  • S (Single-copy): 97.6%.
  • D (Duplicated): 0.8%.
  • F (Fragmented): 0.8%.
  • M (Missing): 0.8%.

Section 3.1.5 complete: “Assembly Metrics – N50/L50, total length, completeness (BUSCO scores)”

3.1.6 Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)

Below are four widely used assemblers—two optimized for short Illumina reads, two for long‐read data.


SPAdes (Short Reads)

Description:
SPAdes implements a multi‐k de Bruijn‐graph approach with built-in error correction and careful bubble/tip handling, yielding high‐quality bacterial and small‐eukaryote assemblies.

Installation (Ubuntu/WSL):

sudo apt update
sudo apt install -y spades

Usage Example

spades.py \
  -1 reads_R1.fastq \
  -2 reads_R2.fastq \
  --careful \
  -o spades_out/

Pros & Cons (SPAdes)

  • ✔️ Excellent single-cell & metagenome performance
  • ✔️ Automated error correction
  • ❌ Moderate RAM use (tens of GB)
  • ❌ Slower on very large genomes

Velvet (Short Reads)

Description:
Velvet constructs a single-k de Bruijn graph, then applies tour-de-force graph cleanup (tour bus, error removal) to produce contigs efficiently.

Installation (Ubuntu/WSL):
⬇️

# Example install via package manager:
sudo apt update
sudo apt install velvet

Usage Example:

velveth velvet_k31 31 -fastq -shortPaired reads_R1.fastq reads_R2.fastq
velvetg velvet_k31 -exp_cov auto -cov_cutoff auto

Pros & Cons (Velvet)

  • ✔️ Very fast and low-memory
  • ✔️ Transparent parameter settings (k, coverage)
  • ❌ Single-k can miss some repeat resolution
  • ❌ Manual tuning of -exp_cov / -cov_cutoff often needed

Canu (Long Reads)

Description:
Canu is an OLC assembler tailored for noisy long reads (PacBio, ONT). It adaptively filters k-mers, corrects reads, and then builds contigs via overlap detection.

Installation (Ubuntu/WSL):
⬇️

# Example installation via package manager
sudo apt update
sudo apt install canu

Usage Example:

canu \
  -p ecoli -d canu_out \
  genomeSize=4.7m \
  -pacbio-raw long_reads.fastq

Pros & Cons (Canu)

  • ✔️ Excellent consensus accuracy after polishing
  • ✔️ Handles very long reads > 100 kb
  • ❌ High CPU & disk I/O requirements
  • ❌ Long runtimes on large genomes

Flye (Long Reads)

Description:
Flye uses a repeat graph approach to assemble long reads with minimal pre-processing. It’s fast, memory-efficient, and includes polishing.

Installation (pip in bioenv):
⬇️

# Activate your bioenv environment, then:
pip install flye

Usage Example:

flye \
  --nano-raw long_reads.fastq \
  --genome-size 4.7m \
  --out-dir flye_out \
  --threads 8

Pros & Cons (Flye)

  • ✔️ Rapid assembly (minutes to hours)
  • ✔️ Low memory footprint
  • ❌ Performance degrades on ultra-high coverage
  • ❌ Polishing steps still recommended

Hands-On Demo – assemble a small bacterial genome and evaluate output

This tutorial walks you through a fully reproducible de novo assembly of a small bacterial genome (e.g., E. coli K-12 MG1655, ~4.6 Mb) using synthetic reads, SPAdes, and quality assessment with BUSCO (and optionally QUAST).


1. Prepare your environment

In your WSL Ubuntu shell:

# Update package list & install Python venv support
sudo apt update
sudo apt install -y python3-venv

Re-create and enter your project venv:

cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
python3 -m venv iss_env
source iss_env/bin/activate

With (iss_env) on your prompt, upgrade pip and install InSilicoSeq:

pip install --upgrade pip
pip install insilicoseq

2. Clean up any previous runs

Remove leftover data so you start fresh:

mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output

3. Create work directories

mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output

4. Obtain or verify the reference genome

4. Obtain (or verify) the reference genome

If you don’t yet have ref_genome.fa in your project root, download and unzip the NCBI RefSeq for E. coli K-12 MG1655:

# (re)download the compressed RefSeq FASTA
wget -O ref_genome.fa.gz \
  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/\
GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

# decompress to plain FASTA
gunzip ref_genome.fa.gz

Once you have ref_genome.fa, verify that it contains the expected header and sequence:

head -n5 ref_genome.fa
# >NC_000913.3 Escherichia coli K-12 substr. MG1655, complete genome
# AGCTTTCATTCTG...      ← looks like DNA sequence
# GATCTGGCAG...  

If you already had ref_genome.fa, just run the head command above to confirm it’s uncompressed and correct.

5. Simulate 500 K paired-end Illumina reads

Use InSilicoSeq’s built-in hiseq error model to generate ~50× coverage:

time iss generate \
  --model hiseq \
  --genomes ref_genome.fa \
  --n_reads 500000 \
  --cpus 8 \
  --output raw_reads/simulated_small

After ~1 min you should see status messages culminating in “Read generation complete.”

Verify the two FASTQs and abundance file:

ls -lh raw_reads/simulated_small_*\.fastq
# raw_reads/simulated_small_R1.fastq   68M
# raw_reads/simulated_small_R2.fastq   68M

ls raw_reads/simulated_small_abundance.txt

6. Assemble with SPAdes

Make sure SPAdes is installed in WSL (sudo apt install spades), then:

spades.py \
  --threads 8 \
  --memory 12 \
  -1 raw_reads/simulated_small_R1.fastq \
  -2 raw_reads/simulated_small_R2.fastq \
  -o assembly/spades_output

When it finishes, your assembled contigs live in:

ls assembly/spades_output/contigs.fasta

7. Inspect your assembly

1. List output files

ls assembly/spades_output

2. Count contigs & view the first one

grep -c "^>" assembly/spades_output/contigs.fasta
head -n5 assembly/spades_output/contigs.fasta

8. Basic assembly statistics

If you have seqkit:

seqkit stats assembly/spades_output/contigs.fasta

This gives you number of contigs, total length, N50, etc.

9. Evaluate completeness with BUSCO

Install BUSCO if needed (sudo apt install busco), then run:

busco \
  -i assembly/spades_output/contigs.fasta \
  -l bacteria_odb10 \
  -o busco_out \
  -m genome

You’ll see a summary like:

C:100.0% [S:100.0%,D:0.0%], F:0.0%, M:0.0%, n:124
124 Complete BUSCOs (C)

10. (Optional) Assess with QUAST

If QUAST is available:

quast.py \
  assembly/spades_output/contigs.fasta \
  -R ref_genome.fa \
  -o quast_report

Inspect quast_report/report.tsv for N50, genome fraction, misassemblies, etc.


You have now gone from reference genome → synthetic reads → de novo assembly → quality assessment in one reproducible pipeline!