Genome Assembly Concepts - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

3.1 Genome Assembly Concepts

In this section we will cover:

Assembly Strategies – reference-guided vs. de novo approaches
Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)
k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly
Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection
Assembly Metrics – N50/L50, total length, completeness (e.g. BUSCO scores)
Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)
Hands-On Demo – assemble a small bacterial genome and evaluate output

3.1.1 Assembly Strategies

When reconstructing a genome from sequencing reads, there are two broad approaches:

A) Reference-Guided (a.k.a. “Mapping” or “Resequencing”)

You align your reads to a known reference genome and then call consensus.

Workflow

Index the reference

bwa index ref.fa

Map reads

bwa mem ref.fa reads_R1.fastq reads_R2.fastq \
  | samtools sort -o aln.bam
samtools index aln.bam

Call variants & consensus

samtools mpileup -uf ref.fa aln.bam \
  | bcftools call -c - \
  | bcftools consensus -o consensus.fa

Pros

Very fast & low-memory
Highly accurate in conserved regions
Straightforward variant calling downstream

Cons

Cannot discover large structural variants easily
Biased by the reference (reference “gaps” remain unassembled)
Fails when there’s no close reference

Typical Use Cases

Human resequencing (whole-genome, exome)
Pathogen outbreak tracking against a known strain

B) De Novo (assembly “from scratch”)

You assemble reads into contigs without any reference.

Workflow (de Bruijn graph example with SPAdes)

Run SPAdes

spades.py \
  -1 reads_R1.fastq -2 reads_R2.fastq \
  -o spades_out/

Inspect contigs

less spades_out/contigs.fasta

Evaluate assembly

quast spades_out/contigs.fasta -o quast_report/

Pros

Unbiased; can recover novel sequences
Captures large insertions / structural variants
No requirement for a “close” reference

Cons

Higher compute (CPU / RAM)
Assemblies can be fragmented (especially with short reads)
Requires careful k-mer choice & parameter tuning

Typical Use Cases

Novel microbial genome assembly
Metagenome assembly
Polyploid or highly divergent species

Feature	Reference-Guided	De Novo
Input requirement	High-quality reference	Only reads
Compute cost	Low	Moderate → High
Bias	Reference bias	Reference-free
Complexity detection	Limited	Good (SVs, novel sequences)
Downstream processing	Variant calling, consensus	Contig scaffolding, annotation

✅ Section 3.1.1 complete: “Assembly Strategies – reference-guided vs de novo”

3.1.2 Sequencing Technologies – Short Reads vs. Long Reads

Illumina (Short Reads)

Pros

Very high per-base accuracy (>99.9%)
Extremely high throughput (hundreds of Gb per run)
Low cost per Gb
Mature ecosystem of bioinformatics tools

Cons

Read length limited to ~50–300 bp
Difficult to resolve long repeats or structural variants
Library preparation can introduce GC-bias

Typical Use Cases

SNP & small indel calling
RNA-Seq / transcript quantification
Targeted resequencing (exomes, panels)

PacBio & Oxford Nanopore (Long Reads)

Pros

Read lengths from 10 kb up to >100 kb
Excellent for spanning repeats & structural variants
Direct detection of base modifications (ONT)
Phasing of haplotypes in diploid/polyploid genomes

Cons

Higher raw error rate (∼90–99% depending on chemistry)
Lower overall throughput per run
Higher cost per Gb (though improving)
Bioinformatics pipelines still maturing

Typical Use Cases

De novo genome assembly
Structural variant discovery
Full-length isoform sequencing
Metagenome assembly & phasing

Technology Comparison

Feature	Illumina (Short Reads)	PacBio / ONT (Long Reads)
Read length	50 – 300 bp	10 kb – 100 kb+
Per-base accuracy	> 99.9 %	∼ 90 – 99 %
Throughput	Very high	Moderate
Cost per Gb	Low	Higher
Bias	GC-bias possible	Systematic indel/errors
Ideal applications	SNP/indel calling, RNA-Seq	De novo assembly, SV detection

✅ Section 3.1.2 complete: “Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)”

3.1.3 k-mer & de Bruijn Graphs – Constructing and Traversing the Graph for Assembly

What Is a k-mer?

A k-mer is any subsequence of length k extracted from your reads.
Example: from the read AATGC, and k = 3, we get the k-mers:

AAT, ATG, TGC

The de Bruijn Graph Model

Nodes represent all possible (k – 1)-mers.
Directed edges represent each k-mer, connecting its prefix (first k – 1 bases) to its suffix (last k – 1 bases).
Assembly becomes an Eulerian path problem: find a path that traverses every edge exactly once.

Example (k = 3)

Reads

AATGC

Extract k-mers

AAT, ATG, TGC

Build nodes (k – 1 = 2)

AA, AT, TG, GC

Add edges

AAT: AA → AT
ATG: AT → TG
TGC: TG → GC

Traverse
An Eulerian path:

AA → AT → TG → GC

Reconstruct the genome:

AATGC

Practical Considerations

Choosing k
Small k → graph highly connected, risk of repeats collapsing
Large k → graph fragmented, less overlap between reads
Error handling
Sequencing errors create “tips” (dead‐end branches) and “bubbles” (parallel paths)
Assemblers trim tips and pop bubbles to clean the graph
Graph simplification
Tip trimming: remove short dead‐end paths
Bubble popping: merge parallel paths that differ by few bases

Popular de Bruijn Graph Assemblers

SPAdes (short reads, multi-k)
Velvet (short reads)
ABySS (distributed, short reads)
MEGAHIT (metagenomes)

Each implements its own heuristics for error correction, k-mer counting, graph simplification, and contig extraction.

✅ Section 3.1.3 complete: “k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly”

3.1.4 Overlap–Layout–Consensus (OLC) – Classic Assembly via Read Overlap Detection

What Is OLC?

The Overlap–Layout–Consensus paradigm reconstructs a genome by:

Overlap: finding all significant pairwise overlaps between reads
Layout: arranging reads into a layout graph based on those overlaps
Consensus: deriving the final sequence by resolving disagreements among overlapping reads

1. Overlap

Perform an all-vs-all comparison of reads (e.g., with suffix arrays, minimizers, or seed-and-extend)
Identify high-confidence overlaps above a length/identity threshold
Record overlaps as edges between read nodes

2. Layout

Build a directed overlap graph where nodes = reads and edges = overlaps
Simplify the graph by removing transitive overlaps and spurious edges
Extract a layout path (or set of contig paths) that traverses reads in order

3. Consensus

For each contig path, align overlapping reads to a work sequence
At each position, call the consensus base (e.g. majority vote, quality-weighted)
Produce polished contigs representing the assembled genome

Pros & Cons

Feature	OLC
Best for	Long reads (PacBio, ONT), moderate dataset sizes
Computational cost	High (all-vs-all overlaps ∼ O(N²) reads)
Repeat resolution	Natural via overlap context
Error handling	Can incorporate quality scores & polishing
Memory footprint	High for short-read datasets

Pros

Handles variable read lengths and high error rates (long reads)
Captures complex repeat structures via read overlaps
Consensus phase can incorporate multiple rounds of polishing

Cons

All-vs-all overlap detection is computationally expensive
Not suitable for extremely high-coverage short-read projects without downsampling
Requires careful tuning of overlap length and identity thresholds

Popular OLC Assemblers

Celera Assembler – one of the first whole-genome OLC pipelines
Canu (formerly PBcR) – optimized for noisy long reads, includes adaptive k-mer filtering
FALCON / FALCON-Unzip – diploid assembly with haplotype phasing
Miniasm + Racon – ultrafast long-read layout and polishing

✅ Section 3.1.4 complete: “Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection”

3.1.5 Assembly Metrics – N50/L50, Total Length, Completeness (BUSCO)

Key Definitions

Total Assembly Length
Sum of the lengths of all contigs or scaffolds. Reflects how much of the genome you’ve recovered.
N50
The length N such that 50% of the total assembly length is contained in contigs of length ≥ N.
1. Sort contigs by length (long → short).
2. Sum their lengths until you reach half of the total assembly length.
3. The length of the contig at that point is the N50.
L50
The minimum number of contigs whose combined length makes up at least 50% of the total assembly.
Equivalently, the rank of the contig at N50 in the sorted list.
Completeness (BUSCO)
BUSCO (Benchmarking Universal Single-Copy Orthologs) estimates the proportion of expected single-copy orthologous genes that are:
- C: Complete (single-copy [S] or duplicated [D])
- F: Fragmented
- M: Missing
  from your assembly, based on a lineage-specific database.

Tools & Commands

QUAST (Quality Assessment)

quast.py contigs.fasta -o quast_report/


Generates a report with contig counts, N50/L50, total length, GC%, and more.

2. BUSCO (Completeness)
```bash
busco \
  -i contigs.fasta \
  -o busco_out \
  -l bacteria_odb10 \
  -m genome

Reports percentage of complete, fragmented, and missing orthologs.

Example QUAST Summary

Metric	Value
Number of contigs	45
Total length (bp)	4,736,284
Largest contig (bp)	892,341
N50 (bp)	327,814
L50	5
GC content (%)	50.3

Example BUSCO Results (bacteria_odb10)

# BUSCO version: 5.0.0
# Running in mode: genome
# Lineage dataset: bacteria_odb10 (n=124)
# 
# C:98.4%[S:97.6%,D:0.8%],F:0.8%,M:0.8%,n:124

C (Complete): 98.4% of the 124 expected genes are fully present.
S (Single-copy): 97.6%.
D (Duplicated): 0.8%.
F (Fragmented): 0.8%.
M (Missing): 0.8%.

✅ Section 3.1.5 complete: “Assembly Metrics – N50/L50, total length, completeness (BUSCO scores)”

3.1.6 Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)

Below are four widely used assemblers—two optimized for short Illumina reads, two for long‐read data.

SPAdes (Short Reads)

Description:
SPAdes implements a multi‐k de Bruijn‐graph approach with built-in error correction and careful bubble/tip handling, yielding high‐quality bacterial and small‐eukaryote assemblies.

Installation (Ubuntu/WSL):

sudo apt update
sudo apt install -y spades

Usage Example

spades.py \
  -1 reads_R1.fastq \
  -2 reads_R2.fastq \
  --careful \
  -o spades_out/

Pros & Cons (SPAdes)

✔️ Excellent single-cell & metagenome performance
✔️ Automated error correction
❌ Moderate RAM use (tens of GB)
❌ Slower on very large genomes

Velvet (Short Reads)

Description:
Velvet constructs a single-k de Bruijn graph, then applies tour-de-force graph cleanup (tour bus, error removal) to produce contigs efficiently.

Installation (Ubuntu/WSL):
⬇️

# Example install via package manager:
sudo apt update
sudo apt install velvet

Usage Example:

velveth velvet_k31 31 -fastq -shortPaired reads_R1.fastq reads_R2.fastq
velvetg velvet_k31 -exp_cov auto -cov_cutoff auto

Pros & Cons (Velvet)

✔️ Very fast and low-memory
✔️ Transparent parameter settings (k, coverage)
❌ Single-k can miss some repeat resolution
❌ Manual tuning of -exp_cov / -cov_cutoff often needed

Canu (Long Reads)

Description:
Canu is an OLC assembler tailored for noisy long reads (PacBio, ONT). It adaptively filters k-mers, corrects reads, and then builds contigs via overlap detection.

Installation (Ubuntu/WSL):
⬇️

# Example installation via package manager
sudo apt update
sudo apt install canu

Usage Example:

canu \
  -p ecoli -d canu_out \
  genomeSize=4.7m \
  -pacbio-raw long_reads.fastq

Pros & Cons (Canu)

✔️ Excellent consensus accuracy after polishing
✔️ Handles very long reads > 100 kb
❌ High CPU & disk I/O requirements
❌ Long runtimes on large genomes

Flye (Long Reads)

Description:
Flye uses a repeat graph approach to assemble long reads with minimal pre-processing. It’s fast, memory-efficient, and includes polishing.

Installation (pip in bioenv):
⬇️

# Activate your bioenv environment, then:
pip install flye

Usage Example:

flye \
  --nano-raw long_reads.fastq \
  --genome-size 4.7m \
  --out-dir flye_out \
  --threads 8

Pros & Cons (Flye)

✔️ Rapid assembly (minutes to hours)
✔️ Low memory footprint
❌ Performance degrades on ultra-high coverage
❌ Polishing steps still recommended

Hands-On Demo – assemble a small bacterial genome and evaluate output

This tutorial walks you through a fully reproducible de novo assembly of a small bacterial genome (e.g., E. coli K-12 MG1655, ~4.6 Mb) using synthetic reads, SPAdes, and quality assessment with BUSCO (and optionally QUAST).

1. Prepare your environment

In your WSL Ubuntu shell:

# Update package list & install Python venv support
sudo apt update
sudo apt install -y python3-venv

Re-create and enter your project venv:

cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
python3 -m venv iss_env
source iss_env/bin/activate

With (iss_env) on your prompt, upgrade pip and install InSilicoSeq:

pip install --upgrade pip
pip install insilicoseq

2. Clean up any previous runs

Remove leftover data so you start fresh:

mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output

3. Create work directories

mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output

4. Obtain or verify the reference genome

4. Obtain (or verify) the reference genome

If you don’t yet have ref_genome.fa in your project root, download and unzip the NCBI RefSeq for E. coli K-12 MG1655:

# (re)download the compressed RefSeq FASTA
wget -O ref_genome.fa.gz \
  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/\
GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

# decompress to plain FASTA
gunzip ref_genome.fa.gz

Once you have ref_genome.fa, verify that it contains the expected header and sequence:

head -n5 ref_genome.fa
# >NC_000913.3 Escherichia coli K-12 substr. MG1655, complete genome
# AGCTTTCATTCTG...      ← looks like DNA sequence
# GATCTGGCAG...

If you already had ref_genome.fa, just run the head command above to confirm it’s uncompressed and correct.

5. Simulate 500 K paired-end Illumina reads

Use InSilicoSeq’s built-in hiseq error model to generate ~50× coverage:

time iss generate \
  --model hiseq \
  --genomes ref_genome.fa \
  --n_reads 500000 \
  --cpus 8 \
  --output raw_reads/simulated_small

After ~1 min you should see status messages culminating in “Read generation complete.”

Verify the two FASTQs and abundance file:

ls -lh raw_reads/simulated_small_*\.fastq
# raw_reads/simulated_small_R1.fastq   68M
# raw_reads/simulated_small_R2.fastq   68M

ls raw_reads/simulated_small_abundance.txt

6. Assemble with SPAdes

Make sure SPAdes is installed in WSL (sudo apt install spades), then:

spades.py \
  --threads 8 \
  --memory 12 \
  -1 raw_reads/simulated_small_R1.fastq \
  -2 raw_reads/simulated_small_R2.fastq \
  -o assembly/spades_output

When it finishes, your assembled contigs live in:

ls assembly/spades_output/contigs.fasta

7. Inspect your assembly

1. List output files

ls assembly/spades_output

2. Count contigs & view the first one

grep -c "^>" assembly/spades_output/contigs.fasta
head -n5 assembly/spades_output/contigs.fasta

8. Basic assembly statistics

If you have seqkit:

seqkit stats assembly/spades_output/contigs.fasta

This gives you number of contigs, total length, N50, etc.

9. Evaluate completeness with BUSCO

Install BUSCO if needed (sudo apt install busco), then run:

busco \
  -i assembly/spades_output/contigs.fasta \
  -l bacteria_odb10 \
  -o busco_out \
  -m genome

You’ll see a summary like:

C:100.0% [S:100.0%,D:0.0%], F:0.0%, M:0.0%, n:124
124 Complete BUSCOs (C)

10. (Optional) Assess with QUAST

If QUAST is available:

quast.py \
  assembly/spades_output/contigs.fasta \
  -R ref_genome.fa \
  -o quast_report

Inspect quast_report/report.tsv for N50, genome fraction, misassemblies, etc.

You have now gone from reference genome → synthetic reads → de novo assembly → quality assessment in one reproducible pipeline!