Genome Assembly Concepts - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
3.1 Genome Assembly Concepts
In this section we will cover:
- Assembly Strategies – reference-guided vs. de novo approaches
- Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)
- k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly
- Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection
- Assembly Metrics – N50/L50, total length, completeness (e.g. BUSCO scores)
- Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)
- Hands-On Demo – assemble a small bacterial genome and evaluate output
3.1.1 Assembly Strategies
When reconstructing a genome from sequencing reads, there are two broad approaches:
A) Reference-Guided (a.k.a. “Mapping” or “Resequencing”)
You align your reads to a known reference genome and then call consensus.
Workflow
- Index the reference
bwa index ref.fa
- Map reads
bwa mem ref.fa reads_R1.fastq reads_R2.fastq \
| samtools sort -o aln.bam
samtools index aln.bam
- Call variants & consensus
samtools mpileup -uf ref.fa aln.bam \
| bcftools call -c - \
| bcftools consensus -o consensus.fa
Pros
- Very fast & low-memory
- Highly accurate in conserved regions
- Straightforward variant calling downstream
Cons
- Cannot discover large structural variants easily
- Biased by the reference (reference “gaps” remain unassembled)
- Fails when there’s no close reference
Typical Use Cases
- Human resequencing (whole-genome, exome)
- Pathogen outbreak tracking against a known strain
B) De Novo (assembly “from scratch”)
You assemble reads into contigs without any reference.
Workflow (de Bruijn graph example with SPAdes)
- Run SPAdes
spades.py \
-1 reads_R1.fastq -2 reads_R2.fastq \
-o spades_out/
- Inspect contigs
less spades_out/contigs.fasta
- Evaluate assembly
quast spades_out/contigs.fasta -o quast_report/
Pros
- Unbiased; can recover novel sequences
- Captures large insertions / structural variants
- No requirement for a “close” reference
Cons
- Higher compute (CPU / RAM)
- Assemblies can be fragmented (especially with short reads)
- Requires careful k-mer choice & parameter tuning
Typical Use Cases
- Novel microbial genome assembly
- Metagenome assembly
- Polyploid or highly divergent species
Feature | Reference-Guided | De Novo |
---|---|---|
Input requirement | High-quality reference | Only reads |
Compute cost | Low | Moderate → High |
Bias | Reference bias | Reference-free |
Complexity detection | Limited | Good (SVs, novel sequences) |
Downstream processing | Variant calling, consensus | Contig scaffolding, annotation |
✅ Section 3.1.1 complete: “Assembly Strategies – reference-guided vs de novo”
3.1.2 Sequencing Technologies – Short Reads vs. Long Reads
Illumina (Short Reads)
Pros
- Very high per-base accuracy (>99.9%)
- Extremely high throughput (hundreds of Gb per run)
- Low cost per Gb
- Mature ecosystem of bioinformatics tools
Cons
- Read length limited to ~50–300 bp
- Difficult to resolve long repeats or structural variants
- Library preparation can introduce GC-bias
Typical Use Cases
- SNP & small indel calling
- RNA-Seq / transcript quantification
- Targeted resequencing (exomes, panels)
PacBio & Oxford Nanopore (Long Reads)
Pros
- Read lengths from 10 kb up to >100 kb
- Excellent for spanning repeats & structural variants
- Direct detection of base modifications (ONT)
- Phasing of haplotypes in diploid/polyploid genomes
Cons
- Higher raw error rate (∼90–99% depending on chemistry)
- Lower overall throughput per run
- Higher cost per Gb (though improving)
- Bioinformatics pipelines still maturing
Typical Use Cases
- De novo genome assembly
- Structural variant discovery
- Full-length isoform sequencing
- Metagenome assembly & phasing
Technology Comparison
Feature | Illumina (Short Reads) | PacBio / ONT (Long Reads) |
---|---|---|
Read length | 50 – 300 bp | 10 kb – 100 kb+ |
Per-base accuracy | > 99.9 % | ∼ 90 – 99 % |
Throughput | Very high | Moderate |
Cost per Gb | Low | Higher |
Bias | GC-bias possible | Systematic indel/errors |
Ideal applications | SNP/indel calling, RNA-Seq | De novo assembly, SV detection |
✅ Section 3.1.2 complete: “Sequencing Technologies – short reads (Illumina) vs. long reads (PacBio, Oxford Nanopore)”
3.1.3 k-mer & de Bruijn Graphs – Constructing and Traversing the Graph for Assembly
What Is a k-mer?
- A k-mer is any subsequence of length k extracted from your reads.
- Example: from the read
AATGC
, and k = 3, we get the k-mers:
AAT, ATG, TGC
The de Bruijn Graph Model
- Nodes represent all possible (k – 1)-mers.
- Directed edges represent each k-mer, connecting its prefix (first k – 1 bases) to its suffix (last k – 1 bases).
- Assembly becomes an Eulerian path problem: find a path that traverses every edge exactly once.
Example (k = 3)
- Reads
AATGC
- Extract k-mers
AAT, ATG, TGC
- Build nodes (k – 1 = 2)
AA, AT, TG, GC
- Add edges
- AAT: AA → AT
- ATG: AT → TG
- TGC: TG → GC
- Traverse
An Eulerian path:
AA → AT → TG → GC
Reconstruct the genome:
AATGC
Practical Considerations
- Choosing k
- Small k → graph highly connected, risk of repeats collapsing
- Large k → graph fragmented, less overlap between reads
- Error handling
- Sequencing errors create “tips” (dead‐end branches) and “bubbles” (parallel paths)
- Assemblers trim tips and pop bubbles to clean the graph
- Graph simplification
- Tip trimming: remove short dead‐end paths
- Bubble popping: merge parallel paths that differ by few bases
Popular de Bruijn Graph Assemblers
- SPAdes (short reads, multi-k)
- Velvet (short reads)
- ABySS (distributed, short reads)
- MEGAHIT (metagenomes)
Each implements its own heuristics for error correction, k-mer counting, graph simplification, and contig extraction.
✅ Section 3.1.3 complete: “k-mer & de Bruijn Graphs – constructing and traversing the graph for assembly”
3.1.4 Overlap–Layout–Consensus (OLC) – Classic Assembly via Read Overlap Detection
What Is OLC?
The Overlap–Layout–Consensus paradigm reconstructs a genome by:
- Overlap: finding all significant pairwise overlaps between reads
- Layout: arranging reads into a layout graph based on those overlaps
- Consensus: deriving the final sequence by resolving disagreements among overlapping reads
1. Overlap
- Perform an all-vs-all comparison of reads (e.g., with suffix arrays, minimizers, or seed-and-extend)
- Identify high-confidence overlaps above a length/identity threshold
- Record overlaps as edges between read nodes
2. Layout
- Build a directed overlap graph where nodes = reads and edges = overlaps
- Simplify the graph by removing transitive overlaps and spurious edges
- Extract a layout path (or set of contig paths) that traverses reads in order
3. Consensus
- For each contig path, align overlapping reads to a work sequence
- At each position, call the consensus base (e.g. majority vote, quality-weighted)
- Produce polished contigs representing the assembled genome
Pros & Cons
Feature | OLC |
---|---|
Best for | Long reads (PacBio, ONT), moderate dataset sizes |
Computational cost | High (all-vs-all overlaps ∼ O(N²) reads) |
Repeat resolution | Natural via overlap context |
Error handling | Can incorporate quality scores & polishing |
Memory footprint | High for short-read datasets |
Pros
- Handles variable read lengths and high error rates (long reads)
- Captures complex repeat structures via read overlaps
- Consensus phase can incorporate multiple rounds of polishing
Cons
- All-vs-all overlap detection is computationally expensive
- Not suitable for extremely high-coverage short-read projects without downsampling
- Requires careful tuning of overlap length and identity thresholds
Popular OLC Assemblers
- Celera Assembler – one of the first whole-genome OLC pipelines
- Canu (formerly PBcR) – optimized for noisy long reads, includes adaptive k-mer filtering
- FALCON / FALCON-Unzip – diploid assembly with haplotype phasing
- Miniasm + Racon – ultrafast long-read layout and polishing
✅ Section 3.1.4 complete: “Overlap–Layout–Consensus (OLC) – classic assembly via read overlap detection”
3.1.5 Assembly Metrics – N50/L50, Total Length, Completeness (BUSCO)
Key Definitions
-
Total Assembly Length
Sum of the lengths of all contigs or scaffolds. Reflects how much of the genome you’ve recovered. -
N50
The length N such that 50% of the total assembly length is contained in contigs of length ≥ N.- Sort contigs by length (long → short).
- Sum their lengths until you reach half of the total assembly length.
- The length of the contig at that point is the N50.
-
L50
The minimum number of contigs whose combined length makes up at least 50% of the total assembly.
Equivalently, the rank of the contig at N50 in the sorted list. -
Completeness (BUSCO)
BUSCO (Benchmarking Universal Single-Copy Orthologs) estimates the proportion of expected single-copy orthologous genes that are:- C: Complete (single-copy [S] or duplicated [D])
- F: Fragmented
- M: Missing
from your assembly, based on a lineage-specific database.
Tools & Commands
- QUAST (Quality Assessment)
quast.py contigs.fasta -o quast_report/
Generates a report with contig counts, N50/L50, total length, GC%, and more.
2. BUSCO (Completeness)
```bash
busco \
-i contigs.fasta \
-o busco_out \
-l bacteria_odb10 \
-m genome
Reports percentage of complete, fragmented, and missing orthologs.
Example QUAST Summary
Metric | Value |
---|---|
Number of contigs | 45 |
Total length (bp) | 4,736,284 |
Largest contig (bp) | 892,341 |
N50 (bp) | 327,814 |
L50 | 5 |
GC content (%) | 50.3 |
Example BUSCO Results (bacteria_odb10)
# BUSCO version: 5.0.0
# Running in mode: genome
# Lineage dataset: bacteria_odb10 (n=124)
#
# C:98.4%[S:97.6%,D:0.8%],F:0.8%,M:0.8%,n:124
- C (Complete): 98.4% of the 124 expected genes are fully present.
- S (Single-copy): 97.6%.
- D (Duplicated): 0.8%.
- F (Fragmented): 0.8%.
- M (Missing): 0.8%.
✅ Section 3.1.5 complete: “Assembly Metrics – N50/L50, total length, completeness (BUSCO scores)”
3.1.6 Popular Tools – SPAdes, Velvet (short reads); Canu, Flye (long reads)
Below are four widely used assemblers—two optimized for short Illumina reads, two for long‐read data.
SPAdes (Short Reads)
Description:
SPAdes implements a multi‐k de Bruijn‐graph approach with built-in error correction and careful bubble/tip handling, yielding high‐quality bacterial and small‐eukaryote assemblies.
Installation (Ubuntu/WSL):
sudo apt update
sudo apt install -y spades
Usage Example
spades.py \
-1 reads_R1.fastq \
-2 reads_R2.fastq \
--careful \
-o spades_out/
Pros & Cons (SPAdes)
- ✔️ Excellent single-cell & metagenome performance
- ✔️ Automated error correction
- ❌ Moderate RAM use (tens of GB)
- ❌ Slower on very large genomes
Velvet (Short Reads)
Description:
Velvet constructs a single-k de Bruijn graph, then applies tour-de-force graph cleanup (tour bus, error removal) to produce contigs efficiently.
Installation (Ubuntu/WSL):
⬇️
# Example install via package manager:
sudo apt update
sudo apt install velvet
Usage Example:
velveth velvet_k31 31 -fastq -shortPaired reads_R1.fastq reads_R2.fastq
velvetg velvet_k31 -exp_cov auto -cov_cutoff auto
Pros & Cons (Velvet)
- ✔️ Very fast and low-memory
- ✔️ Transparent parameter settings (
k
, coverage) - ❌ Single-k can miss some repeat resolution
- ❌ Manual tuning of
-exp_cov
/-cov_cutoff
often needed
Canu (Long Reads)
Description:
Canu is an OLC assembler tailored for noisy long reads (PacBio, ONT). It adaptively filters k-mers, corrects reads, and then builds contigs via overlap detection.
Installation (Ubuntu/WSL):
⬇️
# Example installation via package manager
sudo apt update
sudo apt install canu
Usage Example:
canu \
-p ecoli -d canu_out \
genomeSize=4.7m \
-pacbio-raw long_reads.fastq
Pros & Cons (Canu)
- ✔️ Excellent consensus accuracy after polishing
- ✔️ Handles very long reads > 100 kb
- ❌ High CPU & disk I/O requirements
- ❌ Long runtimes on large genomes
Flye (Long Reads)
Description:
Flye uses a repeat graph approach to assemble long reads with minimal pre-processing. It’s fast, memory-efficient, and includes polishing.
Installation (pip in bioenv):
⬇️
# Activate your bioenv environment, then:
pip install flye
Usage Example:
flye \
--nano-raw long_reads.fastq \
--genome-size 4.7m \
--out-dir flye_out \
--threads 8
Pros & Cons (Flye)
- ✔️ Rapid assembly (minutes to hours)
- ✔️ Low memory footprint
- ❌ Performance degrades on ultra-high coverage
- ❌ Polishing steps still recommended
Hands-On Demo – assemble a small bacterial genome and evaluate output
This tutorial walks you through a fully reproducible de novo assembly of a small bacterial genome (e.g., E. coli K-12 MG1655, ~4.6 Mb) using synthetic reads, SPAdes, and quality assessment with BUSCO (and optionally QUAST).
1. Prepare your environment
In your WSL Ubuntu shell:
# Update package list & install Python venv support
sudo apt update
sudo apt install -y python3-venv
Re-create and enter your project venv:
cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
python3 -m venv iss_env
source iss_env/bin/activate
With (iss_env) on your prompt, upgrade pip and install InSilicoSeq:
pip install --upgrade pip
pip install insilicoseq
2. Clean up any previous runs
Remove leftover data so you start fresh:
mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output
3. Create work directories
mkdir -p raw_reads/simulated_small
mkdir -p assembly/spades_output
4. Obtain or verify the reference genome
4. Obtain (or verify) the reference genome
If you don’t yet have ref_genome.fa
in your project root, download and unzip the NCBI RefSeq for E. coli K-12 MG1655:
# (re)download the compressed RefSeq FASTA
wget -O ref_genome.fa.gz \
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/\
GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
# decompress to plain FASTA
gunzip ref_genome.fa.gz
Once you have ref_genome.fa, verify that it contains the expected header and sequence:
head -n5 ref_genome.fa
# >NC_000913.3 Escherichia coli K-12 substr. MG1655, complete genome
# AGCTTTCATTCTG... ← looks like DNA sequence
# GATCTGGCAG...
If you already had ref_genome.fa, just run the head command above to confirm it’s uncompressed and correct.
5. Simulate 500 K paired-end Illumina reads
Use InSilicoSeq’s built-in hiseq error model to generate ~50× coverage:
time iss generate \
--model hiseq \
--genomes ref_genome.fa \
--n_reads 500000 \
--cpus 8 \
--output raw_reads/simulated_small
After ~1 min you should see status messages culminating in “Read generation complete.”
Verify the two FASTQs and abundance file:
ls -lh raw_reads/simulated_small_*\.fastq
# raw_reads/simulated_small_R1.fastq 68M
# raw_reads/simulated_small_R2.fastq 68M
ls raw_reads/simulated_small_abundance.txt
6. Assemble with SPAdes
Make sure SPAdes is installed in WSL (sudo apt install spades), then:
spades.py \
--threads 8 \
--memory 12 \
-1 raw_reads/simulated_small_R1.fastq \
-2 raw_reads/simulated_small_R2.fastq \
-o assembly/spades_output
When it finishes, your assembled contigs live in:
ls assembly/spades_output/contigs.fasta
7. Inspect your assembly
1. List output files
ls assembly/spades_output
2. Count contigs & view the first one
grep -c "^>" assembly/spades_output/contigs.fasta
head -n5 assembly/spades_output/contigs.fasta
8. Basic assembly statistics
If you have seqkit:
seqkit stats assembly/spades_output/contigs.fasta
This gives you number of contigs, total length, N50, etc.
9. Evaluate completeness with BUSCO
Install BUSCO if needed (sudo apt install busco), then run:
busco \
-i assembly/spades_output/contigs.fasta \
-l bacteria_odb10 \
-o busco_out \
-m genome
You’ll see a summary like:
C:100.0% [S:100.0%,D:0.0%], F:0.0%, M:0.0%, n:124
124 Complete BUSCOs (C)
10. (Optional) Assess with QUAST
If QUAST is available:
quast.py \
assembly/spades_output/contigs.fasta \
-R ref_genome.fa \
-o quast_report
Inspect quast_report/report.tsv for N50, genome fraction, misassemblies, etc.
You have now gone from reference genome → synthetic reads → de novo assembly → quality assessment in one reproducible pipeline!