Genome Assemble concept - igheyas/Bioinformatics GitHub Wiki

What Is Genome Assembly?

Genome assembly is the process of reconstructing an organism’s complete DNA sequence—its genome—from millions of short fragments (“reads”) produced by a sequencer. Think of it like solving a jigsaw puzzle without the picture on the box: you have lots of small pieces (reads), and you want to join them back together into the original, full–length chromosomes.

Why Do We Do It?

Discover Novel Organisms or Strains

When studying a microbe or virus for which no reference genome exists, assembly lets you see its unique genes and variants.

Annotate Genes & Pathways

A complete genome lets you predict where protein-coding genes, regulatory elements, and other features lie.

Compare Genomes

By assembling multiple strains or species, you can pinpoint structural differences (insertions, deletions, rearrangements) that underlie antibiotic resistance, virulence, or evolutionary adaptation.

Build Reference Databases

High-quality assemblies feed public repositories (RefSeq, GenBank) so that future studies have reliable backbones for mapping, variant calling, or metagenomic profiling.

How Does It Work?

1. Generate Reads

Short-read platforms (Illumina): millions of 100–250 bp fragments, very accurate but too short to span long repeats.
Long-read platforms (PacBio, Oxford Nanopore): reads up to tens of kilobases, lower accuracy but can bridge repeats and structural variants.

2. Pre-processing

Quality control & trimming: remove adapters or low-quality bases.
Error correction: especially important for noisy long reads, to reduce downstream misassemblies.

3. Assemble the Pieces

De Bruijn Graph (DBG) Assemblers (e.g. SPAdes, Velvet)

Break reads into overlapping k-mers.
Build a graph where each node is a k-mer and edges represent adjacency in the reads.
Traverse the graph to reconstruct longer contiguous sequences (“contigs”).

Overlap–Layout–Consensus (OLC) Assemblers (e.g. Canu, Flye)

Directly find overlaps between reads.
Layout the reads into a draft assembly graph.
Polish the consensus sequence.

4. Scaffolding & Gap-filling

Use paired-end or mate-pair information (Illumina) or ultra-long reads (Nanopore) to link contigs into larger “scaffolds,” ordering and orienting them.

5. Assembly Validation & Polishing

Metrics:
- N50/L50: contig length at which half the genome is in contigs at least that long—higher is better.
- Total length: how close to the expected genome size.
Completeness checks:
- BUSCO: searches for a set of universal single-copy genes to estimate missing content.
- QUAST: compares contigs to a reference (if available), reports misassemblies, genome fraction, error rates.
Error correction (polishing): map reads back to the assembly and fix residual base errors or small indels.

Putting It All Together

1. Raw reads (FASTQ)
   └─► trim/clean → corrected reads
         │
         ▼
2. Assemble (SPAdes, Canu, Flye)
   └─► contigs.fasta
         │
         ▼
3. Scaffold (paired reads, long reads)
   └─► scaffolds.fasta
         │
         ▼
4. Evaluate (N50, BUSCO, QUAST)
   └─► quality report