MAG Quality and Annotation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.9 MAG Quality & Annotation

After recovering metagenome-assembled genomes (MAGs), it’s essential to assess their quality (completeness, contamination) and then annotate gene content. We cover three key tools: CheckM, BUSCO, and Prokka.


Installation

# Using conda (recommended)
conda install -c bioconda checkm-genome busco prokka

You can also install Prokka via sudo apt install prokka on Debian/Ubuntu, but conda ensures version consistency.

A. CheckM: Completeness & Contamination

What it does:

CheckM uses lineage-specific marker sets to estimate genome completeness and contamination.

# 1) Create an output directory
mkdir -p quality/checkm

# 2) Run the lineage workflow on your MAG folder
checkm lineage_wf \
  --reduced_tree \
  -x fa \
  bins/DASTool_DASTool_bins/ \
  quality/checkm \
  -t 8
  • x fa specifies your MAG files have extension .fa

  • reduced_tree speeds up tree placement for prokaryotes

Key output files in quality/checkm/:

  • lineage.ms (marker set used)

  • storage/final_summary.tsv summary table with columns:

    • Bin Id

    • Completeness (%)

    • Contamination (%)

    • Strain heterogeneity (%)

Example final_summary.tsv:

Bin Id Completeness Contamination Strain heterogeneity
bin.1.fa 98.6 1.2 0.0
bin.2.fa 75.4 2.8 15.0
… … … …
  • Goal: MAGs with β‰₯ 90 % completeness and ≀ 5 % contamination are considered high-quality.

B. BUSCO: Single-Copy Orthologs

What it does:

BUSCO searches for conserved single-copy orthologous genes to assess completeness and fragmentation.

# 1) Create output directory
mkdir -p quality/busco

# 2) Run BUSCO on each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
  name=$(basename "$bin" .fa)
  busco \
    -i "$bin" \
    -l bacteria_odb10 \       # bacterial lineage dataset
    -o "busco/$name" \
    -m genome \
    -c 8
done
  • l bacteria_odb10 uses the bacteria-specific BUSCO set (1 000+ orthologs).

  • BUSCO will download the lineage data automatically.

Key output in each busco// folder:

  • short_summary..txt

  • run_/full_table.tsv

Example summary line:
C:96.4%[S:94.1%,D:2.3%],F:1.2%,M:2.4%,n:1240
  • C = Complete BUSCOs (S = single copy, D = duplicated)

  • F = Fragmented BUSCOs

  • M = Missing BUSCOs

  • n = Number of BUSCO groups searched

C. Prokka: Genome Annotation

What it does:

Prokka predicts coding sequences (CDS), rRNAs, tRNAs, and other features, assigning functional annotations.

# 1) Create annotation output dir
mkdir -p annotation/prokka

# 2) Annotate each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
  name=$(basename "$bin" .fa)
  prokka \
    --outdir annotation/prokka/"$name" \
    --prefix "$name" \
    --cpus 8 \
    --kingdom Bacteria \
    "$bin"
done

Key output files in annotation/prokka//:

  • .gff – GFF3 with gene features

  • .gbk – GenBank file

  • .faa – protein translations (FASTA)

  • .ffn – nucleotide CDS (FASTA)

  • .tbl – feature table for GenBank

  • .txt – summary statistics (number of CDS, rRNA, tRNA)

Example Prokka summary:

# Assembly: bin.1.fa
# Contigs: 52  (total length: 3,745,128 bp)
# Genes:   3,650
# CDS:     3,550
# rRNA:    3
# tRNA:    45

D. Summary

quality/
β”œβ”€ checkm/
β”‚   β”œβ”€ lineage.ms
β”‚   └─ storage/final_summary.tsv
└─ busco/
    β”œβ”€ bin.1/
    β”‚   β”œβ”€ short_summary.bin.1.txt
    β”‚   └─ run_bin.1/full_table.tsv
    └─ bin.2/ …

annotation/
└─ prokka/
    β”œβ”€ bin.1/
    β”‚   β”œβ”€ bin.1.gff
    β”‚   β”œβ”€ bin.1.faa
    β”‚   β”œβ”€ bin.1.ffn
    β”‚   └─ …
    └─ bin.2/ …

Best Practices:

  • Use CheckM and BUSCO together for complementary completeness estimates.

  • Focus downstream analyses on high-quality MAGs (β‰₯ 90 % complete, ≀ 5 % contaminated).

  • Retain Prokka annotations (GFF and protein FASTA) for metabolic reconstruction and comparative genomics.

⚠️ **GitHub.com Fallback** ⚠️