MAG Quality and Annotation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.9 MAG Quality & Annotation

After recovering metagenome-assembled genomes (MAGs), it’s essential to assess their quality (completeness, contamination) and then annotate gene content. We cover three key tools: CheckM, BUSCO, and Prokka.

Installation

# Using conda (recommended)
conda install -c bioconda checkm-genome busco prokka

You can also install Prokka via sudo apt install prokka on Debian/Ubuntu, but conda ensures version consistency.

A. CheckM: Completeness & Contamination

What it does:

CheckM uses lineage-specific marker sets to estimate genome completeness and contamination.

# 1) Create an output directory
mkdir -p quality/checkm

# 2) Run the lineage workflow on your MAG folder
checkm lineage_wf \
  --reduced_tree \
  -x fa \
  bins/DASTool_DASTool_bins/ \
  quality/checkm \
  -t 8

x fa specifies your MAG files have extension .fa
reduced_tree speeds up tree placement for prokaryotes

Key output files in quality/checkm/:

lineage.ms (marker set used)
storage/final_summary.tsv summary table with columns:
- Bin Id
- Completeness (%)
- Contamination (%)
- Strain heterogeneity (%)

Example final_summary.tsv:

Bin Id	Completeness	Contamination	Strain heterogeneity
bin.1.fa	98.6	1.2	0.0
bin.2.fa	75.4	2.8	15.0
…	…	…	…

Goal: MAGs with ≥ 90 % completeness and ≤ 5 % contamination are considered high-quality.

B. BUSCO: Single-Copy Orthologs

What it does:

BUSCO searches for conserved single-copy orthologous genes to assess completeness and fragmentation.

# 1) Create output directory
mkdir -p quality/busco

# 2) Run BUSCO on each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
  name=$(basename "$bin" .fa)
  busco \
    -i "$bin" \
    -l bacteria_odb10 \       # bacterial lineage dataset
    -o "busco/$name" \
    -m genome \
    -c 8
done

l bacteria_odb10 uses the bacteria-specific BUSCO set (1 000+ orthologs).
BUSCO will download the lineage data automatically.

Key output in each busco// folder:

short_summary..txt
run_/full_table.tsv

Example summary line:

C:96.4%[S:94.1%,D:2.3%],F:1.2%,M:2.4%,n:1240

C = Complete BUSCOs (S = single copy, D = duplicated)
F = Fragmented BUSCOs
M = Missing BUSCOs
n = Number of BUSCO groups searched

C. Prokka: Genome Annotation

What it does:

Prokka predicts coding sequences (CDS), rRNAs, tRNAs, and other features, assigning functional annotations.

# 1) Create annotation output dir
mkdir -p annotation/prokka

# 2) Annotate each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
  name=$(basename "$bin" .fa)
  prokka \
    --outdir annotation/prokka/"$name" \
    --prefix "$name" \
    --cpus 8 \
    --kingdom Bacteria \
    "$bin"
done

Key output files in annotation/prokka//:

.gff – GFF3 with gene features
.gbk – GenBank file
.faa – protein translations (FASTA)
.ffn – nucleotide CDS (FASTA)
.tbl – feature table for GenBank
.txt – summary statistics (number of CDS, rRNA, tRNA)

Example Prokka summary:

# Assembly: bin.1.fa
# Contigs: 52  (total length: 3,745,128 bp)
# Genes:   3,650
# CDS:     3,550
# rRNA:    3
# tRNA:    45

D. Summary

quality/
├─ checkm/
│   ├─ lineage.ms
│   └─ storage/final_summary.tsv
└─ busco/
    ├─ bin.1/
    │   ├─ short_summary.bin.1.txt
    │   └─ run_bin.1/full_table.tsv
    └─ bin.2/ …

annotation/
└─ prokka/
    ├─ bin.1/
    │   ├─ bin.1.gff
    │   ├─ bin.1.faa
    │   ├─ bin.1.ffn
    │   └─ …
    └─ bin.2/ …

Best Practices:

Use CheckM and BUSCO together for complementary completeness estimates.
Focus downstream analyses on high-quality MAGs (≥ 90 % complete, ≤ 5 % contaminated).
Retain Prokka annotations (GFF and protein FASTA) for metabolic reconstruction and comparative genomics.

MAG Quality and Annotation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.9 MAG Quality & Annotation

Installation

A. CheckM: Completeness & Contamination

What it does:

Key output files in quality/checkm/:

B. BUSCO: Single-Copy Orthologs

What it does:

Key output in each busco// folder:

Example summary line:

C. Prokka: Genome Annotation

What it does:

Key output files in annotation/prokka//:

Example Prokka summary:

D. Summary

Best Practices:

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️