MAG Quality and Annotation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
After recovering metagenome-assembled genomes (MAGs), itβs essential to assess their quality (completeness, contamination) and then annotate gene content. We cover three key tools: CheckM, BUSCO, and Prokka.
# Using conda (recommended)
conda install -c bioconda checkm-genome busco prokka
You can also install Prokka via sudo apt install prokka on Debian/Ubuntu, but conda ensures version consistency.
CheckM uses lineage-specific marker sets to estimate genome completeness and contamination.
# 1) Create an output directory
mkdir -p quality/checkm
# 2) Run the lineage workflow on your MAG folder
checkm lineage_wf \
--reduced_tree \
-x fa \
bins/DASTool_DASTool_bins/ \
quality/checkm \
-t 8
-
x faβspecifies your MAG files have extension .fa
-
reduced_treeβspeeds up tree placement for prokaryotes
-
lineage.msβ(marker set used)
-
storage/final_summary.tsvβsummary table with columns:
-
Bin Id
-
Completeness (%)
-
Contamination (%)
-
Strain heterogeneity (%)
-
Example final_summary.tsv
:
Bin Id | Completeness | Contamination | Strain heterogeneity |
---|---|---|---|
bin.1.fa | 98.6 | 1.2 | 0.0 |
bin.2.fa | 75.4 | 2.8 | 15.0 |
β¦ | β¦ | β¦ | β¦ |
- Goal: MAGs with β₯ 90 % completeness and β€ 5 % contamination are considered high-quality.
BUSCO searches for conserved single-copy orthologous genes to assess completeness and fragmentation.
# 1) Create output directory
mkdir -p quality/busco
# 2) Run BUSCO on each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
name=$(basename "$bin" .fa)
busco \
-i "$bin" \
-l bacteria_odb10 \ # bacterial lineage dataset
-o "busco/$name" \
-m genome \
-c 8
done
-
l bacteria_odb10βuses the bacteria-specific BUSCO set (1 000+ orthologs).
-
BUSCO will download the lineage data automatically.
-
short_summary..txt
-
run_/full_table.tsv
C:96.4%[S:94.1%,D:2.3%],F:1.2%,M:2.4%,n:1240
-
C = Complete BUSCOs (S = single copy, D = duplicated)
-
F = Fragmented BUSCOs
-
M = Missing BUSCOs
-
n = Number of BUSCO groups searched
Prokka predicts coding sequences (CDS), rRNAs, tRNAs, and other features, assigning functional annotations.
# 1) Create annotation output dir
mkdir -p annotation/prokka
# 2) Annotate each MAG
for bin in bins/DASTool_DASTool_bins/*.fa; do
name=$(basename "$bin" .fa)
prokka \
--outdir annotation/prokka/"$name" \
--prefix "$name" \
--cpus 8 \
--kingdom Bacteria \
"$bin"
done
-
.gffββ GFF3 with gene features
-
.gbkββ GenBank file
-
.faaββ protein translations (FASTA)
-
.ffnββ nucleotide CDS (FASTA)
-
.tblββ feature table for GenBank
-
.txtββ summary statistics (number of CDS, rRNA, tRNA)
# Assembly: bin.1.fa
# Contigs: 52 (total length: 3,745,128 bp)
# Genes: 3,650
# CDS: 3,550
# rRNA: 3
# tRNA: 45
quality/
ββ checkm/
β ββ lineage.ms
β ββ storage/final_summary.tsv
ββ busco/
ββ bin.1/
β ββ short_summary.bin.1.txt
β ββ run_bin.1/full_table.tsv
ββ bin.2/ β¦
annotation/
ββ prokka/
ββ bin.1/
β ββ bin.1.gff
β ββ bin.1.faa
β ββ bin.1.ffn
β ββ β¦
ββ bin.2/ β¦
-
Use CheckM and BUSCO together for complementary completeness estimates.
-
Focus downstream analyses on high-quality MAGs (β₯ 90 % complete, β€ 5 % contaminated).
-
Retain Prokka annotations (GFF and protein FASTA) for metabolic reconstruction and comparative genomics.