Shotgun Taxonomic Profiling - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.2.5 Shotgun Taxonomic Profiling
After QC and (optionally) assembly/binning, the first step in shotgun metagenomics is to assign taxonomy to your reads or contigs. Here we cover four popular tools: Kraken2 (+ Bracken), MetaPhlAn3, Kaiju, and Centrifuge.
Tools & Installation
# Create a conda env (recommended)
conda create -n metagenomics -c bioconda \
kraken2 bracken metaphlan kaiju centrifuge
conda activate metagenomics
On Ubuntu you can also sudo apt install kraken2 bracken metaphlan etc., but conda keeps things isolated.
A. Kraken2 & Bracken
**1. Build or download a Kraken2 database **(e.g. MiniKraken or full RefSeq):
# Download MiniKraken2 (~8 GB)
kraken2-build --download-library bacteria \
--db kraken2_db
kraken2-build --download-library viral \
--db kraken2_db
kraken2-build --download-taxonomy \
--db kraken2_db
kraken2-build --build --db kraken2_db
2. Classify reads:
mkdir -p kraken2_out
kraken2 \
--db kraken2_db \
--threads 8 \
--report kraken2_out/sampleA.report.tsv \
--output kraken2_out/sampleA.output.tsv \
--paired \
trimmed/SampleA_R1.trimmed.fastq.gz \
trimmed/SampleA_R2.trimmed.fastq.gz
3. Abundance estimation with Bracken:
bracken \
-d kraken2_db \
-i kraken2_out/sampleA.report.tsv \
-o kraken2_out/sampleA.bracken.tsv \
-r 150 # read length
4. Outputs:
-
sampleA.output.tsv: per-read assignments
-
sampleA.report.tsv: Kraken2’s hierarchical report
-
sampleA.bracken.tsv: corrected species-level abundances
B. MetaPhlAn3
1. Download marker database (automatically on first run):
# Example: classify a single sample
metaphlan \
trimmed/SampleA_R1.trimmed.fastq.gz,trimmed/SampleA_R2.trimmed.fastq.gz \
--input_type fastq \
--bowtie2out metaphlan_out/sampleA.bowtie2.bz2 \
--nproc 8 \
> metaphlan_out/sampleA.profile.tsv
2. Merge profiles across samples:
metaphlan \
metaphlan_out/*.profile.tsv \
--merge_metaphlan_tables \
-o metaphlan_out/merged_abundance.tsv
3. Outputs:
-
sampleA.profile.tsv: relative abundances (%) of clades
-
merged_abundance.tsv: abundance matrix (samples × taxa)
C. Kaiju & Centrifuge
Kaiju
1. Download/prepare database:
kaiju-makedb -s nr # builds from NCBI nr, requires >50 GB disk
2. Classify reads:
mkdir -p kaiju_out
kaiju \
-t nodes.dmp -f kaiju_db.fmi \
-i trimmed/SampleA_R1.trimmed.fastq.gz \
-j trimmed/SampleA_R2.trimmed.fastq.gz \
-o kaiju_out/sampleA.kaiju.tsv \
-z 8
kaiju2table \
-t nodes.dmp -n names.dmp \
-r species \
-o kaiju_out/sampleA.kaiju.summary.tsv \
kaiju_out/sampleA.kaiju.tsv
3. Outputs:
-
.kaiju.tsv: per-read assignments
-
.kaiju.summary.tsv: species‐level counts
Centrifuge
1. Build or download database:
centrifuge-download -o db/ -P 8 taxonomy \
bacteria viral human
centrifuge-build -p 8 \
db/library/*/*.fna \
db/centdb
2. Classify reads:
mkdir -p centrifuge_out
centrifuge \
-x db/centdb \
-1 trimmed/SampleA_R1.trimmed.fastq.gz \
-2 trimmed/SampleA_R2.trimmed.fastq.gz \
-S centrifuge_out/sampleA.tsv \
--report-file centrifuge_out/sampleA.report.tsv \
-p 8
3. Outputs:
-
sampleA.tsv: read‐level assignments
-
sampleA.report.tsv: aggregated report
Visualization & Downstream
- Krona: convert any table to interactive HTML:
cut -f2,3 kaiju_out/sampleA.kaiju.summary.tsv \
| ktImportTaxonomy -o kaiju_out/sampleA.krona.html -
- R phyloseq: import Bracken or MetaPhlAn tables for diversity analyses.