Shotgun Taxonomic Profiling - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.5 Shotgun Taxonomic Profiling

After QC and (optionally) assembly/binning, the first step in shotgun metagenomics is to assign taxonomy to your reads or contigs. Here we cover four popular tools: Kraken2 (+ Bracken), MetaPhlAn3, Kaiju, and Centrifuge.


Tools & Installation

# Create a conda env (recommended)
conda create -n metagenomics -c bioconda \
    kraken2 bracken metaphlan kaiju centrifuge

conda activate metagenomics

On Ubuntu you can also sudo apt install kraken2 bracken metaphlan etc., but conda keeps things isolated.

A. Kraken2 & Bracken

**1. Build or download a Kraken2 database **(e.g. MiniKraken or full RefSeq):

# Download MiniKraken2 (~8 GB)
kraken2-build --download-library bacteria \
              --db kraken2_db
kraken2-build --download-library viral   \
              --db kraken2_db
kraken2-build --download-taxonomy      \
              --db kraken2_db
kraken2-build --build --db kraken2_db

2. Classify reads:

mkdir -p kraken2_out
kraken2 \
  --db kraken2_db \
  --threads 8 \
  --report kraken2_out/sampleA.report.tsv \
  --output kraken2_out/sampleA.output.tsv \
  --paired \
  trimmed/SampleA_R1.trimmed.fastq.gz \
  trimmed/SampleA_R2.trimmed.fastq.gz

3. Abundance estimation with Bracken:

bracken \
  -d kraken2_db \
  -i kraken2_out/sampleA.report.tsv \
  -o kraken2_out/sampleA.bracken.tsv \
  -r 150   # read length

4. Outputs:

  • sampleA.output.tsv: per-read assignments

  • sampleA.report.tsv: Kraken2’s hierarchical report

  • sampleA.bracken.tsv: corrected species-level abundances

B. MetaPhlAn3

1. Download marker database (automatically on first run):

# Example: classify a single sample
metaphlan \
  trimmed/SampleA_R1.trimmed.fastq.gz,trimmed/SampleA_R2.trimmed.fastq.gz \
  --input_type fastq \
  --bowtie2out metaphlan_out/sampleA.bowtie2.bz2 \
  --nproc 8 \
  > metaphlan_out/sampleA.profile.tsv

2. Merge profiles across samples:

metaphlan \
  metaphlan_out/*.profile.tsv \
  --merge_metaphlan_tables \
  -o metaphlan_out/merged_abundance.tsv

3. Outputs:

  • sampleA.profile.tsv: relative abundances (%) of clades

  • merged_abundance.tsv: abundance matrix (samples × taxa)

C. Kaiju & Centrifuge

Kaiju

1. Download/prepare database:

kaiju-makedb -s nr   # builds from NCBI nr, requires >50 GB disk

2. Classify reads:

mkdir -p kaiju_out
kaiju \
  -t nodes.dmp -f kaiju_db.fmi \
  -i trimmed/SampleA_R1.trimmed.fastq.gz \
  -j trimmed/SampleA_R2.trimmed.fastq.gz \
  -o kaiju_out/sampleA.kaiju.tsv \
  -z 8
kaiju2table \
  -t nodes.dmp -n names.dmp \
  -r species \
  -o kaiju_out/sampleA.kaiju.summary.tsv \
  kaiju_out/sampleA.kaiju.tsv

3. Outputs:

  • .kaiju.tsv: per-read assignments

  • .kaiju.summary.tsv: species‐level counts

Centrifuge

1. Build or download database:

centrifuge-download -o db/ -P 8 taxonomy \
  bacteria viral human
centrifuge-build -p 8 \
  db/library/*/*.fna \
  db/centdb

2. Classify reads:

mkdir -p centrifuge_out
centrifuge \
  -x db/centdb \
  -1 trimmed/SampleA_R1.trimmed.fastq.gz \
  -2 trimmed/SampleA_R2.trimmed.fastq.gz \
  -S centrifuge_out/sampleA.tsv \
  --report-file centrifuge_out/sampleA.report.tsv \
  -p 8

3. Outputs:

  • sampleA.tsv: read‐level assignments

  • sampleA.report.tsv: aggregated report

Visualization & Downstream

  • Krona: convert any table to interactive HTML:
cut -f2,3 kaiju_out/sampleA.kaiju.summary.tsv \
  | ktImportTaxonomy -o kaiju_out/sampleA.krona.html -

  • R phyloseq: import Bracken or MetaPhlAn tables for diversity analyses.