Metagenomics - igheyas/Bioinformatics GitHub Wiki

create your environment using the correct package names

cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
# make sure you have conda-forge and bioconda in your channels,
# and set strict priority so art actually comes from bioconda:
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict

conda create -n metagenomics_toy \
  -c conda-forge \
  -c bioconda \
  python=3.9 \
  biopython \
  art \
  fastp \
  kraken2 \
  megahit \
  prokka
conda activate metagenomics_toy
which art_illumina   # should point at $CONDA_PREFIX/bin/art_illumina

Install Jupyter Notebook

conda install -c conda-forge notebook -y

#Launch the notebook

jupyter notebook --no-browser --ip=0.0.0.0 --port=8888
http://127.0.0.1:8888/tree?token=302876c976feed1000e40e137c0694d0171b67573ecf8a0a

2. Generate three toy “genomes” (1 kb each)

Create a script generate_toy_genomes.py: In a jupyter notebook cell, run the following:

#!/usr/bin/env python3
import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

def random_seq(n):
    return ''.join(random.choices("ACGT", k=n))

records = []
for i in range(1,4):
    seq = random_seq(1000)
    rec = SeqRecord(Seq(seq),
                    id=f"genome{i}",
                    description="toy genome")
    records.append(rec)

with open("data/genomes.fasta", "w") as out:
    SeqIO.write(records, out, "fasta")

print("Wrote 3 toy genomes to data/genomes.fasta")

# make directory and run
mkdir -p data
chmod +x generate_toy_genomes.py
./generate_toy_genomes.py

3. Simulate paired-end Illumina reads

We’ll use ART (HiSeq2500 profile, 100 bp reads, insert 200±10 bp):

mkdir -p data/simulated
art_illumina \
  -ss HS25 \
  -i data/genomes.fasta \
  -p \
  -l 100 \
  -f 10 \
  -m 200 \
  -s 10 \
  -o data/simulated/toy

This produces:

  • data/simulated/toy1.fq

  • data/simulated/toy2.fq

4. Quality control with Fastp

mkdir -p qc
fastp \
  -i data/simulated/toy1.fq \
  -I data/simulated/toy2.fq \
  -o qc/clean_1.fq \
  -O qc/clean_2.fq \
  -h qc/fastp.html \
  -j qc/fastp.json

  • Open qc/fastp.html in your browser to inspect per-base quality, adapter trimming, etc.

5. Taxonomic profiling with Kraken2

5a. Build a toy Kraken2 DB

Step 5a.1. Remove any partial DB and make a fresh one

rm -rf kraken_db
mkdir -p kraken_db

Step 5a.2. Download only the core NCBI taxonomy (names & nodes)

kraken2-build --download-taxonomy --skip-maps --db kraken_db

This pulls just names.dmp & nodes.dmp (few MB), not the giant accession maps.

Step 5a.3. Tag each toy genome with a valid taxID

Let’s pretend:

  • genome1 → taxID 562

  • genome2 → taxID 561

  • genome3 → taxID 562

Run:

sed -e 's/^>genome1/>genome1|kraken:taxid|562|/' \
    -e 's/^>genome2/>genome2|kraken:taxid|561|/' \
    -e 's/^>genome3/>genome3|kraken:taxid|562|/' \
  data/genomes.fasta > data/genomes_taxid.fasta

  • Open up data/genomes_taxid.fasta to confirm your headers now look like:
>genome1|kraken:taxid|562|  toy genome
ACGT…

Step 5a.4. Add your tagged FASTA to the library

kraken2-build --add-to-library data/genomes_taxid.fasta --db kraken_db

  • You should see “added to library” messages.

Step 5a.5. Build the k-mer DB

kraken2-build --build --db kraken_db

-This step indexes only your three 1 kb sequences—takes < 30 s on your laptop.

Step 5b. Classify your QC’d reads

Assuming you did Fastp and have qc/clean_1.fq & qc/clean_2.fq:

#Step 5b: 1) Create the output folder and run Kraken2 classification

# from your project root
mkdir -p kraken_out

kraken2 \
  --db kraken_db \
  --paired qc/clean_1.fq qc/clean_2.fq \
  --report kraken_out/report.txt \
  --output kraken_out/output.txt

This will write:

  • kraken_out/report.txt ← tab-delimited summary

  • kraken_out/output.txt ← per-read assignments

You can verify it’s there with:

ls -lh kraken_out/report.txt

6. Assemble with MEGAHIT

# (make sure you’re in your project root)
megahit \
  -1 qc/clean_1.fq \
  -2 qc/clean_2.fq \
  -o assembly

  • When it finishes you’ll have
assembly/final.contigs.fa

7. Annotate contigs with Prokka

prokka \
  --outdir prokka \
  --prefix toy \
  --cpus 4 \
  assembly/final.contigs.fa

Outputs you’ll care about:

  • prokka/toy.gff ← all feature annotations

  • prokka/toy.faa ← predicted protein sequences

  • prokka/toy.ffn ← predicted CDS nucleotide sequences

8. Parse & visualize in Python

Save the following script as analysis.py in your project folder (notebook cell):

%%writefile analysis.py
#!/usr/bin/env python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO

# 1) Taxonomic report
rep = pd.read_csv(
    "kraken_out/report.txt",
    sep="\t",
    header=None,
    names=["pct","clade_reads","taxon_reads","rank","taxid","name"]
)
top = rep.sort_values("taxon_reads", ascending=False).head(5)
print("\nTop 5 taxa by read count:")
print(top["name","taxon_reads"](/igheyas/Bioinformatics/wiki/"name","taxon_reads"))

plt.figure(figsize=(6,4))
sns.barplot(data=top, y="name", x="taxon_reads", palette="mako")
plt.xlabel("Reads assigned")
plt.ylabel("")
plt.tight_layout()
plt.savefig("kraken_out/top5_taxa.png")
plt.close()

# 2) Assembly stats
contigs = list(SeqIO.parse("assembly/final.contigs.fa", "fasta"))
lengths = np.array([len(c.seq) for c in contigs])
print(f"\nAssembly: {len(lengths)} contigs, {lengths.sum()} bp total")

plt.figure(figsize=(6,4))
sns.histplot(lengths, bins=min(20, len(lengths)), color="steelblue")
plt.xlabel("Contig length (bp)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("assembly/contig_length_hist.png")
plt.close()

# 3) Prokka features (robust parsing)
gff = "prokka/toy.gff"
cds = rrna = trna = 0
with open(gff) as fh:
    for line in fh:
        line = line.strip()
        # skip blank lines, comments, FASTA headers
        if not line or line.startswith("#") or line.startswith(">"):
            continue
        cols = line.split("\t")
        if len(cols) < 3:
            continue
        feat = cols[2]
        if feat == "CDS":
            cds += 1
        elif feat == "rRNA":
            rrna += 1
        elif feat == "tRNA":
            trna += 1

print(f"\nProkka found: {cds} CDS, {rrna} rRNA, {trna} tRNA")

conda activate metagenomics_toy
conda install -c conda-forge pandas numpy matplotlib seaborn -y

Re-run your analysis script

Now that the report exists, rerun:

python analysis.py

Quick sanity check

If you ever want to double-check your working directory and file paths, you can do:

pwd
ls -R .

and make sure you see:

.
├── qc
│   ├── clean_1.fq
│   └── clean_2.fq
├── kraken_db
├── kraken_out
│   ├── output.txt
│   └── report.txt
├── assembly
│   └── final.contigs.fa
├── prokka
│   └── toy.gff
└── analysis.py