Metagenomics - igheyas/Bioinformatics GitHub Wiki
create your environment using the correct package names
cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
# make sure you have conda-forge and bioconda in your channels,
# and set strict priority so art actually comes from bioconda:
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --set channel_priority strict
conda create -n metagenomics_toy \
-c conda-forge \
-c bioconda \
python=3.9 \
biopython \
art \
fastp \
kraken2 \
megahit \
prokka
conda activate metagenomics_toy
which art_illumina # should point at $CONDA_PREFIX/bin/art_illumina
Install Jupyter Notebook
conda install -c conda-forge notebook -y
#Launch the notebook
jupyter notebook --no-browser --ip=0.0.0.0 --port=8888
http://127.0.0.1:8888/tree?token=302876c976feed1000e40e137c0694d0171b67573ecf8a0a
2. Generate three toy “genomes” (1 kb each)
Create a script generate_toy_genomes.py: In a jupyter notebook cell, run the following:
#!/usr/bin/env python3
import random
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
def random_seq(n):
return ''.join(random.choices("ACGT", k=n))
records = []
for i in range(1,4):
seq = random_seq(1000)
rec = SeqRecord(Seq(seq),
id=f"genome{i}",
description="toy genome")
records.append(rec)
with open("data/genomes.fasta", "w") as out:
SeqIO.write(records, out, "fasta")
print("Wrote 3 toy genomes to data/genomes.fasta")
# make directory and run
mkdir -p data
chmod +x generate_toy_genomes.py
./generate_toy_genomes.py
3. Simulate paired-end Illumina reads
We’ll use ART (HiSeq2500 profile, 100 bp reads, insert 200±10 bp):
mkdir -p data/simulated
art_illumina \
-ss HS25 \
-i data/genomes.fasta \
-p \
-l 100 \
-f 10 \
-m 200 \
-s 10 \
-o data/simulated/toy
This produces:
-
data/simulated/toy1.fq
-
data/simulated/toy2.fq
4. Quality control with Fastp
mkdir -p qc
fastp \
-i data/simulated/toy1.fq \
-I data/simulated/toy2.fq \
-o qc/clean_1.fq \
-O qc/clean_2.fq \
-h qc/fastp.html \
-j qc/fastp.json
- Open qc/fastp.html in your browser to inspect per-base quality, adapter trimming, etc.
5. Taxonomic profiling with Kraken2
5a. Build a toy Kraken2 DB
Step 5a.1. Remove any partial DB and make a fresh one
rm -rf kraken_db
mkdir -p kraken_db
Step 5a.2. Download only the core NCBI taxonomy (names & nodes)
kraken2-build --download-taxonomy --skip-maps --db kraken_db
This pulls just names.dmp & nodes.dmp (few MB), not the giant accession maps.
Step 5a.3. Tag each toy genome with a valid taxID
Let’s pretend:
-
genome1 → taxID 562
-
genome2 → taxID 561
-
genome3 → taxID 562
Run:
sed -e 's/^>genome1/>genome1|kraken:taxid|562|/' \
-e 's/^>genome2/>genome2|kraken:taxid|561|/' \
-e 's/^>genome3/>genome3|kraken:taxid|562|/' \
data/genomes.fasta > data/genomes_taxid.fasta
- Open up data/genomes_taxid.fasta to confirm your headers now look like:
>genome1|kraken:taxid|562| toy genome
ACGT…
Step 5a.4. Add your tagged FASTA to the library
kraken2-build --add-to-library data/genomes_taxid.fasta --db kraken_db
- You should see “added to library” messages.
Step 5a.5. Build the k-mer DB
kraken2-build --build --db kraken_db
-This step indexes only your three 1 kb sequences—takes < 30 s on your laptop.
Step 5b. Classify your QC’d reads
Assuming you did Fastp and have qc/clean_1.fq & qc/clean_2.fq:
#Step 5b: 1) Create the output folder and run Kraken2 classification
# from your project root
mkdir -p kraken_out
kraken2 \
--db kraken_db \
--paired qc/clean_1.fq qc/clean_2.fq \
--report kraken_out/report.txt \
--output kraken_out/output.txt
This will write:
-
kraken_out/report.txt ← tab-delimited summary
-
kraken_out/output.txt ← per-read assignments
You can verify it’s there with:
ls -lh kraken_out/report.txt
6. Assemble with MEGAHIT
# (make sure you’re in your project root)
megahit \
-1 qc/clean_1.fq \
-2 qc/clean_2.fq \
-o assembly
- When it finishes you’ll have
assembly/final.contigs.fa
7. Annotate contigs with Prokka
prokka \
--outdir prokka \
--prefix toy \
--cpus 4 \
assembly/final.contigs.fa
Outputs you’ll care about:
-
prokka/toy.gff ← all feature annotations
-
prokka/toy.faa ← predicted protein sequences
-
prokka/toy.ffn ← predicted CDS nucleotide sequences
8. Parse & visualize in Python
Save the following script as analysis.py in your project folder (notebook cell):
%%writefile analysis.py
#!/usr/bin/env python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from Bio import SeqIO
# 1) Taxonomic report
rep = pd.read_csv(
"kraken_out/report.txt",
sep="\t",
header=None,
names=["pct","clade_reads","taxon_reads","rank","taxid","name"]
)
top = rep.sort_values("taxon_reads", ascending=False).head(5)
print("\nTop 5 taxa by read count:")
print(top["name","taxon_reads"](/igheyas/Bioinformatics/wiki/"name","taxon_reads"))
plt.figure(figsize=(6,4))
sns.barplot(data=top, y="name", x="taxon_reads", palette="mako")
plt.xlabel("Reads assigned")
plt.ylabel("")
plt.tight_layout()
plt.savefig("kraken_out/top5_taxa.png")
plt.close()
# 2) Assembly stats
contigs = list(SeqIO.parse("assembly/final.contigs.fa", "fasta"))
lengths = np.array([len(c.seq) for c in contigs])
print(f"\nAssembly: {len(lengths)} contigs, {lengths.sum()} bp total")
plt.figure(figsize=(6,4))
sns.histplot(lengths, bins=min(20, len(lengths)), color="steelblue")
plt.xlabel("Contig length (bp)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("assembly/contig_length_hist.png")
plt.close()
# 3) Prokka features (robust parsing)
gff = "prokka/toy.gff"
cds = rrna = trna = 0
with open(gff) as fh:
for line in fh:
line = line.strip()
# skip blank lines, comments, FASTA headers
if not line or line.startswith("#") or line.startswith(">"):
continue
cols = line.split("\t")
if len(cols) < 3:
continue
feat = cols[2]
if feat == "CDS":
cds += 1
elif feat == "rRNA":
rrna += 1
elif feat == "tRNA":
trna += 1
print(f"\nProkka found: {cds} CDS, {rrna} rRNA, {trna} tRNA")
conda activate metagenomics_toy
conda install -c conda-forge pandas numpy matplotlib seaborn -y
Re-run your analysis script
Now that the report exists, rerun:
python analysis.py
Quick sanity check
If you ever want to double-check your working directory and file paths, you can do:
pwd
ls -R .
and make sure you see:
.
├── qc
│ ├── clean_1.fq
│ └── clean_2.fq
├── kraken_db
├── kraken_out
│ ├── output.txt
│ └── report.txt
├── assembly
│ └── final.contigs.fa
├── prokka
│ └── toy.gff
└── analysis.py