Metagenomic Assembly & Binning - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
6.2.6 Metagenomic Assembly & Binning
Recovering genomes from metagenomes proceeds by assembling reads into contigs, binning contigs into draft genomes (MAGs), and then refining and dereplicating those bins.
A. Assemblers
1. MEGAHIT (ultra-fast, memory-efficient)
conda install -c bioconda megahit
# Run MEGAHIT on paired trimmed reads
megahit \
-1 trimmed/SampleA_R1.trimmed.fastq.gz \
-2 trimmed/SampleA_R2.trimmed.fastq.gz \
-o assembly/megahit_out \
--min-contig-len 1000 \
-t 16
# Output directory: assembly/megahit_out
# ├─ final.contigs.fa # assembled contigs ≥ 1 kb
# └─ intermediate # logs, graphs, smaller assemblies
2. metaSPAdes (high‐quality, metagenome-aware)
conda install -c bioconda spades
# Run metaSPAdes
spades.py \
--meta \
-1 trimmed/SampleA_R1.trimmed.fastq.gz \
-2 trimmed/SampleA_R2.trimmed.fastq.gz \
-o assembly/metaspades_out \
-t 16 \
-m 60
# Output: assembly/metaspades_out
# ├─ contigs.fasta # assembled contigs
# ├─ scaffolds.fasta # scaffolded sequences
# └─ spades.log, assembly_graph.fastg, etc.
B. Binning Draft Genomes
Before binning, map reads back to contigs to get depth information:
# 1) Build index on contigs
bowtie2-build assembly/megahit_out/final.contigs.fa assembly/megahit_contigs
# 2) Align reads and convert to BAM
bowtie2 -x assembly/megahit_contigs \
-1 trimmed/SampleA_R1.trimmed.fastq.gz \
-2 trimmed/SampleA_R2.trimmed.fastq.gz \
-p 8 | samtools view -b -o assembly/megahit.bam
# 3) Sort & index
samtools sort -@ 8 -o assembly/megahit.sorted.bam assembly/megahit.bam
samtools index assembly/megahit.sorted.bam
# 4) Depth per contig
jgi_summarize_bam_contig_depths \
--outputDepth assembly/depth.txt \
assembly/megahit.sorted.bam
1. MetaBAT2
conda install -c bioconda metabat2
mkdir -p bins/metabat2
metabat2 \
-i assembly/megahit_out/final.contigs.fa \
-a assembly/depth.txt \
-o bins/metabat2/bin
# Output: bins/metabat2/bin.1.fa, bin.2.fa, …
2. MaxBin2
conda install -c bioconda maxbin2
mkdir -p bins/maxbin2
run_MaxBin.pl \
-contig assembly/megahit_out/final.contigs.fa \
-abund assembly/depth.txt \
-out bins/maxbin2/run \
-thread 8
# Output: bins/maxbin2/run.*.fasta
3. CONCOCT
conda install -c bioconda concoct
# CONCOCT requires contigs split into 10 kb chunks:
cut_up_fasta.py \
assembly/megahit_out/final.contigs.fa \
-c 10000 -o 0 --split
concoct \
-c 400 \
--composition_file assembly/megahit_out/final.contigs.fa_CUTUP.fasta \
-b assembly/concoct_out
# Merge chunks back into original contigs
merge_cutup_clustering.py \
assembly/megahit_out/final.contigs.fa \
assembly/concoct_out/clustering_gt1000.csv > bins/concoct_bins.csv
C. Refinement & Dereplication
1. DASTool (integrates multiple bin sets)
conda install -c bioconda dastool
# Prepare list of bin files from each tool
echo bins/metabat2/*.fa > metabat2_bins.txt
echo bins/maxbin2/*.fasta > maxbin2_bins.txt
echo bins/concoct_bins.csv > concoct_bins.txt # use your CSV from CONCOCT
# Run DASTool
DASTool \
-i metabat2_bins.txt,maxbin2_bins.txt,concoct_bins.txt \
-l metabat2,maxbin2,concoct \
-c assembly/megahit_out/final.contigs.fa \
-o bins/DASTool \
--threads 8
# Output: bins/DASTool_DASTool_bins/*.fa (refined, non‐redundant)
2. dRep (dereplicates MAGs across samples/assemblies)
conda install -c bioconda drep
# Dereplicate all bins from DASTool
dRep dereplicate \
derep_out/ \
-g bins/DASTool_DASTool_bins/*.fa \
--processors 8 \
--completeness 50 \
--contamination 10
# Output in derep_out/: a set of representative, dereplicated genomes
D. Summary of Outputs
assembly/
├─ megahit_out/final.contigs.fa
└─ metaspades_out/contigs.fasta
bins/
├─ metabat2/bin.*.fa
├─ maxbin2/run.*.fasta
├─ concoct_bins.csv
├─ DASTool_DASTool_bins/*.fa
└─ derep_out/dereplicated_genomes/*.fa
Best Practices:
-
Check MAG quality with CheckM or BUSCO before dereplication.
-
Use DASTool to leverage strengths of multiple binners.
-
dRep clusters similar bins and picks the highest‐quality representative (e.g. highest completeness, lowest contamination).