Metagenomic Assembly & Binning - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

6.2.6 Metagenomic Assembly & Binning

Recovering genomes from metagenomes proceeds by assembling reads into contigs, binning contigs into draft genomes (MAGs), and then refining and dereplicating those bins.


A. Assemblers

1. MEGAHIT (ultra-fast, memory-efficient)

conda install -c bioconda megahit

# Run MEGAHIT on paired trimmed reads
megahit \
  -1 trimmed/SampleA_R1.trimmed.fastq.gz \
  -2 trimmed/SampleA_R2.trimmed.fastq.gz \
  -o assembly/megahit_out \
  --min-contig-len 1000 \
  -t 16

# Output directory: assembly/megahit_out
# ├─ final.contigs.fa       # assembled contigs ≥ 1 kb
# └─ intermediate            # logs, graphs, smaller assemblies

2. metaSPAdes (high‐quality, metagenome-aware)

conda install -c bioconda spades

# Run metaSPAdes
spades.py \
  --meta \
  -1 trimmed/SampleA_R1.trimmed.fastq.gz \
  -2 trimmed/SampleA_R2.trimmed.fastq.gz \
  -o assembly/metaspades_out \
  -t 16 \
  -m 60

# Output: assembly/metaspades_out
# ├─ contigs.fasta          # assembled contigs
# ├─ scaffolds.fasta        # scaffolded sequences
# └─ spades.log, assembly_graph.fastg, etc.

B. Binning Draft Genomes

Before binning, map reads back to contigs to get depth information:

# 1) Build index on contigs
bowtie2-build assembly/megahit_out/final.contigs.fa assembly/megahit_contigs

# 2) Align reads and convert to BAM
bowtie2 -x assembly/megahit_contigs \
  -1 trimmed/SampleA_R1.trimmed.fastq.gz \
  -2 trimmed/SampleA_R2.trimmed.fastq.gz \
  -p 8 | samtools view -b -o assembly/megahit.bam

# 3) Sort & index
samtools sort -@ 8 -o assembly/megahit.sorted.bam assembly/megahit.bam
samtools index assembly/megahit.sorted.bam

# 4) Depth per contig
jgi_summarize_bam_contig_depths \
  --outputDepth assembly/depth.txt \
  assembly/megahit.sorted.bam

1. MetaBAT2

conda install -c bioconda metabat2

mkdir -p bins/metabat2
metabat2 \
  -i assembly/megahit_out/final.contigs.fa \
  -a assembly/depth.txt \
  -o bins/metabat2/bin
# Output: bins/metabat2/bin.1.fa, bin.2.fa, …

2. MaxBin2

conda install -c bioconda maxbin2

mkdir -p bins/maxbin2
run_MaxBin.pl \
  -contig assembly/megahit_out/final.contigs.fa \
  -abund assembly/depth.txt \
  -out bins/maxbin2/run \
  -thread 8
# Output: bins/maxbin2/run.*.fasta

3. CONCOCT

conda install -c bioconda concoct

# CONCOCT requires contigs split into 10 kb chunks:
cut_up_fasta.py \
  assembly/megahit_out/final.contigs.fa \
  -c 10000 -o 0 --split

concoct \
  -c 400 \
  --composition_file assembly/megahit_out/final.contigs.fa_CUTUP.fasta \
  -b assembly/concoct_out

# Merge chunks back into original contigs
merge_cutup_clustering.py \
  assembly/megahit_out/final.contigs.fa \
  assembly/concoct_out/clustering_gt1000.csv > bins/concoct_bins.csv

C. Refinement & Dereplication

1. DASTool (integrates multiple bin sets)

conda install -c bioconda dastool

# Prepare list of bin files from each tool
echo bins/metabat2/*.fa > metabat2_bins.txt
echo bins/maxbin2/*.fasta > maxbin2_bins.txt
echo bins/concoct_bins.csv > concoct_bins.txt  # use your CSV from CONCOCT

# Run DASTool
DASTool \
  -i metabat2_bins.txt,maxbin2_bins.txt,concoct_bins.txt \
  -l metabat2,maxbin2,concoct \
  -c assembly/megahit_out/final.contigs.fa \
  -o bins/DASTool \
  --threads 8
# Output: bins/DASTool_DASTool_bins/*.fa  (refined, non‐redundant)

2. dRep (dereplicates MAGs across samples/assemblies)

conda install -c bioconda drep

# Dereplicate all bins from DASTool
dRep dereplicate \
  derep_out/ \
  -g bins/DASTool_DASTool_bins/*.fa \
  --processors 8 \
  --completeness 50 \
  --contamination 10
# Output in derep_out/: a set of representative, dereplicated genomes

D. Summary of Outputs

assembly/
├─ megahit_out/final.contigs.fa
└─ metaspades_out/contigs.fasta

bins/
├─ metabat2/bin.*.fa
├─ maxbin2/run.*.fasta
├─ concoct_bins.csv
├─ DASTool_DASTool_bins/*.fa
└─ derep_out/dereplicated_genomes/*.fa

Best Practices:

  • Check MAG quality with CheckM or BUSCO before dereplication.

  • Use DASTool to leverage strengths of multiple binners.

  • dRep clusters similar bins and picks the highest‐quality representative (e.g. highest completeness, lowest contamination).