05 Binning - GeertsManon/EEG_Metagenomics GitHub Wiki

Introduction

Up until now, we've been preparing our data - assembling reads into contigs, calculating coverage, predicting genes, and estimating taxonomy. But we still have one big challenge: our contigs are a jumbled mixture from many different organisms. It's like having puzzle pieces from a dozen different puzzles all mixed together in one box.

Binning is the process of grouping contigs that likely belong to the same organism based on their characteristics. Think of it as sorting those mixed puzzle pieces back into their original boxes. Each "bin" represents our best attempt at reconstructing a single microbial genome from the metagenomic data.

We use two main types of information:

  1. Sequence composition (tetranucleotide frequency & GC content): Contigs from the same organism tend to have similar DNA patterns. Different organisms have different "genomic signatures" - some prefer certain combinations of four nucleotides (tetranucleotides), and some have higher or lower GC content (percentage of G and C bases).

  2. Coverage patterns: In a given sample, contigs from the same organism should have similar coverage depth. If organism A is twice as abundant as organism B, all of organism A's contigs should have roughly twice the coverage of organism B's contigs.

When contigs cluster together based on both sequence composition and coverage, there's a strong likelihood they come from the same genome!

Launching the interactive interface

To start the manual binning interface, run the following command and fill the X's with your personal VSC username:

anvi-interactive -p PROFILE/PROFILE.db -c contigs.db --port XXXXX

What this does:

It will open an interactive session based on different layers of information:

  • -p PROFILE/PROFILE.db: Your profile database (coverage information, sequence composition, clustering information)
  • -c contigs.db: Your contigs database (sequences, genes, annotations, taxonomy)

Remember that you have a nested SSH tunnel active (assuming you didn't close your Terminal or PowerShell window on your local machine). To access Anvi'o's web page, simply copy the address (http://localhost:XXXXX) into your local web browser:

image

In the web browser, click Draw.

image

What you'll see:

  • Central visualization: A circular dendrogram πŸ… illustrating how contigs group together based on their sequence composition. Each contig is represented as a branch in this dendrogram. Contigs with similar genomic signatures are thus placed near each other on the dendrogram.
  • Layers (rings): Different data types displayed as concentric rings around the dendrogram
    • Contig length πŸ…‘
    • Coverage depth (how abundant each contig is) in log scale (this is adjustable if you want) πŸ…’
    • GC content (percentage of G and C nucleotides) πŸ…“
    • Taxonomy (ribosomal predictions) πŸ…”
  • Interactive controls: Left sidebar for adjusting visualization settings, selecting contigs, and creating bins
image

Settings

To facilitate manual binning, adjust the following settings:

  • In the Main tab:
image
  • In the Options tab:
image
  • Then click Draw again, which should result in:
image

The binning tab

Let's go to the tab Bin.

image

Check the checkbox Realtime taxonomy estimation.

A few words:

  • Realtime taxonomy estimation πŸ… - This shows the predicted taxonomic classification for each bin based on the single-copy core genes (SCGs) it contains. Remember when we ran anvi-estimate-scg-taxonomy? That command, after identifying SCG with anvi-run-hmms and anvi-run-scg-taxonomy across all your contigs, assigned taxonomic labels to them. As you group contigs into bins, Anvi'o examines which SCGs (and their associated taxonomies) are present in each bin and provides a consensus taxonomic prediction. This helps you immediately see what organism you're likely reconstructing - for example, you might see "Bacteria > Pseudomonadota > Gammaproteobacteria > Methylococcales" for a bin containing methylotrophic bacteria.

  • Comp. (Completeness) πŸ…‘ This estimates what percentage of a complete genome your bin represents. Remember when we ran anvi-run-hmms to identify single-copy core genes? A complete bacterial genome should contain a full set of these SCGs (typically 100-140 markers depending on the collection). If your bin has 120 out of 139 expected SCGs, the completeness would be ~86%. Higher completeness means you've captured more of the organism's genome. Aim for β‰₯90% for high-quality bins, or β‰₯50% for medium-quality bins.

  • Red. (Redundancy/Contamination) πŸ…’ This estimates how much contamination (sequences from other organisms) is in your bin. Because single-copy core genes should appear exactly once per genome, finding multiple copies suggests your bin contains sequences from more than one organism. If several SCGs appear 2-3 times, your redundancy/contamination might be 5-10%. Lower is better - you want each SCG to appear just once! Aim for <5% for high-quality bins, or <10% for medium-quality bins.

image

Creating our first bin

Now comes the exciting part - let's create our first bin!

Scan the circular visualization and look for regions where:

  • The dendrogram branches are clearly separated from the rest AND/OR
  • Coverage patterns look different (different heights in the coverage layer, colored in black) AND/OR
  • GC content stands out (different heights in the GC layer, colored in green)

For example, in the image below, observe the region at the lower leftβ€”it has notably high coverage (>60x). Since we previously adjusted our settings to cap the maximum coverage display at 60x, any area with coverage above this threshold will appear completely black. The branches cluster tightly together and are separated from neighboring regions. This clear visual pattern is exactly what you want for your first bin!

  1. Hover over the branches
image
  1. Click to select
  2. Check the instant feedback
image

As soon as you make a selection, the Bins panel (left sidebar) immediately shows:

  • Name: Assign a name for this bin (e.g., "Bin_1")
  • Splits: Number of contigs in your selection (e.g., 144)
  • Len: Total length of selected contigs in Mbp (e.g., 3.03M)
  • Comp. (Completeness): Percentage of expected SCGs present (e.g., 98.6% βœ…)
  • Red. (Redundancy/Contamination): Percentage of duplicated SCGs (e.g., 0.0% βœ…)
  • Predicted taxonomy: Real-time taxonomic classification based on SCGs present (e.g., (s) Methyloglobulus sp016874115)

Evaluating our first bin

Now compare your bin's metrics to the quality standards:

Quality Tier Completeness Contamination
High Quality (HQ) β‰₯90% <5%
Medium Quality (MQ) β‰₯50% <10%
Low Quality (LQ) <50% OR >10%

In our example:

  • Completeness: 98.6% βœ… (β‰₯90%)
  • Contamination: 0.0% βœ… (<5%)

πŸŽ‰ Jackpot! Your first bin is high-quality!

This means you've successfully reconstructed a near-complete, uncontaminated genome from metagenomic contigs! You've just isolated Methyloglobulus sp016874115 - a methylotrophic bacterium from the DRC cave.

Creating additional bins

Great work on your first bin! Now, let's continue and reconstruct more genomes from this cave sample.

Step 1: Find 5 high-quality bacterial bins

For this practical session, focus on identifying 5 high-quality prokaryotic bins.

Why this target? Remember when we ran anvi-estimate-scg-taxonomy and examined the taxonomy_summary.txt file? That output showed us 13 distinct ribosomal genes from different organisms, suggesting roughly 10-15 genomes are present in this sample. However, not all of them will be easy to bin:

  • Some organisms are low abundance with fragmented assemblies (i.e., incomplete)
  • This sample also contains eukaryotic and viral contigs that won't be detected by bacterial/archaeal SCGs
  • Some closely related strains may be difficult to separate cleanly

⚠️ ⚠️ ⚠️ CRITICAL: Starting a new bin

Before selecting contigs for a new bin, you MUST click the "+" (plus) button in the Bins panel!

Here's why:

  • If you don't click "+", Anvi'o thinks you're modifying your existing bin
  • Your new selection will be added to (or replace) the previous bin
  • This can accidentally contaminate your carefully curated bins!

Step 2: Recalculate taxonomy

Click the Recalculate / Show Taxonomy for Bins button in the Bins panel.

This will update the taxonomy predictions for your bins based on all the SCGs they contain (not just individual contigs).

❓ Question 1: Does the taxonomy and coverage of your bins correspond to what was predicted in taxonomy_summary.txt? It is sufficient to zoom in on one organism.

❓ Question 2: What do you notice about the level of taxonomic detail for different bins?

❓ Question 3: Why do you think this happens?

πŸ’‘ Hint: Think about what the reference database contains.

Step 3: Store your bins

Click on Store bin selection and then Store.

Now go back to the cluster and type CTRL+C. Then, create a summary of your stored bins:

anvi-summarize -p PROFILE/PROFILE.db \
               -c contigs.db \
               -C default \
               -o BIN_SUMMARY

You will find some interesting information in BIN_SUMMARY/bins_summary.txt

❓ Question 4: Create a comprehensive table summarizing all your bins. This will serve for the assignment. For each bin, record:

Bin Name Taxonomy (full lineage) Identified to Completeness (%) Contamination (%) Genome Size (bp) Average coverage of ribosomal S6 gene
Bin_1 Bacteria; Pseudomonadota; Gammaproteobacteria; Methylococcales; Methylomonadaceae; Methyloglobulus; sp016874115 species level 98.6 0.0 3,030,873 62.76x
Bin_2 ... ... ... ... ... ...
Bin_3 ... ... ... ... ... ...
Bin_4 ... ... ... ... ... ...
Bin_5 ... ... ... ... ... ...

❓ Question 5: Compare your table with your neighbors' table. Did they achieve better results in terms of completeness and contamination for certain bins?

⚠️ **GitHub.com Fallback** ⚠️