04 Pre‐binning - GeertsManon/EEG_Metagenomics GitHub Wiki

⚠️ ⚠️ ⚠️ A quick reminder: an interactive session with 8 threads is required for this part. If you haven't requested one yet, please do so before proceeding. Here are the instructions for requesting an interactive session.

How do you know?

If you see tier2-p-login after your personal VSC username, it means you are on the login node and still need to request an interactive session.

If you rather see rXXiXXnXX after your personal VSC username, then you are good to go:

Also make sure you are in the correct folder:

cd $VSC_DATA/EEG_metagenomics/exercises

Introduction

Before we begin grouping contigs into bins, it's helpful to examine what is already in our assembly. Think of this as reconnaissance - we want to understand the types of organisms and genes we are working with before trying to reconstruct their genomes.

At this stage, we have thousands of contigs that could belong to dozens of different organisms. Before using coverage (read depth) patterns and sequence composition to group them (the actual binning process), we can use gene-based methods to gain initial insights.

Gene detection

HMM (Hidden Markov Model) sources are collections of statistical models that can recognize specific genes or protein families based on their sequence patterns. Gene detection using these models allows us to identify specific genes without knowing which organism they belong to. We'll use two particularly useful sets of genes:

Single-Copy Core Genes (SCGs): These are housekeeping genes that typically appear exactly once per genome across most bacteria and archaea. By detecting these genes now, we're setting ourselves up for quality assessment after binning. Here's why they're so powerful:
- Completeness estimation: A complete bacterial genome should contain ~100-140 specific SCGs (depending on the marker set). If your bin has 120 out of 139 expected SCGs, it's ~86% complete.
- Contamination detection: Because these genes should appear only once per genome, finding multiple copies suggests your bin contains sequences from more than one organism. If you find 2 copies of the same SCG, your bin is likely contaminated.
- Taxonomic classification: The identified SCG sequences will be compared against the Genome Taxonomy Database (GTDB, https://gtdb.ecogenomic.org/), a reference database containing taxonomic classifications for thousands of bacterial and archaeal genomes.
Ribosomal RNA genes (rRNAs): HMMs can also detect ribosomal genes (16S, 23S, 5S, ... rRNA), which are excellent taxonomic markers. These genes are highly conserved but have just enough variation to distinguish different organisms.

Why do this before binning? This pre-exploration gives us valuable context. If we know certain contigs contain ribosomal genes or specific SCGs, we can use that information to guide our manual binning decisions. More importantly, by running the HMM searches now, we ensure that when we create bins later, Anvi'o can immediately calculate completeness and contamination statistics based on the SCG profiles.

# The following command should only take about 30s:
anvi-run-hmms \
    -c contigs.db \
    --num-threads 8

# The following command will take ~2.5 minutes:
anvi-run-scg-taxonomy \
    -c contigs.db \
    --num-threads 8 \
    --scgs-taxonomy-data-dir $VSC_DATA

Create a nested SSH tunnel

So far, all our computations have been running on the VSC cluster — but the cluster doesn't have a screen (or a Graphical User Interface, GUI). To actually visualize the results (anvi'o opens an interactive web page), we need to bridge the gap between the remote compute node and your local browser. We do this with a nested SSH tunnel: a chain of connections that routes traffic from your laptop → login node → compute node, making it appear as if the anvi'o web server is running locally on your machine.

Step 1: First, start the anvi'o visualization on the compute node. Fill in the X's with your personal VSC username:

anvi-display-contigs-stats contigs.db --port XXXXX

This will output:

Step 2: Now open ⚠️ a new ⚠️ Terminal window (Mac/Linux) or PowerShell (Windows) on your local computer and ⚠️ do not close the terminal where anvi'o is running ⚠️. We will use this new window to set up the SSH tunnel that connects your browser to the cluster.

Step 3: Build the first part of the tunnel by running the following command, and fill in the X's with your personal VSC username:

ssh -L XXXXX:localhost:XXXXX [email protected]

You will be prompted to open a firewall verification link in your browser.

Step 4: On the login node (in that same terminal, after authentication), run:

ssh -L XXXXX:localhost:XXXXX rXXiXXnXX

Replace rXXiXXnXX with your compute node name (described in Step 2 — check your prompt, e.g. vscXXXXX@r25i27n01).

You now have a tunnel: local laptop → login node → compute node. Whenever anvi'o starts a web server, you can open it in your local browser at:

http://localhost:XXXXX

❓ Question 1: Which archaeal SCG or ribosomal gene is most frequently found, and how many times across all contigs? And for bacteria?

💡 Tip: Hover over the bar plots.

❓ Question 2: How many times are bacterial and archaeal SecY proteins found across all contigs? Are any protist SecY proteins identified?

❓ Question 3: What is the estimated number of genomes per domain in this dataset?

Step 5: Keep your Terminal/PowerShell running on your local machine! On the cluster, type CTRL+C to be able to continue the exercise.

Creating the Profile Database

Now we need to process the read mapping information and create a profile database using anvi-profile. While the contigs database stores information about the sequences themselves (contigs, genes, annotations), the profile database stores information about the sample - specifically, how the sequencing reads from this sample cover those contigs.

The profile database is essential for the interactive interface. It provides the coverage data that, combined with sequence composition, allows us to visually identify which contigs belong together (see further). Without profiling, we'd have sequences but no information about their abundance in the sample.

Profiling becomes especially powerful when working with multiple samples. By mapping reads from different samples (in this case, water eDNA) onto the same set of contigs, you can compare coverage patterns across samples—such as which contigs have high coverage in which samples. Are the patterns consistent across samples, or are they different? Significant differences suggest that samples from different environments likely contain different microbial communities. However, to keep things simple for now, we'll focus on just one sample (DRC_Cave_A) to learn the basics.

anvi-profile \
    -i mappedReadsToContigs.bam \
    -c contigs.db \
    -o PROFILE \
    --num-threads 8 \
    -S DRC_Cave_A \
    --min-contig-length 150 \
    --cluster-contigs

What this does:

-i mappedReadsToContigs.bam: Input BAM file containing read alignments
-c contigs.db: Links to your contigs database (so it knows which contigs these reads map to)
-o PROFILE: Output directory where the profile database will be saved
-S DRC_Cave_A: Sample name (this will appear in the interface)
--min-contig-length 150: Ignore very short contigs (< 150 bp) that are unreliable for binning
--cluster-contigs: Automatically perform an initial hierarchical clustering of contigs based on their sequence composition and coverage

What gets calculated:

The profiling step processes your BAM file and calculates:

Coverage values for each contig (average depth, variability, what percentage of each contig is covered by reads)
Single-nucleotide variants (SNVs) if present
Hierarchical clustering of contigs (groups similar contigs together as a starting point)

❓ Question 4: How many contigs were dropped?

Taxonomic annotation

We've now run two complementary commands that work together to identify and classify genes in your data:

anvi-run-hmms identifies which SCGs are present in your contigs.
anvi-run-scg-taxonomy compares each individual SCG sequence against the GTDB reference database and assigns taxonomic classifications to each gene.

Now we need to take this gene-level information and translate it into something more useful: what organisms are in our sample?

This is where anvi-estimate-scg-taxonomy comes in. While the previous steps classified individual genes, anvi-estimate-scg-taxonomy provides a consensus taxonomic prediction for entire contigs or bins by considering all marker genes together. Instead of having scattered gene-level classifications, you'll get clear statements like "this contig most likely belongs to Methyloglobulus" based on agreement across multiple SCGs.

Step 1: Estimate taxonomy

First, let's estimate taxonomy and save the results to a file we can inspect:

anvi-estimate-scg-taxonomy -c contigs.db \
                           -p PROFILE/PROFILE.db \
                           --metagenome-mode \
                           --output-file taxonomy_summary.txt \
                           --compute-scg-coverages

What this does:

-c contigs.db: Your contigs database (contains the SCGs identified by anvi-run-hmms)
-p PROFILE/PROFILE.db: Your profile database (contains coverage information)
--metagenome-mode: Tells Anvi'o to expect multiple organisms in the sample (as opposed to a single isolate genome). This mode analyzes each contig independently rather than assuming that everything belongs to a single organism.
--output-file taxonomy_summary.txt: Saves the taxonomic predictions to a file for inspection
--compute-scg-coverages: Calculate the coverage of each SCG across the sample. This helps determine which taxonomic groups are more abundant - if certain SCGs have high coverage, that organism is likely more prevalent in your cave sample.

What gets calculated:

This command:

Takes the SCGs identified by HMMs
Compares their sequences against a reference database of known organisms
Assigns taxonomic classifications (domain → phylum → class → order → family → genus → species)
Calculates how abundant each taxonomic group is based on SCG coverage

Let's look at what organisms were detected:

cat taxonomy_summary.txt | column -t

Important note: This output shows only the most frequently detected ribosomal gene across all contigs - specifically the Ribosomal_S6 gene in this case. Anvi'o selects the single best-represented SCG for taxonomy estimation to provide the most reliable taxonomic signal. While your contigs contain many different SCGs, this focused approach using the most abundant marker gives you a cleaner overview of the dominant organisms in your sample.

Understanding the output:

Each row represents one SCG (ribosomal protein gene) that was successfully matched to a known organism:

Column	Meaning
`scg_name`	The specific SCG gene and contig where it was found
`percent_identity`	How similar this gene is to the reference (higher = more confident match)
`t_domain` → `t_species`	Full taxonomic lineage from domain down to species
`DRC_Cave-A`	Coverage of this SCG in your sample (higher = more abundant organism)

Example interpretation:

scg_name        percent_identity        t_domain        t_phylum        t_class                 t_order         t_family                t_genus         t_species                       DRC_Cave_A
DRC_Cave_A_Ribosomal_S6_21391   95.5    Bacteria        Pseudomonadota  Gammaproteobacteria     Methylococcales Methylomonadaceae       Methyloglobulus Methyloglobulus sp016874115     62.75894988066826

This tells us:

✅ Found a Ribosomal S6 gene on contig 21007
✅ 95.5% similar to known Methyloglobulus species
✅ Very high coverage (62.76) - this is likely a dominant organism in the cave!

❓ Question 5: Based on the taxonomy_summary.txt file, which bacterial phylum appears most frequently in the results

❓ Question 6: Which organism has the highest coverage (most abundant)?

❓ Question 7: Are there any Ribosomal S6 genes with 100% identity to known species?

Step 2: Add taxonomy to Profile Database

Now that we've inspected the results and understand what organisms are present, let's store this information in the profile database so it appears in the interactive interface:

anvi-estimate-scg-taxonomy -c contigs.db \
                           -p PROFILE/PROFILE.db \
                           --metagenome-mode \
                           --compute-scg-coverages \
                           --update-profile-db-with-taxonomy

New parameters:

--update-profile-db-with-taxonomy: Store the taxonomic predictions directly in the profile database so they appear as a layer in the interactive interface. Without this flag, results would only be saved to the file.

Why two steps?

Separating these steps allows you to:

First: Examine the taxonomy results and understand your data
Then: Commit the results to your database once you're satisfied

This is good practice - always inspect your results before permanently storing them!