CBW‐2025‐BMB‐Module2 - LangilleLab/microbiome_helper GitHub Wiki

Module 2: Marker gene profiling

This tutorial is part of the 2025 CBW Beginner Microbiome Analysis (held in Vancouver, BC, May 26-27). It is based on the Amplicon SOP v2 available on the Microbiome Helper repository and previous workshops designed by Monica Alvaro Fuss, Diana Haider and Robert Beiko.

Author: Robyn Wright

Introduction

Modules 2 and 3 provide a walkthrough of an end-to-end pipeline using the command line interface for the analysis of high-throughput marker gene data. Commonly used marker genes for microbiome analysis include the 16S ribosomal RNA (rRNA) for prokaryotes, 18S rRNA for eukaryotes, and the internal transcribed spacer (ITS) for fungi.

In this tutorial, you can choose between using:

16S dataset from wild blueberry, Vaccinium angustifolium (soil microbiome)
- Variation in Bacterial and Eukaryotic Communities Associated with Natural and Managed Wild Blueberry Habitats
- Metagenomic Functional Shifts to Plant Induced Environmental Changes
18S dataset from plastics incubated in a coastal marine environment (plastisphere)
- Microbial pioneers of plastic colonisation in coastal seawaters
ITS2 dataset from stool samples from pregnant women (gut microbiome)
- Landscape of the gut mycobiome dynamics during pregnancy and its relationship with host metabolism and pregnancy health

You can jump to one of the below sections for the commands needed for processing each of these amplicons. We recommend choosing one of them - 16S is the most widely used and also allows you to easily generate a phylogenetic tree. If you have never done any amplicon analysis before then we recommend choosing the 16S dataset.

In Module 2 we will cover the basics of marker gene analysis from raw reads to filtered feature table. The pipeline described is embedded in the latest version of QIIME2 (Quantitative Insights into Microbial Ecology version 2025.4), which is a popular microbiome bioinformatics platform for microbial ecology built on user-made software packages called plugins that work on QIIME2 artifact or QZA files. Documentation for these plugins can be found in the QIIME 2 user documentation, along with tutorials and other useful information. QIIME2 also provides interpretable visualizations that can be accessed by opening any generated QZV files within QIIME2 View.

16S analysis
18S analysis
ITS analysis

1. 16S

Create a directory for Module 2 inside workspace and create a symlink to the raw FASTQ files and the metadata file.

cd ~/workspace
mkdir bmb_module2/ bmb_module2/16S_analysis
cd bmb_module2/16S_analysis
ln -s ~/CourseData/MIC_data/bmb_raw_data/16S_raw_data raw_data
ln -s ~/CourseData/MIC_data/bmb_raw_data/Blueberry_16S_metadata.tsv .

You will have installed the latest version of QIIME2 in Module 1, so you can activate that environment with the command below:

conda activate qiime2-amplicon-2025.4

If you didn't get to that part, you can use this environment instead:

conda activate qiime2-amplicon-backup

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them.

1.1. 16S First steps

1.1.1. Inspect raw data

First, let's take a look at the directory containing our raw reads as well as our metadata file.

ls raw_data

head Blueberry_16S_metadata.tsv

Question 1: How many samples are there?

Question 2: Into what group(s) are the samples classified?

1.1.2. Import FASTQs as QIIME2 artifact

To standardize QIIME 2 analyses and to keep track of provenance (i.e. a list of what commands were previously run to produce a file) a special format is used for all QIIME 2 input and output files called an "artifact" (with the extension QZA). The first step is to import the raw reads as a QZA file. We will first create a new directory.

mkdir reads_qza

qiime tools import \
  --type SampleData[PairedEndSequencesWithQuality] \
  --input-path raw_data/ \
  --output-path reads_qza/reads.qza \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt

This might take a minute! If it hasn't come back up with the command prompt that looks something like (qiime2-amplicon-2025.4) ubuntu@ip-10-0-1-248:~/workspace/bmb_module2/16S_analysis$ yet, then it hasn't finished running yet and you'll need to be patient :)

All of the FASTQs are now in the single artifact file reads_qza/reads.qza. This file format can be a little confusing at first, but it is actually just a zipped folder. You can manipulate and explore these files better with the qiime tools utilities (e.g. peek and view).

1.1.3. Trim primers with cutadapt

Screen out reads that do not begin with primer sequence and remove primer sequence from reads using the cutadapt QIIME 2 plugin. The below primers correspond to the 16S V6-V8 region (bacteria-specific primer set). You can see more about different primers and the taxa that they target here.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences reads_qza/reads.qza \
  --p-cores 4 \
  --p-front-f ACGCGHNRAACCTTACC \
  --p-front-r ACGGGCRGTGWGTRCAA \
  --p-discard-untrimmed \
  --p-no-indels \
  --o-trimmed-sequences reads_qza/reads_trimmed.qza

Visualizing your output data is a good idea after any step to make sure nothing unexpected occurred. The following command generates a "visualization" file with the extension QZV.

We still have a few preprocessing requirements to check off our list before denoising, so we can wait until these steps are complete to visualize our data. However, if you would like to see what paired-end reads look like before joining, run the following command and open the QZV file in QIIME2 View.

qiime demux summarize \
  --i-data reads_qza/reads_trimmed.qza \
  --o-visualization reads_qza/reads_trimmed_summary.qzv

Remember that you can see and download these files on the webpage: http://##.uhn-hpc.ca/ (replace ## with your number!)

Question 3: What would happen if you ran this exact command on V4/V5-amplified sequences?

1.2. 16S Denoising the reads into amplicon sequence variants

Different denoising tools require different levels of preprocessing before the actual denoising happens. For example, DADA2 performs read joining and quality filtering as part of the denoising step itself, and can be run directly after trimming the primers. Due to speed considerations, we'll be using Deblur instead, which requires that these steps be carried out separately. Guidelines for running DADA2 can be found here.

1.2.1. Join paired-end reads

Forward and reverse reads can be joined with VSEARCH as shown below. This will generate QZA files for both the joined/merged sequences and unmerged sequences.

qiime vsearch merge-pairs \
  --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
  --output-dir reads_qza/reads_joined

1.2.2. Filter out low-quality reads

This command will filter out low-quality reads based on the default options.

qiime quality-filter q-score \
  --i-demux reads_qza/reads_joined/merged_sequences.qza \
  --o-filter-stats filt_stats.qza \
  --o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza

1.2.3. Summarize joined and filtered reads

It is a good idea at this point just to verify that there haven't been any substantial losses of reads, before going through the whole ASV process, at either the joining or quality-filtering steps above. You will also need to select a length to trim back to that maintains the largest/acceptable quantity of reads during denoising.

qiime demux summarize \
  --i-data reads_qza/reads_trimmed_joined_filt.qza \
  --o-visualization reads_qza/reads_trimmed_joined_filt_summary.qzv

Now open the file in QIIME2 View and look at the Overview and Interactive Quality Plot tabs to explore your data and answer the following questions.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

Question 5: What would be a good trim length for our reads? Remember that there are answers at the bottom of the page if you would like to check this.

1.2.4. Running Deblur

Running the Deblur workflow will correct the raw reads into amplicon sequence variants (ASVs). This denoising tool filters out reads that either do match to known noise or that do not match with low similarity to the expected amplicon region. Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1, and is for 16S sequences only. For other amplicon regions, you can either use the denoise-other option in the command and specify a reference database of sequences to use for positive filtering (as in the below versions for 18S and ITS) or use DADA2.

You'll need to replace X here with the trim length that was decided in Question 5.

qiime deblur denoise-16S \
  --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
  --p-trim-length X \
  --p-sample-stats \
  --p-jobs-to-start 4 \
  --p-min-reads 1 \
  --output-dir deblur_output

Note: this command may take a few minutes to run.

1.2.5. Summarizing Deblur output

Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view. We will use this visualization later to determine the the cut-offs for filtering the table below, but for now you should mainly take a look at the visualization to ensure that sufficient reads have been retained after running deblur. This denoising tool filters out reads that either do match to known noise or that do not match with low similarity to the expected amplicon region. If your samples have very low depth after running deblur (compared to the input read depth) this could be a red flag that either you ran the tool incorrectly, you have a lot of noise in your data, or that deblur is inappropriate for your dataset.

qiime feature-table summarize \
  --i-table deblur_output/table.qza \
  --o-visualization deblur_output/deblur_table_summary.qzv

Question 6: What is the mean sequencing depth per sample after denoising?

Question 7: Which sample has the least reads?

1.3. 16S Assign taxonomy to ASVs

You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the [scikit learn](http://scikit-learn.org/stable/) Python library and the [SILVA](https://www.arb-silva.de/) or [UNITE](https://unite.ut.ee/) databases. This approach requires that a classifier be trained in advance on a reference database. We recommend users use a widely used classifier to help ensure there are no unexpected issues with the Naive-Bayes model. We previously maintained primer-specific classifiers, which theoretically can provide more accurate classifications, but we no longer do this due to concerns regarding issues with the trained models that are difficult to catch if only a couple people are running them. The full-length 16S/18S classifier can be downloaded from the [QIIME 2 website](https://docs.qiime2.org/2022.11/data-resources/) (silva-138-99-nb-classifier.qza for the latest classifier). Custom classifiers for the ITS region that we have generated from the UNITE database are available as well ([see downloads](http://kronos.pharmacology.dal.ca/public_files/MH/taxa_classifiers/qiime2-2020.8_classifiers) and [commands used to create these files](https://github.com/LangilleLab/microbiome_helper/wiki/Creating-QIIME-2-Taxonomic-Classifiers)):

-   Full ITS - fungi only (classifier_sh_refs_qiime_ver9_99_s_27.10.2022_ITS.qza)
-   Full ITS - all eukaryotes (classifier_sh_refs_qiime_ver9_99_s_all_27.10.2022_ITS.qza)

We're going to use the Greengenes2 classifier. We can download that from the FTP site here - click on the folder sklearn-1.4.2-compatible-nb-classifiers/, and then copy the link for the 2022.10.backbone.full-length.nb.sklearn-1.4.2.qza classifier. (Right-click and select "Copy Link Address". Then we'll use the wget command like we did in Module 1. That will look something like this:

wget link-address

1.3.1. Run taxonomic classification

You can run the taxonomic classification with this command, which is one of the longest running and most memory-intensive command of the tutorial. If you receive an error related to insufficient memory (and if you cannot increase your memory usage) then you can look into the --p-reads-per-batch option and set this to be lower than the default (which is dynamic depending on sample depth and the number of threads) and also try running the command with fewer jobs (e.g. set --p-n-jobs 1).

Because of the memory constraints of this workshop, you can copy the output into your folder:

But here is the code you would use to run the taxonomic classification:

qiime feature-classifier classify-sklearn \
  --i-reads deblur_output/representative_sequences.qza \
  --i-classifier 2022.10.backbone.full-length.nb.sklearn-1.4.2.qza \
  --p-n-jobs 4 \
  --output-dir taxa

Unfortunately we don't actually have enough memory to run this so you'll see a Killed output for this. Instead, we'll copy across the output that we would have got.

mkdir taxa      
cp ~/CourseData/MIC_data/bmb_module2/16S/taxa/classification.qza taxa

As with all QZA files, you can export the output file to take a look at the classifications and confidence scores:

qiime tools export \
  --input-path taxa/classification.qza \
  --output-path taxa

1.3.2 Assess subset of taxonomic assignments with BLAST

The performance of the taxonomic classification is difficult to assess without a gold-standard reference, but nonetheless one basic sanity check is to compare the taxonomic assignments with the top BLASTn hits for certain ASVs. First, generate a QZV file for the denoised representative sequences in QIIME 2 by running:

qiime feature-table tabulate-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --o-visualization deblur_output/representative_sequences.qzv

This QZV file tabulates the denoised sequences. Clicking on the nucleotide sequence links to a BLASTn search for that sequence. By comparing these BLAST hits with the taxonomic assignment of ASVs generated above you can reassure yourself that the taxonomic assignments overall worked correctly. It's a good idea to select ~5 ASVs to BLAST for this validation, which should be from taxonomically different groups, such as different phyla, according to the taxonomic classifier.

You can then check the taxonomy that was assigned to them using your classifier - you can either open up the taxonomy.tsv file and search for the Feature ID/ASV names there, or you can use the grep command like so:

grep "02fda7604175053e87dc11e2598a89e0" taxa/taxonomy.tsv

1.4. 16S Filtering resultant table

Filtering the denoised table is an important step of microbiome data analysis. You can see more details on this process in the QIIME 2 filtering tutorial.

1.4.1. Filter out rare ASVs

Based on the summary visualization created in step 2.5 above you can choose a cut-off for how frequent a variant needs to be (and optionally how many samples need to have the variant) for it to be retained. Here we will remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads). To calculate this cut-off you would identify the mean sample depth in the visualization created in step 2.5 (deblur_table_summary.qzv), multiply it by 0.001, and round to the nearest integer.

Once you've determined how you would like to filter your table you can do so with this command (X is a placeholder for your choice):

qiime feature-table filter-features \
  --i-table deblur_output/table.qza \
  --p-min-frequency X \
  --p-min-samples 1 \
  --o-filtered-table deblur_output/deblur_table_filt.qza

1.4.2. Filter out contaminant and unclassified ASVs

Once we have assigned taxonomy to our ASVs we can use that information to remove ASVs which are likely contaminants or noise based on the taxonomic labels. Two common contaminants in 16S sequencing data are mitochondrial and chloroplast 16S sequences, which can be removed by excluding any ASV which contains those terms in its taxonomic label. It can also be sometimes useful to exclude any ASV that is unclassified at the phylum level since these sequences could be noise (e.g. possible chimeric sequences). Note that if your data has not been classified against the default database you may need to change p__ to be a string that enables phylum-level assignments to be identified or simply omit that line.

In general though, it can be very informative if your sequencing reads are coming back with significant amounts of unclassified ASVs as it can indicate upstream analysis problems or indicate you are studying a poorly characterized environment where you have a good chance of identifying a lot of novel phyla. Therefore, our recommendation is to not filter out the unclassified sequences by default, but we will do so here.

qiime taxa filter-table \
  --i-table deblur_output/deblur_table_filt.qza \
  --i-taxonomy taxa/classification.qza \
  --p-include p__ \
  --p-exclude mitochondria,chloroplast \
  --o-filtered-table deblur_output/deblur_table_filt_contam.qza

1.4.3. (Optional) Exclude low-depth samples

Often certain samples will have quite low depth after these filtering steps, which can be excluded from downstream analyses since they will largely add noise. There is no single cut-off that works best for all datasets, but researchers often use minimum cut-offs within the range of 1000 to 4000 reads. You can also use a cut-off much lower than this if you want to retain all samples except those that failed entirely (e.g. depth < 50 reads).

Ideally you would choose this cut-off after visualizing rarefaction curves to determine at what read depth the richness of your samples plateaus and choose a cut-off as close to this plateau as possible while retaining sufficient sample size for your analyses. We learn more about rarefaction curves in Module 3.

1.4.4. Subset and summarize filtered table

Check output after filtering.

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_filt_contam.qza \
  --o-visualization deblur_output/deblur_table_filt_contam_summary.qzv

Question 8: What is the minimum and maximum sequencing depth across all samples?

Happy? Copy a final table.

cp deblur_output/deblur_table_filt_contam.qza deblur_output/deblur_table_final.qza

Once we have our final filtered table we will need to subset the QZA file containing the ASV sequences to the same set. You can exclude any removed ASVs from the sequence file with this command:

qiime feature-table filter-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --i-table deblur_output/deblur_table_final.qza  \
  --o-filtered-data deblur_output/rep_seqs_final.qza

Finally, you can make a new summary of the final filtered abundance table:

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_final.qza \
  --o-visualization deblur_output/deblur_table_final_summary.qzv

16S Answers

Question 1: How many samples are there?

We have 10 samples. The raw data folder contains one fastq.gz file for each set of the forward and reverse sequences (labelled R1 and R2, respectively) for each of the samples (B-Rtxxx).

Question 2: Into what group(s) are the samples classified?

The samples we are using are classified into two different groups: Forest or Managed sites. The study we're looking at does also look at bulk or rhizosphere samples, but we're just using a small subset of the samples for this workshop.

Question 3: What would happen if you ran this exact command on V4/V5-amplified sequences?

Cutadapt will only trim reads that match the specified primer sequence. Therefore, most reads would be discarded because we are including the --p-discard-untrimmed option.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

The forward read median length is 405 nucleotides. There are no reverse reads because forward and reverse reads were merged into one sequence during read joining.

Question 5: What would be a good trim length for our reads?

There is no one right answer for this question, but a trim length of 390 nucleotides will maintain most of our sequences apart from any that are really short. You could choose a different length, but this will give different answers further on. Typically, this choice is a trade-off between maintaining a longer sequence length where we're more likely to get finer-resolution taxonomic classification, or trimming more so that we retain more sequences.

Question 6: What is the mean sequencing depth per sample after denoising?

The mean sequencing depth (frequency) across all denoised samples is 6,669.8 reads.

Question 7: Which sample has the least reads?

Sample B-Rt151 has the lowest sequencing depth (4,332 reads).

Question 8: What is the minimum and maximum sequencing depth across all samples?

The final minimum sequencing depth is 3,432 and the maximum sequencing depth is 8,650 reads.

2. 18S

Create for Module 2 inside workspace and create a symlink to the raw FASTQ files and the metadata file.

cd ~/workspace
mkdir bmb_module2/ bmb_module2/18S_analysis
cd bmb_module2/18S_analysis
ln -s ~/CourseData/MIC_data/bmb_raw_data/18S_raw_data raw_data
ln -s ~/CourseData/MIC_data/bmb_raw_data/Plastisphere_18S_metadata.tsv .

You will have installed the latest version of QIIME2 in Module 1, so you can activate that environment with the command below:

conda activate qiime2-amplicon-2025.4

If you didn't get to that part, you can use this environment instead:

conda activate qiime2-amplicon-backup

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them.

2.1. 18S First steps

2.1.1. Inspect raw data

First, let's take a look at the directory containing our raw reads as well as our metadata file.

ls raw_data

head Plastisphere_18S_metadata.tsv

Question 1: How many samples are there?

Question 2: Into what group(s) are the samples classified?

2.1.2. Import FASTQs as QIIME2 artifact

mkdir reads_qza

qiime tools import \
  --type SampleData[PairedEndSequencesWithQuality] \
  --input-path raw_data/ \
  --output-path reads_qza/reads.qza \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt

This might take a minute! If it hasn't come back up with the command prompt that looks something like (qiime2-amplicon-2025.4) ubuntu@ip-10-0-1-248:~/workspace/bmb_module2/18S_analysis$ yet, then it hasn't finished running yet and you'll need to be patient :)

2.1.3. Trim primers with cutadapt

Screen out reads that do not begin with primer sequence and remove primer sequence from reads using the cutadapt QIIME 2 plugin. The below primers correspond to the 18S V4 region (eukaryote-specific primer set). You can see more about different primers and the taxa that they target here.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences reads_qza/reads.qza \
  --p-cores 4 \
  --p-front-f ATAACAGGTCTGTGATGCCCT \
  --p-front-r CCTTCYGCAGGTTCACCTAC \
  --p-discard-untrimmed \
  --p-no-indels \
  --o-trimmed-sequences reads_qza/reads_trimmed.qza

Visualizing your output data is a good idea after any step to make sure nothing unexpected occurred. The following command generates a "visualization" file with the extension QZV.

qiime demux summarize \
  --i-data reads_qza/reads_trimmed.qza \
  --o-visualization reads_qza/reads_trimmed_summary.qzv

Remember that you can see and download these files on the webpage: http://##.uhn-hpc.ca/ (replace ## with your number!)

Question 3: What would happen if you ran this exact command on 16S V4/V5-amplified sequences?

2.2. 18S Denoising the reads into amplicon sequence variants

2.2.1. Join paired-end reads

Forward and reverse reads can be joined with VSEARCH as shown below. This will generate QZA files for both the joined/merged sequences and unmerged sequences.

qiime vsearch merge-pairs \
  --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
  --output-dir reads_qza/reads_joined

2.2.2. Filter out low-quality reads

This command will filter out low-quality reads based on the default options.

qiime quality-filter q-score \
  --i-demux reads_qza/reads_joined/merged_sequences.qza \
  --o-filter-stats filt_stats.qza \
  --o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza

2.2.3. Summarize joined and filtered reads

qiime demux summarize \
  --i-data reads_qza/reads_trimmed_joined_filt.qza \
  --o-visualization reads_qza/reads_trimmed_joined_filt_summary.qzv

Now open the file in QIIME2 View and look at the Overview and Interactive Quality Plot tabs to explore your data and answer the following questions.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

Question 5: What would be a good trim length for our reads? Remember that there are answers at the bottom of the page if you would like to check this.

2.2.4. Running Deblur

Running the Deblur workflow will correct the raw reads into amplicon sequence variants (ASVs). This denoising tool filters out reads that either do match to known noise or that do not match with low similarity to the expected amplicon region. Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1, and is for 16S sequences only. While for 16S there is a denoise-16S option that doesn't require any reference sequences, for other amplicon regions (like 18S and ITS), you can use the denoise-other option and specify a reference database of sequences to use for positive filtering. We keep these files available for others on our lab server here.

We'll go ahead and download the 18S sequences. You can do that by right-clicking on the gb203_pr2_all_10_28_99p_clean_prob-rm.fasta file, and clicking "Copy Link Address". You can then use the wget command, like we did in Module 1. So your command should look something like:

wget link-address

Then you can import this into QIIME:

qiime tools import \
   --input-path gb203_pr2_all_10_28_99p_clean_prob-rm.fasta \
   --output-path gb203_pr2_all_10_28_99p_clean_prob-rm.qza \
   --type 'FeatureData[Sequence]'

You'll need to replace X here with the trim length that was decided in Question 5.

qiime deblur denoise-other \
   --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
   --i-reference-seqs gb203_pr2_all_10_28_99p_clean_prob-rm.qza \
   --p-trim-length X \
   --p-sample-stats \
   --p-jobs-to-start 4 \
   --p-min-reads 1 \
   --output-dir deblur_output

Note: this command may take around ten minutes to run.

2.2.5. Summarizing Deblur output

qiime feature-table summarize \
  --i-table deblur_output/table.qza \
  --o-visualization deblur_output/deblur_table_summary.qzv

Question 6: What is the mean sequencing depth per sample after denoising?

Question 7: Which sample has the least reads?

2.3. 18S Assign taxonomy to ASVs

You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the scikit learn Python library and the SILVA or UNITE databases. This approach requires that a classifier be trained in advance on a reference database. We recommend users use a widely used classifier to help ensure there are no unexpected issues with the Naive-Bayes model. We previously maintained primer-specific classifiers, which theoretically can provide more accurate classifications, but we no longer do this due to concerns regarding issues with the trained models that are difficult to catch if only a couple people are running them. The full-length 16S/18S classifier can be downloaded from the QIIME 2 website (silva-138-99-nb-classifier.qza for the latest classifier). Sometimes there are custom classifiers or other specific databases that are good for specific purposes. For 18S, we're going to be using the PR2 (Protist Ribosomal Reference) database. A custom classifier that we've generated is available (see downloads and commands used to create classifiers).

We're going to use the PR2 classifier. As you did above, click on "Copy Link Address" for the file pr2_version_5.1.0_SSU_18S.qza and then use wget to download it. Check how you ran this command before if you're unsure!

2.3.1. Run taxonomic classification

You can run the taxonomic classification with this command, which is one of the longest running and most memory-intensive commands of the tutorial. If you receive an error related to insufficient memory (and if you cannot increase your memory usage) then you can look into the --p-reads-per-batch option and set this to be lower than the default (which is dynamic depending on sample depth and the number of threads) and also try running the command with fewer jobs (e.g. set --p-n-jobs 1).

qiime feature-classifier classify-sklearn \
  --i-reads deblur_output/representative_sequences.qza \
  --i-classifier pr2_version_5.1.0_SSU_18S.qza \
  --p-n-jobs 4 \
  --output-dir taxa

Unfortunately we don't actually have enough memory to run this so you'll see a Killed or Terminated output for this. Instead, we'll copy across the output that we would have got.

mkdir taxa      
cp ~/CourseData/MIC_data/bmb_module2/18S/taxa/classification.qza taxa

As with all QZA files, you can export the output file to take a look at the classifications and confidence scores:

qiime tools export \
  --input-path taxa/classification.qza \
  --output-path taxa

2.3.2 Assess subset of taxonomic assignments with BLAST

qiime feature-table tabulate-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --o-visualization deblur_output/representative_sequences.qzv

grep "5ecf17cbc8af149eb7fbec898b841d72" taxa/taxonomy.tsv

You should see that this sequence has been classified as d__Eukaryota; sg__Archaeplastida; di__Rhodophyta; sd__Eurhodophytina; c__Florideophyceae; o__Ceramiales; f__Rhodomelaceae with confidence ~0.99. If we look at the taxonomy on NCBI BLAST, then we'll see that some of the top hits are Osmundea pinnatifida (identity 99.34%) and Laurencia filiformis (identity 98.01%). A quick google tells me that both of these are in the family Rhodomelaceae, so it seems pretty reasonable that - with two hits with relatively high identity that are from different genera - our classifier would determine that this ASV can be classified at the family level as Rhodomelaceae.

Try taking a look at a few more and seeing how they seem.

2.4. 18S Filtering resultant table

Filtering the denoised table is an important step of microbiome data analysis. You can see more details on this process in the QIIME 2 filtering tutorial.

2.4.1. Filter out rare ASVs

Once you've determined how you would like to filter your table you can do so with this command (X is a placeholder for your choice):

qiime feature-table filter-features \
  --i-table deblur_output/table.qza \
  --p-min-frequency X \
  --p-min-samples 1 \
  --o-filtered-table deblur_output/deblur_table_filt.qza

2.4.2. Filter out contaminant and unclassified ASVs

Once we have assigned taxonomy to our ASVs we can use that information to remove ASVs which are likely contaminants or noise based on the taxonomic labels. Two common contaminants in 16S sequencing data are mitochondrial and chloroplast 16S sequences, which can be removed by excluding any ASV which contains those terms in its taxonomic label. Obviously this is different for 18S, so we don't need to exclude those here. It can also be sometimes useful to exclude any ASV that is unclassified at the phylum level since these sequences could be noise (e.g. possible chimeric sequences). Note that here we actually don't have phylum-level classifications because the PR2 database doesn't use these - it uses levels of domain, supergroup, division, subdivision, class, order, family, genus, and species. So in our taxonomy file you'll see that these are d, sg, di, sd, c, o, f, g, and s, respectively. We'll therefore probably want to use the sg__ instead of p__ like we used for 16S data.

qiime taxa filter-table \
  --i-table deblur_output/deblur_table_filt.qza \
  --i-taxonomy taxa/classification.qza \
  --p-include sg__ \
  --o-filtered-table deblur_output/deblur_table_filt_contam.qza

2.4.3. (Optional) Exclude low-depth samples

2.4.4. Subset and summarize filtered table

Check output after filtering.

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_filt_contam.qza \
  --o-visualization deblur_output/deblur_table_filt_contam_summary.qzv

Question 8: What is the minimum and maximum sequencing depth across all samples?

Happy? Copy a final table.

cp deblur_output/deblur_table_filt_contam.qza deblur_output/deblur_table_final.qza

Once we have our final filtered table we will need to subset the QZA file containing the ASV sequences to the same set. You can exclude any removed ASVs from the sequence file with this command:

qiime feature-table filter-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --i-table deblur_output/deblur_table_final.qza  \
  --o-filtered-data deblur_output/rep_seqs_final.qza

Finally, you can make a new summary of the final filtered abundance table:

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_final.qza \
  --o-visualization deblur_output/deblur_table_final_summary.qzv

18S Answers

Question 1: How many samples are there?

We have 12 samples. The raw data folder contains one fastq.gz file for each set of the forward and reverse sequences (labelled R1 and R2, respectively) for each of the samples (MHxxxx18S).

Question 2: Into what group(s) are the samples classified?

The samples we are using are classified using two different variables: light and time. These are pieces of plastic that were either incubated in a sunny coastal environment (OpenCoast) or in a cave (Cave; the light variable), and they were incubated for 6h or 28h (time).

Question 3: What would happen if you ran this exact command on 16S sequences?

Cutadapt will only trim reads that match the specified primer sequence. Therefore, most reads would be discarded because we are including the --p-discard-untrimmed option.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

The forward read median length is 331 nucleotides. There are no reverse reads because forward and reverse reads were merged into one sequence during read joining.

Question 5: What would be a good trim length for our reads?

There is no one right answer for this question, but a trim length of 302 nucleotides will maintain most of our sequences apart from any that are really short. You could choose a different length, but this will give different answers further on. Typically, this choice is a trade-off between maintaining a longer sequence length where we're more likely to get finer-resolution taxonomic classification, or trimming more so that we retain more sequences.

Question 6: What is the mean sequencing depth per sample after denoising?

The mean sequencing depth (frequency) across all denoised samples is 6,588.8 reads.

Question 7: Which sample has the least reads?

Sample MH28C318S has the lowest sequencing depth (only 27 reads).

Question 8: What is the minimum and maximum sequencing depth across all samples?

The final minimum sequencing depth is 7 and the maximum sequencing depth is 12,890 reads.

3. ITS

Create for Module 2 inside workspace and create a symlink to the raw FASTQ files and the metadata file.

cd ~/workspace
mkdir bmb_module2/ bmb_module2/ITS_analysis
cd bmb_module2/ITS_analysis
ln -s ~/CourseData/MIC_data/bmb_raw_data/ITS_raw_data raw_data
ln -s ~/CourseData/MIC_data/bmb_raw_data/pregnancy_ITS_metadata.tsv .

You will have installed the latest version of QIIME2 in Module 1, so you can activate that environment with the command below:

conda activate qiime2-amplicon-2025.4

If you didn't get to that part, you can use this environment instead:

conda activate qiime2-amplicon-backup

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts. You'll find the answers at the bottom of this page, but no one will be marking them.

3.1. ITS First steps

3.1.1. Inspect raw data

First, let's take a look at the directory containing our raw reads as well as our metadata file.

ls raw_data

head pregnancy_ITS_metadata.tsv

Question 1: How many samples are there?

Question 2: Into what group(s) are the samples classified?

3.1.2. Import FASTQs as QIIME2 artifact

mkdir reads_qza

qiime tools import \
  --type SampleData[PairedEndSequencesWithQuality] \
  --input-path raw_data/ \
  --output-path reads_qza/reads.qza \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt

This might take a minute! If it hasn't come back up with the command prompt that looks something like (qiime2-amplicon-2025.4) ubuntu@ip-10-0-1-248:~/workspace/bmb_module2/ITS_analysis$ yet, then it hasn't finished running yet and you'll need to be patient :)

3.1.3. Trim primers with cutadapt

Screen out reads that do not begin with primer sequence and remove primer sequence from reads using the cutadapt QIIME 2 plugin. The below primers correspond to the 18S V8-V9 region (eukaryote-specific primer set). You can see more about some other primers and the taxa that they target here.

qiime cutadapt trim-paired \
  --i-demultiplexed-sequences reads_qza/reads.qza \
  --p-cores 4 \
  --p-front-f GCATCGATGAAGAACGCAGC \
  --p-front-r TCCTCCGCTTATTGATATGC \
  --p-discard-untrimmed \
  --p-no-indels \
  --o-trimmed-sequences reads_qza/reads_trimmed.qza

Visualizing your output data is a good idea after any step to make sure nothing unexpected occurred. The following command generates a "visualization" file with the extension QZV.

qiime demux summarize \
  --i-data reads_qza/reads_trimmed.qza \
  --o-visualization reads_qza/reads_trimmed_summary.qzv

Remember that you can see and download these files on the webpage: http://##.uhn-hpc.ca/ (replace ## with your number!)

Question 3: What would happen if you ran this exact command on 16S V4/V5-amplified sequences?

3.2. ITS Denoising the reads into amplicon sequence variants

2.1. Join paired-end reads

Forward and reverse reads can be joined with VSEARCH as shown below. This will generate QZA files for both the joined/merged sequences and unmerged sequences.

qiime vsearch merge-pairs \
  --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
  --output-dir reads_qza/reads_joined

3.2.2. Filter out low-quality reads

This command will filter out low-quality reads based on the default options.

qiime quality-filter q-score \
  --i-demux reads_qza/reads_joined/merged_sequences.qza \
  --o-filter-stats filt_stats.qza \
  --o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza

3.2.3. Summarize joined and filtered reads

qiime demux summarize \
  --i-data reads_qza/reads_trimmed_joined_filt.qza \
  --o-visualization reads_qza/reads_trimmed_joined_filt_summary.qzv

Now open the file in QIIME2 View and look at the Overview and Interactive Quality Plot tabs to explore your data and answer the following questions.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

Question 5: What would be a good trim length for our reads? Remember that there are answers at the bottom of the page if you would like to check this.

3.2.4. Running Deblur

Running the Deblur workflow will correct the raw reads into amplicon sequence variants (ASVs). This denoising tool filters out reads that either do match to known noise or that do not match with low similarity to the expected amplicon region. Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1, and is for 16S sequences only. While for 16S there is a denoise-16S option that doesn't require any reference sequences, for other amplicon regions (like 18S and ITS), you can use the denoise-other option and specify a reference database of sequences to use for positive filtering. We keep these files available for others on our lab server here.

We'll go ahead and download the 18S sequences. You can do that by right-clicking on the UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.fasta file, and clicking "Copy Link Address". You can then use the wget command, like we did in Module 1. So your command should look something like:

wget link-address

Then you can import this into QIIME:

qiime tools import \
   --input-path UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.fasta \
   --output-path UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.qza \
   --type 'FeatureData[Sequence]'

You'll need to replace X here with the trim length that was decided in Question 5.

qiime deblur denoise-other \
   --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
   --i-reference-seqs UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.qza \
   --p-trim-length X \
   --p-sample-stats \
   --p-jobs-to-start 4 \
   --p-min-reads 1 \
   --output-dir deblur_output

Note: this command may take around ten minutes to run.

3.2.5. Summarizing Deblur output

qiime feature-table summarize \
  --i-table deblur_output/table.qza \
  --o-visualization deblur_output/deblur_table_summary.qzv

Question 6: What is the mean sequencing depth per sample after denoising?

Question 7: Which sample has the least reads?

3.3. ITS Assign taxonomy to ASVs

You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the scikit learn Python library and the SILVA or UNITE databases. This approach requires that a classifier be trained in advance on a reference database. We recommend users use a widely used classifier to help ensure there are no unexpected issues with the Naive-Bayes model. We previously maintained primer-specific classifiers, which theoretically can provide more accurate classifications, but we no longer do this due to concerns regarding issues with the trained models that are difficult to catch if only a couple people are running them. The full-length 16S/18S classifier can be downloaded from the QIIME 2 website (silva-138-99-nb-classifier.qza for the latest classifier). Custom classifiers for the ITS region that we have generated from the UNITE database are available as well (see downloads and commands used to create these files).

We're going to use one that contains fungal ITS sequences from the UNITE database. As you did above, click on "Copy Link Address" for the file classifier_fungi_ITS_sh_taxonomy_sh_taxonomy_qiime_ver10_99_04.04.2024_dev.qza and then use wget to download it. Check how you ran this command before if you're unsure!

3.3.1. Run taxonomic classification

Because of the memory constraints of this workshop, you can copy the output into your folder:

But here is the code you would use to run the taxonomic classification:

qiime feature-classifier classify-sklearn \
  --i-reads deblur_output/representative_sequences.qza \
  --i-classifier classifier_fungi_ITS_sh_taxonomy_sh_taxonomy_qiime_ver10_99_04.04.2024_dev.qza \
  --p-n-jobs 4 \
  --output-dir taxa

Unfortunately we don't actually have enough memory to run this so you'll see a Killed output for this. Instead, we'll copy across the output that we would have got.

mkdir taxa      
cp ~/CourseData/MIC_data/bmb_module2/ITS/taxa/classification.qza taxa

As with all QZA files, you can export the output file to take a look at the classifications and confidence scores:

qiime tools export \
  --input-path taxa/classification.qza \
  --output-path taxa

3.3.2. Assess subset of taxonomic assignments with BLAST

qiime feature-table tabulate-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --o-visualization deblur_output/representative_sequences.qzv

grep "af38114da4fce5a55693cb937b8991ad" taxa/taxonomy.tsv

You should see that this sequence has been classified as k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Aspergillaceae;g__Aspergillus with confidence ~0.99. If we look at the taxonomy on NCBI BLAST, then we'll see that the top hits are Aspergillus cristatus and Aspergillus chevalieri, both with 99.67% identity. If it's just as good a match to two different species within the same genus, it makes sense why we just have the genus-level classification Aspergillus here, and seems reasonable.

Try taking a look at a few more and seeing how they seem.

3.4. ITS Filtering resultant table

Filtering the denoised table is an important step of microbiome data analysis. You can see more details on this process in the QIIME 2 filtering tutorial.

3.4.1. Filter out rare ASVs

Once you've determined how you would like to filter your table you can do so with this command (X is a placeholder for your choice):

qiime feature-table filter-features \
  --i-table deblur_output/table.qza \
  --p-min-frequency X \
  --p-min-samples 1 \
  --o-filtered-table deblur_output/deblur_table_filt.qza

3.4.2. Filter out contaminant and unclassified ASVs

Once we have assigned taxonomy to our ASVs we can use that information to remove ASVs which are likely contaminants or noise based on the taxonomic labels. Two common contaminants in 16S sequencing data are mitochondrial and chloroplast 16S sequences, which can be removed by excluding any ASV which contains those terms in its taxonomic label. We don't really have that issue with ITS data, however. It can also be sometimes useful to exclude any ASV that is unclassified at the phylum level since these sequences could be noise (e.g. possible chimeric sequences). Note that if your data has not been classified against the default database you may need to change p__ to be a string that enables phylum-level assignments to be identified or simply omit that line.

qiime taxa filter-table \
  --i-table deblur_output/deblur_table_filt.qza \
  --i-taxonomy taxa/classification.qza \
  --p-include p__ \
  --o-filtered-table deblur_output/deblur_table_filt_contam.qza

3.4.3. (Optional) Exclude low-depth samples

3.4.4. Subset and summarize filtered table

Check output after filtering.

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_filt_contam.qza \
  --o-visualization deblur_output/deblur_table_filt_contam_summary.qzv

Question 8: What is the minimum and maximum sequencing depth across all samples?

Question 8 answer: The final minimum sequencing depth is 192 and the maximum sequencing depth is 80,762 reads. This huge difference from the previous step (where it was actually a different sample that had the lowest depth!) would definitely indicate some problems if this were our own project. This was not a project that I was involved with, and this could just be because the ITS region is more difficult to accurately taxonomically classify than the 16S region, but I would definitely want to do some further investigating as to what is going on. We only removed ASVs that didn't have phylum-level classifications, so the first thing that I would do is look at the taxonomy.tsv file. If we open this in Excel and sort based on Column B, we can see that 114/961 total ASVs within our samples have only been classified as Fungi (these are the ones that we would have filtered out). If we just have a quick search of a few of these ASVs in the deblur_table_summary.qzv Feature Detail tab, then we'll see that e.g. 87121182b8f0d093e937d915b2b58b22 is present in 13 samples at a total frequency of 15,889 reads. Another one, 1086009b09b17b9364cb1ae984a552e1 is similar; it is present in 18 samples at a total frequency of 35,799 reads. Based on this, I don't think I'd want to filter these out. So unlike with the other amplicons, I think I'd probably use the unfiltered table here.

Copy a final table.

cp deblur_output/deblur_table_filt.qza deblur_output/deblur_table_final.qza

Once we have our final filtered table we will need to subset the QZA file containing the ASV sequences to the same set. You can exclude any removed ASVs from the sequence file with this command:

qiime feature-table filter-seqs \
  --i-data deblur_output/representative_sequences.qza \
  --i-table deblur_output/deblur_table_final.qza  \
  --o-filtered-data deblur_output/rep_seqs_final.qza

Finally, you can make a new summary of the final filtered abundance table:

qiime feature-table summarize \
  --i-table deblur_output/deblur_table_final.qza \
  --o-visualization deblur_output/deblur_table_final_summary.qzv

ITS Answers

Question 1: How many samples are there?

We have 20 samples. The raw data folder contains one fastq.gz file for each set of the forward and reverse sequences (labelled R1 and R2, respectively) for each of the samples (CRRxxxxxxx).

Question 2: Into what group(s) are the samples classified?

The samples we are using are classified into two different groups: below_20 or above_35. The study we're looking at had information on the age of the pregnant women, so that's what we're looking at here. We also have their exact age in these samples.

Question 3: What would happen if you ran this exact command on V4/V5-amplified sequences?

Cutadapt will only trim reads that match the specified primer sequence. Therefore, most reads would be discarded because we are including the --p-discard-untrimmed option.

Question 4: How long are our forward reads? Why are there no reverse reads in our file?

The forward read median length is 364 nucleotides. There are no reverse reads because forward and reverse reads were merged into one sequence during read joining.

Question 5: What would be a good trim length for our reads?

There is no one right answer for this question, but a trim length of 300 nucleotides will maintain most of our sequences apart from any that are really short. You could choose a different length, but this will give different answers further on. Typically, this choice is a trade-off between maintaining a longer sequence length where we're more likely to get finer-resolution taxonomic classification, or trimming more so that we retain more sequences.

Question 6: What is the mean sequencing depth per sample after denoising?

The mean sequencing depth (frequency) across all denoised samples is 62,566.6 reads.

Question 7: Which sample has the least reads?

Sample CRR1039764 has the lowest sequencing depth (21,453 reads).

CBW‐2025‐BMB‐Module2 - LangilleLab/microbiome_helper GitHub Wiki

Module 2: Marker gene profiling

Introduction

Table of Contents

1. 16S

1.1. 16S First steps

1.1.1. Inspect raw data

1.1.2. Import FASTQs as QIIME2 artifact

1.1.3. Trim primers with cutadapt

1.2. 16S Denoising the reads into amplicon sequence variants

1.2.1. Join paired-end reads

1.2.2. Filter out low-quality reads

1.2.3. Summarize joined and filtered reads

1.2.4. Running Deblur

1.2.5. Summarizing Deblur output

1.3. 16S Assign taxonomy to ASVs

1.3.1. Run taxonomic classification

1.3.2 Assess subset of taxonomic assignments with BLAST

1.4. 16S Filtering resultant table

1.4.1. Filter out rare ASVs

1.4.2. Filter out contaminant and unclassified ASVs

1.4.3. (Optional) Exclude low-depth samples

1.4.4. Subset and summarize filtered table

16S Answers

2. 18S

2.1. 18S First steps

2.1.1. Inspect raw data

2.1.2. Import FASTQs as QIIME2 artifact

2.1.3. Trim primers with cutadapt

2.2. 18S Denoising the reads into amplicon sequence variants

2.2.1. Join paired-end reads

2.2.2. Filter out low-quality reads

2.2.3. Summarize joined and filtered reads

2.2.4. Running Deblur

2.2.5. Summarizing Deblur output

2.3. 18S Assign taxonomy to ASVs

2.3.1. Run taxonomic classification

2.3.2 Assess subset of taxonomic assignments with BLAST

2.4. 18S Filtering resultant table

2.4.1. Filter out rare ASVs

2.4.2. Filter out contaminant and unclassified ASVs

2.4.3. (Optional) Exclude low-depth samples

2.4.4. Subset and summarize filtered table

18S Answers

3. ITS

3.1. ITS First steps

3.1.1. Inspect raw data

3.1.2. Import FASTQs as QIIME2 artifact

3.1.3. Trim primers with cutadapt

3.2. ITS Denoising the reads into amplicon sequence variants

2.1. Join paired-end reads

3.2.2. Filter out low-quality reads

3.2.3. Summarize joined and filtered reads

3.2.4. Running Deblur

3.2.5. Summarizing Deblur output

3.3. ITS Assign taxonomy to ASVs

3.3.1. Run taxonomic classification

3.3.2. Assess subset of taxonomic assignments with BLAST

3.4. ITS Filtering resultant table

3.4.1. Filter out rare ASVs

3.4.2. Filter out contaminant and unclassified ASVs

3.4.3. (Optional) Exclude low-depth samples

3.4.4. Subset and summarize filtered table

ITS Answers