Microbiome Helper 2 Marker gene workflow - LangilleLab/microbiome_helper GitHub Wiki
Authors: Robyn Wright, Monica Alvaro Fuss, Nidhi Gohil
Modifications by: NA
Based on initial versions by: Gavin Douglas (Amplicon SOP v2 (qiime2 2018.6) and André Comeau (PacBio CCS Amplicon SOP v1 (qiime2))
Please note: We think that everything here should work, but we are still testing/developing this so use with caution :)
In this workflow, we will go from our raw reads (as they come off a sequencer), and take these all the way through to denoised Amplicon Sequence Variants (ASVs), a feature table containing representative sequences, and a phylogenetic tree.
The pipeline described is embedded in the latest version of QIIME2 (Quantitative Insights into Microbial Ecology version 2025.4), which is a popular microbiome bioinformatics platform for microbial ecology built on user-made software packages called plugins that work on QIIME2 artifact or QZA files. Documentation for these plugins can be found in the QIIME 2 user documentation, along with tutorials and other useful information. QIIME2 also provides interpretable visualizations that can be accessed by opening any generated QZV files within QIIME2 View.
Many steps in this workflow are the same whether you have sequenced your samples using Illumina or PacBio. The steps that differ are split to Illumina/PacBio.
There are some questions throughout that are designed to make you think about what the steps you are running are doing. The answers provided are for the Arctic Ocean tutorial data.
If you use this workflow make sure to keep track of the commands you use locally as this page will be updated over time (see "revisions" above for earlier versions).
To standardize QIIME 2 analyses and to keep track of provenance (i.e. a list of what commands were previously run to produce a file) a special format is used for all QIIME 2 input and output files called an "artifact" (with the extension QZA). In many of these steps, we will also create QIIME 2 files that can be viewed on their QIIME 2 View webpage (with the extension QZV).
It is assumed that you already have a folder containing your raw data that is called
raw_data/
. This will contain either single- or paired-end reads. I would recommend creating this inside a folder that describes your project - if you are using the tutorial data, then this could be inside a folder calledarctic_ocean_illumina
orarctic_ocean_pacbio
so that you have e.g.,arctic_ocean_illumina/raw_data
It is also assumed that these reads are in demultiplexed FASTQ format. QIIME 2 accepts many different formats so if your files are not already in this format (e.g. not demultiplexed) you would just need to use slightly different commands for importing your data.
You should run the rest of the workflow in a conda environment, which makes sure the correct version of the Python packages required by QIIME 2 are being used. Our Setting up environments for analysis page contains details on how to download and install QIIME2. Note that this workflow should work with the same QIIME2 version as we are using (i.e., in the name of the environment below) - there are not usually major changes between QIIME2 versions, but there may be small changes in the commands needed.
You can activate this conda environment with this command (you may need to swap in source
for conda
if you get an error):
conda activate qiime2-amplicon-2025.4
Several commands throughout this workflow can run on multiple cores in parallel. How many cores to use in these cases will be saved to the NCORES
variable defined below. We set this variable to 1 below, but you can change this to be however many cores you would like to use.
NCORES=1
Visualize sequence quality across raw reads. This is important as a sanity check that your reads are of reasonable quality and to determine how your reads should be trimmed in downstream steps. QIIME 2 comes with a plugin for visualizing read quality, which we will use at a later step. However, when dealing with raw reads the easiest method to use is a combination of [FASTQC][15] and [MultiQC][16]. Note that these tools are not packaged with QIIME 2 so you will need to install them separately.
This is an important step for identifying outlier samples with especially low quality, read sizes, read depth, and other metrics.
Note that we don't actually show the steps for removing any samples here. There are other steps later on where low-quality sequences will be removed, however, if a large number of your samples show very low quality here, this may be an indication that something went wrong with your sequencing and you may want to investigate this further before carrying on with your analysis.
You can run FASTQC with this command (after creating the output directory).
mkdir fastqc_out
fastqc -t $NCORES raw_data/*.fastq.gz -o fastqc_out
If you receive the error Value "FASTQ" invalid for option threads (number expected)
(where "FASTQ" is an input filename) then make sure you have defined the NCORES
variable correctly and re-run the command.
FASTQC generates a report for each individual file. To aggregate the summary files into a single report we can run MultiQC with these commands:
multiqc fastqc_out --filename multiqc_report.html
The full report is found within multiqc_report.html
. You can view this report in a web-browser on your local computer. The most important reason to visualize this report is to ensure that your samples are of high-quality (based largely on whether the per-base quality is >30 across most of the reads) and that there are no outlier samples.
Remember that you can use the scp
command to download these files to your local computer.
You can format the metadata file (which is in a compatible format to the QIIME 1 mapping files) in a spreadsheet and validate the format using a google spreadsheet plugin as described on the QIIME 2 website. The only required column is the sample id column, which should be first. All other columns should correspond to sample metadata. Below, we assume that your metadata filename has been assigned to a bash variable called "$METADATA".
This can be done like so (for example if your file is called metadata.txt and is found in the folder /home/user):
METADATA="/home/user/project_name/metadata.txt"
If you are working from the directories/folders that we have suggested, this will likely look like this:
METADATA="/home/user/arctic_ocean_illumina/arctic_study_metadata.txt"
If your file names within raw_data
follow the expected Illumina format (described below), then you can just import them using the QIIME 2 command. If not, then you should follow the instructions for importing them using a manifest file.
Expected Illumina format: 105CHE6WT_S325_L001_R1_001.fastq.gz
, where each field separated by an _ character corresponds to:
- The sample name (105CHE6WT)
- The sample number on the run (S325)
- The lane number (L001)
- The read number (R1 for forward and R2 for reverse)
- The set number (001)
Import the raw reads as a QZA file.
mkdir reads_qza
qiime tools import \
--type SampleData[PairedEndSequencesWithQuality] \
--input-path raw_data/ \
--output-path reads_qza/reads.qza \
--input-format CasavaOneEightSingleLanePerSampleDirFmt
If your samples are not in this expected Illumina format, you can use a manifest file to import them. This just a file that tells QIIME 2 what the path to the files is as well as what each of the samples should be called.
We have made a script that makes this file for you if you don't have it. Skip to the import step if you already have one.
First, download the make_manifest.py
script:
wget http://kronos.pharmacology.dal.ca/public_files/MH2/scripts/make_manifest.py
There are then two different ways to run this, you can either run it with just the folder that contains your reads and an output file name:
python make_manifest.py -f raw_data/ -o Illumina_manifest.txt
It will automatically detect whether your data is single end or paired end based on the presence of _r1'/
_R1/
_1` within the file names.
With a metadata file. The first column of this should contain the sample names that match the file names for your reads. If you also include a column called sample_rename
, it will rename the samples when QIIME2 reads them in to the names that you give it.
python make_manifest.py --f raw_data/ -o Illumina_manifest_metadata.txt -m arctic_study_metadata_illumina.txt
Import the raw reads as a QZA file.
With paired-end data (i.e. forward and reverse reads):
mkdir reads_qza
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path Illumina_manifest.txt \
--output-path reads_qza/reads.qza \
--input-format PairedEndFastqManifestPhred33V2
With single-end data (e.g. PacBio or other formats where the reads have already been joined or only a single read is sequenced):
mkdir reads_qza
qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path PacBio_manifest.txt \
--output-path reads_qza/reads.qza \
--input-format SingleEndFastqManifestPhred33V2
All of the FASTQs are now in the single artifact file reads_qza/reads.qza. This file format can be a little confusing at first, but it is actually just a zipped folder. You can manipulate and explore these files better with the qiime tools utilities (e.g. peek and view)
It is often useful to check that the files have imported as we expected. QIIME 2 has some helpful QIIME 2 view files (see above) that we can use to check the number of samples that we've imported, as well as the number of reads that we have in them.
qiime demux summarize \
--i-data reads_qza/reads.qza \
--o-visualization reads_qza/reads_summary.qzv
You can run the demux summarize command on any of the reads files to get a report of the number of reads per sample and quality distribution across the reads. This generates a more basic output compared to FASTQC/MultiQC, but is sufficient for these purposes.
Note that we gave the output file above the extension .qzv
since this a special type of artifact file - a visualization. You can look at the visualization by uploading the file to the QIIME2 view website and clicking on the Interactive Quality Plot
tab at the top of the page
These next steps are split based on whether you have Illumina or PacBio data.
Screen out reads that do not begin with primer sequence and remove primer sequence from reads using the cutadapt QIIME 2 plugin. The below primers correspond to the 16S V4-V5 region ("universal" primer set).
qiime cutadapt trim-paired \
--i-demultiplexed-sequences reads_qza/reads.qza \
--p-cores $NCORES \
--p-front-f GTGYCAGCMGCCGCGGTAA \
--p-front-r CCGYCAATTYMTTTRAGTTT \
--p-discard-untrimmed \
--p-no-indels \
--o-trimmed-sequences reads_qza/reads_trimmed.qza
If you received the error --p-cores
option requires an argument make sure that you set the NCORES bash variable as described above.
If using 16S V6-V8 Bacterial primers, use these options in the above command:
--p-front-f ACGCGHNRAACCTTACC \
--p-front-r ACGGGCRGTGWGTRCAA \
If using 16S V6-V8 Archaeal primers, use these options in the above command:
--p-front-f TYAATYGGANTCAACRCC \
--p-front-r CRGTGWGTRCAAGGRGCA \
If using 18S V4 primers, use these options in the above command:
--p-front-f CYGCGGTAATTCCAGCTC \
--p-front-r AYGGTATCTRATCRTCTTYG \
If using Fungal ITS2 primers, use these options in the above command:
--p-front-f GTGAATCATCGAATCTTTGAA \
--p-front-r TCCTCCGCTTATTGATATGC \
You can see a list of commonly used primers, their targets, and their sequences on our IMR protocols page (scroll down to the "Currently Available Amplicon Targets/Primers (recommended sets in bold)" section).
Visualizing your output data is a good idea after any step to make sure nothing unexpected occurred. The following command generates a "visualization" file with the extension QZV. It is good to check that we still have most of our reads after trimming the primers! If you have very few or no reads left, this suggests that you got the primer sequence wrong in the above cutadapt
command.
qiime demux summarize \
--i-data reads_qza/reads_trimmed.qza \
--o-visualization reads_qza/reads_trimmed_summary.qzv
At this stage the main 2 pipelines you can use are based on either deblur or DADA2. Below we will describe the commands for running deblur. See here if you are interested in running DADA2.
Currently deblur doesn't support unjoined paired-end reads. Forward and reverse reads can be joined with VSEARCH as shown below. Note that this command is more stringent than other approaches like PEAR when joining reads. If a low proportion of reads are joined at this step then one solution would be to export the data from QIIME2, run PEAR (see the run_pear.pl script that is part of this repository), and then re-import your data. In order to avoid those extra export/import manipulations, another solution is to restart from the raw reads, replacing the initial steps here until the Deblur/DADA2 step below - see our instructions for Alternative Processing of Reads with PEAR.
qiime vsearch merge-pairs \
--i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
--o-merged-sequences reads_qza/reads_trimmed_joined.qza \
--o-unmerged-sequences reads_qza/reads_trimmed_unjoined.qza
This command will filter out low-quality reads based on the default options.
qiime quality-filter q-score \
--i-demux reads_qza/reads_trimmed_joined.qza \
--o-filter-stats filt_stats.qza \
--o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza
It is a good idea at this point just to verify that there haven't been any substantial losses of reads, before going through the whole ASV process, at either the joining or quality-filtering steps above:
qiime demux summarize \
--i-data reads_qza/reads_trimmed_joined_filt.qza \
--o-visualization reads_qza/reads_trimmed_joined_filt_summary.qzv
Run the deblur workflow to correct reads and get amplicon sequence variants (ASVs). Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1
, and is for 16S sequences only. For other amplicon regions, you can either use the denoise-other
option in the command and specify a reference database of sequences to use for positive filtering (as in the below versions for 18S and ITS) or use DADA2 as described below (the former still being preferred due to speed considerations).
Note that you will need to trim all sequences to the same length with the --p-trim-length
option if you get an error. In order to determine the correct length to trim down to, take a look at the summary visualization generated above in order to select a length to trim back to that maintains the largest/acceptable quantity of reads.
We typically choose this length based on the Demultiplexed sequence length summary
section of the Interactive Quality Plot
tab in the reads_trimmed_joined_filt_summary.qzv
file. In this summary, a subset of your sequences have been sampled to determine the lengths. This will look something like this:
Total Sequences Sampled | 10000.0 |
---|---|
2% | 366 nts |
9% | 368 nts |
25% | 370 nts |
50% (median) | 372 nts |
75% | 374 nts |
91% | 376 nts |
98% | 377 nts |
This means that 98% of my sequences (100% - 2%) are at least 366 nts long, 91% (100% - 9%) are at least 368 nts long, and so on. In this case, I would probably choose 366 as my cut-off because 366 still seems long enough that I can get reasonable taxonomic classification and it is within what is expected with my primer set, but I am removing any sequences that are very short. In some cases, you may want to do different things, for example, if you expect some off-target amplification that is much shorter than your amplicon target, you would want to trim to a length that would remove that off-target amplification. This can also be easier to choose for the 16S rRNA gene, where the length of the gene (and therefore the variable regions of the gene) is much more consistent than for e.g. 18S or ITS2.
Finally, input that length in the deblur option --p-trim-length
below and run the command.
qiime deblur denoise-16S \
--i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
--p-trim-length -1 \
--p-sample-stats \
--p-jobs-to-start $NCORES \
--p-min-reads 1 \
--output-dir deblur_output
When running 18S and ITS data with deblur you will need to specify custom reference files, which can be downloaded here.
For the 18S version (note the first command imports the deblur database and only needs to be run once on your system):
qiime tools import \
--input-path /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean_prob-rm.fasta \
--output-path /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean_prob-rm.qza \
--type 'FeatureData[Sequence]'
qiime deblur denoise-other \
--i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
--i-reference-seqs /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean_prob-rm.qza \
--p-trim-length -1 \
--p-sample-stats \
--p-jobs-to-start $NCORES \
--p-min-reads 1 \
--output-dir deblur_output
For the ITS version (note the first command imports the deblur database and only needs to be run once on your system):
qiime tools import \
--input-path /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.fasta \
--output-path /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.qza \
--type 'FeatureData[Sequence]'
qiime deblur denoise-other \
--i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
--i-reference-seqs /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver8_99_s_all_02.02.2019.qza \
--p-trim-length -1 \
--p-sample-stats \
--p-jobs-to-start $NCORES \
--p-min-reads 1 \
--output-dir deblur_output
Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.
qiime feature-table summarize \
--i-table deblur_output/table.qza \
--o-visualization deblur_output/deblur_table_summary.qzv
We will use this visualization later to determine the the cut-offs for filtering the table below, but for now you should mainly take a look at the visualization to ensure that sufficient reads have been retained after running deblur. This denoising tool filters out reads that either do match to known noise or that do not match with low similarity to the expected amplicon region. If your samples have very low depth after running deblur (compared to the input read depth) this could be a red flag that either you ran the tool incorrectly, you have a lot of noise in your data, or that deblur is inappropriate for your dataset.
So that the rest of the steps in the workflow can be identical regardless of whether you've used Deblur or DADA2, we'll just copy the Deblur output.
mkdir denoised_output
cp deblur_output/*.qza denoised_output/
At this stage, the main 2 pipelines you can use are based on either deblur or DADA2. We recommend running DADA2, which now supports PacBio reads, as Deblur may not work correctly since it does not currently have a PacBio mode.
Run the DADA2 workflow to remove primers, correct reads and get amplicon sequence variants (ASVs). Note that the previous version of the PacBio CCS Amplicon SOP included two steps ("Resolve orientation problems" and "Trim primers with cutadapt") prior to running DADA2. Those steps are now included as part of DADA2's "denoise-ccs" mode described below. Also, note that we are not doing any form of initial quality filtering (unlike our Illumina SOP) or truncation in the below command due to the fact the CCS reads are already of very high quality when produced (HiFi reads are 99% consensus accuracy). You will probably also want to increase the number of threads used below to the maximum your system has available. If your PacBio data is not CCS in nature or you have used a lower consensus threshold (ie: <99%), then we would suggest you add in a quality-filter step. The below primers correspond to the full-length 16S (Bacteria-specific primer set).
mkdir dada2_output
qiime dada2 denoise-ccs \
--i-demultiplexed-seqs reads_qza/reads.qza \
--p-min-len 1200 --p-max-len 1800 \
--p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT \
--p-n-threads $NCORES \
--o-table dada2_output/table.qza \
--o-representative-sequences dada2_output/representative_sequences.qza \
--o-denoising-stats dada2_output/stats.qza \
--verbose
Note we are using linked anchored adapters here to ensure only reads having the primers at the extremities are retained and a size-range of 1300-1800 nt to allow some indels (there should be relatively few in actuality), but prevent amplicon dimers (we've observed ~1% in PacBio CCS data) from passing through, since the size of the 16S amplicon here is ~1500 nt.
This will take quite a long time to run! If you're in a workshop and would just like the output, you can download it below.
mkdir dada2_output
cd dada2_output
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/dada2_output/representative_sequences.qza
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/dada2_output/stats.qza
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/dada2_output/table.qza
cd ..
If using our full-length 16S Archaea-specific primers (currently in testing), use the following command:
qiime dada2 denoise-ccs \
--i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 1200 --p-max-len 1800 \
--p-front TCCGGTTGATCCYGCCGG --p-adapter CRGTGWGTRCAAGGRGCA \
--p-n-threads $NCORES \
--o-table dada2_output/table.qza \
--o-representative-sequences dada2_output/representative_sequences.qza \
--o-denoising-stats dada2_output/stats.qza \
--verbose
If using our full-length 18S primers, use the following command (note a much wider size-range, around the ~1800 nt normal size, due to the fact some species of Eukaryotes have much larger indels in their rRNA than Bacteria do):
qiime dada2 denoise-ccs \
--i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 1000 --p-max-len 3000 \
--p-front CTGGTTGATYCTGCCAGT --p-adapter TGATCCTTCTGCAGGTTCACCTAC \
--p-n-threads $NCORES \
--o-table dada2_output/table.qza \
--o-representative-sequences dada2_output/representative_sequences.qza \
--o-denoising-stats dada2_output/stats.qza \
--verbose
If using our full-length fungal ITS primers, use the following command (again, size-range is modified for the proper range to keep variation, but hopefully remove larger dimers; ~600 nt average size):
qiime dada2 denoise-ccs \
--i-demultiplexed-seqs reads_qza/raw_reads.qza \
--p-min-len 300 --p-max-len 900 \
--p-front TAGAGGAAGTAAAAGTCGTAA --p-adapter TCCTCCGCTTWTTGWTWTGC \
--p-n-threads $NCORES \
--o-table dada2_output/table.qza \
--o-representative-sequences dada2_output/representative_sequences.qza \
--o-denoising-stats dada2_output/stats.qza \
--verbose
Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.
qiime feature-table summarize \
--i-table dada2_output/table.qza \
--o-visualization dada2_output/dada2_table_summary.qzv
We will use this visualization later to determine the the cut-offs for filtering the table below, but for now you should mainly take a look at the visualization to ensure that sufficient reads have been retained after running DADA2.
If you see substantial losses of read numbers after this step, then make absolutely certain you are using the correct primer sequences (in the correct linked orientations) and have not filtered out most of your reads due to the size-range restrictions - this would indicate a problem in matching your fragment with the correct parameters (especially important if using custom primers and adjusting all the above values).
So that the rest of the steps in the workflow can be identical regardless of whether you've used Deblur or DADA2, we'll just copy the Deblur output.
mkdir denoised_output
cp dada2_output/*.qza denoised_output/
You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the scikit learn Python library and one of many databases. We have provided details on downloading these classifiers for several common databases as well as for building your own. Some databases that we have used:
- SILVA (16S/18S SSU)
- Greengenes 2 (16S)
- UNITE (ITS, fungi and other eukaryotes)
- PR2 (18S, Protist Ribosomal Reference)
- HOMD (16S, Human Oral Microbiome Database)
- GTDB (16S, Genome Taxonomy DataBase)
- MitoFish (12S)
This approach requires that a classifier be trained in advance on a reference database. We recommend users use a widely used classifier to help ensure there are no unexpected issues with the Naive-Bayes model. We previously maintained primer-specific classifiers, which theoretically can provide more accurate classifications, but we no longer do this due to concerns regarding issues with the trained models that are difficult to catch if only a couple people are running them. No matter what approach you use, it's a good idea to run a few sanity checks on the output to make sure it worked correctly for your data (see below).
You can see the available downloads on the QIIME 2 resources page. The SciKit-Learn Naive Bayes classifiers will typically work across multiple QIIME 2 versions - the important thing is the database that they are trained on and the SciKit Learn version that they use.
wget https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes2/2024.09.backbone.full-length.nb.sklearn-1.4.2.qza
Sklearn Version: 1.4.2
wget https://data.qiime2.org/classifiers/sklearn-1.4.2/silva/silva-138-99-nb-classifier.qza
Sklearn Version: 1.4.2 Notes: not recommended for species-level taxonomy.
wget https://data.qiime2.org/classifiers/sklearn-1.4.2/gtdb/gtdb_classifier_r220.qza
Sklearn Version: 1.4.2
We also have a few taxonomic classifiers available, some of these we have downloaded from the QIIME 2 website, and others we have trained ourselves. You can see all of these for Sklearn version 1.4.2 here. There should be enough details in all of the file names for you to figure out what they each are.
We will add details here later! For now, you can view a previous version of these commands here.
This is an example for making a HOMD classifier.
Make a folder to work from:
mkdir q2_classifier_homd
cd q2_classifier_homd
Download the relevant files from the HOMD FTP:
wget https://www.homd.org//ftp/16S_rRNA_refseq/HOMD_16S_rRNA_RefSeq/V16.01/HOMD_16S_rRNA_RefSeq_V16.01_full.fasta
wget https://www.homd.org//ftp/16S_rRNA_refseq/HOMD_16S_rRNA_RefSeq/V16.01/HOMD_16S_rRNA_RefSeq_V16.01.qiime.taxonomy
Import them into QIIME2:
qiime tools import --type 'FeatureData[Sequence]' \
--input-path HOMD_16S_rRNA_RefSeq_V16.01_full.fasta \
--output-path HOMD_16S_rRNA_RefSeq_V16.01_seqs.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
--input-path HOMD_16S_rRNA_RefSeq_V16.01.qiime.taxonomy \
--output-path HOMD_16S_rRNA_RefSeq_V16.01_tax.qza
Build the classifier:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads HOMD_16S_rRNA_RefSeq_V16.01_seqs.qza \
--i-reference-taxonomy HOMD_16S_rRNA_RefSeq_V16.01_tax.qza \
--o-classifier HOMD_16S_rRNA_RefSeq_V16.01.qza
You can run the taxonomic classification with this command, which is one of the longest running and most memory-intensive command of the SOP. If you receive an error related to insufficient memory (and if you cannot increase your memory usage) then you can look into the --p-reads-per-batch
option and set this to be lower than the default (which is dynamic depending on sample depth and the number of threads) and also try running the command with fewer jobs (e.g. set --p-n-jobs 1
).
qiime feature-classifier classify-sklearn \
--i-reads denoised_output/representative_sequences.qza \
--i-classifier path_to_the_classifier_you_have_chosen.qza \
--p-n-jobs $NCORES \
--output-dir taxa
This will take quite a long time to run! If you're in a workshop and would just like the output, you can download it below.
mkdir taxa
cd taxa
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/taxa/classification.qza
cd ..
As with all QZA files, you can export the output file to take a look at the classifications and confidence scores - this needs to be done in this case, as we use the exported taxonomy.tsv file later on when exporting all the final analysis files:
qiime tools export \
--input-path taxa/classification.qza --output-path taxa
The performance of the taxonomic classification is difficult to assess without a gold-standard reference, but nonetheless one basic sanity check is to compare the taxonomic assignments with the top BLASTn hits for certain ASVs.
It is simple to do this with QIIME 2 by running:
qiime feature-table tabulate-seqs --i-data denoised_output/representative_sequences.qza \
--o-visualization denoised_output/representative_sequences.qzv
The file denoised_output/representative_sequences.qzv
is a QIIME 2 visualization file that you can open in the QIIME 2 viewer. The format makes it easy to BLAST certain ASVs against the NCBI nt database (just click on each of the sequences!). By comparing these BLAST hits with the taxonomic assignment of ASVs generated above you can reassure yourself that the taxonomic assignments overall worked correctly. It's a good idea to select ~5 ASVs to BLAST for this validation, which should be from taxonomically different groups, such as different phyla, according to the taxonomic classifier.
You can then check the taxonomy that was assigned to them using your classifier - you can either open up the taxonomy.tsv file and search for the Feature ID/ASV names there, or you can use the grep command like so:
grep "02fda7604175053e87dc11e2598a89e0" taxa/taxonomy.tsv
Filtering the denoised table is an important step of microbiome data analysis. You can see more details on this process in the QIIME 2 filtering tutorial.
Based on the summary visualization created in the denoising steps above you can choose a cut-off for how frequent a variant needs to be (and optionally how many samples need to have the variant) for it to be retained. One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads). Although advances in Illumina technology mean that this may no longer be the case (and this also may vary for PacBio), this is still what we tend to use.
To calculate this cut-off you would identify the mean sample depth in the visualization created above (deblur_output/deblur_table_summary.qzv
or dada2_output/dada2_table_summary.qzv
), multiply it by 0.001, and round to the nearest integer.
Once you've determined how you would like to filter your table you can do so with this command (X is a placeholder for your choice):
qiime feature-table filter-features \
--i-table denoised_output/table.qza \
--p-min-frequency X \
--p-min-samples 1 \
--o-filtered-table denoised_output/table_filt.qza
Now that we have assigned taxonomy to our ASVs we can use that information to remove ASVs which are likely contaminants or noise based on the taxonomic labels. Two common contaminants in 16S sequencing data are mitochondrial and chloroplast 16S sequences, which can be removed by excluding any ASV which contains those terms in its taxonomic label. It can also be sometimes useful to exclude any ASV that is unclassified at the phylum level since these sequences could be noise (e.g. possible chimeric sequences). Note that if your data has not been classified against the default database you may need to change p__
to be a string that enables phylum-level assignments to be identified or simply omit that line.
In general though, it can be very informative if your sequencing reads are coming back with significant amounts of unclassified ASVs as it can indicate upstream analysis problems or indicate you are studying a poorly characterized environment where you have a good chance of identifying a lot of novel phyla. Therefore, our recommendation is to not filter out the unclassified sequences by default:
qiime taxa filter-table \
--i-table denoised_output/table_filt.qza \
--i-taxonomy taxa/classification.qza \
--p-exclude mitochondria,chloroplast \
--o-filtered-table denoised_output/table_filt_contam.qza
If you aren't trying to remove groups like this (if you have sequenced a different marker gene, for example), then you could just skip this step. Keep in mind that you may also need to modify the file names given to the following commands.
If you do want to exclude all the unclassifieds, then run the following instead (with the above caveat that the --p-include
line might need to be adapted according to the tax database used):
qiime taxa filter-table \
--i-table denoised_output/table_filt.qza \
--i-taxonomy taxa/classification.qza \
--p-include p__ \
--p-exclude mitochondria,chloroplast \
--o-filtered-table denoised_output/table_filt_contam.qza
Often certain samples will have quite low depth after these filtering steps, which can be excluded from downstream analyses since they will largely add noise. There is no single cut-off that works best for all datasets, but researchers often use minimum cut-offs within the range of 1000 to 4000 reads. You can also use a cut-off much lower than this if you want to retain all samples except those that failed entirely (e.g. depth < 50 reads). Ideally you would choose this cut-off after visualizing rarefaction curves to determine at what read depth the richness of your samples plateaus and choose a cut-off as close to this plateau as possible while retaining sufficient sample size for your analyses.
To perform this rarefaction curve analysis you would first need to summarize the filtered table we produced in the last step:
qiime feature-table summarize \
--i-table denoised_output/table_filt_contam.qza \
--o-visualization denoised_output/table_filt_contam_summary.qzv
From this table you need to determine the maximum depth across your samples. You can then generate the rarefaction curves with this command (where X is a placeholder for the max depth across samples).
qiime diversity alpha-rarefaction \
--i-table denoised_output/table_filt_contam.qza \
--p-max-depth X \
--p-steps 20 \
--p-metrics 'observed_features' \
--o-visualization rarefaction_curves_test.qzv
If the depth of your samples varies quite a lot, then you may find that the --p-max-depth
is too large to easily see your samples with lower read depths. In this case, I would recommend choosing a lower number (e.g. 10000) and re-running this step with that number.
Take a look at these curves to help decide on a minimum depth cut-off for retaining samples. Once you decide on a hard cut-off you can exclude samples below this cut-off with this command (where SET_CUTOFF
is a placeholder for the minimum depth you select):
qiime feature-table filter-samples \
--i-table denoised_output/table_filt_contam.qza \
--p-min-frequency SET_CUTOFF \
--o-filtered-table denoised_output/table_final.qza
In many analyses, we might not want to exclude samples at this point. Sometime we will investigate using a few different cut-offs to see how much these lower depth samples impact our results (if at all), or we might just be running a pilot study so we may be less concerned with whether we have kept (potentially) lower quality samples. In that case, if you do not wish to exclude any samples, then you can simply make a copy of the QZA file with the final table filename (i.e. cp denoised_output/table_filt_contam.qza denoised_output/table_final.qza
), since this is the filename used for the remaining commands below.
Once we have our final filtered table we will need to subset the QZA of ASV sequences to the same set. You can exclude the low frequency ASVs from the sequence file with this command:
qiime feature-table filter-seqs \
--i-data denoised_output/representative_sequences.qza \
--i-table denoised_output/table_final.qza \
--o-filtered-data denoised_output/rep_seqs_final.qza
Finally, you can make a new summary of the final filtered abundance table:
qiime feature-table summarize \
--i-table denoised_output/table_final.qza \
--o-visualization denoised_output/table_final_summary.qzv
SEPP is one method for placing short sequences into a reference phylogenetic tree. This is a useful way of determining a phylogenetic tree for your ASVs, particularly as short amplicons do not typically make robust trees. As with the taxonomic classifiers, there are several that could be used.
We are now typically using the Greengenes 2 reference with SEPP, but there are several that could be used.
Greengenes 2 (2022.10)
wget http://kronos.pharmacology.dal.ca/public_files/MH/taxa_classifiers/scikit-learn_v1.4.2_classifiers/greengenes_2022.10.backbone.sepp-reference.qza
SILVA v128
wget https://data.qiime2.org/classifiers/sepp-ref-dbs/sepp-refs-silva-128.qza
Greengenes 13_8
https://data.qiime2.org/classifiers/sepp-ref-dbs/sepp-refs-gg-13-8.qza
For 16S data you can do this with q2-fragment-insertion using the below command:
qiime fragment-insertion sepp \
--i-representative-sequences denoised_output/rep_seqs_final.qza \
--i-reference-database reference_tree.qza \
--o-tree asvs-tree.qza \
--o-placements insertion-placements.qza \
--p-threads $NCORES
Where reference_tree.qza
should be replaced with whichever tree you downloaded above.
This will take quite a long time to run! If you're in a workshop and would just like the output, you can download it below.
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/asvs-tree.qza
wget http://kronos.pharmacology.dal.ca/public_files/MH2/MH2_AWS_out/arctic_ocean_pacbio/insertion-placements.qza
You can specify custom reference files to place other amplicons, but the easiest approach for 18S and ITS data is to instead create a de novo tree as outlined below.
Given the lack of a pre-calculated reference tree for 18S and ITS data (unlike the above 16S tree) to do direct placement, we create the tree de novo here below.
We'll need to make a multiple-sequence alignment of the ASVs before running FastTree. First, we'll make a folder for the output files.
mkdir tree_out
We'll use MAFFT to make a de novo multiple-sequence alignment of the ASVs.
qiime alignment mafft --i-sequences denoised_output/rep_seqs_final.qza \
--p-n-threads $NCORES \
--o-alignment tree_out/rep_seqs_final_aligned.qza
Variable positions in the alignment need to be masked before FastTree is run, which can be done with this command:
qiime alignment mask --i-alignment tree_out/rep_seqs_final_aligned.qza \
--o-masked-alignment tree_out/rep_seqs_final_aligned_masked.qza
Finally FastTree can be run on this masked multiple-sequence alignment:
qiime phylogeny fasttree --i-alignment tree_out/rep_seqs_final_aligned_masked.qza \
--p-n-threads $NCORES \
--o-tree tree_out/rep_seqs_final_aligned_masked_tree
FastTree returns an unrooted tree. One basic way to add a root to a tree is to add it add it at the midpoint of the largest tip-to-tip distance in the tree, which is done with this command:
qiime phylogeny midpoint-root --i-tree tree_out/rep_seqs_final_aligned_masked_tree.qza \
--o-rooted-tree tree_out/rep_seqs_final_aligned_masked_tree_rooted.qza
To keep this output filename consistent with the SOP you can simply make a copy of this output tree.
cp tree_out/rep_seqs_final_aligned_masked_tree_rooted.qza asvs-tree.qza
Other things that we are considering for future versions of this tutorial:
- Ghost tree for ITS2 or others?
- Building a de novo tree - would ssu-align or similar do a better job than mafft?
Lastly, to get the BIOM file (with associated taxonomy) and FASTA file (one per ASV) for your dataset to plug into other programs you can use the commands below.
To export the FASTA:
qiime tools export \
--input-path denoised_output/rep_seqs_final.qza \
--output-path denoised_output_exported
To export the BIOM table (with taxonomy added as metadata):
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' taxa/taxonomy.tsv
qiime tools export \
--input-path denoised_output/table_final.qza \
--output-path denoised_output_exported
biom add-metadata \
-i denoised_output_exported/feature-table.biom \
-o denoised_output_exported/feature-table_w_tax.biom \
--observation-metadata-fp taxa/taxonomy.tsv \
--sc-separated taxonomy
biom convert \
-i denoised_output_exported/feature-table_w_tax.biom \
-o denoised_output_exported/feature-table_w_tax.txt \
--to-tsv \
--header-key taxonomy
To export the tree of your ASVs:
qiime tools export \
--input-path asvs-tree.qza \
--output-path denoised_output_exported
Links to next tutorials including whole analysis workflow and basic visualisation and statistics in QIIME2. These will be here soon!!