Step 6: Taxonomy assignment - shenjean/diversity GitHub Wiki

In QIIME2, taxonomy is assigned to each reference sequence using a pre-trained Naïve Bayesian classifier. Here is a list of available classifiers and references sequences: https://resources.qiime2.org

V4 515F/806R region

There are pre-trained classifier(s) specifically for the V4 515F/806R region. This is available for download in the data resources page. Download the sequences and taxonomy corresponding to the 515F/806R regions. Make sure all files are downloaded to your working folder (e.g. BocaCiegaBay).

For CIRCE users, download the old classifier trained from Silva 132 99% OTUs from 515F/806R region of sequences from https://docs.qiime2.org/2019.10/data-resources/

wget https://data.qiime2.org/2019.10/common/silva-132-99-515-806-nb-classifier.qza

For other users, download the classifier: Greengenes2 2022.10 from 515F/806R region of sequences

wget https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes2/2022.10.backbone.v4.nb.sklearn-1.4.2.qza

Extracting other 16S rRNA gene regions

For other regions or to train your own classifier, download the NR99 full-length sequences (silva-138-99-seqs.qza) and taxonomy (silva-138-99-tax.qza) in the Silva (16S/18S rRNA) section of the data resources page.

wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza

Now, we're going to extract our target regions using the primer sequences used for library construction and the amplicon length.
It has been shown that taxonomic classification accuracy improves when a Naive Bayes classifier is trained on only the region of the target sequences that was sequenced (see [Werner et al., 212])(https://www.ncbi.nlm.nih.gov/pubmed/21716311)
The pipeline below follows the RESCRIPt tutorial

This command will probably take several hours to run. For CIRCE users, with an older version of QIIME2, the best way is to submit and run a job on the server. First, make sure you are logged into CIRCE and make sure you are in the QIIME2 environment. Then, create the following text file with notepad on your own computer, and transfer it to the server. Alternatively, you can use the nano text editor while logged in to CIRCE. Please change the folder path /home/j/jeanlim/BocaCiegaBay below to your own folder path on CIRCE. You can name this file qiime2.sh.

#!/bin/bash

#SBATCH --time=48:00:00
#SBATCH --job-name=qiime2_extract
#SBATCH --output=qiime2_extract.log

cd /home/j/jeanlim/BocaCiegaBay

qiime feature-classifier extract-reads \
--i-sequences silva-138-99-seqs.qza \
--p-f-primer GTGCCAGCMGCCGCGGTAA \
--p-r-primer GGACTACHVGGGTWTCTAAT \
--p-n-jobs 1 \
--o-reads silva-138-v4.qza

Once the file is created, submit the job to the server by typing sbatch qiime2.sh while logged in to CIRCE server. You can check the status of your queue by typing using the squeue command, for example squeue -u jeanlim. Change the username to your own username. An example squeue output is shown below:

(qiime2-2019.10) [jeanlim@itn2 ~]$ squeue -u jeanlim
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          22526001     circe qiime2_e  jeanlim  R       6:04      1 svc-3024-9-32

The first column shows the JOBID, which is useful for cancelling a job. To cancel a job, type scancel followed by the JOBID, for example scancel 22526001.
The fourth column (ST) shows the job status, where PD=pending and R=running.
When the job is completed, you will see the output file silva-138-v4.qza in your folder. If you don't see the output file but the job is no longer running, you can view the log file qiime2_extract.log to check for errors.

For other users, with a newer version of QIIME2

We’ll set --p-read-orientation 'forward', as the SILVA database is curated to be in the same “forward” orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search.

qiime feature-classifier extract-reads \
    --i-sequences silva-138-99-seqs.qza \
    --p-f-primer GTGCCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-138-v4.qza

Moving forward, you can edit or create copies of the qiime2.sh to run any of the subsequent QIIME2 commands below. For each run, you will have to replace the old QIIME command each time with a new command. For example, you can replace the qiime feature-classifier extract-reads with the qiime rescript dereplicate command below.

Dereplicate extracted sequences

qiime rescript dereplicate \
    --i-sequences silva-138-v4.qza --i-taxa silva-138-99-tax.qza --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138-v4-uniq.qza \
    --o-dereplicated-taxa  silva-138-tax-v4-uniq.qza

Training the classifier

Now, let's train the classifier based on the region of interest. This is the basic QIIME2 command:

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-v4-uniq.qza --i-reference-taxonomy silva-138-tax-v4-uniq.qza \
--o-classifier q2-v4classifier.qza

The rescript evaluate-fit-classifier command can be used as a substitute for QIIME2's fit-classifier-naive-bayes command, with the added functionality of evaluating the accuracy of the classifier. However, this takes a LONG time to run (>35 hours and still running with 128gb RAM and 12 processors).

qiime rescript evaluate-fit-classifier \
    --i-sequences silva-138-v4-uniq.qza --i-taxonomy silva-138-tax-v4-uniq.qza  \
    --o-classifier rescript-v4-classifier.qza \
    --o-observed-taxonomy rescript-v4-predicted-taxonomy.qza \
    --o-evaluation rescript-v4-classifier-evaluation.qzv

Assign taxonomy to representative sequences

Using a pre-trained or self-trained classifier, assign taxonomy to your representative sequences. This should take 15-30 mins to run. Be patient.

For CIRCE users,

qiime feature-classifier classify-sklearn \
--i-classifier silva-132-99-515-806-nb-classifier.qza \
--i-reads pe.repseqs.qza \
--o-classification taxonomy.qza

For other users,

qiime feature-classifier classify-sklearn \
  --i-classifier 2022.10.backbone.v4.nb.sklearn-1.4.2.qza \
  --i-reads pe.repseqs.qza \
  --o-classification taxonomy.qza

Another alternative is to perform taxonomic classification using BLAST:

wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza
qiime feature-classifier classify-consensus-blast --i-query pe.repseqs.qza \
--i-reference-reads silva-138-99-seqs.qza --i-reference-taxonomy silva-138-99-tax.qza \
--o-classification taxonomy.qza

Get list of taxonomic names and confidence for each feature

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

View the taxonomic composition of each sample with interactive bar plots.

qiime taxa barplot \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file metadata.txt \
  --o-visualization taxa-bar-plots.qzv

Filtering the count table by taxonomy

The command below keeps ASVs with phylum-level annotations and removes ASVs annotated as “archaea”, “chloroplast”, “eukaryota”, and “mitochondria”

If you used silva-132-99-515-806-nb-classifier.qza to classify your sequences:

qiime taxa filter-table \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.qza \
  --p-include D_1__ \
  --p-exclude mitochondria,chloroplast,eukaryota,archaea \
  --o-filtered-table bacteriatable.qza

If you used other classifiers to classify your sequences:

qiime taxa filter-table \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.qza \
  --p-include p__ \
  --p-exclude mitochondria,chloroplast,eukaryota,archaea \
  --o-filtered-table bacteriatable.qza

Check the taxonomy bar plots again

qiime taxa barplot \
  --i-table bacteriatable.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file metadata.txt \
  --o-visualization bacteria-taxa-bar-plots.qzv

Export taxonomy file

qiime tools export --input-path taxonomy.qza --output-path taxonomy_export

Summarize ASV abundance of bacteria table

qiime feature-table summarize --i-table bacteriatable.qza --o-visualization bacteriatable.qzv

Collapse features by their taxonomy

This example command collapses ASVs to the genus level

qiime taxa collapse --i-table bacteriatable.qza --i-taxonomy taxonomy.qza --p-level 6 --o-collapsed-table genustable.qza

Summarize ASV abundance of genus table

qiime feature-table summarize --i-table genustable.qza --o-visualization genustable.qzv