Step 6: Taxonomy assignment - shenjean/diversity GitHub Wiki

In QIIME2, taxonomy is assigned to each reference sequence using a pre-trained Naïve Bayesian classifier. Here is a list of available classifiers and references sequences: https://resources.qiime2.org

Ia. Classifier for 18S rRNA gene from PR2 database

The PR2 database seems to yield better taxonomic assignment (more classified taxa) based on the 18S rRNA gene.

wget https://github.com/pr2database/pr2database/releases/download/v5.1.1/pr2_version_5.1.1_SSU_mothur.tax.gz
wget https://github.com/pr2database/pr2database/releases/download/v5.1.1/pr2_version_5.1.1_SSU_mothur.fasta.gz
gunzip pr2_version_5.1.1_SSU_mothur.tax.gz
gunzip pr2_version_5.1.1_SSU_mothur.fasta.gz
  • Add header line to taxonomy file: ['Feature ID', 'Taxon'] must be the first two header values.

  • Import unzipped files to QIIME2

qiime tools import --type 'FeatureData[Sequence]' --input-path pr2_version_5.1.1_SSU_mothur.fasta --output-path PR2-seqs.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-path pr2_version_5.1.1_SSU_mothur.tax --output-path PR2-tax.qza
qiime feature-classifier extract-reads --i-sequences PR2-seqs.qza \
--p-f-primer CCAGCASCYGCGGTAATTCC --p-r-primer ACTTTCGTTCTTGATYR \
--p-n-jobs 8 --o-reads PR2-v4-seqs.qza
  • Dereplicate
qiime rescript dereplicate \
    --i-sequences PR2-v4-seqs.qza --i-taxa PR2-tax.qza --p-mode 'uniq' \
    --o-dereplicated-sequences PR2-seqs-v4-uniq.qza \
    --o-dereplicated-taxa PR2-tax-v4-uniq.qza

Ib. Classifier for 16S rRNA gene: V4 515F/806R region from SILVA database

There are pre-trained classifier(s) specifically for the V4 515F/806R region. This is available for download in the data resources page. Download the sequences and taxonomy corresponding to the 515F/806R regions. Make sure all files are downloaded to your working folder (e.g. BocaCiegaBay).

wget https://data.qiime2.org/2019.10/common/silva-132-99-515-806-nb-classifier.qza
  • For other users, download the classifier: Greengenes2 2022.10 from 515F/806R region of sequences
wget https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes2/2022.10.backbone.v4.nb.sklearn-1.4.2.qza

Ic. Train your own classifiers for other 16S rRNA gene regions

For other regions or to train your own classifier, download the NR99 full-length sequences (silva-138-99-seqs.qza) and taxonomy (silva-138-99-tax.qza) in the Silva (16S/18S rRNA) section of the data resources page.

wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza
  • Now, we're going to extract our target regions using the primer sequences used for library construction and the amplicon length.
  • It has been shown that taxonomic classification accuracy improves when a Naive Bayes classifier is trained on only the region of the target sequences that was sequenced (see [Werner et al., 212])(https://www.ncbi.nlm.nih.gov/pubmed/21716311)
  • The pipeline below follows the RESCRIPt tutorial

This command will probably take several hours to run. For CIRCE users, with an older version of QIIME2, the best way is to submit and run a job on the server. First, make sure you are logged into CIRCE and make sure you are in the QIIME2 environment. Then, create the following text file with notepad on your own computer, and transfer it to the server. Alternatively, you can use the nano text editor while logged in to CIRCE. Please change the folder path /home/j/jeanlim/BocaCiegaBay below to your own folder path on CIRCE. You can name this file qiime2.sh.

#!/bin/bash

#SBATCH --time=48:00:00
#SBATCH --job-name=qiime2_extract
#SBATCH --output=qiime2_extract.log

cd /home/j/jeanlim/BocaCiegaBay

qiime feature-classifier extract-reads \
--i-sequences silva-138-99-seqs.qza \
--p-f-primer GTGCCAGCMGCCGCGGTAA \
--p-r-primer GGACTACHVGGGTWTCTAAT \
--p-n-jobs 1 \
--o-reads silva-138-v4.qza

Once the file is created, submit the job to the server by typing sbatch qiime2.sh while logged in to CIRCE server. You can check the status of your queue by typing using the squeue command, for example squeue -u jeanlim. Change the username to your own username. An example squeue output is shown below:

(qiime2-2019.10) [jeanlim@itn2 ~]$ squeue -u jeanlim
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          22526001     circe qiime2_e  jeanlim  R       6:04      1 svc-3024-9-32
  • The first column shows the JOBID, which is useful for cancelling a job. To cancel a job, type scancel followed by the JOBID, for example scancel 22526001.

  • The fourth column (ST) shows the job status, where PD=pending and R=running.

  • When the job is completed, you will see the output file silva-138-v4.qza in your folder. If you don't see the output file but the job is no longer running, you can view the log file qiime2_extract.log to check for errors.

For other users, with a newer version of QIIME2

  • We’ll set --p-read-orientation 'forward', as the SILVA database is curated to be in the same “forward” orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search.
qiime feature-classifier extract-reads \
    --i-sequences silva-138-99-seqs.qza \
    --p-f-primer GTGCCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 8 \
    --p-read-orientation 'forward' \
    --o-reads silva-138-v4.qza

Moving forward, you can edit or create copies of the qiime2.sh to run any of the subsequent QIIME2 commands below. For each run, you will have to replace the old QIIME command each time with a new command. For example, you can replace the qiime feature-classifier extract-reads with the qiime rescript dereplicate command below.

Dereplicate extracted sequences

qiime rescript dereplicate \
    --i-sequences silva-138-v4.qza --i-taxa silva-138-99-tax.qza --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138-v4-uniq.qza \
    --o-dereplicated-taxa  silva-138-tax-v4-uniq.qza

Training the classifier

Now, let's train the classifier based on the region of interest. This is the basic QIIME2 command:

qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-v4-uniq.qza --i-reference-taxonomy silva-138-tax-v4-uniq.qza \
--o-classifier q2-v4classifier.qza

The rescript evaluate-fit-classifier command can be used as a substitute for QIIME2's fit-classifier-naive-bayes command, with the added functionality of evaluating the accuracy of the classifier. However, this takes a LONG time to run (>35 hours and still running with 128gb RAM and 12 processors).

qiime rescript evaluate-fit-classifier \
    --i-sequences silva-138-v4-uniq.qza --i-taxonomy silva-138-tax-v4-uniq.qza  \
    --o-classifier rescript-v4-classifier.qza \
    --o-observed-taxonomy rescript-v4-predicted-taxonomy.qza \
    --o-evaluation rescript-v4-classifier-evaluation.qzv 

II. Assign taxonomy to representative sequences

Using a pre-trained or self-trained classifier, assign taxonomy to your representative sequences. This should take 15-30 mins to run. Be patient.

For CIRCE users,

qiime feature-classifier classify-sklearn \
--i-classifier silva-132-99-515-806-nb-classifier.qza \
--i-reads pe.repseqs.qza \
--o-classification taxonomy.qza 

For other users,

qiime feature-classifier classify-sklearn \
  --i-classifier 2022.10.backbone.v4.nb.sklearn-1.4.2.qza \
  --i-reads pe.repseqs.qza \
  --o-classification taxonomy.sklearn.qza

Another alternative is to perform taxonomic classification using BLAST:

wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza
qiime feature-classifier classify-consensus-blast --i-query pe.repseqs.qza \
--i-reference-reads silva-138-99-seqs.qza --i-reference-taxonomy silva-138-99-tax.qza \
--o-classification taxonomy.blast.qza --o-search-results blast-results.qza

III. Visualize taxonomy data

Get list of taxonomic names and confidence for each feature

qiime metadata tabulate \
  --m-input-file taxonomy.sklearn.qza \
  --o-visualization taxonomy.sklearn.qzv

View the taxonomic composition of each sample with interactive bar plots.

qiime taxa barplot \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.sklearn.qza \
  --m-metadata-file metadata.txt \
  --o-visualization taxa-bar-plots.qzv

IV. Merge taxonomies predicted by different methods

How to merge feature taxonomies

  • len will select the taxonomy with the most elements (e.g., species level will beat genus level)
  • lca will find the least common ancestor and report this consensus taxonomy
  • score will select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframe
  • majority finds the LCA consensus while giving preference to majority labels.
  • super finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"

For SILVA taxonomy:

qiime rescript merge-taxa \
--i-data taxonomy.blast.qza taxonomy.sklearn.qza \
--p-mode super --o-merged-data taxonomy.merged.qza

For other taxonomies, you will need to modify the --p-rank-handle-regex arguments to recognize rank handles, e.g. For example, ^[dkpcofgs]__ will recognize greengenes or silva rank handles. For eukaryotic databases (e.g. PR2), you may also want to disable the --p-new-rank-handles option to prevent trimming of the taxonomy to the given levels.

qiime rescript merge-taxa --i-data 18S.PR2.blast.taxonomy.qza 18S.PR2.sklearn.taxonomy.qza --p-mode super --o-merged-data 18S.PR2.taxonomy.merged.qza  --p-new-rank-handles disable

Get list of taxonomic names and confidence for each feature

qiime metadata tabulate \
  --m-input-file taxonomy.merged.qza \
  --o-visualization taxonomy.merged.qzv

View the taxonomic composition of each sample with interactive bar plots.

qiime taxa barplot \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.merged.qza \
  --m-metadata-file metadata.txt \
  --o-visualization taxa-merged-bar-plots.qzv

V. Filtering the count table by taxonomy

The command below keeps ASVs with phylum-level annotations and removes ASVs annotated as “archaea”, “chloroplast”, “eukaryota”, and “mitochondria”

If you used silva-132-99-515-806-nb-classifier.qza to classify your sequences:

qiime taxa filter-table \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.qza \
  --p-include D_1__ \
  --p-exclude mitochondria,chloroplast,eukaryota,archaea \
  --o-filtered-table bacteriatable.qza

If you used other classifiers to classify your sequences:

qiime taxa filter-table \
  --i-table pe.dada2.qza \
  --i-taxonomy taxonomy.qza \
  --p-include p__ \
  --p-exclude mitochondria,chloroplast,eukaryota,archaea \
  --o-filtered-table bacteriatable.qza

Check the taxonomy bar plots again

qiime taxa barplot \
  --i-table bacteriatable.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file metadata.txt \
  --o-visualization bacteria-taxa-bar-plots.qzv

VI. Summarize ASV abundance of bacteria table

qiime feature-table summarize \
    --i-table bacteriatable.qza \
    --o-feature-frequencies feature-frequencies.qza \
    --o-sample-frequencies sample-frequencies.qza \
    --o-summary visual summary.qzv

VII. Collapse features by their taxonomy

This example command collapses ASVs to the genus level

qiime taxa collapse --i-table bacteriatable.qza --i-taxonomy taxonomy.qza --p-level 6 --o-collapsed-table genustable.qza

Summarize ASV abundance of genus table

qiime feature-table summarize --i-table genustable.qza --o-feature-frequencies genus-feature-frequencies.qza --o-sample-frequencies genus-sample-frequencies.qza --o-summary genustable.qzv

VIII. Export taxonomy file

qiime tools export --input-path taxonomy.qza --output-path taxonomy_export