Step 6: Taxonomy assignment - shenjean/diversity GitHub Wiki
In QIIME2, taxonomy is assigned to each reference sequence using a pre-trained Naïve Bayesian classifier. Here is a list of available classifiers and references sequences: https://resources.qiime2.org
Ia. Classifier for 18S rRNA gene from PR2 database
The PR2 database seems to yield better taxonomic assignment (more classified taxa) based on the 18S rRNA gene.
- The reference database can be downloaded from: https://github.com/pr2database/pr2database/releases
wget https://github.com/pr2database/pr2database/releases/download/v5.1.1/pr2_version_5.1.1_SSU_mothur.tax.gz
wget https://github.com/pr2database/pr2database/releases/download/v5.1.1/pr2_version_5.1.1_SSU_mothur.fasta.gz
gunzip pr2_version_5.1.1_SSU_mothur.tax.gz
gunzip pr2_version_5.1.1_SSU_mothur.fasta.gz
-
Add header line to taxonomy file: ['Feature ID', 'Taxon'] must be the first two header values.
-
Import unzipped files to QIIME2
qiime tools import --type 'FeatureData[Sequence]' --input-path pr2_version_5.1.1_SSU_mothur.fasta --output-path PR2-seqs.qza
qiime tools import --type 'FeatureData[Taxonomy]' --input-path pr2_version_5.1.1_SSU_mothur.tax --output-path PR2-tax.qza
- Extract primer region 18S rRNA gene V4 primer sequences TAReuk454FWD1 and TAReuk454REV3.1 are from https://doi.org/10.3389/fmicb.2018.01998
qiime feature-classifier extract-reads --i-sequences PR2-seqs.qza \
--p-f-primer CCAGCASCYGCGGTAATTCC --p-r-primer ACTTTCGTTCTTGATYR \
--p-n-jobs 8 --o-reads PR2-v4-seqs.qza
- Dereplicate
qiime rescript dereplicate \
--i-sequences PR2-v4-seqs.qza --i-taxa PR2-tax.qza --p-mode 'uniq' \
--o-dereplicated-sequences PR2-seqs-v4-uniq.qza \
--o-dereplicated-taxa PR2-tax-v4-uniq.qza
Ib. Classifier for 16S rRNA gene: V4 515F/806R region from SILVA database
There are pre-trained classifier(s) specifically for the V4 515F/806R region. This is available for download in the data resources page. Download the sequences and taxonomy corresponding to the 515F/806R regions. Make sure all files are downloaded to your working folder (e.g. BocaCiegaBay).
- For CIRCE users, download the old classifier trained from Silva 132 99% OTUs from 515F/806R region of sequences from https://docs.qiime2.org/2019.10/data-resources/
wget https://data.qiime2.org/2019.10/common/silva-132-99-515-806-nb-classifier.qza
- For other users, download the classifier: Greengenes2 2022.10 from 515F/806R region of sequences
wget https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes2/2022.10.backbone.v4.nb.sklearn-1.4.2.qza
Ic. Train your own classifiers for other 16S rRNA gene regions
For other regions or to train your own classifier, download the NR99 full-length sequences (silva-138-99-seqs.qza) and taxonomy (silva-138-99-tax.qza) in the Silva (16S/18S rRNA) section of the data resources page.
wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza
- Now, we're going to extract our target regions using the primer sequences used for library construction and the amplicon length.
- It has been shown that taxonomic classification accuracy improves when a Naive Bayes classifier is trained on only the region of the target sequences that was sequenced (see [Werner et al., 212])(https://www.ncbi.nlm.nih.gov/pubmed/21716311)
- The pipeline below follows the RESCRIPt tutorial
This command will probably take several hours to run. For CIRCE users, with an older version of QIIME2, the best way is to submit and run a job on the server. First, make sure you are logged into CIRCE and make sure you are in the QIIME2 environment. Then, create the following text file with notepad on your own computer, and transfer it to the server. Alternatively, you can use the nano text editor while logged in to CIRCE. Please change the folder path /home/j/jeanlim/BocaCiegaBay below to your own folder path on CIRCE. You can name this file qiime2.sh.
#!/bin/bash
#SBATCH --time=48:00:00
#SBATCH --job-name=qiime2_extract
#SBATCH --output=qiime2_extract.log
cd /home/j/jeanlim/BocaCiegaBay
qiime feature-classifier extract-reads \
--i-sequences silva-138-99-seqs.qza \
--p-f-primer GTGCCAGCMGCCGCGGTAA \
--p-r-primer GGACTACHVGGGTWTCTAAT \
--p-n-jobs 1 \
--o-reads silva-138-v4.qza
Once the file is created, submit the job to the server by typing sbatch qiime2.sh while logged in to CIRCE server. You can check the status of your queue by typing using the squeue command, for example squeue -u jeanlim. Change the username to your own username. An example squeue output is shown below:
(qiime2-2019.10) [jeanlim@itn2 ~]$ squeue -u jeanlim
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
22526001 circe qiime2_e jeanlim R 6:04 1 svc-3024-9-32
-
The first column shows the JOBID, which is useful for cancelling a job. To cancel a job, type
scancelfollowed by the JOBID, for examplescancel 22526001. -
The fourth column (ST) shows the job status, where PD=pending and R=running.
-
When the job is completed, you will see the output file
silva-138-v4.qzain your folder. If you don't see the output file but the job is no longer running, you can view the log fileqiime2_extract.logto check for errors.
For other users, with a newer version of QIIME2
- We’ll set
--p-read-orientation 'forward', as the SILVA database is curated to be in the same “forward” orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search.
qiime feature-classifier extract-reads \
--i-sequences silva-138-99-seqs.qza \
--p-f-primer GTGCCAGCMGCCGCGGTAA \
--p-r-primer GGACTACHVGGGTWTCTAAT \
--p-n-jobs 8 \
--p-read-orientation 'forward' \
--o-reads silva-138-v4.qza
Moving forward, you can edit or create copies of the qiime2.sh to run any of the subsequent QIIME2 commands below. For each run, you will have to replace the old QIIME command each time with a new command. For example, you can replace the qiime feature-classifier extract-reads with the qiime rescript dereplicate command below.
Dereplicate extracted sequences
qiime rescript dereplicate \
--i-sequences silva-138-v4.qza --i-taxa silva-138-99-tax.qza --p-mode 'uniq' \
--o-dereplicated-sequences silva-138-v4-uniq.qza \
--o-dereplicated-taxa silva-138-tax-v4-uniq.qza
Training the classifier
Now, let's train the classifier based on the region of interest. This is the basic QIIME2 command:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-v4-uniq.qza --i-reference-taxonomy silva-138-tax-v4-uniq.qza \
--o-classifier q2-v4classifier.qza
The rescript evaluate-fit-classifier command can be used as a substitute for QIIME2's fit-classifier-naive-bayes command, with the added functionality of evaluating the accuracy of the classifier. However, this takes a LONG time to run (>35 hours and still running with 128gb RAM and 12 processors).
qiime rescript evaluate-fit-classifier \
--i-sequences silva-138-v4-uniq.qza --i-taxonomy silva-138-tax-v4-uniq.qza \
--o-classifier rescript-v4-classifier.qza \
--o-observed-taxonomy rescript-v4-predicted-taxonomy.qza \
--o-evaluation rescript-v4-classifier-evaluation.qzv
II. Assign taxonomy to representative sequences
Using a pre-trained or self-trained classifier, assign taxonomy to your representative sequences. This should take 15-30 mins to run. Be patient.
For CIRCE users,
qiime feature-classifier classify-sklearn \
--i-classifier silva-132-99-515-806-nb-classifier.qza \
--i-reads pe.repseqs.qza \
--o-classification taxonomy.qza
For other users,
qiime feature-classifier classify-sklearn \
--i-classifier 2022.10.backbone.v4.nb.sklearn-1.4.2.qza \
--i-reads pe.repseqs.qza \
--o-classification taxonomy.sklearn.qza
Another alternative is to perform taxonomic classification using BLAST:
wget https://data.qiime2.org/2024.5/common/silva-138-99-seqs.qza
wget https://data.qiime2.org/2024.5/common/silva-138-99-tax.qza
qiime feature-classifier classify-consensus-blast --i-query pe.repseqs.qza \
--i-reference-reads silva-138-99-seqs.qza --i-reference-taxonomy silva-138-99-tax.qza \
--o-classification taxonomy.blast.qza --o-search-results blast-results.qza
III. Visualize taxonomy data
Get list of taxonomic names and confidence for each feature
qiime metadata tabulate \
--m-input-file taxonomy.sklearn.qza \
--o-visualization taxonomy.sklearn.qzv
View the taxonomic composition of each sample with interactive bar plots.
qiime taxa barplot \
--i-table pe.dada2.qza \
--i-taxonomy taxonomy.sklearn.qza \
--m-metadata-file metadata.txt \
--o-visualization taxa-bar-plots.qzv
IV. Merge taxonomies predicted by different methods
How to merge feature taxonomies
lenwill select the taxonomy with the most elements (e.g., species level will beat genus level)lcawill find the least common ancestor and report this consensus taxonomyscorewill select the taxonomy with the highest score (e.g., confidence or consensus score). Note that "score" assumes that this score is always contained as the second column in a feature taxonomy dataframemajorityfinds the LCA consensus while giving preference to majority labels.superfinds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__"
For SILVA taxonomy:
qiime rescript merge-taxa \
--i-data taxonomy.blast.qza taxonomy.sklearn.qza \
--p-mode super --o-merged-data taxonomy.merged.qza
For other taxonomies, you will need to modify the --p-rank-handle-regex arguments to recognize rank handles, e.g. For example, ^[dkpcofgs]__ will recognize greengenes or silva rank handles. For eukaryotic databases (e.g. PR2), you may also want to disable the --p-new-rank-handles option to prevent trimming of the taxonomy to the given levels.
qiime rescript merge-taxa --i-data 18S.PR2.blast.taxonomy.qza 18S.PR2.sklearn.taxonomy.qza --p-mode super --o-merged-data 18S.PR2.taxonomy.merged.qza --p-new-rank-handles disable
Get list of taxonomic names and confidence for each feature
qiime metadata tabulate \
--m-input-file taxonomy.merged.qza \
--o-visualization taxonomy.merged.qzv
View the taxonomic composition of each sample with interactive bar plots.
qiime taxa barplot \
--i-table pe.dada2.qza \
--i-taxonomy taxonomy.merged.qza \
--m-metadata-file metadata.txt \
--o-visualization taxa-merged-bar-plots.qzv
V. Filtering the count table by taxonomy
The command below keeps ASVs with phylum-level annotations and removes ASVs annotated as “archaea”, “chloroplast”, “eukaryota”, and “mitochondria”
If you used silva-132-99-515-806-nb-classifier.qza to classify your sequences:
qiime taxa filter-table \
--i-table pe.dada2.qza \
--i-taxonomy taxonomy.qza \
--p-include D_1__ \
--p-exclude mitochondria,chloroplast,eukaryota,archaea \
--o-filtered-table bacteriatable.qza
If you used other classifiers to classify your sequences:
qiime taxa filter-table \
--i-table pe.dada2.qza \
--i-taxonomy taxonomy.qza \
--p-include p__ \
--p-exclude mitochondria,chloroplast,eukaryota,archaea \
--o-filtered-table bacteriatable.qza
Check the taxonomy bar plots again
qiime taxa barplot \
--i-table bacteriatable.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file metadata.txt \
--o-visualization bacteria-taxa-bar-plots.qzv
VI. Summarize ASV abundance of bacteria table
qiime feature-table summarize \
--i-table bacteriatable.qza \
--o-feature-frequencies feature-frequencies.qza \
--o-sample-frequencies sample-frequencies.qza \
--o-summary visual summary.qzv
VII. Collapse features by their taxonomy
This example command collapses ASVs to the genus level
qiime taxa collapse --i-table bacteriatable.qza --i-taxonomy taxonomy.qza --p-level 6 --o-collapsed-table genustable.qza
Summarize ASV abundance of genus table
qiime feature-table summarize --i-table genustable.qza --o-feature-frequencies genus-feature-frequencies.qza --o-sample-frequencies genus-sample-frequencies.qza --o-summary genustable.qzv
VIII. Export taxonomy file
qiime tools export --input-path taxonomy.qza --output-path taxonomy_export