Creating QIIME 2 Taxonomic Classifiers - LangilleLab/microbiome_helper GitHub Wiki

We use the below commands when creating new QIIME2 taxonomic classifiers. These commands are simply based on this QIIME2 tutorial and are listed here for convenience.

This file represents the current commands used to create custom classifiers. This was only done for the ITS classifiers, because the default QIIME 2 classifier works with both 16S and 18S data. To see the previous commands used to generate primer-specific classifiers please see here.

First, the appropriate reference files need to be downloaded, which corresponded to the UNITE (ver8_99_s_04.02.2020) ITS database files (with and without all eukaryotes).

Importing files

All of the database files (FASTAs and taxonomy tables) need to be imported as QIIME 2 artifacts.

mkdir imported_files	

qiime tools import --type 'FeatureData[Sequence]' \
    --input-path sh_qiime_release_s_04.02.2020/sh_refs_qiime_ver8_99_s_04.02.2020.fasta \
    --output-path imported_files/sh_refs_qiime_ver8_99_s_04.02.2020_ITS.qza
    
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
    --input-path sh_qiime_release_s_04.02.2020/sh_taxonomy_qiime_ver8_99_s_04.02.2020.txt \
    --output-path imported_files/sh_taxonomy_qiime_ver8_99_s_04.02.2020.qza


qiime tools import --type 'FeatureData[Sequence]' \
    --input-path sh_qiime_release_s_all_04.02.2020/sh_refs_qiime_ver8_99_s_all_04.02.2020.fasta \
    --output-path imported_files/sh_refs_qiime_ver8_99_s_all_04.02.2020_ITS.qza
    
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \
    --input-path sh_qiime_release_s_all_04.02.2020/sh_taxonomy_qiime_ver8_99_s_all_04.02.2020.txt \
    --output-path imported_files/sh_taxonomy_qiime_ver8_99_s_all_04.02.2020.qza

Train Naive Bayes Classifiers

Now that the data is imported we can generate the classifiers themselves, which is performed with the below commands. Note the & at the end of each command to enable them to be run in the background. The ITS classifiers are based on the entire ITS region and that two different classifiers are created based on the UNITE database for either all eukaryotes (classifier_sh_refs_qiime_ver8_99_s_all_04.02.2020_ITS.qza) or based on just fungi (classifier_sh_refs_qiime_ver8_99_s_04.02.2020_ITS.qza).

mkdir taxa_classifiers

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads imported_files/sh_refs_qiime_ver8_99_s_04.02.2020_ITS.qza \
  --i-reference-taxonomy imported_files/sh_taxonomy_qiime_ver8_99_s_04.02.2020.qza \
  --o-classifier taxa_classifiers/classifier_sh_refs_qiime_ver8_99_s_04.02.2020_ITS.qza &

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads imported_files/sh_refs_qiime_ver8_99_s_all_04.02.2020_ITS.qza \
  --i-reference-taxonomy imported_files/sh_taxonomy_qiime_ver8_99_s_all_04.02.2020.qza \
  --o-classifier taxa_classifiers/classifier_sh_refs_qiime_ver8_99_s_all_04.02.2020_ITS.qza &

The taxonomic classifiers are now prepared. It's important that you now run sanity checks on these classifiers to ensure they were created correctly. This is best done by comparing the taxonomic assignments on test input sequences based on these classifiers to the assignments based on an independent approach. I've written a quick pipeline for running these sanity checks specifically for these amplicon regions, which you can see here.