Creating primer specific QIIME 2 taxonomic classifiers - LangilleLab/microbiome_helper GitHub Wiki

We used the below commands when creating primers-ecific QIIME2 taxonomic classifiers. These commands are simply based on this QIIME2 tutorial and are listed here for convenience. Note that we no longer maintain primer-specific classifiers.

First, the appropriate reference files need to be downloaded. For the most recent amplicon SOP we used the SILVA v132 99% 16S and 18S FASTAs and taxonomy files and the UNITE (02.02.2019) ITS database files (with and without all eukaryotes). We assume that these FASTAs and taxonomy files are in a folder called raw_files.

Prep taxonomy files

We have run into a number of errors with QIIME 2 classifiers related to the taxonomy string formatting in custom classifiers. Spaces, hash characters, and quotes can all sometimes cause issues when classifying datasets, but only if taxa labels containing these characters are identified (i.e. whether or not an error occurs on the input data). Below are the formatting fixes we have noted so far by trial and error. Note that these commands were only needed to run for the SILVA taxonomy files and not for the UNITE database. The below commands are sed commands that will modify a file so it's a good idea to make a back-up copying of the original taxonomy file before modifying it.

sed -i 's/#//g' FILE
sed -i "s/'/./g" FILE
sed -i 's/ /_/g' FILE

We also needed to add the column names Feature ID and Taxon to the SILVA taxonomy files.

Examples of the taxonomy files we have used previously with the above commands are SILVA_132_QIIME_release/taxonomy/16S_only/99/majority_taxonomy_7_levels.txt and SILVA_132_QIIME_release/taxonomy/18S_only/99/majority_taxonomy_7_levels.txt.

Importing files

All of the database files (FASTAs and taxonomy tables) then need to be imported as QIIME 2 artifacts. Below are the commands we have used for the QIIME 2 2019.7 and 2019.10 SOPs.

mkdir imported_files	
# 16S:	
qiime tools import --type 'FeatureData[Sequence]' \	
    --input-path raw_files/silva_132_99_16S_exported/dna-sequences.fasta \	
    --output-path imported_files/silva_132_99_16S.qza	
    	
    	
qiime tools import --type 'FeatureData[Taxonomy]' \	
  --input-path raw_files/silva_132_99_16S_majority_taxonomy_7_levels_exported/taxonomy.tsv \	
  --output-path imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza	
# 18S:	
qiime tools import --type 'FeatureData[Sequence]' \	
    --input-path raw_files/silva_132_99_18S_exported/dna-sequences.fasta \	
    --output-path imported_files/silva_132_99_18S.qza	
    	
    	
qiime tools import --type 'FeatureData[Taxonomy]' \	
  --input-path raw_files/silva_132_99_18S_majority_taxonomy_7_levels_exported/taxonomy.tsv \	
  --output-path imported_files/silva_132_99_18S_majority_taxonomy_7_levels.qza

# ITS:	
qiime tools import --type 'FeatureData[Sequence]' \	
    --input-path raw_files/sh_qiime_release_s_02.02.2019/sh_refs_qiime_ver8_99_s_02.02.2019.fasta \	
    --output-path imported_files/sh_refs_qiime_ver8_99_s_02.02.2019_ITS.qza	
    	
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \	
    --input-path raw_files/sh_qiime_release_s_02.02.2019/sh_taxonomy_qiime_ver8_99_s_02.02.2019.txt \	
    --output-path imported_files/sh_taxonomy_qiime_ver8_99_s_02.02.2019.qza	
qiime tools import --type 'FeatureData[Sequence]' \	
    --input-path raw_files/sh_qiime_release_s_all_02.02.2019/sh_refs_qiime_ver8_99_s_all_02.02.2019.fasta \	
    --output-path imported_files/sh_refs_qiime_ver8_99_s_all_02.02.2019_ITS.qza	
    	
qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat \	
    --input-path raw_files/sh_qiime_release_s_all_02.02.2019/sh_taxonomy_qiime_ver8_99_s_all_02.02.2019.txt \	
    --output-path imported_files/sh_taxonomy_qiime_ver8_99_s_all_02.02.2019.qza

Extract amplified regions

We then need to extract the sequences for each primer pair of interest (this command is very long running for the 16S files).

Note that this step requires careful consideration if you're interested in slicing out regions for non-standard primers. Previously we have run into issues with custom classifiers (see: issue here), which were resolved by using more restrictive settings for qiime feature-classifier extract-reads (see below).

The below commands are used for slicing out the regions based on these primers:

"universal" V4/V5 (515F+926R) GTGYCAGCMGCCGCGGTAA CCGYCAATTYMTTTRAGTTT
Bacterial V3/V4 (341F+805R) CCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC
Bacteria-specific V6/V8 (B969F+BA1406R) ACGCGHNRAACCTTACC ACGGGCRGTGWGTRCA
Archaea-specific V6/V8 (A956F+A1401R) TYAATYGGANTCAACRCC CRGTGWGTRCAAGGRGCA
cyano-specific V3/V4 (CYA359F+CYA781R) GGGGAATYTTCCGCAATGGG GACTACWGGGGTATCTAATCCCWTT
18S V4 (E572F+E1009R) CYGCGGTAATTCCAGCTC AYGGTATCTRATCRTCTTYG
ITS (ITS86F+ITS4R) GTGAATCATCGAATCTTTGAA TCCTCCGCTTATTGATATGC

Note the & after each command which is added so that these commands can be run in the background. Make sure these commands actually finish before moving to the next step! Also note that the reverse ITS primer is not actually in the UNITE database and that there is some talk that using this approach reduces accuracy for the ITS region so this step is omitted for the ITS primers and we retain the full region for taxonomic classification of ITS data. Note that non-default options for --p-identity, --p-min-length, and --p-max-length were set to address the issue mentioned above.

mkdir extracted_reads

qiime feature-classifier extract-reads --i-sequences ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --p-f-primer GTGYCAGCMGCCGCGGTAA --p-r-primer CCGYCAATTYMTTTRAGTTT \
    --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads extracted_reads/silva_132_99_16S_V4.V5_515F_926R_stringent.qza &


qiime feature-classifier extract-reads --i-sequences ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --p-f-primer CCTACGGGNGGCWGCAG --p-r-primer GACTACHVGGGTATCTAATCC \
    --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads extracted_reads/silva_132_99_16S_V3.V4_341F_805R_stringent.qza &

qiime feature-classifier extract-reads --i-sequences ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --p-f-primer ACGCGHNRAACCTTACC --p-r-primer ACGGGCRGTGWGTRCAA \
    --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads extracted_reads/silva_132_99_16S_V6.V8_B969F_BA1406R_stringent.qza &

qiime feature-classifier extract-reads --i-sequences ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --p-f-primer TYAATYGGANTCAACRCC --p-r-primer CRGTGWGTRCAAGGRGCA \
    --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads extracted_reads/silva_132_99_16S_V6.V8_A956F_A1401R_stringent.qza &

qiime feature-classifier extract-reads --i-sequences ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --p-f-primer GGGGAATYTTCCGCAATGGG --p-r-primer GACTACWGGGGTATCTAATCCCWTT \
    --p-identity 0.9 \
  --p-min-length 300 \
  --p-max-length 500 \
  --o-reads extracted_reads/silva_132_99_16S_V3.V4_CYA359F_CYA781R_stringent.qza &

Train Naive Bayes Classifiers

Lastly, we need to generate the classifiers themselves, which is performed with the below commands. We will generate a classifier on the entire 16S region as well although that typically is not used. Again note the & at the end of each command to enable them to be run in the background. Also, the ITS classifiers are based on the entire ITS region and that two different classifiers are created based on the UNITE database for either all eukaryotes (sh_refs_qiime_ver8_99_s_all_02.02.2019_ITS.qza) or based on just fungi (sh_refs_qiime_ver8_99_s_02.02.2019_ITS.qza).

mkdir taxa_classifiers

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ../qiime2-2019.7_files/imported_files/sh_refs_qiime_ver8_99_s_02.02.2019_ITS.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/sh_taxonomy_qiime_ver8_99_s_02.02.2019.qza \
  --o-classifier taxa_classifiers/classifier_sh_refs_qiime_ver8_99_s_02.02.2019_ITS.qza &

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ../qiime2-2019.7_files/imported_files/sh_refs_qiime_ver8_99_s_all_02.02.2019_ITS.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/sh_taxonomy_qiime_ver8_99_s_all_02.02.2019.qza \
  --o-classifier taxa_classifiers/classifier_sh_refs_qiime_ver8_99_s_all_02.02.2019_ITS.qza &

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ../qiime2-2019.7_files/imported_files/silva_132_99_16S.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads ../qiime2-2019.7_files/imported_files/silva_132_99_18S.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_18S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_18S.qza
  
qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads extracted_reads/silva_132_99_16S_V6.V8_B969F_BA1406R_stringent.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S_V6.V8_B969F_BA1406R.qza
  
  qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads extracted_reads/silva_132_99_16S_V6.V8_A956F_A1401R_stringent.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S_V6.V8_A956F_A1401R.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads extracted_reads/silva_132_99_16S_V4.V5_515F_926R_stringent.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S_V4.V5_515F_926R.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads extracted_reads/silva_132_99_16S_V3.V4_341F_805R_stringent.qza  \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S_V3.V4_341F_805R.qza

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads extracted_reads/silva_132_99_16S_V3.V4_CYA359F_CYA781R_stringent.qza \
  --i-reference-taxonomy ../qiime2-2019.7_files/imported_files/silva_132_99_16S_majority_taxonomy_7_levels.qza \
  --o-classifier taxa_classifiers/classifier_silva_132_99_16S_V3.V4_CYA359F_CYA781R.qza

Voila - your taxonomic classifiers are prepared! It's important that you now run sanity checks on these classifiers to ensure they were created correctly. This is best done by comparing the taxonomic assignments on test input sequences based on these classifiers to the assignments based on an independent approach. I've written a quick pipeline for running these sanity checks specifically for these amplicon regions, which you can see here.