OTU_classifier.sh - juanravm/MicroSeqProfiler GitHub Wiki
OTU classification and chimera filtering (OTU_classifier.sh)
After importing and quality filtering this data, this script groups together reads with high sequence identity in clusters called OTUs (Operational Taxonomic Units). These OTUs are supposed to group reads coming from closely related species, depending on the identity threshold established for the study. It also eliminates chimeric reads and classify this OTUs to species phylogenetic level, identifying microorganisms by training a machine learning model with 16S rRNA reference sequences. You can run this script as shown below:
bash /file_path/OTU_classifier.sh \
--QC_table file_path/QC-table.qza \
--QC_seqs file_path/QC-seq.qza \
--metadata_fp /file_path/metadata.tsv \
--perc_identity 0.99 \
--f_primer GTGYCAGCMGCCGCGGTAA \
--r_primer GGACTACNVGGGTWTCTAAT \
--class_seq /file_path/99_otus.fasta \
--class_tax /file_path/99_otu_taxonomy.txt
In this code you must specify the next input variables:
--QC_table- QC-table.qza file path--QC_seqs- QC-seq.qza file path--metadata_fp- Metadata file path--perc_identity- Identity percentage to cluster OTUs: From 0 to 1. In this example, 0.99 groups reads with a sequence identity higher than 99%--f_primer- Forward primer sequence in 5' -> 3' direction for 16S rRNA amplification--r_primer- Reverse primer sequence in 5' -> 3' direction for 16S rRNA amplification--class_seq- Reference sequences in .fasta format to train a Naive Bayes taxonomic classifier--class_tax- Taxonomy associated to match taxa with these reference sequences
Usually, for this type of analysis, QIIME2 recommends the use of GreenGenes or Silva databases references sequences and taxonomy for your specific identity percentage. In our exampledata, we used 99% identity clustered reference sequences and taxonomy from GreenGenes database.
This script returns:
taxa_barplot.qzv- A QIIME2 visualization with taxonomic composition of samples that can be displayed in this link or usingqiime tools view /file_path/taxa_barplot.qzvcommand in terminal.OTU_filtered_seqs.qza- A file with the representative chimera-filtered sequence of each OTU in QIIME2 artifact formatOTU_filtered_table.qza- A file with the filtered OTU counts in QIIME2 artifact formatOTU_filtered_table.tsv- A file with the filtered OTU counts in .tsv format
This script also returns a picrust2 directory with a input subdirectory preparing files for further analysis by picrust2.sh script. This directory will not be necessary if you are not running this script later.