11. Classifiers - DianaCarolinaVergara/16S-rRNA-Analysis GitHub Wiki

Machine Learning

Machine learning refers to automated methods for creating classifiers. A classifier is an algorithm which assigns categories to observations, where an observation is described by numerical values for a set of features. For example, categories could be "baby" and "adult", and the data could be a photograph of a face. Here, features are pixels, and values represent colors 1.

In OTU analysis, observations are samples and categories are metadata such as healthy / sick, day / night, here zooxanthelae / azooxanthellate. The data describing the observation is the set of OTU counts or frequencies in a sample, i.e. a column in the OTU table. The column can be viewed as a vector, which is machine learning terminology would be called a feature vector (OTUs are features).

Random Forest

We can use a random forest classifier directly within QIIME via the qiime sample-classifier tool.

https://chmi-sops.github.io/mydoc_qiime2.html

https://github.com/qiime2/q2-sample-classifier

With this supervised learning, the parameters of a classifier are trained on observations which are labeled with their correct categories. It's called "supervised" because the machine needs to be helped (supervised) during the learning phase, but not during the classification phase.

A trained classifier can be used to predict categories of novel samples. This could be used, for example, to create a diagnostic test for a gut disorder using 16S data from stool samples.

The most common use of machine learning in OTU analysis is to answer the questions:

Does the composition of the community change with the sample metadata state (healthy / sick etc.)?

Can metadata states be predicted from OTU counts or frequencies?

The qiime sample-classifier uses as input the dada2 table --i-table, the metadata table --m-metadata-file in .txt format and select the specific column --m-metadata-column you want to analyze, here the zooxanthellae column.

The estimator --p-estimator here would be Random Forest RandomForestClassifier

Code:

qiime sample-classifier classify-samples \
  --i-table table-dada2.qza \
  --m-metadata-file sample_metadata.txt \
  --m-metadata-column zooxanthellae \
  --p-optimize-feature-selection \
  --p-parameter-tuning \
  --p-estimator RandomForestClassifier \
  --p-n-estimators 100 \
  --output-dir RandomForest