Processing OTU table - scubalaina/plants_lol GitHub Wiki
Below are some stats I'm running on that OTU table using R studio.
1. Format the OTU table
So that OTU table generated by QIIME2 has the format:
# Contsructed from biom file
#OTU ID sample1 sample2 sample3 ...so on
asdlfjd2f 0 42 2
feosif2sf 3 12 155
...so on
To prep the table file for R, first, get rid of that first line # Constructed from biom file
. And get rid of any spaces in the headers.
OTU_ID sample1 sample2 sample3 ...so on
asdlfjd2f 0 42 2
feosif2sf 3 12 155
...so on
Save this file as whatever you like, but this tutorial explains how to import the table in R as a comma-separated file (.csv
).
2. Importing the OTU table as a matrix
I named the file in R "otus" because I am laaaaaazy.
setwd("/Users/alaina/Documents/1PhD/projects/Plants")
otus <- read.csv("plant_otus_qiime2.csv", sep = ",", header = TRUE, row.names = 1)
sep = ","
tells R columns were separated by commas in the file
header = TRUE
tells R that the first row corresponds to the name of the column (OTU_id, sample1, sample2, etc.)
row.names = 1
tells R that the first column corresponds to the name of the rows (the weirda$$ OTU id names)
3. Normalizing the data
The OTU table contains counts of sequences belonging to each OTU in each sample. Because each sample may not have the same number of reads, you need to normalize these OTU counts to the total number of reads in the sample so that you can more accurately compare OTU composition between samples.
Rationale: Sample 1 might have 1000 reads, and sample 2 might have 500 reads. If there are 400 sequences of OTU:A in sample 1 and 400 in sample 2, they actually have drastically different composition (OTU:A comprises **40% ** of the sequences in sample 1 and 80% of the sequences in sample 2).
I know, there are more efficient ways to do this, but this honey badger dgaf.
First, you need a vector containing the total number of reads in each sample, the sum of each column.
total_reads <- colSums(otus)
Next, you use the scale
function to divide the values in the column by that column sum vector.
norm_otus <- scale(otus, center = FALSE, scale = total_reads)
Here's a little explanation on this fuction.
The first input is the otu table which I had conveniently called otus
.
Then I set center = FALSE
because "centering" the values would result in +/- values based on the average value of the column. I'm not explaining that well, but just know you need to set it to FALSE
.
Next, I set scale = total_reads
. This tells the function to scale the values of each column by the values in the vector of that you created with the colSums
function.
Because I'm incompetent and didn't know how to replace these normalized values in my original table, I wrote this normalized matrix to a new .csv
and then opened excel and copied over the sample IDs and factor columns.
write.csv(norm_otus, file = "plant_otus_normalized.csv")
So now this file looks like dis:
OTU_ID sample1 sample2 sample3 ...so on
asdlfjd2f 0 0.42 0.2
feosif2sf 0.3 0.12 0.155
...so on
Next, you need to transpose the table so you can start doing stats and visualizations in R. There are ways to transpose via R, but I just opened the .csv
file in Excel and copied it the whole shebang, then reopened a blank file and copy-transposed it there. And then saved this file plant_norm_transposed.csv
. I also renamed the first column from OTU_ID
to sample_id
.
It should look like dis:
sample_ID asdlfjd2f feosif2sf ...so on
sample1 0 0.3
sample2 0.42 0.12
sample3 0.2 0.155
...so on
To do the fun stats and things, you need to add columns with your meta-data variables. For example:
sample_ID season location asdlfjd2f feosif2sf
sample1 winter Camelot 0 0.3
sample2 winter Oz 0.42 0.12
sample3 spring Camelot 0.2 0.155
These factors (i.e. temperature, host) will overlay the data points in your plots. Now save this as whatever you like, but this web page will tell you how to import as if you've saved this file as a .csv
.
4. Create a distance matrix for stats and figs
Then, I created a distance matrix using the Bray-Curtis method using the file that had the normalized values and the environmental factors.
Rationale The main reason for Bray-Curtis versus other methods, as far as I can recall, is that this distance only looks at presence of OTUs shared between samples as a basis of dis/similarity. Meanwhile the Jaccard index takes into account both shared presences and absences of OTUs as a basis for dis/similarity. And as many of you know in ecology, just because we didn't detect it, doesn't mean it's not there. So it's a little more accurate to look at shared presence of organisms rather than both presence and absences.
To calculate this distance matrix, I used the vegan
package. If you don't have it already, you can install it. You'll also need to load in your normalized OTU table with the meta-data columns added. I called this matrix norm_otus
.
norm_otus <- read.csv("new_plant_otus_normalized_with_var.csv", sep = ",", header = TRUE)
install.packages(vegan)
library(vegan)
otu_dis <- vegdist(norm_otus, method="bray")
Once you have this distance matrix, you can run multivariate stats and other interesting things on your dataset. I have some examples here.