Differential Abundance ‐ Options and Code - meyermicrobiolab/Meyer_Lab_Resources GitHub Wiki
Differential abundance tests are common analyses we employ to identify microbial taxa or genes that are enriched or depleted with a given condition (ex - reef, oxygen level, coral species). There are many options when it comes to differential abundance tests, so it can be challenging to know which test to use. This page serves as a resource for different tests our lab has used in the past. Existing research on differential abundance tests comparing many tools can provide some direction on which tool to use. For a good example, see Nearing et al. 2022. Specifically, their table 1 includes an overview of the different tools they tested.
Our lab members have used the following tools: Corncob, DESeq2, and ALDEx2. We have also tried ANCOM-BC (or ANCOM-II?), but struggled to get the tool to work. Under each tool, we will provide example code and pros/cons of the tool.
Contents
- Corncob
- DESeq2
- ALDEx2
Corncob is a great tool for differential abundance. Some of the positive aspects of Corncob:
- built for microbiome-based research (16S, whole genome sequencing (WGS))
- works on a phyloseq object with raw counts making it relatively easy to use
- works with BOTH continuous or categorical variables
- can test both abundance and variance, so you could also ask "is the variance in taxon x changing across a given covariate?"
Some negative to neutral aspects of Corncob:
- Tends to output fewer results than other tools (i.e. DESeq2)
- Results generally better at capturing more abundant taxa
Some additional tutorials/help from the developers of Corncob:
The following is code for using Corncob with a continuous variable. In this case, I was asking the question: What taxa were differentially abundant in response to ammonium enrichment?
Corncob works right on a phyloseq object (must use raw counts), which I first adjust to remove any samples with "NA" values for ammonium. Corncob will not work in continuous mode if there are NA values. Corncob will also not work if you supply relative abundances.
# Remove Ammonium values that are NAs and the associated samples.
sample_data(ps)$missingnh4 <- is.na(sample_data(ps)$nh4)
ps.nh4.narm <- ps %>%
subset_samples(missingnh4 == "FALSE")
# Try renaming all the NA taxa using the fantaxtic package - makes it easier to see the rank of unclassified taxa
ps.nh4.narm <- name_na_taxa(ps.benthic.nh4.narm,
na_label = "Unclassified <tax> (<rank>)")
In the following code, I want to test for the effect of ammonium (nh4
) on the abundance of each taxon, so set the formula_null = ~ 1
(null model) and the formula = ~ nh4
. Since I am not interested in the effect of ammonium on variance of taxa, I set phi.formula
and phi.formula_null
to ~ nh4
. The test is LRT
for likelihood ratio test. The data =
is where you put the phyloseq object. The fdr_cutoff
is the false-discovery rate cutoff, which is generally set at 0.05.
set.seed(1) # Set seed for reproducibility
nh4_narm.da <- differentialTest(formula = ~ nh4,
phi.formula = ~ nh4,
formula_null = ~ 1,
phi.formula_null = ~ nh4,
test = "LRT", boot = FALSE,
data = ps.nh4.narm,
fdr_cutoff = 0.05)
plot(nh4_narm.da) # use the corncob plot function to investigate the output
The plot()
function is limited in corncob. If you want to manipulate the corncob output in prep for graphing, take a look at some functions Cynthia has created for saving the corncob output, and how she applied the function.
Corncob will work with any number of categories. In this example, I am asking What coral microbiome taxa were differentially abundant with coral health state?.
otu <- read.table("ASV_STT2020_filt_decontam.txt",sep="\t",header=TRUE, row.names=1)
taxon <- as.matrix(read.table("taxonomy_STT2020_nospecies_filt_decontam.txt", sep="\t", header=TRUE, row.names=1))
samples <- read.table("metadata_STT2020_filt_decontam.txt",sep="\t",header=T,row.names=1)
ASV = otu_table(otu, taxa_are_rows = FALSE)
TAX = tax_table(taxon)
META = sample_data(samples)
ps <- phyloseq(ASV, TAX, META)
# Set the levels. Corncob will automatically "choose" the control/reference level as the first one.
# So in this case, I need to change the levels so "H" for healthy is the baseline level.
sample_data(ps)$healthstate <- factor(sample_data(ps2)$healthstate,levels=c("H","D"))
ps_mcav = subset_samples(ps, species == "MCAV")
set.seed(1)
mcav.da <- differentialTest(formula = ~ healthstate,
phi.formula = ~ healthstate,
formula_null = ~ 1,
phi.formula_null = ~ healthstate,
test = "Wald", boot = FALSE,
data = ps_mcav,
fdr_cutoff = 0.05)
plot(mcav.da) # check out the results with the corncob plot function
Corncob also works well for "controlling" for different covariates. For example, say I know reef site is also influencing the microbiome of coral hosts. So what if I want to identify taxa changing with healthstate when controlling for reef site?. To do this, change the function with the following below. Now, by adding reef_site
to both the formula_null
and formula
, you are controlling for the effect of reef site.
mcav.da <- differentialTest(formula = ~ healthstate + reef_site,
phi.formula = ~ healthstate + reef_site,
formula_null = ~ 1 + reef_site,
phi.formula_null = ~ healthstate + reef_site,
test = "Wald", boot = FALSE,
data = ps_mcav,
fdr_cutoff = 0.05)
DESeq2 is a widely used tool in microbiome analyses to determine differentially abundant taxa. This method is based on the negative binomial distribution of the data. Some positives aspects of the tool:
- It works on phyloseq objects with raw counts for easy application
- Lots of docomentation, support, and tutorials available
- Tends to output more results than other tools
Some of the cons of DESeq2:
- It was originally developed to test differences in gene expression in RNA-Seq data
- Can only be used to compare abundance between two categorical variables
Citation: https://genomebiology.biomedcentral.com/counter/pdf/10.1186/gb-2010-11-10-r106.pdf
##############################################
## DESeq2 analysis
##############################################
# check to see that you are using the full dataset (ps), not the low abundance ASV filtered dataset (ps5)
ps
# Define the order of the conditions for testing
# In this order, the positive fold change values are increased in Belize
sample_data(ps)$Region<-factor(sample_data(ps)$Region,levels=c("Curaçao","Belize"))
head(sample_data(ps)$Region, 10)
# DESEQ2 analysis on Dcyl Samples
dds = phyloseq_to_deseq2(ps, ~ Region)
dds
#filter rows with very few counts
dds <- dds[ rowSums(counts(dds)) > 5, ]
cts <- counts(dds)
geoMeans <- apply(cts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
dds <- estimateSizeFactors(dds, geoMeans=geoMeans)
dds <- DESeq(dds,test="Wald", fitType="parametric")
res = results(dds, cooksCutoff = FALSE)
alpha = 0.01
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(ps)[rownames(sigtab), ], "matrix"))
head(sigtab)
dim(sigtab)
#save table of results
sig = as(sigtab, "matrix")
write.table(sig,"DESeq2_results.txt",sep="\t",col.names=NA)
sig_dif<- read.table("DESeq2_results.txt", header = TRUE)
colnames(sig_dif)[1]<- "ASV"