Limma DA - mucosal-immunology-lab/microbiome-analysis GitHub Wiki
- Differential abundance testing
Say we now want to see whether there are bacteria that are differentially abundant according to a sample metadata variable we have available. Perhaps we have information about an individual's age at sampling, or we have some grouping information. This information can be input into a custom wrapper function around the popular limma package we provide here, called bio_limma().
The minimum required arguments for bio_limma() are:
- input_data
- 
model_formula_as_stringormetadata_var
- 
coefficients(not required in usingmetadata_var)
- 
input_data: either aphyloseqobject, aSummarizedExperimentobject (containing metabolomics/lipidomics data and a feature naming column calledshortname), or an RNAseqDGEListorEListobject to use for differential abundance testing.
- 
metadata_var: optional - the name of a single column from thesample_datato use for DA testing (e.g.metadata_var = 'group'). NOT required if providing a formula - it will be changed toNULLifmodel_formula_as_stringis also provided.
- 
metadata_condition: optional - a conditional statement about a certain metadata value, e.g. keeping a certain age group only.
- 
metadata_keep_columns: optional - this is typically required when you have complex metadata, and also supply your own model matrix and contrasts matrix. Choose columns from your metadata to retain using a character vector with column names.
- 
model_matrix: optional - best to let the function create the model matrix for you unless you have more complex comparisons in mind. Also supply your own contrast matrix in this case.
- 
model_formula_as_string: just like it sounds - a string containing the model formula you want to use (only works with '+' and not '*' at this stage).
- 
use_contrast_matrix: a boolean selector for whether to use the contrast matrix or a selected coefficient - the function should select the correct option.
- 
coefficients: selection of coefficient(s) you want to be tested. This will depend on the order of variables in the formula, and you can select as many as you'd like (e.g.coefficients = 2, orcoefficients = 3:4).
- 
DGEList_slot(default ='counts'): only used for RNAseq analysis when using a DGEList object as your input data. It tellsbio_limmawhich assay slot to use for analysis. Supply a string giving the name of the assay if it is not'counts'.
- 
factor_reorder_list: optional - a named list containing reordered factor values, e.g.list(Group = c('GroupHealthy', 'GroupTreatment1', 'GroupTreatment2'))
- 
continuous_modifier_list: optional - a named list containing functions to alter continuous variables, e.g.list(Age = function (x) x / 365)to change ages in days to ages in years.
- 
contrast_matrix: optional - best to let the function create the contrast matrix for you unless you have more complex comparisons in mind. Also supply your own model matrix in this case.
- 
adjust_method: optional - the method used to correct for multiple comparisons, listed here.
- 
rownames: optional - a custom vector of names to be used if you don't wish the names to be automatically derived from taxonomy.
- 
tax_id_col: optional - the phyloseq objecttax_tablecolumn you wish to use for naming - this should match the level being tested (and should also match the deepest taxonomic level in the phyloseq object - if you want to test a higher level then agglomerate the data using thephyloseq::tax_glom()function). If you do not provide this, then the function will automatically select the deepest level (i.e. the right-mosttax_tablecolumn that isn't comprised of allNAvalues).
- 
adj_pval_threshold(default = 0.05): the minimum level deemed significant.
- 
logFC_threshold(default = 1): the minimum log-fold change deemed meaningful.
- 
legend_metadata_string: optional - a custom name for colour or fill options.
- 
volc_plot_title: optional - a custom title for the volcano plot (will be reused for the associated bar plots and individual feature plots).
- 
volc_plot_subtitle: optional - a custom subtitle for the volcano plot (will be reused for the associated bar plots and individual feature plots).
- 
use_groups_as_subtitle: ifTRUE, the function will use the group names (or the contrast matrix comparison names) as the volcano plot subtitle.
- 
volc_plot_xlab: optional - a custom x label for the volcano plot.
- 
volc_plot_ylab: optional - a custom y label for the volcano plot.
- 
remove_low_variance_taxa(default = FALSE): optional - if TRUE, the phyloseq OTU table will be checked for feature-wise variance, and all features with zero variance will be removed prior to downstream analysis. Limma may throw an error if most of the features have no variance, so this step is sometimes required for certain datasets.
- 
plot_output_folder: optional - the path to a folder where you would like output plots to be saved. If left blank, no plots will be saved. Will create a new folder if it does not exist only at the final level; i.e. the parent folder must still exist.
- 
plot_file_prefix: optional - a string to attach to the start of the individual file names for your plots. This input is only used if theplot_output_folderargument is also provided.
- 
redo_boxplot_stats(default =FALSE): boxplot statistics can be recalculated usingstat_compare_means()from theggpubrpackage. The default optionFALSEwill only show the limma statistics for the single comparison.
- 
max_feature_plot_pages(default =NULL): if an integer value is given, it will limit the number of feature plots generated – there will still be 12 plots per page. If you have a large number of significant features, setting this argument to something such as 5 or 10 can greatly speed up the function.
- 
use_voom(default = FALSE): this is a used for RNAseq analysis, and will perform voom transformation prior to running thelimma::fitLm()using the input data and model matrix.
- 
force_feature_table_variable(default =NULL): if you provide your own model matrix and contrast matrix,bio_limmawill attempt to determine which metadata column is being referenced when it tries to generate the feature plots. It does so using thestringdistpackage, and does a decent job. However, if you decided to make "pretty" contrasts names that aren't all that similar to the metadata column name,bio_limmamay make the wrong decision. By settingforce_feature_table_variableto a string, you can ensure that get the plots you want.
- 
feature_plot_dot_colours(default =NULL): you can colour your the dots in your feature plots using a categorical variable identified by providing a string with the name of a metadata column. Keep in mind this column should be complete to avoid potential errors (i.e. noNAvalues).
- 
feature_plot_beeswarm_cex(default =1): use this value to determine the horizontal spacing of the dots. Seeggbeeswarmdocumentation for more details.
The function will return a list with different outputs from the function.
- 
input_data: the original OTU or intensity data used to run the analysis.
- 
input_metadata: a data.frame with the original metadata you provided.
- 
test_variables: a data.frame with the subset of metadata variables used for the analysis.
- 
model_matrix: the model matrix generated (or provided) to the function.
- 
constrast_matrixORcoefficients: either the contrast matrix used, or the coefficients selected, depending on the analysis you chose to run.
- 
limma_significant: a list of data.frames containing the signficant taxa determined by the limma function, with the adjusted p-value and logFC threshold selected, for each comparison/coefficient.
- 
limma_all: a list of data.frames containing the significance levels of DA analysis for all taxa for each comparison/coefficient.
- 
volcano_plots: volcano plots for each of the comparisons/coefficients selected.
- 
bar_plots: bar plots combining significant features for each of the comparisons/coefficients selected. The x-axis shows the log2FC value calculated by limma, with feature names on the y-axis, ordered by the effect magnitude.
- 
feature_plots: individual box plots for each feature, for each of the comparisons/coefficients selected. A significance bar will only be shown for the groups being compared, however all groups (if there are more than two) will be plotted for reference.
- 
venn_diagram: a Venn diagram that will show up when you run the function.
If a plot output folder path is provided, for each comparison/coefficient you have selected, three output .pdf files will be generated (provided there is at least 1 significant difference detected):
- 
{plot_file_prefix}_{test_variable + group}_volcplot.pdf: a volcano plot showing all features, with significant feaetures labelled, decreased features in blue and increased features in red.
- 
{plot_file_prefix}_{test_variable + group}_barplot.pdf: a bar plot showing significant features, with the log2FC on the x-axis and feature name on the y-axis. The y-axis is ordered by the log2FC magnitude, with the lowest at the bottom and highest at the top. Negative log2FC features are coloured blue, while positive ones are coloured red. The output plot automatically resizes depending on the number of variables being plotted.
- 
{plot_file_prefix}_{test_variable + group}_featureplots.pdf: individual box plots for each feature, for each of the comparisons/coefficients selected. A significance bar will only be shown for the groups being compared, however all groups (if there are more than two) will be plotted for reference. These plots are arranged with 12 features to a page (3 columns and 4 rows). Multiple pages will be combined into a single output .pdf file if there are more than 12 significant features.
If we have longitudinal microbiome sampling, we may want to know which taxa change with time. As there is likely to be little change on a day-to-day basis, we can also modify the age information from days to years. Furthermore, we can even control for potentially confounding factors like individual variation.
# Run custom limma continuous function for taxa vs age
bact_limma_age <- bio_limma(input_data = bact_kraken2_logCSS,
                            model_formula_as_string = '~ Age + Individual',
                            continuous_modifier_list = list(Age = function (x) x / 365),
                            coefficients = 2,
                            volc_plot_title = 'Differentially Abundant Taxa over Time',
                            volc_plot_xlab = 'log2FC/year')
# View volcano plot
bact_limma_age$volcano_plots$Age
This will produce a volcano plot that looks something like this (the taxa names have been omitted, but will normally appear):
 
In another example looking at differences between treatment groups, the bar plot produced looks like this:
 
An example of the feature plots output for a categorical explanatory variable is shown here:
 
Or an example of the feature plots output for a continuous explanatory variable is shown here:
 
When we split up the phyloseq temporarily for investigation of beta diversity, we saw that there appears to be some good separation of groups at days 7, 14, and 21 post-treatment. Therefore we will probably decide to do some differential abundance testing, and the bio_limma() function above can handle this for us, as well as generate and save all of the relevant exploratory plots we will want initially to inspect the differences between groups.
Because lists are a great way to keep everything organised (and minimise the growing number of variables in your R environment), we will separate the phyloseq object into multiple subsets (one for each time point), and store these within a list called input_data, which itself sits inside a main list that will house both the input data and the results from statistical analyses of that input data – you may want to run several variations with different statistical thresholds or test variable combinations for example. By storing everything in a master list, you keep all the inputs and outputs together.
We will separate the bacteria by time and store these "sub-phyloseq objects" within a container list called input_data. This container object is a sub-list that contains each of the datasets separated by time, and sits inside a "master list" that will also hold any statistical tests we run on the data.
# Split the data by time
bact_time_split <- list(input_data = list(day0 = subset_samples(bact_data_logCSS, days_postTx == '0'),
                                          day7 = subset_samples(bact_data_logCSS, days_postTx == '7'),
                                          day14 = subset_samples(bact_data_logCSS, days_postTx == '14'),
                                          day21 = subset_samples(bact_data_logCSS, days_postTx == '21'),
                                          day28 = subset_samples(bact_data_logCSS, days_postTx == '28')))The data now has the following structure:
bact_time_split # class = list
  |
  |---input_data # class = list
        |
        |---day0 # class = phyloseq
        |---day7 # class = phyloseq
        |---day14 # class = phyloseq
        |---day21 # class = phyloseq
        |---day28 # class = phyloseqFrom here, we can loop through the phyloseq objects contained within the input_data list, and run bio_limma().
# Loop over the custom bio_limma function
bact_time_split$limma_groupDA_ASV <- list() # create an empty list to store outputs
for (day in names(bact_time_split$input_data)) { # loop over the names in 'input_data', assigning each successively to the 'day' variable
  bact_phyloseq <- bact_time_split$input_data[[day]] # retrieve the appropriate phyloseq object
  limma_groupDA <- bio_limma(phyloseq_object = bact_phyloseq,
                             model_formula_as_string = '~ group',
                             coefficients = 2,
                             plot_output_folder = here::here('figures', 'limma_DA', 'treatment_group', 'ASV'), # output plots will be saved here
                             plot_file_prefix = paste0('tx_group_', day),
                             volc_plot_title = paste0('DA ASVs - ', day, ' post-treatment'))
  bact_time_split$limma_groupDA_ASV[[day]] <- limma_groupDA # add the output from the limma function to the list
}After this step, our data now has the following structure:
bact_time_split # class = list
  |
  |---input_data # class = list
  |     |
  |     |---day0 # class = phyloseq
  |     |---day7 # class = phyloseq
  |     |---day14 # class = phyloseq
  |     |---day21 # class = phyloseq
  |     |---day28 # class = phyloseq
  |
  ----limma_groupDA_ASV # class = list
        |
        |---day0 # class = list
        |     |
        |     |---input_data # class = data.frame
        |     |---input_metadata # class = data.frame
        |     |---test_variables # class = data.frame
        |     |---model_matrix # class = double
        |     |---coefficients # class = double
        |     |---limma_significant # class = list
        |     |     |
        |     |     |---groupDrugXYZ # class = data.frame
        |     |
        |     |---limma_all # class = list
        |     |     |
        |     |     |---groupDrugXYZ # class = data.frame
        |     |
        |     |---volcano_plots # class = list
        |     |     |
        |     |     |---groupDrugXYZ # class = ggplot
        |     |
        |     |---bar_plots # class = list
        |     |     |
        |     |     |---groupDrugXYZ # class = ggplot
        |     |
        |     |---feature_plots # class = list
        |     |     |
        |     |     |---groupDrugXYZ # class = list
        |     |           |
        |     |           |---significant_feature1 # class = ggplot
        |     |           |---significant_feature2 # class = ggplot
        |     |           |---etc.
        |     |
        |     |---venn_diagram
        |
        |---day7 # class = list (as per day0)
        |---day14 # class = list (as per day0)
        |---day21 # class = list (as per day0)
        |---day28 #class = list (as per day0)The major benefit of organising our data in this way is that is keeps all analyses we run combined in the same place. We could even choose to organise our data this way even if we weren't splitting the original phyloseq.
In the same way that we added the empty limma_groupDA_ASV list to our master list, we could add another empty list, e.g. limma_geneY_ASV, to look at how microbial abundance may be altered by expression of "gene Y", or any other number of variables.
The main thing is that all analyses relating to a particular phyloseq object or set of split phyloseq objects are contained in the same place. How tidy!
If we have a selection of sample metadata that we are unsure whether or not to include in the model formula for bio_limma(), we can use a wrapper around the limma::selectModel() function called limma_best_model(), provided here.
The function takes the primary variable you want to test, a set of additional variables you want to assess for inclusion, and the phyloseq object for testing. You can also choose between Akaike's Information Criterion (AIC) or Bayesian Information Criterion (BIC) as your selection methods.
- 
input_data: either aphyloseqobject or aSummarizedExperimentobject (containing metabolomics/lipidomics data) that you are assessing model fits for – this should be the same as the object you will use insidebio_limma().
- 
key_variable: the name of thesample_datacolumn of the variable you are interested in testing differential abundance for.
- 
other_parameters: a character vector of othersample_datacolumns you want to assess to determine whether they improve the fit of the model.
- 
selection_metric(default ='bic'): the selection criterion metric you want to use to compare different model fits. The default is the BIC metric, but you can also use AIC (by changing the argument to'aic').
The function will output a list with the following elements:
- 
formula_string: the best model formula as a string (for direct use inbio_limma())
- 
key_var_coef: the coefficients of the model that contain thekey_variable, for direct use inbio_limma().
- 
scores_plot: a plot of the top 10 models. The x-axis of the plot shows the percentage of input for which a given model is the best.
- 
model_scores: adata.framewith the model scores output information.
- 
selection_metric: a string containing the selection criterion metric you used, either'AIC'or'BIC'.
An example of using this function is as follows:
# Determine the best model
bestmodel <- limma_best_model(phyloseq_object = bact_data_logCSS,
                              key_variable = 'gene_expression_IL1a_brain_log',
                              other_parameters = c('treatment', 'injury', 'sex'),
                              selection_metric = 'bic')
bestmodel$scores_plotThe output plot looks like this:
 
The bio_limma() function ideally wants to be given a model formula as a string and the coefficients it should be using. Therefore, we can use the outputs of limma_best_model() directly within the call to bio_limma(), as shown here:
limma_il1a <- bio_limma(input_data = bact_data_logCSS,
                        model_formula_as_string = bestmodel$formula_string,
                        coefficients = bestmodel$key_var_coef)If you used this repository in a publication, please mention its url.
- Copyright (c) 2023 Mucosal Immunology Lab, Monash University, Melbourne, Australia.
- Authors: M. Macowan