MTX model 3 - biobakery/biobakery GitHub Wiki

MTX Model 3.0 Tutorial

MaAsLin 3 (Microbiome Multivariable Association with Linear Models) now directly incorporates the MTX model built for metatranscriptomics (MTX) differential gene expression analysis. It integrates feature-specific covariates to determine multivariable associations between metadata and microbial MTX features since RNA expression changes within a microbial community are highly affected by the underlying differences in metagenomic abundances (i.e. gene copy number or the abundance of a given microbe). MaAsLin 3 can adjust for the feature DNA abundance as a continuous covariate for a given RNA feature in the model, allowing for robust differential expression analysis in microbial communities.

If you use the MTX model, please cite our manuscripts:

William A. Nickols, Jacob T. Nearing, Kelsey N. Thompson, Jiaxian Shen, Curtis Huttenhower MaAsLin 3: Refining and extending generalized multivariate linear models for meta-omic association discovery. (In progress).

Yancong Zhang, Kelsey N. Thompson, Huttenhower C, Eric A. Franzosa. "Statistical approaches for differential expression analysis in metatranscriptomics." Bioinformatics, 37.Supplement_1: i34-i41 (2021).

1. Description
2. Installation
3. Running the MTX model

1. Description

In this tutorial, we will walk through most of the steps from the MTX model manuscript. We will compare the output of unadjusted MaAsLin 3 runs on MTX data with ratio-adjusted MTX data using DNA copy number and MTX abundance data adjusted with DNA abundances.

2. Installation

The latest version of MaAsLin 3 can be installed from BiocManager. For MaAsLin 3 to install, you will need R >= 4.3. If your version is older than that, please refer to section Installing R for the first time from the MaAsLin 3 tutorial to download the latest R.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biobakery/maaslin3")

The files for the MTX model must be downloaded from GitHub:

# Install devtools if not present
if (!require('devtools', character.only = TRUE)) {
  install.packages('devtools')
}

# Install MaAsLin 3
library("devtools")
install_github("biobakery/maaslin3_tutorial_files")

for (lib in c('maaslin3', 'dplyr', 'ggplot2', 'knitr', 'kableExtra', 'maaslin3TutorialFiles')) {
    suppressPackageStartupMessages(require(lib, character.only = TRUE))
}

3. Running the MTX model

The MTX model in MaAsLin 3 can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings. This tutorial will focus on running the MTX model in R, but the MTX model can be run from the command line in the same way as described in the MaAsLin 3 tutorial.

3.1 Input Files

3.1.1 Required inputs

The MTX model in MaAsLin 3 requires three input files:

A feature table of RNA abundances - we generated this with HUMAnN 2
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible features in this file include RNA abundance of genes, enzymes, or pathways.
- This can be a filepath to a tab-delimited file.
Covariate DNA data of features file
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible data in this file include DNA abundance of genes, enzymes, or pathways.
- This can be a filepath to a tab-delimited file.
Metadata file
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible metadata in this file include pH, disease status, or age.
- This can be a filepath to a tab-delimited file.

The data file can contain samples not included in the metadata file (as is true for the reverse case of more samples in the metadata). For both cases, those samples that are not included in both files will be removed prior to model construction. Additionally, The sample order within the files does not need to match as MaAsLin 3 will double check this.

3.1.2 Examples of input files

Example input files can be found in the inst/extdata folder of the MaAsLin 3 source or the MaAsLin 3 tutorial files repository. The files provided were generated from the HMP2 data which can be downloaded from https://ibdmdb.org/.

HMP2_pwyRNA.tsv: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples.
HMP2_pwyDNA.tsv: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway DNA abundances for all samples.
HMP2_pwy.RNA_DNA_ratio.tsv: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples and it has been normalized as a ratio of the underlying (matched) DNA abundances.
HMP2_metadata.tsv: a tab-delimited file with samples as rows and metadata as columns. It is a subset of the metadata file that just includes some of the fields.

### RNA abundances
input_data <- system.file(
  'extdata','HMP2_pwyRNA.tsv', package="maaslin3TutorialFiles")
df_input_data = read.table(file = input_data,
                           header = TRUE,
                           sep = "\t", 
                           row.names = 1,
                           stringsAsFactors = FALSE)
df_input_data[1:5, 1:5]

##            X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn   ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ3T_P                                     0.03156540                                 0.00114574
## CSM5FZ46_P                                     0.00000000                                 0.00000000
## CSM5FZ4C_P                                     0.01669700                                 0.00000000
## CSM5FZ4G_P                                     0.01153230                                 0.00833768
## CSM5FZ4K_P                                     0.00899462                                 0.01277590
##            ANAGLYCOLYSIS_PWY_glycolysis_III     ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn    ARGSYN_PWY_L_arginine_biosyn_I
## CSM5FZ3T_P                        0.0219371                                      0.004636530                                 0
## CSM5FZ46_P                        0.0483395                                      0.004535580                                 0
## CSM5FZ4C_P                        0.0439542                                      0.009188330                                 0
## CSM5FZ4G_P                        0.0320195                                      0.004589710                                 0
## CSM5FZ4K_P                        0.0532906                                      0.000836979                                 0

# RNA/DNA ratio data 
input_dataratio <- system.file(
 'extdata','HMP2_pwy.RNA_DNA_ratio.tsv', package="maaslin3TutorialFiles")
df_input_dataratio = read.table(file = input_dataratio,
                                header = TRUE,
                                sep = "\t", 
                                row.names = 1,
                                stringsAsFactors = FALSE)
df_input_dataratio[1:5, 1:5]

##            X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn    ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ3T_P                                      1.5572011                                   0.1598196
## CSM5FZ46_P                                      0.0000000                                   0.0000000
## CSM5FZ4C_P                                      0.8985142                                   0.0000000
## CSM5FZ4G_P                                      0.9351449                                   0.8675594
## CSM5FZ4K_P                                      0.6136112                                   1.3873472
##            ANAGLYCOLYSIS_PWY_glycolysis_III    ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn    ARGSYN_PWY_L_arginine_biosyn_I
## CSM5FZ3T_P                         1.587988                                      0.40400561                               NaN
## CSM5FZ46_P                         3.912071                                      0.37264242                                 0
## CSM5FZ4C_P                         3.628830                                      0.54072537                               NaN
## CSM5FZ4G_P                         2.226932                                      0.46351343                               NaN
## CSM5FZ4K_P                         4.015689                                      0.07905127                               NaN

# Metadata from the HMP2
input_metadata <-system.file(
  'extdata','HMP2_metadata.tsv', package="maaslin3TutorialFiles")
df_input_metadata = read.table(file             = input_metadata,
                               header           = TRUE,
                               sep              = "\t", 
                               row.names        = 1,
                               stringsAsFactors = FALSE)
df_input_metadata$diagnosis <- 
  factor(df_input_metadata$diagnosis, levels = c('nonIBD', 'UC', 'CD'))
df_input_metadata$dysbiosis_state <- 
  factor(df_input_metadata$dysbiosis_state, levels = c('none', 'dysbiosis_UC', 'dysbiosis_CD'))
df_input_metadata$antibiotics <- 
  factor(df_input_metadata$antibiotics, levels = c('No', 'Yes'))

df_input_metadata[1:5, 1:5]

##            participant_id    site_name week_num    reads diagnosis
## CSM5FZ3N_P          C3001 Cedars-Sinai        0  9961743        CD
## CSM5FZ3R_P          C3001 Cedars-Sinai        2 16456391        CD
## CSM5FZ3T_P          C3002 Cedars-Sinai        0 10511448        CD
## CSM5FZ3V_P          C3001 Cedars-Sinai        6 17808965        CD
## CSM5FZ3X_P          C3002 Cedars-Sinai        2 13160893        CD

# DNA data 
input_dnadata <- system.file(
  'extdata','HMP2_pwyDNA.tsv', package="maaslin3TutorialFiles")
df_input_dnadata = read.table(file             = input_dnadata,
                               header           = TRUE,
                               sep              = "\t", 
                               row.names        = 1,
                               stringsAsFactors = FALSE)
df_input_dnadata[1:5, 1:5]

##          X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn    ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ4M                                      0.0158099                                  0.00946321
## CSM5MCUO                                      0.0101701                                  0.00440300
## CSM5MCVL                                      0.0167429                                  0.00611800
## CSM5MCVN                                      0.0180019                                  0.00710437
## CSM5MCW6                                      0.0153125                                  0.00257452
##          ANAGLYCOLYSIS_PWY_glycolysis_III    ARG_POLYAMINE_SYN_SP_of_arginine__polyamine_biosyn
## CSM5FZ4M                       0.01486510                                           3.02835e-05
## CSM5MCUO                       0.01175060                                           2.80500e-03
## CSM5MCVL                       0.01127290                                           5.85403e-04
## CSM5MCVN                       0.01096790                                           3.79672e-05
## CSM5MCW6                       0.00577032                                           0.00000e+00
##          ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn
## CSM5FZ4M                                  0.011894500
## CSM5MCUO                                  0.003355600
## CSM5MCVL                                  0.010223600
## CSM5MCVN                                  0.008524720
## CSM5MCW6                                  0.000781918

3.2 Output files

Running the MTX model in MaAsLin 3 generates the same output files as usual. See more details in the MaAsLin 3 manual. All outputs of this tutorial can be found in the MaAsLin 3 tutorial files repository.

3.3 Running models in R

Next, we are going to run the model in three different ways:

Run MaAsLin 3 on the raw RNA abundances from HUMAnN
Run MaAsLin 3 on the RNA/DNA ratios, which were created using a helper script from HUMAnN on paired MGX/MTX data
Run the MTX model in MaAsLin 3 to adjust the raw RNA abundance by the underlying DNA abundances

3.3.1 Raw RNA Abundances with MaAsLin 3

In this first example run on MTX data, we will run MaAsLin 3 on the RNA pathway abundances as characterized by HUMAnN 2 but not normalized by the matched DNA of these samples. The outputs can be viewed in the MaAsLin 3 tutorial files repository.

set.seed(1)
fit_maaslin_rna <- maaslin3(
    input_data = df_input_data,
    input_metadata = df_input_metadata,
    output = 'demo_output_rna',
    fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
    random_effects = c('participant_id'),
    coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
    heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
                        'antibiotics Yes'))

3.3.2 RNA/DNA Ratios with MaAsLin 3

Next, we will run the same model changing the input to the RNA/DNA ratio data frame. Note that we now set normalization = 'NONE' so that MaAsLin 3 does not total-sum scale the ratios. The outputs can be viewed in the MaAsLin 3 tutorial files repository.

fit_maaslin_ratio <- maaslin3(
    input_data = df_input_dataratio,
    input_metadata = df_input_metadata,
    output = 'demo_output_ratio',
    fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
    random_effects = c('participant_id'),
    normalization = 'NONE',
    coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
    heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
                    'antibiotics Yes'),
    warn_prevalence = F)

3.3.3 RNA abundance adjusted by DNA abundance with MaAsLin 3

Finally, we will run the MTX model in MaAsLin 3. We first put the DNA and RNA abundance files into the MaAsLin 3 function preprocess_dna_mtx to total sum scale the abundances of both and apply the proper transformation to the DNA abundances. For each sample in each feature, this function:

Log 2 transforms the DNA abundance if the DNA abundance is >=0.
Sets the DNA abundance to log2([minimum non-zero relative abundance in the dataset] / 2) if the corresponding RNA abundance is non-zero but the DNA abundance is zero.
Sets the DNA abundance to NA if both are zero, which excludes the sample when fitting the model for the feature.

There is also preprocess_taxa_mtx, which will perform a similar set of operations if per-taxon abundances but not per-gene abundances are available from the DNA data. See ?preprocess_dna_mtx and ?preprocess_taxa_mtx for more details.

Now, we will switch the input_data to the preprocessed RNA table preprocess_out$dna_table and include the pre-processed DNA as the feature-specific covariate with feature_specific_covariate = preprocess_out$dna_table. We also set the name of the covariate for model fitting with feature_specific_covariate_name = 'DNA' and we specify that we do not want to record the associations with the DNA in the outputs and plots by setting feature_specific_covariate_record = FALSE. The outputs can be viewed in the MaAsLin 3 tutorial files repository.

preprocess_out <- preprocess_dna_mtx(df_input_dnadata, df_input_data)

fit_maaslin_mtx_mgx <- maaslin3(
    input_data = preprocess_out$rna_table,
    input_metadata = df_input_metadata,
    output = 'demo_output_mtx_mgx',
    fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
    random_effects = c('participant_id'),
    feature_specific_covariate = preprocess_out$dna_table,
    feature_specific_covariate_name = 'DNA',
    feature_specific_covariate_record = FALSE,
    coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
    heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
                    'antibiotics Yes'))

3.4 Compare output

Finally, let's use some simple R scripts to compare the results from each model. First, we will look at the number of significant dysbiosis associations. To do this we will use the base R function subset to subset the results to just the ones from the dysbiosis comparisons and table to count the number of pathways that were associated with UC/CD dysbiosis in each model.

#compare the raw counts of features associated with dysbiosis
results_rna = subset(rbind(fit_maaslin_rna$fit_data_abundance$results, 
                           fit_maaslin_rna$fit_data_prevalence$results),
                     metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna$value, results_rna$model)
#              linear logistic
# dysbiosis_CD     58       37
# dysbiosis_UC      8       11

results_rna_ratio = subset(rbind(fit_maaslin_ratio$fit_data_abundance$results, 
                                 fit_maaslin_ratio$fit_data_prevalence$results), 
                           metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna_ratio$value, results_rna_ratio$model)
#              linear logistic
# dysbiosis_CD     17        3
# dysbiosis_UC      1        0

results_rna_dna = subset(rbind(fit_maaslin_mtx_mgx$fit_data_abundance$results, 
                               fit_maaslin_mtx_mgx$fit_data_prevalence$results),
                           metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna_dna$value, results_rna_dna$model)
#              linear logistic
# dysbiosis_CD     42       10
# dysbiosis_UC      1        2

As you can tell, this number was highly dependent on the model, with most results coming from the raw RNA abundances and the least results from the RNA/DNA ratio.

Next, let's look at which features overlapped between the models. We can do this with the intersect call in R:

# features called by the RNA/DNA ratios compared to the Raw RNA abundances 
intersect(results_rna_ratio$feature, results_rna$feature)
# 10 features overlapped

# features called by the Raw RNA abundances compared to the RNA abundances adjusted by the DNA abundances 
intersect(results_rna$feature, results_rna_dna$feature)
# 47 features overlapped

# features called by the RNA/DNA ratios compared to the RNA abundances adjusted by the DNA abundances 
intersect(results_rna_ratio$feature, results_rna_dna$feature)
# 11 features overlapped

From this, we can see that while the models are calling different total numbers of pathways, the ones that they are calling are significantly overlapping.

Finally, let's plot the top CD dysbiosis results in the MTX_model, across all the models. Here we first create one object that includes all the results, then subset it to the top 10 pathways in the DNA covariate results.

top_pathways <- results_rna_dna$feature[order(results_rna_dna$qval_individual)][1:10]

# Specify model type
results_rna$model_type = "RNA model"
results_rna_ratio$model_type = "RNA/DNA ratio"
results_rna_dna$model_type = "RNA with DNA covariate"

results = rbind(results_rna, results_rna_dna, results_rna_ratio) # create one object from the results 
results = results[results$feature %in% top_pathways, ] # subset to just the top features in the MaAsLin 3 DNA covariate model
results$model <- ifelse(results$model == 'LM', 'Abundance', 'Prevalence') # Rename abundance/prevalence

# Plot significant CD dysbiosis associations
ggplot(results[results$value == 'dysbiosis_CD',], aes(x = coef, y = feature, color = model_type)) + 
  geom_point(aes(shape = model), size = 3, alpha = 0.8) + 
  geom_errorbar(aes(xmin = coef - stderr, xmax = coef + stderr), width = 0.2) + 
  theme_bw() + 
  theme(axis.title = ggplot2::element_text(size = 16), 
    axis.text = ggplot2::element_text(size = 8),   
) + 
  xlab('Coefficient +/- SE') + 
  ylab('Pathway') + 
  labs(color = 'Model', shape = 'Association')

Here, you can tell that model choice can influence the effect size, particularly for the prevalence models.