MTX model 3 - biobakery/biobakery GitHub Wiki
MTX Model 3.0 Tutorial
MaAsLin 3 (Microbiome Multivariable Association with Linear Models) now directly incorporates the MTX model built for metatranscriptomics (MTX) differential gene expression analysis. It integrates feature-specific covariates to determine multivariable associations between metadata and microbial MTX features since RNA expression changes within a microbial community are highly affected by the underlying differences in metagenomic abundances (i.e. gene copy number or the abundance of a given microbe). MaAsLin 3 can adjust for the feature DNA abundance as a continuous covariate for a given RNA feature in the model, allowing for robust differential expression analysis in microbial communities.
If you use the MTX model, please cite our manuscripts:
William A. Nickols, Jacob T. Nearing, Kelsey N. Thompson, Jiaxian Shen, Curtis Huttenhower MaAsLin 3: Refining and extending generalized multivariate linear models for meta-omic association discovery. (In progress).
Yancong Zhang, Kelsey N. Thompson, Huttenhower C, Eric A. Franzosa. "Statistical approaches for differential expression analysis in metatranscriptomics." Bioinformatics, 37.Supplement_1: i34-i41 (2021).
Contents
1. Description
In this tutorial, we will walk through most of the steps from the MTX model manuscript. We will compare the output of unadjusted MaAsLin 3 runs on MTX data with ratio-adjusted MTX data using DNA copy number and MTX abundance data adjusted with DNA abundances.
2. Installation
The latest version of MaAsLin 3 can be installed from BiocManager. For MaAsLin 3 to install, you will need R >= 4.3. If your version is older than that, please refer to section Installing R for the first time from the MaAsLin 3 tutorial to download the latest R.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biobakery/maaslin3")
The files for the MTX model must be downloaded from GitHub:
# Install devtools if not present
if (!require('devtools', character.only = TRUE)) {
install.packages('devtools')
}
# Install MaAsLin 3
library("devtools")
install_github("biobakery/maaslin3_tutorial_files")
for (lib in c('maaslin3', 'dplyr', 'ggplot2', 'knitr', 'kableExtra', 'maaslin3TutorialFiles')) {
suppressPackageStartupMessages(require(lib, character.only = TRUE))
}
3. Running the MTX model
The MTX model in MaAsLin 3 can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings. This tutorial will focus on running the MTX model in R, but the MTX model can be run from the command line in the same way as described in the MaAsLin 3 tutorial.
3.1 Input Files
3.1.1 Required inputs
The MTX model in MaAsLin 3 requires three input files:
- A feature table of RNA abundances - we generated this with HUMAnN 2
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible features in this file include RNA abundance of genes, enzymes, or pathways.
- This can be a filepath to a tab-delimited file.
- Covariate DNA data of features file
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible data in this file include DNA abundance of genes, enzymes, or pathways.
- This can be a filepath to a tab-delimited file.
- Metadata file
- Formatted with features as columns and samples as rows.
- The transposition of this format is also okay.
- Possible metadata in this file include pH, disease status, or age.
- This can be a filepath to a tab-delimited file.
The data file can contain samples not included in the metadata file (as is true for the reverse case of more samples in the metadata). For both cases, those samples that are not included in both files will be removed prior to model construction. Additionally, The sample order within the files does not need to match as MaAsLin 3 will double check this.
3.1.2 Examples of input files
Example input files can be found in the inst/extdata
folder of the MaAsLin 3 source or the MaAsLin 3 tutorial files repository. The files provided were generated from the HMP2 data which can be downloaded from https://ibdmdb.org/.
HMP2_pwyRNA.tsv
: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples.HMP2_pwyDNA.tsv
: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway DNA abundances for all samples.HMP2_pwy.RNA_DNA_ratio.tsv
: a tab-delimited file with pathways as columns and samples as rows. It is a subset of the pathway file so it just includes the pathway RNA abundances for all samples and it has been normalized as a ratio of the underlying (matched) DNA abundances.HMP2_metadata.tsv
: a tab-delimited file with samples as rows and metadata as columns. It is a subset of the metadata file that just includes some of the fields.
### RNA abundances
input_data <- system.file(
'extdata','HMP2_pwyRNA.tsv', package="maaslin3TutorialFiles")
df_input_data = read.table(file = input_data,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_data[1:5, 1:5]
## X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ3T_P 0.03156540 0.00114574
## CSM5FZ46_P 0.00000000 0.00000000
## CSM5FZ4C_P 0.01669700 0.00000000
## CSM5FZ4G_P 0.01153230 0.00833768
## CSM5FZ4K_P 0.00899462 0.01277590
## ANAGLYCOLYSIS_PWY_glycolysis_III ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn ARGSYN_PWY_L_arginine_biosyn_I
## CSM5FZ3T_P 0.0219371 0.004636530 0
## CSM5FZ46_P 0.0483395 0.004535580 0
## CSM5FZ4C_P 0.0439542 0.009188330 0
## CSM5FZ4G_P 0.0320195 0.004589710 0
## CSM5FZ4K_P 0.0532906 0.000836979 0
# RNA/DNA ratio data
input_dataratio <- system.file(
'extdata','HMP2_pwy.RNA_DNA_ratio.tsv', package="maaslin3TutorialFiles")
df_input_dataratio = read.table(file = input_dataratio,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_dataratio[1:5, 1:5]
## X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ3T_P 1.5572011 0.1598196
## CSM5FZ46_P 0.0000000 0.0000000
## CSM5FZ4C_P 0.8985142 0.0000000
## CSM5FZ4G_P 0.9351449 0.8675594
## CSM5FZ4K_P 0.6136112 1.3873472
## ANAGLYCOLYSIS_PWY_glycolysis_III ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn ARGSYN_PWY_L_arginine_biosyn_I
## CSM5FZ3T_P 1.587988 0.40400561 NaN
## CSM5FZ46_P 3.912071 0.37264242 0
## CSM5FZ4C_P 3.628830 0.54072537 NaN
## CSM5FZ4G_P 2.226932 0.46351343 NaN
## CSM5FZ4K_P 4.015689 0.07905127 NaN
# Metadata from the HMP2
input_metadata <-system.file(
'extdata','HMP2_metadata.tsv', package="maaslin3TutorialFiles")
df_input_metadata = read.table(file = input_metadata,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_metadata$diagnosis <-
factor(df_input_metadata$diagnosis, levels = c('nonIBD', 'UC', 'CD'))
df_input_metadata$dysbiosis_state <-
factor(df_input_metadata$dysbiosis_state, levels = c('none', 'dysbiosis_UC', 'dysbiosis_CD'))
df_input_metadata$antibiotics <-
factor(df_input_metadata$antibiotics, levels = c('No', 'Yes'))
df_input_metadata[1:5, 1:5]
## participant_id site_name week_num reads diagnosis
## CSM5FZ3N_P C3001 Cedars-Sinai 0 9961743 CD
## CSM5FZ3R_P C3001 Cedars-Sinai 2 16456391 CD
## CSM5FZ3T_P C3002 Cedars-Sinai 0 10511448 CD
## CSM5FZ3V_P C3001 Cedars-Sinai 6 17808965 CD
## CSM5FZ3X_P C3002 Cedars-Sinai 2 13160893 CD
# DNA data
input_dnadata <- system.file(
'extdata','HMP2_pwyDNA.tsv', package="maaslin3TutorialFiles")
df_input_dnadata = read.table(file = input_dnadata,
header = TRUE,
sep = "\t",
row.names = 1,
stringsAsFactors = FALSE)
df_input_dnadata[1:5, 1:5]
## X1CMET2_PWY_N10_formyl_tetrahydrofolate_biosyn ANAEROFRUCAT_PWY_homolactic_fermentation
## CSM5FZ4M 0.0158099 0.00946321
## CSM5MCUO 0.0101701 0.00440300
## CSM5MCVL 0.0167429 0.00611800
## CSM5MCVN 0.0180019 0.00710437
## CSM5MCW6 0.0153125 0.00257452
## ANAGLYCOLYSIS_PWY_glycolysis_III ARG_POLYAMINE_SYN_SP_of_arginine__polyamine_biosyn
## CSM5FZ4M 0.01486510 3.02835e-05
## CSM5MCUO 0.01175060 2.80500e-03
## CSM5MCVL 0.01127290 5.85403e-04
## CSM5MCVN 0.01096790 3.79672e-05
## CSM5MCW6 0.00577032 0.00000e+00
## ARGININE_SYN4_PWY_L_ornithine_de_novo_biosyn
## CSM5FZ4M 0.011894500
## CSM5MCUO 0.003355600
## CSM5MCVL 0.010223600
## CSM5MCVN 0.008524720
## CSM5MCW6 0.000781918
3.2 Output files
Running the MTX model in MaAsLin 3 generates the same output files as usual. See more details in the MaAsLin 3 manual. All outputs of this tutorial can be found in the MaAsLin 3 tutorial files repository.
3.3 Running models in R
Next, we are going to run the model in three different ways:
- Run MaAsLin 3 on the raw RNA abundances from HUMAnN
- Run MaAsLin 3 on the RNA/DNA ratios, which were created using a helper script from HUMAnN on paired MGX/MTX data
- Run the MTX model in MaAsLin 3 to adjust the raw RNA abundance by the underlying DNA abundances
3.3.1 Raw RNA Abundances with MaAsLin 3
In this first example run on MTX data, we will run MaAsLin 3 on the RNA pathway abundances as characterized by HUMAnN 2 but not normalized by the matched DNA of these samples. The outputs can be viewed in the MaAsLin 3 tutorial files repository.
set.seed(1)
fit_maaslin_rna <- maaslin3(
input_data = df_input_data,
input_metadata = df_input_metadata,
output = 'demo_output_rna',
fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
random_effects = c('participant_id'),
coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
'antibiotics Yes'))
3.3.2 RNA/DNA Ratios with MaAsLin 3
Next, we will run the same model changing the input to the RNA/DNA ratio data frame. Note that we now set normalization = 'NONE'
so that MaAsLin 3 does not total-sum scale the ratios. The outputs can be viewed in the MaAsLin 3 tutorial files repository.
fit_maaslin_ratio <- maaslin3(
input_data = df_input_dataratio,
input_metadata = df_input_metadata,
output = 'demo_output_ratio',
fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
random_effects = c('participant_id'),
normalization = 'NONE',
coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
'antibiotics Yes'),
warn_prevalence = F)
3.3.3 RNA abundance adjusted by DNA abundance with MaAsLin 3
Finally, we will run the MTX model in MaAsLin 3. We first put the DNA and RNA abundance files into the MaAsLin 3 function preprocess_dna_mtx
to total sum scale the abundances of both and apply the proper transformation to the DNA abundances. For each sample in each feature, this function:
- Log 2 transforms the DNA abundance if the DNA abundance is >=0.
- Sets the DNA abundance to
log2([minimum non-zero relative abundance in the dataset] / 2)
if the corresponding RNA abundance is non-zero but the DNA abundance is zero. - Sets the DNA abundance to NA if both are zero, which excludes the sample when fitting the model for the feature.
There is also preprocess_taxa_mtx
, which will perform a similar set of operations if per-taxon abundances but not per-gene abundances are available from the DNA data. See ?preprocess_dna_mtx
and ?preprocess_taxa_mtx
for more details.
Now, we will switch the input_data
to the preprocessed RNA table preprocess_out$dna_table
and include the pre-processed DNA as the feature-specific covariate with feature_specific_covariate = preprocess_out$dna_table
. We also set the name of the covariate for model fitting with feature_specific_covariate_name = 'DNA'
and we specify that we do not want to record the associations with the DNA in the outputs and plots by setting feature_specific_covariate_record = FALSE
. The outputs can be viewed in the MaAsLin 3 tutorial files repository.
preprocess_out <- preprocess_dna_mtx(df_input_dnadata, df_input_data)
fit_maaslin_mtx_mgx <- maaslin3(
input_data = preprocess_out$rna_table,
input_metadata = df_input_metadata,
output = 'demo_output_mtx_mgx',
fixed_effects = c('diagnosis', 'dysbiosis_state', 'antibiotics', 'age'),
random_effects = c('participant_id'),
feature_specific_covariate = preprocess_out$dna_table,
feature_specific_covariate_name = 'DNA',
feature_specific_covariate_record = FALSE,
coef_plot_vars = c('dysbiosis_state dysbiosis_CD', 'diagnosis CD'),
heatmap_vars = c('dysbiosis_state dysbiosis_UC', 'diagnosis UC', 'age',
'antibiotics Yes'))
3.4 Compare output
Finally, let's use some simple R scripts to compare the results from each model. First, we will look at the number of significant dysbiosis associations. To do this we will use the base R function subset
to subset the results to just the ones from the dysbiosis comparisons and table
to count the number of pathways that were associated with UC/CD dysbiosis in each model.
#compare the raw counts of features associated with dysbiosis
results_rna = subset(rbind(fit_maaslin_rna$fit_data_abundance$results,
fit_maaslin_rna$fit_data_prevalence$results),
metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna$value, results_rna$model)
# linear logistic
# dysbiosis_CD 58 37
# dysbiosis_UC 8 11
results_rna_ratio = subset(rbind(fit_maaslin_ratio$fit_data_abundance$results,
fit_maaslin_ratio$fit_data_prevalence$results),
metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna_ratio$value, results_rna_ratio$model)
# linear logistic
# dysbiosis_CD 17 3
# dysbiosis_UC 1 0
results_rna_dna = subset(rbind(fit_maaslin_mtx_mgx$fit_data_abundance$results,
fit_maaslin_mtx_mgx$fit_data_prevalence$results),
metadata == "dysbiosis_state" & qval_individual < 0.1 & is.na(error))
table(results_rna_dna$value, results_rna_dna$model)
# linear logistic
# dysbiosis_CD 42 10
# dysbiosis_UC 1 2
As you can tell, this number was highly dependent on the model, with most results coming from the raw RNA abundances and the least results from the RNA/DNA ratio.
Next, let's look at which features overlapped between the models. We can do this with the intersect
call in R:
# features called by the RNA/DNA ratios compared to the Raw RNA abundances
intersect(results_rna_ratio$feature, results_rna$feature)
# 10 features overlapped
# features called by the Raw RNA abundances compared to the RNA abundances adjusted by the DNA abundances
intersect(results_rna$feature, results_rna_dna$feature)
# 47 features overlapped
# features called by the RNA/DNA ratios compared to the RNA abundances adjusted by the DNA abundances
intersect(results_rna_ratio$feature, results_rna_dna$feature)
# 11 features overlapped
From this, we can see that while the models are calling different total numbers of pathways, the ones that they are calling are significantly overlapping.
Finally, let's plot the top CD dysbiosis results in the MTX_model, across all the models. Here we first create one object that includes all the results, then subset it to the top 10 pathways in the DNA covariate results.
top_pathways <- results_rna_dna$feature[order(results_rna_dna$qval_individual)][1:10]
# Specify model type
results_rna$model_type = "RNA model"
results_rna_ratio$model_type = "RNA/DNA ratio"
results_rna_dna$model_type = "RNA with DNA covariate"
results = rbind(results_rna, results_rna_dna, results_rna_ratio) # create one object from the results
results = results[results$feature %in% top_pathways, ] # subset to just the top features in the MaAsLin 3 DNA covariate model
results$model <- ifelse(results$model == 'LM', 'Abundance', 'Prevalence') # Rename abundance/prevalence
# Plot significant CD dysbiosis associations
ggplot(results[results$value == 'dysbiosis_CD',], aes(x = coef, y = feature, color = model_type)) +
geom_point(aes(shape = model), size = 3, alpha = 0.8) +
geom_errorbar(aes(xmin = coef - stderr, xmax = coef + stderr), width = 0.2) +
theme_bw() +
theme(axis.title = ggplot2::element_text(size = 16),
axis.text = ggplot2::element_text(size = 8),
) +
xlab('Coefficient +/- SE') +
ylab('Pathway') +
labs(color = 'Model', shape = 'Association')
Here, you can tell that model choice can influence the effect size, particularly for the prevalence models.