macarron - biobakery/biobakery GitHub Wiki

MACARRoN Tutorial

MACARRoN (Metabolome Analysis and Combined Annotation Ranks to pRioritize Novel bioactives) is a computational workflow for the systematic analysis and prioritization of potentially bioactive small molecules from microbial community metabolomes. MACARRoN uses covariance between metabolic features to associate several thousand unknown metabolic features in untargeted metabolomics datasets with annotated metabolites (standards) and transfer biological or functional annotations to the unknowns. Further, for each feature, quantitative indicators of bioactivity such as abundance relative to a covarying annotated metabolite (AVA) and effect-size and q-value of association with a phenotype (or exposure or environmental condition) are evaluated. A combined rank from these indicators of bioactivity is used to prioritize metabolic features. MACARRoN is available as a Bioconductor package called Macarron and can be run via both R and command line.


Contents


1. Overview


2. Setup

2.1 Requirements

  • R (>= 4.2.0)
  • Required CRAN libraries: WGCNA, dynamicTreeCut, ff, plyr, stats, psych, data.table, xml2, RCurl, RJSONIO, logging, optparse
  • Required BioConductor libraries: SummarizedExperiment, BiocParallel, DelayedArray, Maaslin2

2.2 Installation with Bioconductor

Once you have the correct R and Bioconductor version, you can install Macarron with Bioconductor:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Macarron")

3. Prioritizing bioactive metabolites

3.1 Macarron input files

Macarron needs 4 comma-delimited i.e. CSV inputs: (1) metabolic features abundances, (2) metabolic features annotations, (3) sample metadata, and (4) chemical taxonomy. The formatting requirements described below are necessary for successful execution of Macarron.

  1. Metabolic features abundances table
    • Metabolic features must be in rows and samples in columns.
    • First column identifies features; can be indices. Macarron will assign feature names "F1" to "Fn" where n is the total number of features in the abundance and annotation files.
    • Missing data (NA cells) are allowed.
  2. Metabolic features annotation table
    • Must contain features in rows and annotations in columns.
    • First column must identify features.
    • Second column must contain either HMDB ID or PubChem Compound Identifier (CID) e.g. HMDB0304632 or 24749 for glucose. These will be available only for a small fraction of features in untargeted metabolomes resulting in several empty cells.
    • Third column must contain the name of the metabolite.
    • Fourth column must contain a continuous chemical property such as m/z (recommended) or RT or shift/ppm.
    • Other compound metadata can be listed column 5 onwards.
  3. Sample metadata table
    • Must contain samples in rows and metadata in columns.
    • First column must identify samples.
    • Second column must contain categorical metadata relevant to prioritization such as phenotypes, exposures or environments.
  4. Chemical taxonomy table
    • First column must contain the HMDB ID or PubChem CID. IDs must be consistent between annotation and taxonomy tables.
    • Second and third columns must contain chemical subclass and class of the respective metabolite.
    • If you do not have this file, please see instructions under Advanced Topics on how to generate this file using the annotation table and MACARRoN utility decorateID.

In this tutorial, we will be using a subset of the PRISM stool metabolomes of Crohn's disease (inflammatory bowel disorder; IBD) patients and healthy controls to identify metabolic features with potential bioactivity in IBD. The demo input files can be found in the inst/extdata folder of the Macarron source. The files can also be downloaded from the links given below:

If you are running this tutorial as part of a workshop, the input files are already loaded into VM at ~/Tutorials/macarron/input/. You can view the input CSV files with Excel or Libre Office Calc (in VM). Let's take a closer look at each of these files. First, to enter the correct folder and then to view all input files, enter:

cd ~/Tutorials/macarron/
ls input
  1. The metabolic features abundances table contains the median-normalized, mass-spectrometric intensities for 900 metabolic features in 102 IBD (case) and healthy (control) gut metabolomes (i.e. samples). Sample names are used as column headers and the first column identifies the 900 features. To view this file from the terminal, enter
column -t -s "," input/demo_abundances.csv | less -S

Output:

"Feature"  "P7122"   "P7147"   "P7150"   "P7153"   "P7184"    "P7238"   "P7406"   "P7408"   "P7421" 
"1"        1680      56        NA        13500     NA         83515     224115    14418     12559   
"2"        17416048  81563     2544079   2384208   22636      14071918  52818     5017602   7415892 
"3"        3076251   13064     570299    389248    3642       3043520   46207     1112648   1750064 
"4"        7612298   19975373  9692260   1503552   1290966    271435    32024     3663665   311265  
"5"        4079264   7050575   4364569   1238865   1113435    307882    24412     2434061   291352  
"6"        11063     12564     8421      6771      724        NA        NA        17537     12855   
"7"        1455      14866     11400     8878      9206       NA        6178      471       263     
"8"        473       1535      1145      2129      1538       NA        875       1111      1259    

Note: Use arrow keys to navigate and q to quit

  1. The metabolic feature annotations table has the same number of rows and column 1 as the abundances table. For a small subset of the features which have been identified through m/z comparisons with a library of chemical standards, the HMDB ID and metabolite name are available. The LC-MS chromatographic column (Method) used to separate the sample, retention time (RT) and mass-to-charge ratio (mz) are available for all features.
column -t -s "," input/demo_annotations.csv | less -S

Output:

"Feature"  "HMDB_ID"      "Metabolite"                 "mz"       "RT"   "Method"
1          "HMDB0000874"  "tauroursodeoxycholic acid"  498.2898   7.26   "C18-neg"
2          "HMDB0000251"  "taurine"                    124.0072   3.72   "HILIC-neg"
3          "HMDB0000251"  "taurine"                    126.0218   6.1    "HILIC-pos"
4          "HMDB0000921"  "cholestenone"               385.3465   7      "C8-pos"
5          "HMDB0000921"  "cholestenone"               407.3285   7      "C8-pos"
6          ""             ""                           261.0614   7      "HILIC-neg"
7          ""             ""                           651.5886   9.98   "C8-pos"
8          ""             ""                           225.1479   1.77   "HILIC-pos"

Let us see how many metabolic features have been found via the different LC-MS chromatographic columns.

cut -d"," -f6 input/demo_annotations.csv | grep -v "Method" | sort | uniq -c

which yields:

 215 "C18-neg"
 253 "C8-pos"
 184 "HILIC-neg"
 248 "HILIC-pos"
  1. The sample metadata file contains categorical metadata describing the diagnosis (i.e. phenotype) of the stool metabolome donor i.e. either "Control" (for healthy) or "IBD" (for Crohn's disease) in the second column per formatting requirements. Other relevant metadata such as age and antibiotics use are also provided.
column -t -s "," input/demo_metadata.csv | less -S

Output:

"sample"  "diagnosis"  "age"  "antibiotics"
"P7122"   "IBD"        38     "No"
"P7147"   "IBD"        50     "No"
"P7150"   "IBD"        41     "No"
"P7153"   "IBD"        51     "No"
"P7184"   "IBD"        68     "No"
"P7238"   "IBD"        67     "Yes"
"P7406"   "IBD"        59     "No"

Let us see how many IBD patients and heathy controls we have in this study.

cut -d"," -f2 input/demo_metadata.csv | grep -v "diagnosis" | sort | uniq -c 

which yields:

  34 "Control"
  68 "IBD"
  1. The chemical taxonomy file contains the sub class and the class for each of the unique HMDB IDs in the annotation file.
column -t -s "," input/demo_taxonomy.csv | less -S

Output:

"HMDB_ID"      "Sub_Class"                             "Class"
"HMDB0000874"  "Bile acid alcohols and derivatives"    "Steroids and steroid derivatives"
"HMDB0000251"  "Organosulfonic acids and derivatives"  "Organic sulfonic acids and derivatives"
"HMDB0000921"  "Cholestane steroids"                   "Steroids and steroid derivatives"
"HMDB0002121"  "Terpene lactones"                      "Prenol lipids"
"HMDB0013325"  "Fatty acid esters"                     "Fatty Acyls"
"HMDB0008036"  "Glycerophosphocholines"                "Glycerophospholipids"
"HMDB0000908"  "Cholestane steroids"                   "Steroids and steroid derivatives"

  • Which is the largest metabolic feature? Which is the smallest?
  • How many of the metabolic features have been annotated?
  • What is the most common class of metabolites?
  • How many sample donors were taking antibiotics? What's unique about them?

3.2 Running Macarron

We will now apply Macarron to the aforementioned PRISM metabolomes to prioritize IBD-associated bioactive metabolites using the demo input files. The order in which the inputs have to be provided and additional options can be seen with:

MacarronCMD.R --help

If not on the VM, please provide the full path to the Macarron executable.

Output:

Usage: MacarronCMD.R [Inputs]
 <feature_abundances.csv>
 <feature_annotations.csv>
 <sample_metadata.csv>
 <chemical_taxonomy.csv> 

In the command prompt, enter:

MacarronCMD.R input/demo_abundances.csv \
input/demo_annotations.csv \
input/demo_metadata.csv \
input/demo_taxonomy.csv

If not on the VM, please provide the full path to the Macarron executable and the demo input files.

  • When Macarron is running successfully, you will see:
INFO::Creating output folder
INFO::Writing function arguments to log file
INFO::Samples with both metabolic features and metadata: 102
INFO::Total number of metabolic features: 900
INFO::Summarized Experiment created
INFO::Metabolic features that passed prevalence threshold: 869
  • When running via the command line, Macarron is a systematic single function workflow which performs the following steps serially:
    • Prevalence filtering: Selects features that are prevalent in at least one of the phenotypes or conditions of interest.
    • Guilt-by-association: Assigns features to modules based on pairwise correlation in abundances across samples.
    • Evaluation of bioactivity: Evaluates quantitative indicators of bioactivity which include abundance relative to co-clustered standard metabolite (AVA), and effect-size and q-value of differential abundance.
    • Prioritization: Integrates ranks from the aforementioned indicators of bioactivity into a meta-rank and prioritizes features as potential bioactive metabolites based on the meta-rank.

Notes:

  • By default, Macarron will create a folder called Macarron_output in which all files generated by Macarron will be stored. The name of the output folder can be changed with the --output option.
  • The Macarron.log file in the output folder will record the parameter settings and the steps carried out during the analysis.
  • The workflow runs with default settings which are expected to work for most datasets. If you need to modify the default settings e.g. the prevalence threshold or the reference categorical variable for differential abundance analysis, please see Advanced Topics for instructions on the same.

3.3 Macarron output files

If you are running this tutorial on the VM, pre-run output is stored in ~/Tutorials/macarron/demo_output/.

3.3.1 Prioritized bioactives

Three main output files are provided to the user.

  1. prioritized_metabolites_all.csv: This file contains the complete list of potential bioactive metabolites associated with case phenotypes (IBD in this example) ordered by priority score where 1 represents highest priority.
column -t -s "," Macarron_output/prioritized_metabolites_all.csv | less -S

Output:

"Feature_index"  "HMDB_ID"      "Metabolite"                 "mz"       "Priority_score"  "Status"           "Module"  "Anchor"                     "Related_classes"                         "Covaries_with_standard"  "AVA"    "qvalue"              "effect_size"  "RT"   "Method"
"829"            ""             ""                           795.699    1                 "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       40.4382  3.47689090224283e-07  12.4852        7.49   "C8-pos"
"380"            ""             ""                           401.378    0.9988            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       2.9028   3.47689090224283e-07  10.3464        7.5    "C8-pos"
"257"            ""             ""                           773.7169   0.9977            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       15.7132  8.79853106959839e-07  11.5196        7.5    "C8-pos"
"197"            ""             ""                           441.3705   0.9965            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       1.2267   3.47689090224283e-07  9.351          7.5    "C8-pos"
"93"             ""             ""                           401.3779   0.9954            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       1.1977   4.32789652942172e-07  9.5287         8.58   "C8-pos"
"623"            ""             ""                           427.3549   0.9942            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       2.4931   8.7755193308939e-07   9.0981         7.5    "C8-pos"
"15"             ""             ""                           371.3583   0.9931            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       0.9775   5.0047257375422e-07   9.3572         7.5    "C8-pos"
"875"            "HMDB0000908"  "5alpha-Cholestanol"         371.3656   0.9919            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       1        3.09981456651594e-06  8.3768         1.69   "HILIC-pos"
"420"            ""             ""                           389.3688   0.9908            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       1.569    7.19582524474529e-06  7.3144         7.5    "C8-pos"
"710"            ""             ""                           851.7614   0.9896            "depleted in IBD"  9         ""                           ""                                        "0"                       0.5514   8.7755193308939e-07   11.2965        8.02   "C8-pos"
"394"            ""             ""                           404.3889   0.9885            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       11.1613  0.000122289579678283  7.7957         7.5    "C8-pos"
"806"            ""             ""                           243.2107   0.9873            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       0.5184   5.65927947106562e-07  9.0136         7.5    "C8-pos"
"330"            ""             ""                           425.3181   0.9862            "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       0.5718   8.79853106959839e-07  8.2735         7.49   "C8-pos"
"381"            ""             ""                           369.3517   0.985             "depleted in IBD"  16        "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       24.9208  1.35413294278085e-05  5.6839         7.5    "C8-pos"

Let us see what information this file has about the prioritized bioactive metabolites:

head -1 Macarron_output/prioritized_metabolites_all.csv | tr ',' '\n'

which yields:

"Feature_index"
"HMDB_ID"
"Metabolite"
"mz"
"Priority_score"
"Status"
"Module"
"Anchor"
"Related_classes"
"Covaries_with_standard"
"AVA"
"qvalue"
"effect_size"
"RT"
"Method"

The columns are:

  • Feature_index: lists the identifier of the metabolic feature found in column 1 of abundance and annotation tables. In this example, it was serial numbers.
  • HMDB_ID and Metabolite: Public database identifier and metabolite name from columns 2 and 3 of the annotation table.
  • mz: The continuous numerical chemical property from column 4 of the annotation table.
  • Priority_score: Priority score where 1 indicates most prioritized. Rank percentile is derived from the meta-rank of AVA, q-value and effect size.
  • Status: Direction of perturbation (differential abundance) in IBD (phenotype of interest) compared to Control (reference phenotype).
  • Module: ID of the covariance module a metabolic feature is a member of. Module = 0 indicates a singleton i.e., a metabolic feature that is not assigned to any module.
  • Anchor: Metabolic feature that has the highest abundance in any phenotype (IBD or Control) e.g. 5alpha-Cholestanol had the highest abundance among all metabolic features in module 16. In modules containing both annotated (standard) and unannotated features, anchor is the most abundant annotated feature. If a metabolic feature is the only annotated one in its module, it automatically becomes the anchor. In modules containing all unannotated features, the one with the highest abundance is chosen as the anchor. This column helps in functional characterization or hypothesis generation.
  • Related_classes: Chemical taxonomy of the annotated features that covary with a metabolic feature e.g. feature 829 covaries with metabolites that are steroids or their derivatives. This column also helps in functional characterization or hypothesis generation.
  • Covaries_with_standard: 1 (yes) and 0 (no). Column specifies if the metabolic feature covaries with at least one annotated (standard) metabolite.
  • AVA: Abundance versus anchor which is a ratio of the highest abundance (in any phenotype) of a metabolic feature and highest abundance of the covarying anchor. Naturally, the AVA of an anchor is 1.
  • qvalue and effect size: Estimated from the multivariate linear model (in this example: feature ~ diagnosis + age + antibiotics) using MaAsLin2.
  • Remaining columns from the annotation table are appended.
  1. prioritized_metabolites_characterizable.csv: This is a subset of prioritized_metabolites_all.csv and only contains metabolic features which covary with at least one annotated metabolite. The columns are the same as prioritized_metabolites_all.csv.

  2. highly_prioritized_per_module_in_IBD.csv: This file provides a "digested" view of the prioritization output and only shows the top-ranked prioritized metabolites in each module. Modules are listed according to the fraction of prioritized metabolites in them. Priority score cut-off (default: 0.9) and number of features to be displayed per module (default: 10) can be changed. This file will be generated for each case phenotype in the metadata.

column -t -s "," Macarron_output/highly_prioritized_per_module_in_IBD.csv| less -S

Output:

"Row"  "Feature_index"  "HMDB_ID"      "Metabolite"                 "mz"        "mz_vs_Anchor"  "Priority_score"  "Status"           "Module"  "Anchor"                     "Related_classes"                         "Covaries_with_standard"  "AVA"      "qvalue"                "effect_size"  "RT"     "Method"     "phenotype"
1      "875"            "HMDB0000908"  "5alpha-Cholestanol"         "371.3656"  "0"             "0.9919"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "1"        "3.09981456651594e-06"  "8.3768"       "1.69"   "HILIC-pos"  "IBD"
2      "829"            ""             ""                           "795.699"   "424.333"       "1"               "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "40.4382"  "3.47689090224283e-07"  "12.4852"      "7.49"   "C8-pos"     "IBD"
3      "380"            ""             ""                           "401.378"   "30.012"        "0.9988"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "2.9028"   "3.47689090224283e-07"  "10.3464"      "7.5"    "C8-pos"     "IBD"
4      "257"            ""             ""                           "773.7169"  "402.351"       "0.9977"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "15.7132"  "8.79853106959839e-07"  "11.5196"      "7.5"    "C8-pos"     "IBD"
5      "197"            ""             ""                           "441.3705"  "70.005"        "0.9965"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "1.2267"   "3.47689090224283e-07"  "9.351"        "7.5"    "C8-pos"     "IBD"
6      "93"             ""             ""                           "401.3779"  "30.012"        "0.9954"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "1.1977"   "4.32789652942172e-07"  "9.5287"       "8.58"   "C8-pos"     "IBD"
7      "623"            ""             ""                           "427.3549"  "55.989"        "0.9942"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "2.4931"   "8.7755193308939e-07"   "9.0981"       "7.5"    "C8-pos"     "IBD"
8      "15"             ""             ""                           "371.3583"  "-0.007"        "0.9931"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "0.9775"   "5.0047257375422e-07"   "9.3572"       "7.5"    "C8-pos"     "IBD"
9      "420"            ""             ""                           "389.3688"  "18.003"        "0.9908"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "1.569"    "7.19582524474529e-06"  "7.3144"       "7.5"    "C8-pos"     "IBD"
10     "394"            ""             ""                           "404.3889"  "33.023"        "0.9885"          "depleted in IBD"  "16"      "5alpha-Cholestanol"         "Steroids and steroid derivatives"        "1"                       "11.1613"  "0.000122289579678283"  "7.7957"       "7.5"    "C8-pos"     "IBD"
11     ""               ""             ""                           ""          ""              ""                ""                 ""        ""                           ""                                        ""                        ""         ""                      ""             ""       ""           "IBD"
12     "307"            "HMDB0002121"  "carnosol"                   "329.1756"  "0"             "0.9678"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "1"        "0.00104094741048206"   "5.5538"       "10.79"  "C18-neg"    "IBD"
13     "671"            ""             ""                           "329.1756"  "0"             "0.9735"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "15.5505"  "0.00372642569730537"   "7.0985"       "1.68"   "HILIC-neg"  "IBD"
14     "150"            ""             ""                           "345.1704"  "15.995"        "0.9712"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "1.2928"   "0.00202101100639083"   "6.46"         "4.23"   "HILIC-neg"  "IBD"
15     "481"            ""             ""                           "345.1704"  "15.995"        "0.9597"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "5.5624"   "0.00874198844748742"   "6.2214"       "1.67"   "HILIC-neg"  "IBD"
16     "345"            ""             ""                           "361.2018"  "32.026"        "0.954"           "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "0.8071"   "0.00415299665661071"   "5.4986"       "3.58"   "HILIC-neg"  "IBD"
17     "174"            ""             ""                           "329.1756"  "0"             "0.9413"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "31.9902"  "0.0309443946766554"    "6.6436"       "3.04"   "HILIC-neg"  "IBD"
18     "391"            ""             ""                           "345.2069"  "16.031"        "0.9367"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "0.8476"   "0.00779402690926328"   "4.4997"       "12.61"  "C18-neg"    "IBD"
19     "700"            ""             ""                           "353.1725"  "23.997"        "0.9275"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "0.2758"   "0.0074165894505943"    "4.8305"       "3.88"   "C8-pos"     "IBD"
20     "858"            ""             ""                           "566.2555"  "237.08"        "0.9241"          "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "0.1954"   "0.00637303794821807"   "4.9118"       "3.88"   "C8-pos"     "IBD"
21     "293"            ""             ""                           "343.1549"  "13.979"        "0.916"           "depleted in IBD"  "11"      "carnosol"                   "Prenol lipids"                           "1"                       "0.464"    "0.0139268607287863"    "4.1754"       "10.34"  "C18-neg"    "IBD"
22     ""               ""             ""                           ""          ""              ""                ""                 ""        ""           

Interpretation of results

Typically, based on the size of the dataset i.e. total number of features, it is recommended to consider a small percentage (5-10%) of top-ranked metabolites as candidates for further characterization. The functional or biological significance of a prioritized metabolite can be predicted using the columns in the file e.g. F829 is significantly depleted in IBD patients as compared to healthy individuals and co-clusters with 5-alpha-Cholestanol which means their abundances are correlated.

  • The annotated compounds in module 16 are cholestane steroids. What does that say about F829?
  • How will you find out if F829 is biochemically related to 5-alpha-Cholestanol?
  • Which columns in this file can be used to say that the metabolic features in a module could be chemically similar?

3.3.2 Intermediate and accessory files

Macarron provides the use intermediate files to help understand the results better. Enter ls Macarron_output to view all files in the output folder. These include:

  1. Macarron.log: A record of all chosen parameters and steps that were followed during execution.
  2. modules_measures_of_success.csv: Covariation (of metabolic features) likely indicates a similar source and trajectory in the ecosystem. To transfer biologically and functionally meaningful annotations to unannotated features, Macarron clusters features based on covarying abundances into modules. This is based on a previous study which found that covarying metabolites are 15x more likely to belong to the same chemical class. Modules are generated using a minimum module size (MMS) equal to cube root of the total number of prevalent metabolic features. This choice is expected to work for most real world datasets. Macarron evaluates 9 measures of success that collectively capture the "correctness" and chemical homoegeneity of the modules. Additionally, Macarron also provides information on the properties of modules generated using MMS + 5, MMS + 10, MMS -5 and MMS - 10 (e.g., the Minimum module size used for this demo dataset is 10), allowing users to choose an MMS (arg = min_module_size) that is most appropriate for their dataset. The measures of success are as follows:
  • Total modules: Total modules found using MMS.
  • Singletons: Total number of metabolic features that were not assigned to any module at MMS.
  • % Annotated modules: Percentage of modules that contained at least one annotated metabolic feature.
  • % Consistent assignments: Percentage of times the same metabolic feature was assigned to the same module e.g. if three metabolic features represent glucose, they must all be in the same module. This perentage must be high.
  • Max classes per module: The highest number of chemical classes observed in any module. This is evaluated using the chemical taxonomy of covarying annotated features.
  • 90p classes per module: 90th percentile of classes per module; captures the chemical homogeneity of the modules.
  • Max subclasses per module: The highest number of chemical subclasses observed in any module.
  • 90p subclasses per module: 90th percentile of subclasses per module; captures the chemical homogeneity of the modules.
  • % Features in HAM: MACARRoN first finds homogeneously annotated modules (HAMs). These are modules in which >75% annotated features have the same chemical class indicating that they are chemically homogeneous. It then calculates how many features the HAMs account for.

To view this file, enter:

column -t -s "," Macarron_output/modules_measures_of_success.csv | less -S

Output:

""   "Minimum module size (min_module_size)"  "Total modules"  "Singletons"  "% Annotated modules"  "% Consistent assignments"  "Max classes per module"  "90p classes per module"  "Max subclasses per module"  "90p subclasses per module"  "% Features in HAM"
"1"  5                                        71               8             7.04                   100                         1                         1                         1                            1                            10.36
"2"  10                                       30               8             16.67                  100                         1                         1                         1                            1                            13
"3"  15                                       21               8             23.81                  100                         1                         1                         1                            1                            17.49
"4"  20                                       15               28            33.33                  100                         1                         1                         1                            1                            20.94

The user can inspect this file to provide an MMS of their choice using the --min_module_size option in Macarron. In general, a good min_module_size should result in high "% Consistent assignments", small number of singletons, high "% Annotated modules", high "% Features in HAM", and low "Max classes per module", "90p classes per module", "Max subclasses per module", "90p subclasses per module".


  • Why must we care about functional homogeneity of the modules?
  • Why is "number of singletons" a measure of success for module evaluation?

  1. Maaslin2_results: A folder containing the MaAsLin2 differential abundance analysis results which include
  • all_results.tsv: coef, std.err, p-value and q-value of all metabolic features estimated from the linear model
  • significant_results.tsv: subset of all_results.tsv containing only the significant features
  • residual.rds: residuals from the linear model
  • maaslin2.log: log file

4. Advanced topics

4.1 Generating the taxonomy input file

The chemical taxonomy table (see demo_taxonomy.csv) is one of the four required input files. The chemical taxonomy information is used by Macarron to evaluate covariance modules as discussed above. It is necessary to have PubChem CID OR HMDB ID annotations in Column 2 of the annotation table (see demo_annotations.csv) to generate this file. The annotation table can then be used to generate the chemical taxonomy table using the decorateID utility in Macarron.

 # Import annotation table as a dataframe
prism_annotations = system.file("extdata", "demo_annotations.csv", package="Macarron")
annotations_df <- read.csv(file = prism_annotations, row.names = 1)

 # Create taxonomy file
met_taxonomy <- decorateID(annotations_df)
write.csv(met_taxonomy, file="prism_taxonomy.csv")

Note: Macarron currently supports only HMDB ID and PubChem CID as inputs for creating the taxonomy file. Users that do not have an HMDB ID or a PubChem CID available for the annotated metabolites in their dataset may use ClassyFire to obtain taxonomy information. Select Query type as "Chemical" or "IUPAC Name" and proceed with SMILES, InChI or IUPAC name as inputs.

4.2 Choosing the prevalence threshold

Macarron filters out metabolic features that are not prevalent in the dataset. The default prevalence threshold is 0.7 which means features that are observed in 70% samples of at least one phenotype or condition in the metadata variable of choice (default: second column of the metadata input file) will be considered for further analyses. For example, in this demo, metabolic features that were "seen" in 70% IBD or 70% Control samples were retained. If you wish to use another column from the metadata input file that has categorical variables identifying phenotypes or conditions or groups of samples, use the --metadata_variable option followed by the name of that column to set it e.g --metadata_variable "antibiotics".

The --min_prevalence option allows users to choose a prevalence threshold suitable for their dataset. For richer metabolomes such as the stool metabolome, a higher prevalence threshold such as 0.7 may be used. For metabolomes with lower chemical diversity and abundance i.e. soil or water, a lower threshold is recommended to ensure the retention of sufficient number of metabolic features for prioritization. It is also a good idea to examine the prevalence of the annotated metabolites in their dataset before selecting the prevalence threshold.

4.3 Changing default MaAsLin2 options

Macarron uses MaAsLin2 for determining the q-value of differential abundance in a phenotype of interest. For default execution, the phenotype of interest must be a category in the Column 2 of the metadata table e.g. IBD in diagnosis in the demo. This is also the column that is picked by the metadata_variable for identifying the main phenotypes/conditions in any dataset (see Macarron.log file). Further, in the default execution, all columns in the metadata table are considered as fixed effects (covariates) and the alphabetically first categorical variable in each covariate with two categories is considered as the reference. MaAsLin2 requires reference categories to be explicitly defined for all categorical metadata with more than two categories.

Defaults can be changed with the arguments fixed_effects, random_effects and reference. In the demo example, fixed effects and reference can be defined as follows:

In command line

MacarronCMD.R input/demo_abundances.csv \
input/demo_annotations.csv \
input/demo_metadata.csv \
input/demo_taxonomy.csv \
-m "diagnosis" \
-f "diagnosis,age,antibiotics" \
--reference "diagnosis,Control;antibiotics,No"

In R

# If you are running this tutorial as part of a workshop, set working directory to ~/Tutorials/macarron/.
setwd("~/Tutorials/macarron/") 

library(Macarron)
# Specify path to input files
prism_abundances <- system.file(
    'extdata','demo_abundances.csv', package="Macarron")
prism_annotations <-system.file(
    'extdata','demo_annotations.csv', package="Macarron")
prism_metadata <-system.file(
    'extdata','demo_metadata.csv', package="Macarron")
mets_taxonomy <-system.file(
    'extdata','demo_taxonomy.csv', package="Macarron")

prioritized_output = Macarron(
    input_abundances = prism_abundances,
    input_annotations = prism_annotations,
    input_metadata = prism_metadata,
    input_taxonomy = mets_taxonomy,
    metadata_variable = "diagnosis",
    fixed_effects = c("diagnosis","age","antibiotics"),
    reference = c("diagnosis,Control;antibiotics,No"))
⚠️ **GitHub.com Fallback** ⚠️