General data usage tips - borenstein-lab/microbiome-metabolome-curated-data GitHub Wiki

Here are some general tips about how to use the data:

  • Most of the datasets are from "case-control studies", i.e. consist of samples from individuals with a studied disease, and samples from "healthy" controls. We call these two (or sometimes more) groups - "study groups", and they are reported in each metadata.tsv file. Users should consider these study groups in any analysis they perform.
  • Some of the datasets are from longitudinal studies, meaning that they include multiple samples per subject. Depending on the analysis, users may want to handle such samples differently.
  • To relate metabolites across studies, users can use either HMDB or KEGG ID's, given in the mtb.map tables.
    • Note that some HMDB/KEGG annotations are marked as High.Confidence.Annotation = FALSE, indicating that the metabolite's identification should be used with caution. See Data processing details for details about the High.Confidence.Annotation flag.
    • Additionally, metabolite values (or presence/absence) cannot be compared directly across datasets, due to differences between metabolomic platforms. See the Limitations for a further discussion on this topic.
  • To compare microbial taxa across studies, genera tables can be used as is (genus names are all derived from GTDB), or if analyzing only shotgun datasets, species tables can be used as is. All genera and species names are in accordance to the GTDB taxonomy.
  • A simple example of a cross-study comparison using this data collection can be found in the following R notebook: meta-analysis_of_genus_metabolite_associations.Rmd. The rendered html of the R notebook can be viewed here.