Key limitations - borenstein-lab/microbiome-metabolome-curated-data GitHub Wiki

There are several limitations to keep in mind when using this data resource for your own analysis.

Metabolite levels as well as presence/absence cannot be directly compared between studies due to differences between metabolomics platforms. Short-chain fatty acids, for example, are mostly detectable using as chromatography–mass spectrometry and rarely detectable by liquid chromatography, due to their poor ionization efficiency. So even though they are known to be important microbial metabolites with substantial impact on host health, they are undetected in approx. half of our resource datasets. Hence the number of datasets in which a metabolite appears should not be used as an indication of its prevalence. Similarly, metabolite values and scale are effected by metabolite properties and metabolomic method, meaning that claims such as "metabolite X is more abundant in dataset A than in dataset B" are ill-founded.
Metabolite annotations in untargeted metabolomics vary in their level of confidence. As metabolite annotations in each dataset were provided by authors of the original publications, we encourage users to review the reported metabolomic methods from the original studies, or our summarized notes about each metabolomic dataset given in (Supplementary Table S3). We specifically added a High.Confidence.Annotation flag to the mtb.map tables and set the value to FALSE in places where lower confidence was indicating in the original data or in a few other cases described in the metabolmics processing notes section.
Most metabolomic methods result in semi-quantitative data, meaning that metabolite values do not represent absolute concentrations. This implies that different metabolites cannot be compared within the same sample (e.g. we cannot infer that metabolite A is more abundant than metabolite B in a given sample).
Microbiome profiles, similarly to metabolome profiles, are effected by the metagenomics approaches (16S amplicon sequencing vs. shotgun sequencing) and the sequencing depth, both expected to impact resolution and accuracy. In addition, while we chose to use certain processing tools (e.g. DADA2, Kraken, etc.) for the metagenomic data - other processing pipelines would have likely resulted in slightly different taxonomic profiles.
Taxonomy assignments for 16S and shotgun data were performed based on slightly different versions of the Genome Taxonomy Database (GTDB), due to technical constraints. The taxonomic assignments of 16S amplicon sequence variants (ASV's) was performed with DADA2 using the "SBDI Sativa curated 16S GTDB database" which is based on GTDB version R06-RS202. The taxonomic assignments of shotgun reads was performed with Kraken using Struo2's GTDB database for Kraken, which is based on GTDB version R05-RS95. Slight differences in taxonomic assignments are therefore expected, and specific species of interest could be looked up in the GTDB website to verify they were not affected by version updates.