Matching Spectral Signatures - EBI-Metabolights/SAFERnmr GitHub Wiki

Matching Problems

A primary reason that annotation is difficult is that it is unclear at which level spectra, or spectral signatures of metabolites should be compared.

Standard matching approaches

Here are some 'levels' at which this could be done:

  • full-resolution spectrum, element-by-element
    • very detailed
    • zero information loss
    • too specific to be practical
    • alignment issue
  • picked peak (resonance) maxima
    • easily computed features
    • very low data size
    • efficient matching algorithms (Hungarian linear assignment) and scoring (Jaccard)
    • not very specific (features only distinguished by chemical shift axis position)
    • alignment issue
  • ML-generated classification and/or feature extraction
    • needs training data
    • can be difficult to interpret
  • fitted peak shapes
    • difficult optimization problem
    • laborious reference database construction

How SAFER does it

SAFER attempts to combine the latter two approaches. It produces dataset-derived hypotheses about feature shapes and positions by employing the STOCSY and STORM ideas, essentially looking for spectral points which form a coherent, consistent shape of intensities across a subset of spectra. It assumes that strongly statistically associated spectral points which are close to one another are likely to be in the same feature, or are at least indistinguishable from a feature in the dataset. It creates a Spectral Association Tag (SAT) from these points and uses these putative features as the basis for matching. If only half of your feature is clearly represented in your spectra, it will only use that part and/or eliminate the spectra in which it wasn't clear(thanks to the STORM approach). If there is overlap across samples, but on average the feature shape is represented, the general shape will be used (thanks to the STOCSY aspect). Thus we use the most specific and complete evidence possible for each feature.

With these features in hand, SAFER attempts to locate their most likely positions in a collection of 'reference' spectra using Fast Fourier Transform-based cross-correlation, then it estimates the feature fit at those positions. This in effect extracts the same feature from the reference spectrum, should it exist. This builds a map between the two sets of spectra (dataset and reference/PCRS).

Matching is not complete at that point, however. For the purpose of annotation, what most users are interested is seeing how the features extracted from the PCRSs (reference spectra) map onto their samples. Since the PCRS signal is likely to be the best representation of the true feature shape, we extract that out (we call this a 'reference-extracted feature', or 'ref-feature'), then fit that directly to the dataset spectra from which the feature was derived. Importantly, because we extracted this feature from those data already, we know the general position to which the ref-feature should be fit. For each dataset spectrum, the ref-feature fit (backfit) is locally optimized and stored. Finally, we now have a spectra matched to references using features (SMRF), which is a map between spectral points in the two datasets.

Note: this mapping is many-to-many, and should not be viewed as a declaration of which spectral points should in fact be compared. The idea is that, on average, the correct comparisons will ultimately outrank incorrect ones and have more 'read depth' so to speak.

A final note on SATs after matching to reference spectra: these relationships are quite like ESTs or RNA-seq reads being mapped to a reference sequence. Specifically, points in complex reference spectra can be grouped empirically using SATs, as shown for this example of citrate peaks:

This doesn't mean that actual NMR peaks are necessarily always captured by these, but, on average, they could lend guidance on how to decompose segments of the reference spectrum for peak-level dereplication.

Additionally, this approach can, in theory, be extended to any spectrum which shares the chemical shift axis of the dataset...like fraction libraries, or 2Ds like TOCSYs or COSYs.