How FSE works - EBI-Metabolights/SAFERnmr GitHub Wiki

Features in 1D NMR

Here are some 'levels' at which this could be done:

  • full-resolution spectrum, element-by-element
    • very detailed
    • zero information loss
    • too specific to be practical
    • alignment issue
  • picked peak (resonance) maxima
    • easily computed features
    • very low data size
    • efficient matching algorithms (Hungarian linear assignment) and scoring (Jaccard)
    • not very specific (features only distinguished by chemical shift axis position)
    • alignment issue
  • Bins/buckets
    • generally pretty good, but still have alignment issues
    • how many resonances to include?
    • what if different peaks appear in different spectra, but in the same place?
  • ML-generated classification and/or feature extraction
    • needs training data
    • can be difficult to interpret
  • fitted peak shapes
    • difficult optimization problem
    • laborious reference database construction

What we want

Ideally, we want features which are specific enough to form the basis for flexible and robust matching. Simple features aren't terribly useful, as they only carry chemical shift information and that isn't super reliable. Compound features are more specific and are in fact rooted in structural information, but they require knowledge of feature shapes and we don't have this information.

Some observations and reflections about how humans identify features

  • if a shape consistently appears across samples, it's a pretty good bet that the constituent points belong together in a feature.
  • it doesn't need to be in all the samples, but we should see it at least a few times
  • Parts of a peak may give enough information - it doesn't necessarily need to be pretty and we'll take what we can get.
  • Other peaks might disrupt a feature. That's okay - we can keep track of those gaps.
  • It's generally very difficult to be sure that a simple feature (singlet) is the same one across spectra

Some solutions

  • STOCSY gives us a continuous likelihood that spectral points belong together. We might just peak-pick a STOCSY's covariance profile, but this is disrupted by misalignments, overlap, etc.
  • STORM can refine a STOCSY by selecting the spectra which have the signature of interest. However, it attempts to optimize for a whole signature, and often requires prior knowledge of a signature to work well.
  • While correlation thresholding in STOCSY-based approaches can be highly subjective, there are a few things to keep in mind:
    1. identifying a threshold which works for all peaks is often difficult, but locally this can be easier.
    2. sometimes these approaches are frustrating because the entire shape of a peak is not captured. This is less of an issue when shape-based matching is considered. Being open to partial signatures gives flexibility to thresholding.
    3. there is always at least one correlation peak for every spectral point, with maximum of 1 (by definition). The boundaries of the correlation peak usually correspond to the average boundaries of a resonance.
    4. when a real peak is present and contains > 1 resonances, it is highly likely that the second highest correlation peak in the local STOCSY will correspond to that

When looking at a STOCSY locally (in a region wide enough to capture a few intra-peak resonances), if we can find the two highest correlation peaks, we can get a pretty good guess at where two resonances in the same peak would be. The corresponding covariance at those points describes a hypothesized partial feature shape, or 'protofeature'. Using this protofeature as a more specific starting place than a peak maximum or single resonance, STORM (with some added restrictions) can optimize the shape of these two resonances as if they were a feature. Additionally, it will also incorporate any other resonances which belong to the peak in question. This is very computationally light, because only a very small region is involved. Thus, it is easy to apply indiscriminately to every spectral point in a matter of minutes, documenting the dominant feature shape (if one exists) and subset of spectra containing it at each spectral point.

What results is a collection of hundreds or thousands of putative features which we term Spectral Association Tags (SATs) because of their likeness to Expressed Sequence Tags (ESTs) and RNA-seq reads in genomics.