How FSE works - EBI-Metabolights/SAFERnmr GitHub Wiki

Features in 1D NMR

Here are some 'levels' at which this could be done:

full-resolution spectrum, element-by-element
- very detailed
- zero information loss
- too specific to be practical
- alignment issue
picked peak (resonance) maxima
- easily computed features
- very low data size
- efficient matching algorithms (Hungarian linear assignment) and scoring (Jaccard)
- not very specific (features only distinguished by chemical shift axis position)
- alignment issue
Bins/buckets
- generally pretty good, but still have alignment issues
- how many resonances to include?
- what if different peaks appear in different spectra, but in the same place?
ML-generated classification and/or feature extraction
- needs training data
- can be difficult to interpret
fitted peak shapes
- difficult optimization problem
- laborious reference database construction

What we want

Ideally, we want features which are specific enough to form the basis for flexible and robust matching. Simple features can be useful, but they only carry chemical shift information (not very reliable). Compound features are more specific, and are in fact rooted in structural information, but they require knowledge of feature shapes. We don't know what those shapes are.

Some observations and reflections about how (humans think) humans identify features

if a shape consistently appears across samples, it's a pretty good bet that the constituent points belong together in a feature.
it doesn't need to be in all the samples, but we should see it at least a few times to know it's real
Parts of a peak may give enough information - it doesn't necessarily need to be pretty and we'll take what we can get.
Other peaks might disrupt a feature. That's okay - we can keep track of those gaps.
It's generally very difficult to be sure that a simple feature (singlet) is the same one across spectra. Usually, these must be annotated in relationship to other peaks in the molecule.

Some solutions

STOCSY gives us a continuous likelihood that spectral points belong together. We might just peak-pick a STOCSY's covariance profile, but this is disrupted by misalignments, overlap, etc. and leads us back to some of the issues with simple features
STORM can refine a STOCSY by excluding the spectra which do not appear to have the signature of interest. However, it attempts to optimize for a whole signature, and often requires prior knowledge of a signature to work well.
We can correlation-threshold after STOCSY-based approaches. This is somewhat subjective, and there are a few things to keep in mind:
1. identifying a threshold which works for all peaks is often difficult, but locally this can be easier.
2. sometimes these approaches are frustrating because the entire shape of a peak is not captured. This is less of an issue when shape-based matching is considered. Being open to partial signatures allows for more flexibility when thresholding [compare thresholded and peak-picked STOCSY with shape-based comparison].
3. there is always at least one correlation peak for every spectral point, with maximum of 1 (by definition). The boundaries of the correlation peak usually correspond to the average boundaries of a resonance. When they are wider, that gives an indication that things are messy across spectra [show this].
4. when a real peak is present and contains > 1 resonance (i.e. it's a multiplet), it is highly likely that the second highest correlation peak in the local STOCSY will correspond to one of those. First, the intra-peak resonances will experience more similar baseline effects than will distal peaks. Sometimes the relative intensities of peaks change; it is less likely that intra-peak intensities will change.

When looking at a STOCSY locally (in a region wide enough to capture a few intra-peak resonances), if we can find the two highest correlation peaks, we have a pretty good guess at where two resonances in the same peak would be (i.e. a feature mask). Meanwhile, the corresponding covariance at those points describes a hypothesized partial feature shape, or 'protofeature'. Using this protofeature as a more specific starting place than a peak maximum or single resonance, STORM (with some added restrictions) can optimize the shape of these two resonances as if they were a feature. Using these as a seed, the STORM procedure will incorporate any other resonances which belong to the peak in question, if they have good enough statistical support. This is very computationally light, because only a very small region is involved. It is easy to apply indiscriminately to every spectral point in a matter of minutes, documenting the dominant feature shape (if one exists) and subset of spectra containing it at each spectral point. (Note: there is a ton of opportunity for improvement here, as we are only extracting one feature shape from each point. We could go back through the non-optimal subset to pull out additional features with no issues. One of the key things about SAFER is that features are simply addresses and shapes and not actual spectral points, and many different versions can exist in the same place.)

What results is a collection of hundreds or thousands of putative features which we term Spectral Association Tags (SATs) because of their likeness to Expressed Sequence Tags (ESTs) and RNA-seq reads in genomics.

We then do a few filtering and cleanup steps on these to ensure we get the types of features we're looking for. Then, we fine-tune the locations of the features for each spectrum in the dataset.