Parameters File description - EBI-Metabolights/SAFERnmr GitHub Wiki

What the parameter file does

The parameter file stores your parameter settings and passes them in a systematic format (yaml) to the pipeline for use by all the different functions. It also lets us index all the SAFER runs we've done by their parameter sets. The names of the parameters should match SAFER parameter names exactly, and the inputs are interpreted and checked by the valid_pars() function, which, in some cases, might modify a parameter if needed for formatting. Warnings are printed to the log/terminal if the function detects any parameter values out of the ordinary, and will attempt to assign basic params if possible and not present.

You'll need the following fields in a .yaml file:

Note: yaml format has some requirements:

alpha characters followed by colon indicate a heading
each space following a newline indicates nesting (or a sub-heading)
alpha characters followed by colon and non-whitespace characters indicate name-value pairs
the file should end in a newline (no spaces)
two or more elements can be passed to a single param and interpreted as a vector, see e.g. tina: bounds
you can check if a file is yaml-formatted by reading it into R with yaml::yaml.load_file(). I find it useful to have an LLM like ChatGPT check them for me as well.

Parameter descriptions:

dirs

temp

/Users/mjudge/Documents/ftp_ebi/pipeline_runs_new

description

Local temp directory, where the timestamped run directory will be created. If the last part of this path doesn't exist, it will be created.

description

Top-level filepaths for the pipeline.

study

id

MTBLS1

description

This is the study ID (e.g. Metabolights study ID).

spectrometer.frequency

700

description

What spectrometer frequency were the spectra acquired at in MHz?

description

These are details about the study being analyzed - necessary for indexing results, but not checked if running locally.

files

spectral.matrix

/nfs/production/odonovan/nmr_staging/spectral_matrices/MTBLS1_1r_noesypr1d_spectralMatrix.RDS

description

Filepath to the .RDS spectral matrix file being used. On Galaxy, this can be uploaded or selected. See [specifications](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/Data-processing)

lib.data

/nfs/production/odonovan/nmr_staging/gissmo_ref/data.list_700MHz.RDS

description

Library reference spectra files. See [specifications](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/Reference-Library-Data)

description

File information for pipeline inputs

corrpockets

half.window

0.06

description

1/2 initial tolerance for resonance pairs; this is in PPM

noise.percentile

0.99

description

type of noise cutoff; higher is looser (more noise).

only.region.between

- -1.0
- 11.0

description

only run corrpocketpairs on this region (e.g. c(-1, 11) - not necessary, but only protofeatures in this region will be STORM'd
The bounds can be written on one line as:
!expr c(-1, 11)
or on two in this form:
- -1
- 11

rcutoff

0.5

description

Pearson correlation coefficient (r) cutoff for considering correlation peaks in protofeature extraction.

description

Protofeature generation parameters. These can be set relatively permissively. See [algorithm details](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/How-FSE-works)

storm

correlation.r.cutoff

0.7

description

correlation cutoff (for both subset and profile refinement steps in LOG-STORM)

q

0.01

description

STORM adjusted pvalue

b

1.5

description

STORM b parameter - increasing opens up the search to a wider area. Units: peak widths

number.of.plots

250

description

Plots are typically not generated - deprecated parameter!

description

Parameters for LOG-STORM. See details about the algorithm [here]()

tina

bounds

-1
11

description

Features outside of these boundaries (ppm units) will be discarded before matching.
The bounds can be written on one line as:
!expr c(-1, 11)
or on two in this form:
- -1
- 11

min.subset

5

description

Minimum number of spectra in which a feature must be found to be included. Usually set to 4-5, must be > 3 (or else correlations are not really meaningful)

prom.ratio

0.3

description

Prominence ratio - in order to detect strong local baseline effects in feature shapes, at least one peak must have prominence > prom.ratio * range(feature.intensities).

do.clustering

FALSE

description

Should clustering be attempted to reduce feature shape duplication? Not recommended at present. If used, OPTICS clustering will be employed.

clustering

max.eps

50

description

description

minPts

2

description

description

eps.stepsize

0.01

description

description

description

Parameters for OPTICS clustering of feature shapes. Clustering can be used to reduce matching computations. However, matching scales better now, and this is not necessary and can introduce undesirable effects.

plots

max.plots

600

description

description

filtered.out

TRUE

description

description

filtered.features

TRUE

description

description

cleaned.clusters

TRUE

description

description

description

Whether or not certain sets of plots should be generated. Can help with troubleshooting.

description

Parameters for TINA (TINA Is Not Alignment). This was initially a script for feature de-duplication and combination into clusters, but that is unnecessary as reference matching effectively accomplishes this. Now, since matching scales well, it's more about filtering features and spec-feature extraction (identifying features across spectra).

matching

cluster.profile

representative.feature

description

Not currently in use (leave set to 'representative.feature'. If clustering is done, this determines whether a representative feature shape is used for matching, or whether the weighted.mean of the cluster is used as the shape to match.

ref.sig.SD.cutoff

0.01

description

When interpolating reference spectra, it is useful to filter out very low spectral regions in order to compress the reference data matrix. This parameter controls that cutoff, defined in multiples of the whole-spectrum standard deviation. It only needs to be a rough approximation as most useful shape matching information tends to be higher signal. Exercise caution if real PCRSs are used, but imperfect settings shouldn't affect results too much.

max.hits

5

description

During matching, only the top _max.hits_ convolution peaks between each feature and ref spectrum (fast proxy for cross correlation lags) are assessed for Pearson Correlation Coefficient (r). This should be high enough to allow for several local optima (e.g. 3-5).

r.thresh

0.8

description

The top _max.hits_ convolution peaks are assessed for Pearson's r. If any matches between the are lower than this, they are excluded. Note: backfits (matches back to dataset spectra) are limited, so the effective _r.thresh_ may change as a result of jettisoning matches to satisfy that cutoff (_max.backfits_)

p.thresh

0.01

description

Adjusted p-value cutoff for cross-correlations (as defined for STORM).

filtering

res.area.threshold

0.25

description

What fraction of a matched resonance (peak) in the reference dataset must be accounted for by a fit feature to be considered 'matched'?

ppm.tol

0.1

description

How far away (+/- ppm) can the center of a match be in the dataset from the matched point on the reference spectrum?

max.backfits

1e+08

description

Upper limit to the number of ref-features that can be back-fitted to the dataset spectra from which the corresponding feature was extracted?

select

random

description

If backfits need to be limited, matches must be discarded. Should they be discarded at 'random', or should higher r-value matches be prioritized ('rval')?

description

match filtering parameters

description

Parameters for matching to PCRSs (reference spectra). More info on [matching]()

par

ncores

48

description

number of cores to use for computations. It's best to leave a few extra cores on personal machines. Be careful about RAM (expect to use 5-10Gb per core).

type

FORK

description

what type of parallel process ("FORK" or "PSOCK"; see parallel::makeCluster documentation). Most parallel operations have now been converted to mclapply, so this primarily affects the matching loops.

description

Parallel computing parameters.

opts

npoints

32000

description

Since NMR spectra often contain far more spectral points than necessary, a lot of compute is duplicated and wasted. This is the number of spectral points to interpolate both the dataset and library spectra to in order to save time. < 16k is not recommended, as significant peak shape information loss occurs. Currently, a basic linear interpolation is used.

galaxy

enabled

FALSE

description

Is this going to be run on a Galaxy environment? If so, certain pipeline operations need to change, and this switch will trigger those.

description

description

debug

enabled

FALSE

description

Run in debug mode? This is necessary to throttle features (see below).

throttle_features

100

description

How many features (1:n) should be used for matching? This can keep runs short, but can also result in no matches if very small numbers are used.

all.outputs

TRUE

description

Should intermediate outputs be written to a debug directory? These will be zipped if so.

description

things useful for debugging

Example files

A quick run (for testing purposes; change the directories for your machine):

show file

dirs :
 temp : /Users/mjudge/Documents/ftp_ebi/local_outputs/
 # where files will be stored on runtime (relative to working)

study :
 id : MTBLS1
 spectrometer.frequency : 700 # MHz of the dataset

files :
 spectral.matrix : /Users/mjudge/Documents/ftp_ebi/spectral_matrices/MTBLS1_1r_noesypr1d_spectralMatrix.RDS
 lib.data : /Users/mjudge/Documents/ftp_ebi/gissmo/700MHz_tiny.RDS

corrpockets :
 half.window : .03 # 1/2 initial tolerance for resonance pairs; this is in PPM
 noise.percentile : .95 # type of noise cutoff; higher is looser (more noise)
 only.region.between : !expr c(2,4) # only run corrpocketpairs on this region (e.g. c(0,10)) - not necessary
 rcutoff : .75 # don't consider corrpocket peaks whose maximum is < this

storm :
 correlation.r.cutoff : 0.8 # correlation cutoff (for both subset and profile steps)
 q : .01 # STORM adjusted pvalue 
 b : 1.5 # STORM b parameter - increasing opens up the search to a wider area. Units: peak widths
 number.of.plots : 250

tina :
 bounds : !expr c(-1, 11)
 min.subset : 5
 prom.ratio : 0.3
 do.clustering : FALSE 
 clustering : 
  max.eps : 50
  minPts : 2
  eps.stepsize : .01
 plots :
  max.plots : 600
  filtered.out : FALSE
  filtered.features : TRUE
  cleaned.clusters : FALSE

matching :
 cluster.profile : representative.feature # or weighted.mean
 ref.sig.SD.cutoff : 0.01 # fraction of signal standard deviation (signal cutoff)
 max.hits : 5  # number of convolution (cross-correlation) maxima to consider for each pairwise feature - ref spectrum pair
 r.thresh : 0.9 # pearson correlation coefficient (r) cutoff for cross-correlations
 p.thresh : 0.01 # p-value cutoff for cross-correlations
 filtering :
  res.area.threshold : 0.25
  ppm.tol : 0.1
  max.backfits : 1E3
  select : random

par :
 ncores : 2  # number of cores to use for parallelized matching; could set to: !expr parallel:detectCores() - 1
 type : FORK  # what type of parallel process ("FORK" or "PSOCK"; see parallel::makeCluster documentation). Most parallel operations have now been converted to mclapply, so this primarily affects the matching loops.

opts:
  npoints: 32000

galaxy:
  enabled: FALSE

debug:
  enabled: TRUE # this controls whether or not debug params are considered. 
  throttle_features: 4 # limit number of features used in matching (to save time)
  all.outputs: FALSE # save outputs from each step (lots of duplicated data)

A typical run for an HPC environment with 1-2 Tb RAM and 48 cores: