Parameters File description - EBI-Metabolights/SAFERnmr GitHub Wiki

What the parameter file does

The parameter file stores your parameter settings and passes them in a systematic format (yaml) to the pipeline for use by all the different functions. It also lets us index all the SAFER runs we've done by their parameter sets. The names of the parameters should match SAFER parameter names exactly, and the inputs are interpreted and checked by the valid_pars() function, which, in some cases, might modify a parameter if needed for formatting. Warnings are printed to the log/terminal if the function detects any parameter values out of the ordinary, and will attempt to assign basic params if possible and not present.

You'll need the following fields in a .yaml file:

Note: yaml format has some requirements:

  • alpha characters followed by colon indicate a heading
  • each space following a newline indicates nesting (or a sub-heading)
  • alpha characters followed by colon and non-whitespace characters indicate name-value pairs
  • the file should end in a newline (no spaces)
  • two or more elements can be passed to a single param and interpreted as a vector, see e.g. tina: bounds
  • you can check if a file is yaml-formatted by reading it into R with yaml::yaml.load_file(). I find it useful to have an LLM like ChatGPT check them for me as well.

Parameter descriptions:

dirs
temp
/Users/mjudge/Documents/ftp_ebi/pipeline_runs_new
description
Local temp directory, where the timestamped run directory will be created. If the last part of this path doesn't exist, it will be created.
lib
/nfs/production/odonovan/nmr_staging/gissmo_ref
description
The local directory from where reference library files are stored.
description
Top-level filepaths for the pipeline.
study
id
MTBLS1
description
This is the study ID (e.g. Metabolights study ID).
spectrometer.frequency
700
description
What spectrometer frequency were the spectra acquired at in MHz?
description
These are details about the study being analyzed - necessary for indexing results, but not checked if running locally.
files
spectral.matrix
/nfs/production/odonovan/nmr_staging/spectral_matrices/MTBLS1_1r_noesypr1d_spectralMatrix.RDS
description
Filepath to the .RDS spectral matrix file being used. On Galaxy, this can be uploaded or selected. See [specifications](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/Data-processing)
lib.data
/nfs/production/odonovan/nmr_staging/gissmo_ref/data.list_700MHz.RDS
description
Library reference spectra files. See [specifications](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/Reference-Library-Data)
description
File information for pipeline inputs
corrpockets
half.window
0.06
description
1/2 initial tolerance for resonance pairs; this is in PPM
noise.percentile
0.99
description
type of noise cutoff; higher is looser (more noise).
only.region.between
- -1.0
- 11.0
description
only run corrpocketpairs on this region (e.g. c(-1, 11) - not necessary, but only protofeatures in this region will be STORM'd
The bounds can be written on one line as:
!expr c(-1, 11)
or on two in this form:
- -1
- 11
rcutoff
0.5
description
Pearson correlation coefficient (r) cutoff for considering correlation peaks in protofeature extraction.
description
Protofeature generation parameters. These can be set relatively permissively. See [algorithm details](https://github.com/EBI-Metabolights/SAFER-NMR/wiki/How-FSE-works)
storm
correlation.r.cutoff
0.7
description
correlation cutoff (for both subset and profile refinement steps in LOG-STORM)
q
0.01
description
STORM adjusted pvalue
b
1.5
description
STORM b parameter - increasing opens up the search to a wider area. Units: peak widths
number.of.plots
250
description
Plots are typically not generated - deprecated parameter!
description
Parameters for LOG-STORM. See details about the algorithm [here]()
tina
bounds
-1
11
description
Features outside of these boundaries (ppm units) will be discarded before matching.
The bounds can be written on one line as:
!expr c(-1, 11)
or on two in this form:
- -1
- 11
min.subset
5
description
Minimum number of spectra in which a feature must be found to be included. Usually set to 4-5, must be > 3 (or else correlations are not really meaningful)
prom.ratio
0.3
description
Prominence ratio - in order to detect strong local baseline effects in feature shapes, at least one peak must have prominence > prom.ratio * range(feature.intensities).
do.clustering
FALSE
description
Should clustering be attempted to reduce feature shape duplication? Not recommended at present. If used, OPTICS clustering will be employed.
clustering
max.eps
50
description
description
minPts
2
description
description
eps.stepsize
0.01
description
description
description
Parameters for OPTICS clustering of feature shapes. Clustering can be used to reduce matching computations. However, matching scales better now, and this is not necessary and can introduce undesirable effects.
plots
max.plots
600
description
description
filtered.out
TRUE
description
description
filtered.features
TRUE
description
description
cleaned.clusters
TRUE
description
description
description
Whether or not certain sets of plots should be generated. Can help with troubleshooting.
description
Parameters for TINA (TINA Is Not Alignment). This was initially a script for feature de-duplication and combination into clusters, but that is unnecessary as reference matching effectively accomplishes this. Now, since matching scales well, it's more about filtering features and spec-feature extraction (identifying features across spectra).
matching
cluster.profile
representative.feature
description
Not currently in use (leave set to 'representative.feature'. If clustering is done, this determines whether a representative feature shape is used for matching, or whether the weighted.mean of the cluster is used as the shape to match.
ref.sig.SD.cutoff
0.01
description
When interpolating reference spectra, it is useful to filter out very low spectral regions in order to compress the reference data matrix. This parameter controls that cutoff, defined in multiples of the whole-spectrum standard deviation. It only needs to be a rough approximation as most useful shape matching information tends to be higher signal. Exercise caution if real PCRSs are used, but imperfect settings shouldn't affect results too much.
max.hits
5
description
During matching, only the top _max.hits_ convolution peaks between each feature and ref spectrum (fast proxy for cross correlation lags) are assessed for Pearson Correlation Coefficient (r). This should be high enough to allow for several local optima (e.g. 3-5).
r.thresh
0.8
description
The top _max.hits_ convolution peaks are assessed for Pearson's r. If any matches between the are lower than this, they are excluded. Note: backfits (matches back to dataset spectra) are limited, so the effective _r.thresh_ may change as a result of jettisoning matches to satisfy that cutoff (_max.backfits_)
p.thresh
0.01
description
Adjusted p-value cutoff for cross-correlations (as defined for STORM).
filtering
res.area.threshold
0.25
description
What fraction of a matched resonance (peak) in the reference dataset must be accounted for by a fit feature to be considered 'matched'?
ppm.tol
0.1
description
How far away (+/- ppm) can the center of a match be in the dataset from the matched point on the reference spectrum?
max.backfits
1e+08
description
Upper limit to the number of ref-features that can be back-fitted to the dataset spectra from which the corresponding feature was extracted?
select
random
description
If backfits need to be limited, matches must be discarded. Should they be discarded at 'random', or should higher r-value matches be prioritized ('rval')?
description
match filtering parameters
description
Parameters for matching to PCRSs (reference spectra). More info on [matching]()
par
ncores
48
description
number of cores to use for computations. It's best to leave a few extra cores on personal machines. Be careful about RAM (expect to use 5-10Gb per core).
type
FORK
description
what type of parallel process ("FORK" or "PSOCK"; see parallel::makeCluster documentation). Most parallel operations have now been converted to mclapply, so this primarily affects the matching loops.
description
Parallel computing parameters.
opts
npoints
32000
description
Since NMR spectra often contain far more spectral points than necessary, a lot of compute is duplicated and wasted. This is the number of spectral points to interpolate both the dataset and library spectra to in order to save time. < 16k is not recommended, as significant peak shape information loss occurs. Currently, a basic linear interpolation is used.
galaxy
enabled
FALSE
description
Is this going to be run on a Galaxy environment? If so, certain pipeline operations need to change, and this switch will trigger those.
description
description
debug
enabled
FALSE
description
Run in debug mode? This is necessary to throttle features (see below).
throttle_features
100
description
How many features (1:n) should be used for matching? This can keep runs short, but can also result in no matches if very small numbers are used.
all.outputs
TRUE
description
Should intermediate outputs be written to a debug directory? These will be zipped if so.
description
things useful for debugging

Example files

A quick run (for testing purposes; change the directories for your machine):

show file
dirs :
 temp : /Users/mjudge/Documents/ftp_ebi/local_outputs/
 # where files will be stored on runtime (relative to working)

study :
 id : MTBLS1
 spectrometer.frequency : 700 # MHz of the dataset

files :
 spectral.matrix : /Users/mjudge/Documents/ftp_ebi/spectral_matrices/MTBLS1_1r_noesypr1d_spectralMatrix.RDS
 lib.data : /Users/mjudge/Documents/ftp_ebi/gissmo/700MHz_tiny.RDS

corrpockets :
 half.window : .03 # 1/2 initial tolerance for resonance pairs; this is in PPM
 noise.percentile : .95 # type of noise cutoff; higher is looser (more noise)
 only.region.between : !expr c(2,4) # only run corrpocketpairs on this region (e.g. c(0,10)) - not necessary
 rcutoff : .75 # don't consider corrpocket peaks whose maximum is < this

storm :
 correlation.r.cutoff : 0.8 # correlation cutoff (for both subset and profile steps)
 q : .01 # STORM adjusted pvalue 
 b : 1.5 # STORM b parameter - increasing opens up the search to a wider area. Units: peak widths
 number.of.plots : 250

tina :
 bounds : !expr c(-1, 11)
 min.subset : 5
 prom.ratio : 0.3
 do.clustering : FALSE 
 clustering : 
  max.eps : 50
  minPts : 2
  eps.stepsize : .01
 plots :
  max.plots : 600
  filtered.out : FALSE
  filtered.features : TRUE
  cleaned.clusters : FALSE

matching :
 cluster.profile : representative.feature # or weighted.mean
 ref.sig.SD.cutoff : 0.01 # fraction of signal standard deviation (signal cutoff)
 max.hits : 5  # number of convolution (cross-correlation) maxima to consider for each pairwise feature - ref spectrum pair
 r.thresh : 0.9 # pearson correlation coefficient (r) cutoff for cross-correlations
 p.thresh : 0.01 # p-value cutoff for cross-correlations
 filtering :
  res.area.threshold : 0.25
  ppm.tol : 0.1
  max.backfits : 1E3
  select : random

par :
 ncores : 2  # number of cores to use for parallelized matching; could set to: !expr parallel:detectCores() - 1
 type : FORK  # what type of parallel process ("FORK" or "PSOCK"; see parallel::makeCluster documentation). Most parallel operations have now been converted to mclapply, so this primarily affects the matching loops.

opts:
  npoints: 32000

galaxy:
  enabled: FALSE

debug:
  enabled: TRUE # this controls whether or not debug params are considered. 
  throttle_features: 4 # limit number of features used in matching (to save time)
  all.outputs: FALSE # save outputs from each step (lots of duplicated data)

A typical run for an HPC environment with 1-2 Tb RAM and 48 cores:

show file
dirs:
  temp: ~/safer_runs/MTBLS1_1r_noesypr1d_spectralMatrix.RDS
study:
  id: MTBLS1
  spectrometer.frequency: 700.0
files:
  spectral.matrix: ~/spectral_matrices/MTBLS1_1r_noesypr1d_spectralMatrix.RDS
  lib.data: ~/gissmo_ref/data.list_700MHz.RDS
corrpockets:
  half.window: 0.06
  noise.percentile: 0.99
  only.region.between: ~
  rcutoff: 0.5
storm:
  correlation.r.cutoff: 0.7
  q: 0.01
  b: 1.5
  number.of.plots: 250
tina:
  bounds:
  - -1.0
  - 11.0
  min.subset: 5.0
  prom.ratio: 0.3
  do.clustering: no
  clustering:
    max.eps: 50
    minPts: 2
    eps.stepsize: 0.01
  plots:
    max.plots: 600
    filtered.out: yes
    filtered.features: yes
    cleaned.clusters: yes
matching:
  cluster.profile: representative.feature
  ref.sig.SD.cutoff: 0.01
  max.hits: 5
  r.thresh: 0.8
  p.thresh: 0.01
  filtering:
    res.area.threshold: 0.25
    ppm.tol: 0.1
    max.backfits: 1.0e+08
    select: random
par:
  ncores: 48.0
  type: FORK
opts:
  npoints: 32000
galaxy:
  enabled: no
debug:
  enabled: no
  throttle_features: 100
  all.outputs: yes

⚠️ **GitHub.com Fallback** ⚠️