1.1 For Linux users - GenomicSEM/GenomicSEM GitHub Wiki

In some instances of R on Linux a parallel backend is automatically configured which by default uses the maximum number of cores (i.e. 1 thread per core). It is unclear to me if this is base R or some other package that's doing this in the background, but it happens before the parallel cluster in GenomicSEM is initiated so it is unfortunately outside of our control. This cluster in GenomicSEM is created with parallel and foreach which seem to work outside this scope (i.e. they simply create child-processes, not managed by these backends). Combining the pre-configured parallel backend with this foreach cluster causes the creation of far too many threads, precisely: cores argument * number of actual CPU cores. For example, a 16-core machine, with cores=15 would spawn 16*15=240 R threads.
This in turn causes CPU congestion, and causes a very significant drop in performance, especially in high core-count machines (see test results below).

How to change this may depend on your Linux and/or R build, but as a catch-all you can use the following prior to running R:

export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1

Note that this may change behavior of other programs or packages as well, so it is recommended to do this in a separate session, or from a separate script, e.g. create RunGSEMAnalyses.sh:

#!/bin/bash
export OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 NUMEXPR_NUM_THREADS=1 VECLIB_MAXIMUM_THREADS=1
/usr/lib/R/bin/exec/R --no-echo --no-restore --file=MyGSEMAnalysesRscript.R --args argument1 arument2

Best performance is achieved by leaving the values for these backends at 1 and maximizing the number of cores in GenomicSEM (only bound by CPU or RAM constraints).

Test-results are from a server with 2x64 cores (=2x128 threads), 2x256GB RAM on Ubuntu 20.04 running userGWAS on 100K SNPs in GenomicSEM v0.0.5. On this system OPENBLAS_NUM_THREADS was the culprit, but as stated previously this may vary between systems.

OPENBLAS_NUM_THREADS unlimited:

cores= runtime (s) Number of R threads
1 12,645 257
2 7,577 513
4 4,214 1,024
8 5,347 2,049
12 6,585 3,073
24 5,170 3,100-3,500*

*Note in the 24 core tests the number of threads varied, likely due to some OS-level limitations

OPENBLAS_NUM_THREADS=1:

cores= runtime (s) Number of R threads
1 10,545 2
2 4,489 3
4 2,559 5
8 1,186 9
12 793 13
24 458 25

*Note the number of threads is always 1 higher than the cores argument because the main R thread is not included, hence our recommendation to set cores to (at least) 1 fewer than the total number of cores.