Using the bwCluster - Data2Dynamics/d2d GitHub Wiki
Preface
Already medium scale dynamical models can require large amounts of computational power for fitting and calculation of the profile likelihood. It can take days or even weeks to compute an appropriate number of mult-istart fits on a single high-end PC. The state of Baden-Württemberg provides such resources through the bwHPC concept by granting us access to the bwForCluster MLS&WISO Production.
This article is specific to the high performance computing infrastructure of the state of Baden-Württemberg and the Kreutz & Timmer groups. However, the mentioned functions can probably be adapted to an arbitrary third-party cluster with not too much effort.
How to get access
The two login nodes can be accessed via
ssh fr_[RZ-ID]@bwfor.cluster.uni-mannheim.de
ssh fr_[RZ-ID]@bwforcluster.bwservices.uni-heidelberg.de
when the cluster entitlement was granted by the Rechenzentrum (RZ). To gain the entitlement, check the bwHPC Wiki for the point "Become Coworker of an RV". The Kreutz and Timmer group already registered a Rechenvorhaben (RV).
Some useful commands
- Overview of jobs:
squeue
- Cancel a job:
scancel [jobId]
- Cancel all jobs:
scancel --user=fr_[yourRzId]
- View cluster usage:
sinfo_t_idle
D2D cluster suite
The D2D cluster suite is a collection of functions that allow conveniently using the bwCluster from a local machine without having to copy files and to ssh to the cluster manually. It contains functions for file upload, job submission, review of job status, job killing and file download. An example workflow is described below.
Preliminaries
To use the D2D cluster suite, it is necessary to do some preparations on the cluster (once): Cloning the D2D repository
from GitHub to some directory on the cluster, e.g. ~/d2d
, and creating a folder where all the files generated during
the computational tasks are stored (the cluster working directory, e.g. ~/d2d_work
. Next, it is necessary to set some
configuration options. This can be done interactively byalling arUserConfigBwCluster
without an argument.
It will ask for
- your ssh username
- the ssh server (
bwfor.cluster.uni-mannheim.de
) - the MATLAB version to use on the cluster (
R2019b
) - the path to D2D on the cluster (
~/d2d/arFramework3
) - the working directory (
~/d2d_work
)
Upload and job submission
For the available cluster functions, see the section below. For this example, a multistart optimization with 20 runs will be conducted. Upload of all necessary files and submission of the cluster jobs is possible via the function
arSendJobBwCluster(name, functionString, [cjId], [uploadFiles])
name
specifies a subfolder in the cluster working directory to which all files are being uploadedfunctionString
is a string that contains the actual function call on the clustercjId
is an identifier for the computing job that will be submitted. It is stored in ar.config.cluster as well as in a backup fileuploadFiles
is a boolean variable indicating whether arSendJobBwCluster should only submit the computing job or also upload the files
For this example, a suitable function call can look like this:
arSendJobBwCluster('myProj', 'arFitLhsBwCluster(20,2)', 'myProj_r01', true)
To submit another computing job for the same workspace but, e.g., different parameter settings, one can now simply do the desired changes and then submit another job with a new cjId
and without uploading the files again by arSendJobBwCluster('myProj', 'arFitLhsBwCluster(20,2)', 'myProj_r02', false)
.
Job status and results download
The results can be downloaded to the local results folder by calling
arDownloadResultsBwCluster('myProj_r01', 'myProj_r01_clusterResults')
.
Before initiating the download, this function first checks whether all computations have finished by calling arJobStatusBwCluster('myProj_r01')
internally. This function creates a new folder inside the results directory with the name 'myProj_r01_clusterResults'
.
D2D cluster functions
The D2D workflow on the cluster in similar to that on the knechte/ruprechte. First, upload (or clone) D2D to the cluster and your D2D workspace, e.g. by using the scp
command. Then log in to one of the login nodes, load the matlab module by
module load math/matlab/R2019b
and type matlab
to start MATLAB. Now add d2d to the MATLAB path via addpath
.
Multi-start fitting
To perform multi-start fitting on the bwFor cluster, use the function arFitLhsBwCluster
. The usage is similar to calling arFitLHS
on a local machine. First, load your workspace on the cluster, then run arFitLhsBwCluster
. You can specify the queue and walltime for the cluster jobs in the arguments of that function. Type help arFitLhsBwCluster
for further information.
arFitLhsBwCluster
will create a number of MATLAB instances that perform a small batch of the total number of multistart fitting runs in parallel. Because each of those instances writes its results into a different directory, a results collection function must be run after all jobs have finished:
addpath('myD2Dpath')
arLoad('myWorkspace')
% start jobs on standard queue with 1 hour walltime and 10 fits per job
collectfun = arFitLhsBwCluster(1000, 10, 'standard', '01:00:00')
% wait for all jobs to be finished (!) run results collection
collectfun
% results will be stored in 'myWorkspace'
To achieve optimal performance, choose the number of jobs per node as the number of cores divided by the number of conditions of your model. This can be achieved by adding a local copy of arClusterConfig
to your workspace in which you can change the value of conf.n_inNode
.
Profile likelihood
Profile likelihood calculation can also be accelerated by using the cluster. The function pleBwCluster
shares the logic of arFitLhsBwCluster
and can compute the left and right branch of every profile in parallel. It can this speed up profile likelihood calculation by a factor of up to 2 * [# of fitted parameters]
compared to using ple
.