Using the bwCluster - Data2Dynamics/d2d GitHub Wiki

Preface

Already medium scale dynamical models can require large amounts of computational power for fitting and calculation of the profile likelihood. It can take days or even weeks to compute an appropriate number of mult-istart fits on a single high-end PC. The state of Baden-Württemberg provides such resources through the bwHPC concept by granting us access to the bwForCluster MLS&WISO Production.

This article is specific to the high performance computing infrastructure of the state of Baden-Württemberg and the Kreutz & Timmer groups. However, the mentioned functions can probably be adapted to an arbitrary third-party cluster with not too much effort.

How to get access

The two login nodes can be accessed via

ssh fr_[RZ-ID]@bwfor.cluster.uni-mannheim.de
ssh fr_[RZ-ID]@bwforcluster.bwservices.uni-heidelberg.de

when the cluster entitlement was granted by the Rechenzentrum (RZ). To gain the entitlement, check the bwHPC Wiki for the point "Become Coworker of an RV". The Kreutz and Timmer group already registered a Rechenvorhaben (RV).

Some useful commands

  • Overview of jobs: squeue
  • Cancel a job: scancel [jobId]
  • Cancel all jobs: scancel --user=fr_[yourRzId]
  • View cluster usage: sinfo_t_idle

D2D cluster suite

The D2D cluster suite is a collection of functions that allow conveniently using the bwCluster from a local machine without having to copy files and to ssh to the cluster manually. It contains functions for file upload, job submission, review of job status, job killing and file download. An example workflow is described below.

Preliminaries

To use the D2D cluster suite, it is necessary to do some preparations on the cluster (once): Cloning the D2D repository from GitHub to some directory on the cluster, e.g. ~/d2d, and creating a folder where all the files generated during the computational tasks are stored (the cluster working directory, e.g. ~/d2d_work. Next, it is necessary to set some configuration options. This can be done interactively byalling arUserConfigBwCluster without an argument. It will ask for

  • your ssh username
  • the ssh server (bwfor.cluster.uni-mannheim.de)
  • the MATLAB version to use on the cluster (R2019b)
  • the path to D2D on the cluster (~/d2d/arFramework3)
  • the working directory (~/d2d_work)

Upload and job submission

For the available cluster functions, see the section below. For this example, a multistart optimization with 20 runs will be conducted. Upload of all necessary files and submission of the cluster jobs is possible via the function

arSendJobBwCluster(name, functionString, [cjId], [uploadFiles])

  • name specifies a subfolder in the cluster working directory to which all files are being uploaded
  • functionString is a string that contains the actual function call on the cluster
  • cjId is an identifier for the computing job that will be submitted. It is stored in ar.config.cluster as well as in a backup file
  • uploadFiles is a boolean variable indicating whether arSendJobBwCluster should only submit the computing job or also upload the files

For this example, a suitable function call can look like this:

arSendJobBwCluster('myProj', 'arFitLhsBwCluster(20,2)', 'myProj_r01', true)

To submit another computing job for the same workspace but, e.g., different parameter settings, one can now simply do the desired changes and then submit another job with a new cjId and without uploading the files again by arSendJobBwCluster('myProj', 'arFitLhsBwCluster(20,2)', 'myProj_r02', false).

Job status and results download

The results can be downloaded to the local results folder by calling

arDownloadResultsBwCluster('myProj_r01', 'myProj_r01_clusterResults').

Before initiating the download, this function first checks whether all computations have finished by calling arJobStatusBwCluster('myProj_r01') internally. This function creates a new folder inside the results directory with the name 'myProj_r01_clusterResults'.

D2D cluster functions

The D2D workflow on the cluster in similar to that on the knechte/ruprechte. First, upload (or clone) D2D to the cluster and your D2D workspace, e.g. by using the scp command. Then log in to one of the login nodes, load the matlab module by

module load math/matlab/R2019b

and type matlab to start MATLAB. Now add d2d to the MATLAB path via addpath.

Multi-start fitting

To perform multi-start fitting on the bwFor cluster, use the function arFitLhsBwCluster. The usage is similar to calling arFitLHS on a local machine. First, load your workspace on the cluster, then run arFitLhsBwCluster. You can specify the queue and walltime for the cluster jobs in the arguments of that function. Type help arFitLhsBwCluster for further information.

arFitLhsBwCluster will create a number of MATLAB instances that perform a small batch of the total number of multistart fitting runs in parallel. Because each of those instances writes its results into a different directory, a results collection function must be run after all jobs have finished:

addpath('myD2Dpath')
arLoad('myWorkspace')

% start jobs on standard queue with 1 hour walltime and 10 fits per job
collectfun = arFitLhsBwCluster(1000, 10, 'standard', '01:00:00')

% wait for all jobs to be finished (!) run results collection
collectfun

% results will be stored in 'myWorkspace'

To achieve optimal performance, choose the number of jobs per node as the number of cores divided by the number of conditions of your model. This can be achieved by adding a local copy of arClusterConfig to your workspace in which you can change the value of conf.n_inNode.

Profile likelihood

Profile likelihood calculation can also be accelerated by using the cluster. The function pleBwCluster shares the logic of arFitLhsBwCluster and can compute the left and right branch of every profile in parallel. It can this speed up profile likelihood calculation by a factor of up to 2 * [# of fitted parameters] compared to using ple.