Handling Memory limits on HPC clusters using slurm - mspass-team/mspass GitHub Wiki

Handling Memory Limits on HPC Clusters

Computing Job Memory Requirements

The most significant data memory use in seismic processing is normally the block of waveform data dask/spark have to move around in the cluster. There are two items that you should estimate before running a large data set on a cluster:

What is the nominal size of the data objects being moved around? For TimeSeries it is the number of samples (npts) times bytes per sample. In MsPASS all sample data is stored as real*8/double floating points so the size is 8*npts. For Seismograms objects with three components the size is 3*8*npts. For all ensembles the chunk size is the nominal number of members per ensemble times the nominal number of bytes per member. For memory use estimates it is best to treat "nominal" as the maximum expected. e.g. if you are working with TimeSeriesEnsemble data with a maximum size of 3000 with fixed length time window data 20000 samples each a rough number is 3000*20000*8=240 Mbytes. A good rough estimate for additional memory use for Metadata and other data can be had, from my experience, by adding roughly 10%. (240 + 24 = 244 Mbytes for the example).
The way dask and spark buffer data in a map operation remains a mystery to me (glp). I have found that you should multiply the value from item 1 by a factor of 4 or 5 to reduce the chances of memory problems. An additional mystery is how this interacts with the number of partitions. Our User Manual may be incorrect in saying memory use scales with partition size. I suspect dask, at least, is doing some kind of buffering and not trying to load the entire partition all at once. That, however, may be wrong. It remains one of the big current flows in using map operators with dask bags.
The processing workflow needs to move around any large data objects that appear as arguments in a parallel workflow. The most common in MsPASS are the normalization objects that cache normalizing metadata. A more extreme example is the plane wave migration code we are in the processing of adapting to MsPASS where things like travel time grids need to be passed around. In any case, the point is if the data passed as arguments is significant add an estimate of the size of all the arguments to your memory estimate. A potential solution to this issue with dask is to use their concept of an actor. An actor in dask distributed may be a way to move large objects shared by all map components only once. It might actually be more important for speed than memory use.

slurm memory limits

Slurm is a batch job submission system widely used on HPC systems. It is used at TACC and Indiana University where, at least up to now, most MsPASS development has happened. To schedule a batch job, certain Slurm setup will require to know how much memory a job may need. This is usually the case if the system share nodes among multiple users. That process is complicated by the fact that it needs to deal with everything from big memory serial jobs to a job requesting many nodes with only a small small memory requirement. It has two ways to define memory requirements that are very different:

The more common --mem is a request for TOTAL MEMORY RESOURCES for this job. If your job requests multiple nodes the memory assigned per node is divided equally between the nodes. (e.g. if you use 4 nodes and have --mem=200G each node will run in a memory block of 50G). This is the way most MsPASS jobs should be run to avoid confusion.
The alternative is --mem-per-cpu which slurm's documentation says is "memory required be allocated cpu". What that means, is complicated by the ambiguity in slurm of -n which they define as "the number of tasks to be launched", -N defined as "node count for the job", and "--cpus-per-task" defined as "number of cpus required per task". That is confusing because it appears slurm has support for heterogeneous job scheduling. For the standard configuration we have been using where the "primary node" runs the dbserver, scheduler, and front end and all other nodes are pure workers, a simple model is to set -N and -n equal. In that case, --cpus-per-task would be the same on all 4 nodes and the total memory request would be N*(cpus-per-task)*(mem-per-cpu). e.g. the following would request 4 nodes with each node requiring 20 cpus and 200 GB of memory per node and a total of 800 GB of cluster memory (a large requirement even by modern standard):

#SBATCH -N 4
#SBATCH -n 4
#SBATCH --mem-per-cpu 10G
#SBATCH --cpus-per-task 20