Submitting jobs on the Rockfish - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki

Here are step-by-step directions for submitting maps on the MARCC. Currently the data must be pulled on idmodeling and copied over to the MARCC, where the only the Stan model will be run.

Here is the reference to MARCC documentation. If you don't have an account yet, you should start by requesting one here.

Login to the MARCC from the terminal

ssh -X [email protected]

You may check the section System Access of this page for the reference.

Directory setup

The file system on the marc has scratch and data directories which are shared across the group, and a home directory which is user specific (more on the file system [here](no such website yet)). We keep all github repositories in the data directory.

Create your data subdirectory

On the login node, run:

cd /data/aazman1/
mkdir $USER

Clone cholera-mapping-pipeline dev branch

On the login node, run:

cd /data/aazman1/$USER
git clone https://github.com/HopkinsIDD/cholera-mapping-pipeline.git -b dev

Clone cholera-mapping-output branch

On the login node, run:

cd /data/aazman1/$USER
ml git-lfs
git clone https://github.com/HopkinsIDD/cholera-mapping-output.git -b <output branch>
cd cholera-mapping-output
git lfs install
git lfs pull
ln -s /data/aazman1/$USER/cholera-mapping-output /data/aazman1/$USER/cholera-mapping-pipeline/Analysis/data

Clone cholera-configs

On the login node, run:

cd /data/aazman1/$USER
git clone https://github.com/HopkinsIDD/cholera-configs.git
ln -s /data/aazman1/$USER/cholera-configs /data/aazman1/$USER/cholera-mapping-pipeline/Analysis/configs

Fake cholera-covariates

We don't need the whole covariates for marcc submissions, so we only pretend to have it.

cd /data/aazman1/$USER/cholera-mapping-pipeline/
mkdir /data/aazman1/$USER/cholera-mapping-pipeline/Layers
touch /data/aazman1/$USER/cholera-mapping-pipeline/Layers/covariate_dictionary.yml

Module setup

MARCC uses modules to allow users to load software packages that can be individually loaded with the command ml (as described here). Some R packages have specific version requirements to work together (sf, rgdal, rgeos, etc..), for which specific setups have been prepared by MARCC maintainers. See a submission script for the modules we use.

If loading the modules manually, remember to include module purge before loading modules as part of a submission script. MARCC has default modules, and in order to run more smoothly across users, we assume there are no default modules.

First time installation

Submit the stan setup submission script

This is to set up the Stan platform for the pipeline.

sbatch /data/aazman1/$USER/cholera-mapping-pipeline/submission_Shell_scripts/marcc_install_cmdstan.sh

Submit the taxdat-related_packages installation script

This is to install all the dependencies for the taxdat package.

sbatch /data/aazman1/$USER/cholera-mapping-pipeline/submission_Shell_scripts/marcc_install_taxdat-related_packages.sh

Submit the taxdat installation/reinstallation script.

This is to install/reinstall the taxdat package. You may also need to submit this script each time after you pull new changes for the package.

sbatch /data/aazman1/$USER/cholera-mapping-pipeline/submission_Shell_scripts/marcc_reinstall_taxdat.sh

Every time

Create an sbatch script

SBATCH arguments

#!/bin/bash
#SBATCH --job-name <descriptive name>
#SBATCH --time=<how long to run max>
#SBATCH --mem=<amount of memory>
#SBATCH --ntasks=<number of chains>
#SBATCH --partition=<defq or highmem>
#SBATCH --account=aazman1

Setup modules

export GCC_VERSION=9.3.0
export R_VERSION=4.0.2
module purge

ml gcc/$GCC_VERSION
ml openmpi
ml gdal
ml r/$R_VERSION
ml r-magrittr
ml r-optparse
ml r-yaml
ml r-rprojroot
ml r-purrr
ml r-jsonlite
ml r-dplyr
ml r-tidyverse
ml r-stringi
ml r-rstan

Set environment variables

export CHOLERA_ON_MARCC=TRUE
export CHOLERA_PIPELINE_DIRECTORY=/data/aazman1/$USER/cholera-mapping-pipeline/
export CHOLERA_CONFIG=/data/aazman1/$USER/cholera-configs/
export CHOLERA_CONFIG_DIRECTORY=/data/aazman1/$USER/cholera-configs/no_covariate_production_2016_2020/2016_2020_country
export CHOLERA_CONFIG=$CHOLERA_CONFIG_DIRECTORY/config_CIV_2016_2020.yml
export R_LIBRARY_DIRECTORY=$HOME/rlibs/cmp/$R_VERSION/gcc/$GCC_VERSION/
export CMDSTAN_LOCATION=/data/aazman1/$USER/cmdstan

Actually run something

Rscript -e "library(withr, lib.loc='$R_LIBRARY_DIRECTORY'); library(processx, lib.loc='$R_LIBRARY_DIRECTORY'); cmdstanr::set_cmdstan_path('$CMDSTAN_LOCATION'); source('Analysis/R/set_parameters.R')"

Use pre-made Shell scripts for job submission

You can find all the Shell scripts in the Shell scripts folder and submit the one you need.

Optimize CPU use and memory use

Our experience

Generally, the Rockfish technicians can only help us extend the 3-day time limit to 5 or 6 days once, so we can't keep asking for longer running times. It would be helpful if we could always try to optimize our workflow and programs so that we don't go over the time limit very frequently.

It will also be good to check the CPU use and memory use efficiency for the completed runs in order to know how many cores and how much memory exactly are necessary for the programs. Consider using the following command:

seff <slurm-job-id>

Just to help conceptualize how much resource is needed for the Stan model run of the mapping pipeline, here's the efficiency statistics for the previously completed run: PTC 2023 ID-01 RUN

CMR ID-01 2011-2015:   Nodes-1   Cores per node-4   CPU Efficiency-98.17%   Memory Efficiency-20.78% of 30.00 GB

CMR ID-01 2016-2020:   Nodes-1   Cores per node-4   CPU Efficiency-96.65%   Memory Efficiency-17.18% of 30.00 GB

KEN ID-01 2011-2015:   Nodes-1   Cores per node-4   CPU Efficiency-71.66%   Memory Efficiency-25.43% of 30.00 GB

KEN ID-01 2016-2020:   Nodes-1   Cores per node-4   CPU Efficiency-94.69%   Memory Efficiency-26.32% of 30.00 GB

MWI ID-01 2016-2020:   Nodes-1   Cores per node-4   CPU Efficiency-82.78%   Memory Efficiency-3.11% of 30.00 GB

UGA ID-01 2011-2015:   Nodes-1   Cores per node-4   CPU Efficiency-97.38%   Memory Efficiency-14.54% of 30.00 GB

UGA ID-01 2016-2020:   Nodes-1   Cores per node-4   CPU Efficiency-98.90%   Memory Efficiency-11.22% of 30.00 GB

Suggested Shell script headers

This suggests that when running Stan models, 10 GB of memory should already suffice.

After talking to the Rockfish technicians, they suggested using all 48 CUPs for each run, which was later proven not to be helpful. The testing run 13888447_1 used all CUPs by submitting a Shell script with the following header setting.

#!/bin/bash
#SBATCH --job-name multi_cpu
#SBATCH --time=2-23:59:59
#SBATCH --mem=12G
#SBATCH --array=1%1           #if running one job at a time
#SBATCH --ntasks=1            #clarify that there's only one job to run
#SBATCH --cpus-per-task=48    #for each job, how many cores/CPUs should be used
#SBATCH --partition=defq
#SBATCH --account=aazman1

However, this run was only 5% completed after three days and had very low efficiency, which means the current Stan model cannot automatically use more than 4 cores for parallel computation:

Job ID: 13888447
Array Job ID: 13888447_1
Cluster: slurm
User/Group: kzou7/aazman1
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 48
CPU Utilized: 00:00:03
CPU Efficiency: 0.00% of 144-00:22:24 core-walltime
Job Wall-clock time: 3-00:00:28
Memory Utilized: 2.90 GB
Memory Efficiency: 24.15% of 12.00 GB

However, using 4 CPUs/cores is still recommended because 4 chains are supposed to be simulated in the Stan model. With this experience, the following Shell script header is suggested for running the Stan model:

#!/bin/bash
#SBATCH --job-name submit_multiple
#SBATCH --time=2-23:59:59
#SBATCH --mem=10G
#SBATCH --cpus-per-task=4  #how many cores to use
#SBATCH --array=0-43%44    #how many parallel runs 
#SBATCH --partition=defq
#SBATCH --account=aazman1