Submitting maps on the MARCC (old) - HopkinsIDD/cholera-mapping-pipeline GitHub Wiki

Here are step-by-step directions for submitting maps on the MARCC. Currently the data must be pulled on idmodeling and copied over to the MARCC, where the only the Stan model will be run.

If you don't have an account yet, you should start by requesting one here.

Login to the MARCC from the terminal

Follow login directions here.

ssh -X login.marcc.jhu.edu -l [email protected]

Module setup

The file system on MARCC has a scratch/ and data/ directories which are user-specific, and a work directory that is accessible by all users in the group (more on the file system here). For now all mapping-related runs have been done from the work/cholera-mapping-pipeline directory.

MARCC uses modules to allow users to load software packages that can be individually loaded with the command ml (as described here). Some R packages have specific version requirements to work together (sf, rgdal, rgeos, etc..), for which specific setups have been prepared by MARCC maintainers. In order to install use these packages that are required by taxdat, load the following modules:

ml stack/0.3
ml r-sf
ml r-rgdal
ml r-curl
ml r-mgcv
ml r-purrr
ml r-magrittr
ml r-ggplot2
ml r-openssl

First time installation

The very first time you use MARCC you will have to install packages to run the model. rstan requires the packge V8, which requires an additional step. Package installation needs to be done through an interactive job session (and not directly out of the login node). The steps to install R packages are:

Request an interactive job (this is done on the debug partition to avoid wait time):

interact -p debug -c 4 -t 1:00:0

Load modules as specified above

ml stack/0.3
ml r-sf
ml r-rgdal
ml r-curl
ml r-mgcv
ml r-purrr
ml r-magrittr
ml r-stringr
ml r-tibble
ml r-openssl
ml r-devtools

Set environmental variable to allow the installation of V8:

export DOWNLOAD_STATIC_LIBV8=1

Enter R and install packages needed to run run_stan_model.R (TODO: check if list is complete)

R
install.packages(c("dplyr","rstan","ISOCodes","tidyr","httr","optparse","yaml","lubridate"))
devtools::document("packages/taxdat")
install.packages("packages/taxdat", type = "source", repos = NULL)

Set up git-lfs for your MARCC account in the Linux terminal. This step needs to be performed one time for each user.

mkdir ~/bin
scp /scratch/users/jcombar1/JUNK06/usr/bin/git-lfs   $HOME/bin
git lfs install

You should have received a return statement that git LFS has been initialized. If you're not sure, you can try typing git lfs. If the program installed correctly, you should receive some information about how to use this command. If the program did not install correctly, you should receive a statement that git lfs is not a valid command.

If they do not already exist in the work/git folder, clone the cholera mapping Github repositories into work/git. This only needs to be performed once per group account (i.e. all of IDD group), as the work folder is shared across all accounts in the same group. Below are the commands if you intend to perform the clone via ssh. Replace the [email protected]... arguments with the appropriate target if you intend to perform the clone via HTTPS instead. (See the Github docs on cloning a repository for more details.)

Note that the cholera-mapping-output repo is quite large and may take a long time to clone.

git clone [email protected]:hopkinsidd/cholera-mapping-pipeline.git
git clone [email protected]:hopkinsidd/cholera-mapping-output.git cholera-mapping-pipeline/Analysis/data
git clone [email protected]:hopkinsidd/cholera-configs.git cholera-mapping-pipeline/Analysis/configs

28 April 2021 update: cholera-mapping-output is in the upper level work directory and there is a symbolic link in git/cholera-mapping-pipeline/Analysis/data

Every-time setup

Load the modules under the Module Setup section above.
Update all Github repositories and check that they are on the correct branches.
Enter R from work/git/cholera-mapping-pipeline and reinstall taxdat. (N.B. that there are two folders with similar names -- work/git/cholera-mapping-pipeline is correct and work/cholera-mapping-pipeline is old)

R
devtools::document("packages/taxdat")
install.packages("packages/taxdat", type = "source", repos = NULL)

Submitting jobs

MARCC uses the SLURM queuing system. The bash script runs each config file as a separate job array element. One needs only to replace <userid> by the MARCC user id, and possibly the array specification depending on the number of configuration files in the directory (note that indexing starts at 0).

The largest partition on MARCC is shared, which however has a job runtime limit of 72h. If you want to run longer jobs use the unlimited partition (more information on partitions here).

#!/bin/bash
#SBATCH --job-name=run_stan_%j.job
#SBATCH --output=cholera-mapping-output/logs/run_stan_%A_%a.log
#SBATCH --time=3-00:00
#SBATCH --mem=10G
#SBATCH -c 4
#SBATCH --array=0-70
#SBATCH -p shared

ml stack/0.3
ml r-sf
ml r-rgdal
ml r-curl
ml r-mgcv
ml r-purrr
ml r-magrittr
ml r-stringr
ml r-tibble

echo "Beginning of script"
date
TAXDIR='/home-2/<userid>/work/git/cholera-mapping-pipeline'
CONFIGDIR=$TAXDIR/Analysis/configs
CONFIG_SUBDIR=single_year_configs
export LTO=-pg

export CONFIGS=($(find $CONFIGDIR/$CONFIG_SUBDIR -type f))
if [ $SLURM_ARRAY_TASK_COUNT == ${#CONFIGS[@]} ]
then
  echo $RSCRIPT $TAXDIR/Analysis/R/set_parameters.R -c ${CONFIGS[$SLURM_ARRAY_TASK_ID]}
else
  echo "Expected $SLURM_ARRAY_TASK_COUNT configs, found ${#CONFIGS[@]}"
  $RSCRIPT $TAXDIR/Analysis/R/set_parameters.R -c ${CONFIGS[$SLURM_ARRAY_TASK_ID]} -d $TAXDIR -l $TAXDIR/Layers -m TRUE || exit 1
fi

cd $TAXDIR


echo "End of script"
date

Below is the old SLURM script that was used. Note that $1 refers to a command line argument to specify the config directory.

#!/bin/bash
#SBATCH --job-name=run_stan_%j.job
#SBATCH --output=logs/run_stan_%A_%a.log
#SBATCH --time=7-00:00
#SBATCH --mem=10G
#SBATCH -c 4
#SBATCH --array=0-45
#SBATCH -p shared

ml stack/0.3
ml r-sf
ml r-rgdal
ml r-curl
ml r-mgcv
ml r-purrr
ml r-magrittr
ml r-stringr
ml r-tibble

echo "Beginning of script"
date
TAXDIR='/home-2/<userid>/work/git/cholera-mapping-pipeline'
CONFIGDIR=$1
export LTO=-pg

cd $TAXDIR
# CONFIGNAMES=$(ls $TAXDIR/$CONFIGDIR)
CONFIGNAMES=($(ls $TAXDIR/$CONFIGDIR | tr ' ' '\n'))

echo Rscript $TAXDIR/Analysis/R/set_parameters.R -c $TAXDIR/$CONFIGDIR/${CONFIGNAMES[$SLURM_ARRAY_TASK_ID]}
Rscript $TAXDIR/Analysis/R/set_parameters.R -c $TAXDIR/$CONFIGDIR/${CONFIGNAMES[$SLURM_ARRAY_TASK_ID]} -d $TAXDIR -l $TAXDIR/Layers-m TRUE || exit 1
echo "End of script"
date