RCC Guide - ganong-noel/lab_manual GitHub Wiki

Getting started

Request an account here. Your account should be associated with pi-ganong.
Connect to the RCC using the guide here
Clone the repo using SSH (GiHub guide). Although git-lfs is installed by default on the repo, you may need to git lfs pull in repos that use git-lfs
A guide on how to run jobs is here
Midway has a number of R packages pre-installed. Note that you are not able to update packages from within an interactive session. Instead, launch an R session from a login node, and install packages there as needed. e.g. to update dplyr

module load R
R
install.packages("dplyr")
q()

Do not try to run parallelised computations from within RStudio in an interactive session. This will just lead to memory errors and sadness.
Your home directory is /home/<CNetId>
For easier file transfers between the RCC and your computer, you can mount the RCC drive as a network drive on your computer. This allows you to interact with files existing on the RCC exactly as you would files on your own computer. Instructions here.

Working on the RCC

The easiest way to work on the RCC is through an interactive session. This enables one to use RStudio on the RCC in the same way that it is used on a local computer. For this, I would recommend connecting to the RCC with ThinLinc.

Open and connect with ThinLinc (instructions here)
Open the Terminal in ThinLinc: Applications>System Tools>Terminal
In the terminal navigate to the repo directory - for example cd repo/strategic. This will be your home directory once the interactive session starts.
Start an interactive job with the command sinteractive --account=pi-ganong --mem=32g --time=04:00:00
1. All our computing units are requested for pi-ganong so this is the only option for this argument
2. You can request more memory, but if you need more than 64GB you need to additionally specify --partition=bigmem2. The maximum memory that can be requested is 512GB.
3. The time is specified by hh:mm:ss. The maximum time that can be requested is 32h. There is no cost in requesting more time than needed as the counter stops as soon as you finish the interactive session, but keep in mind that queuing time will increase.
Once the terminal stops loading you are in an interactive session. Now we need to load the programs we will use.
To start RStudio, you need first to enter the command module load rstudio, and then start RStudio with rstudio. RStudio can now be used like on your personal desktop.
To start Stata, the commands are module load xstata and xstata.
RStudio can be closed normally (like on a desktop). Typing exit in the terminal will close the interactive session.

Running a job on the RCC

Jobs on the RCC are scheduled by using a batch script. An example of such a script is here. These scripts have always a header that specifies the parameters of the job. Those are

--job-name: the name that will be used when sending the exit code (more to exit codes below)
--account: same as for interactive job, the account which is used to par for the computing units
--time: the time that should be allocated to this job
--partition: this will be broadwl for most jobs and bigmem2 for anything with 64GB or more memory
--nodes: the number of CPUs requested - we always use only one
--mail-type: which emails should be send out. With All you will get two emails, one when the job was queued (started) and one when the job finishes.
--mem: the memory requested

Most of the time the only two arguments that need to be adjusted are --time and --mem.

In the linked script above we source file paths from analysis/source/CRISM/sh_scripts/file_names.sh. This is done to simplify changing directories. This one file controls the paths in all R-scripts.

Similar to interactive jobs, we need to load the software that we want to use. In this case, we are using R so we run module load R. Note that we don't need RStudio here since we aren't using any user interface.

The actual running of a script is done by running, for example, Rscript Rscript --quiet --no-restore --no-save analysis/source/CRISM/R_scripts/14_ph_reg_data.R $scratch_dir $crism_data $corelogic_dir $output_dir. The arguments specify that we don't want to save any temporary files. The path to a file is always relative to the path from where you execute a command. So in this case you need to be with the terminal in /home/<CNetId>/repo/strategic to find the file analysis/source/CRISM/R_scripts/14_ph_reg_data.R. $crism_data $corelogic_dir $output_dir are arguments that are sourced from file_names.sh and passed into the R-script.

The job can then be started from the terminal by using the command sbatch strategic/analysis/source/CRISM/sh_scripts/R_run_ph_reg.sh. This will return in the terminal Submitted batch job <batch job number>. Note that while three scripts are called from the one batch file, this is still only one job.

Once the job started an email will be sent to your uchicago.edu email with the subject, for example, Slurm Job_id=18964351 Name=ph_reg Began, Queued time 00:00:46. Sometimes the queued time will be super short, like here, but depending on the resources requested (time and memory) and usage of Midway this can take multiple hours. Once the job is done you will get an email with Slurm Job_id=18964351 Name=ph_reg Ended, Run time 19:06:45, COMPLETED, ExitCode 0.

If you are calling many batch scripts that need to be executed in a specific order (for example, to construct data), it is possible to write a shell script that calls batch scripts like here. This is a shell script and not a batch script, so this file is run with sh analysis/source/CRISM/master_default.sh.

The simplest use case for a batch job is a script that will take very long, and you want to run in the background on the RCC - even though you run it on your machine. Assume you are working on uieip, and need to run a job for about (but less than) 12 hours. Then you would write a batch script of this form:

#!/bin/bash
#SBATCH --job-name=full_run_of_cps_fun
#SBATCH --account=pi-ganong
#SBATCH --time=12:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --mail-type=ALL
#SBATCH --mem=32G


module load R
Rscript --quiet --no-restore --no-save issues/issue_1_cps_fun/full_cps_fun.R

Then save this file as issues/issue_1_cps_fun/run_full.sh and run sbatch issues/issue_1_cps_fun/run_full.sh.

Note that you can submit many jobs at once. There is no need to wait until a job is ready to submit another one.

Debugging on the RCC

In case you don't receive an email with COMPLETED, ExitCode 0, you will get an email with FAILED, ExitCode X. Depending on the exit code the debugging method can be quite different.

The most common failed exit code is FAILED, ExitCode 1. This is a generic code and means most of the time there is a problem in the (R)-script that you tried to run. By default, on Miday log files are constructed that capture the output that would normally show up in the console on RStudio. You can print the log file in your terminal by using the command cat slurm-<batch job number>.out. The batch job number will be also in the email reporting the exit code. In many cases, you can see from the log file where the code crashed and from the error message see what was wrong. If this isn't helpful, then a good place to start is to start an interactive session and try to run the script with a small sample interactively.

Another common exit code is FAILED, ExitCode 135 or FAILED, ExitCode 139. This will mean that your job ran out of memory. The simplest solution here is to restart the job with more memory. Sometimes this error will show up as OUT_OF_MEMORY without any ExitCode or the more generic NODE_FAIL, ExitCode 0. Note that NODE_FAIL, ExitCode 0 can also mean something else went wrong.

Another type of exit code is TIMEOUT, ExitCode 0. As the name suggests here the job stopped before finishing because it ran out of time. It can be that this error shows up with another exit code but you can usually recognize this failure by the similarity of run time and requested time, for example, like here: Slurm Job_id=15065566 Name=ph_reg Failed, Run time 18:00:01, TIMEOUT, ExitCode 0.

For any exotic exit codes, you can always send a message to [email protected] or google "exit code X slurm".

Requesting computing units

Private, see here