RCC Guide - uchicago-bfi-gnlab/lab_manual GitHub Wiki

Your main reference for RCC documentation is the RCC user guide. Please consult this guide whenever you have more specific questions not addressed here.

Getting started

  1. Request an account here. Your account should be associated with pi-ganong.
  2. Connect to the RCC (midway2 will be your default if you are using Peter's allocation) using SSH as described here
  3. Clone the repo in your SSH session (SSH guide).
    • Note that you cannot permanently log into your github account in the RCC. Every time you clone, pull, or push something in the RCC, you will be prompted for your github username and password. In the password field, provide your secure access token.
  4. You are now in a login node in the RCC. This node is not where you will run jobs. You can set up your Python and R environments and packages form here.
    • To start Python, type module load python in your login node, and then create your environment from there. For an example environment, you can copy the RDFO environment from rdfo_env.yml.
    • Midway has a number of R packages pre-installed. Note that you are not able to update packages from within an interactive session. Instead, launch an R session from a login node, and install packages there as needed. e.g. to update dplyr
module load R
R
install.packages("dplyr")
q()
  1. Do not try to run parallelised computations from within RStudio in an interactive session. This will just lead to memory errors.
  2. Your home directory is /home/<CNetId>

Working on the RCC

There are two types of jobs in the RCC: interactive jobs and batch jobs. Interactive jobs enables one to use RStudio of Jupyter Notebook on the RCC in the same way that it is used on a local computer.

Interactive Sessions

  1. Start an interactive job with the command sinteractive --account=pi-ganong. You can further customize parameters for the request. Details on interactive jobs here.
    • All our computing units are requested for pi-ganong so this is the only option for this argument
    • You can request more memory, but if you need more than 64GB you need to additionally specify a partition. See here for the different partitions available on the RCC nodes.
    • The time is specified by hh:mm:ss. The maximum time that can be requested is 32h. There is no cost in requesting more time than needed as the counter stops as soon as you finish the interactive session, but keep in mind that queuing time will increase.
  2. Once the terminal stops loading you are in an interactive session. Now we need to load the programs we will use.
    • Note: bigger requests will land you in the queue for longer
  3. To start RStudio, you need first to enter the command module load rstudio, and then start RStudio with rstudio. RStudio can now be used like on your personal desktop.
  4. To start Jupyter, run the following commands in your interactive session terminal (replacing the path to your Python environment). This will generate two links. Copy and paste these URLs into your browser: one of them will open your Jupyter session. Note: you must be using the campus Wi-Fi or VPN for this to work.
module load python
source activate <your_env_path>
HOST_IP=`/sbin/ip route get 8.8.8.8 | awk '{print $7;exit}'`
jupyter-notebook --no-browser --ip=$HOST_IP
  1. To start Stata, the commands are module load xstata and xstata.
  2. Note: interactive sessions will stop running after some amount of time if they detect you are "idle" (are not actively using the window where code is running). Be mindful of this, and try to run logner workflows in batch mode when you can
  3. Typing exit in the terminal will close the interactive session.

Batch Jobs

Batch jobs on the RCC are scheduled by using a batch script in a .sh file. An example of such a script is here. More documentation on batch jobs here. These scripts have always a header that specifies the parameters of the job. Those are

  1. --job-name: the name that will be used when sending the exit code (more to exit codes below)
  2. --account: same as for interactive job, the account which is used to par for the computing units
  3. --time: the time that should be allocated to this job. The maximum is 36 hours
  4. --partition: this will be broadwl for most jobs and bigmem2 for anything with 64GB or more memory
  5. --nodes: the number of CPUs requested - we always use only one
  6. --mail-type: which emails should be send out. With All you will get two emails, one when the job was queued (started) and one when the job finishes.
  7. --mem: the memory requested
  • Most of the time the only two arguments that need to be adjusted are --time and --mem.
  • In the linked script above we source file paths from analysis/source/CRISM/sh_scripts/file_names.sh. This is done to simplify changing directories. This one file controls the paths in all R-scripts.
  • Similar to interactive jobs, we need to load the software that we want to use. In this case, we are using R so we run module load R. Note that we don't need RStudio here since we aren't using any user interface.
  • The actual running of a script is done by running, for example, Rscript Rscript --quiet --no-restore --no-save analysis/source/CRISM/R_scripts/14_ph_reg_data.R $scratch_dir $crism_data $corelogic_dir $output_dir.
  • The job can then be started from the terminal by using the command sbatch strategic/analysis/source/CRISM/sh_scripts/R_run_ph_reg.sh. This will return in the terminal Submitted batch job <batch job number>.
    • Note that while three scripts are called from the one example batch file, this is still only one job.
  • Once the job started an email will be sent to your uchicago.edu email with the subject, for example, Slurm Job_id=18964351 Name=ph_reg Began, Queued time 00:00:46.
    • Sometimes the queued time will be super short, like here, but depending on the resources requested (time and memory) and usage of Midway this can take multiple hours.
  • Once the job is done you will get an email with Slurm Job_id=18964351 Name=ph_reg Ended, Run time 19:06:45, COMPLETED, ExitCode 0.
  • If you are calling many batch scripts that need to be executed in a specific order (for example, to construct data), it is possible to write a shell script that calls batch scripts like here.
    • This is a shell script and not a batch script, so this file is run with sh analysis/source/CRISM/master_default.sh.
  • The simplest use case for a batch job is a script that will take very long, and you want to run in the background on the RCC - even though you run it on your machine. Assume you are working on uieip, and need to run a job for about (but less than) 12 hours. Then you would write a batch script of this form:
#!/bin/bash
#SBATCH --job-name=full_run_of_cps_fun
#SBATCH --account=pi-ganong
#SBATCH --time=12:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --mail-type=ALL
#SBATCH --mem=32G


module load R
Rscript --quiet --no-restore --no-save issues/issue_1_cps_fun/full_cps_fun.R
  • Then save this file as issues/issue_1_cps_fun/run_full.sh and run sbatch issues/issue_1_cps_fun/run_full.sh.
  • Note that you can submit many jobs at once. There is no need to wait until a job is ready to submit another one.

Debugging on the RCC

  • In case you don't receive an email with COMPLETED, ExitCode 0, you will get an email with FAILED, ExitCode X. Depending on the exit code the debugging method can be quite different.
  • The most common failed exit code is FAILED, ExitCode 1. This is a generic code and means most of the time there is a problem in the script that you tried to run.
    • You can save console output and error outputs form a batch job by specifying #SBATCH --output=<out_file_name>.out for the console output, and #SBATCH --error=<err_file_name>.err
    • Note that, by default, Python processes console outputs in batches. So, you would not be able to check your .out file in real time. To bypass this, run a script with python -u <my_script>.py, which will force console output to write in real time to your .out file.
  • Another common exit code is FAILED, ExitCode 135 or FAILED, ExitCode 139. This will mean that your job ran out of memory. The simplest solution here is to restart the job with more memory. Sometimes this error will show up as OUT_OF_MEMORY without any ExitCode or the more generic NODE_FAIL, ExitCode 0. Note that NODE_FAIL, ExitCode 0 can also mean something else went wrong.
  • Another type of exit code is TIMEOUT, ExitCode 0. As the name suggests here the job stopped before finishing because it ran out of time. It can be that this error shows up with another exit code but you can usually recognize this failure by the similarity of run time and requested time, for example, like here: Slurm Job_id=15065566 Name=ph_reg Failed, Run time 18:00:01, TIMEOUT, ExitCode 0.
  • For any exotic exit codes, you can always send a message to [email protected] or google "exit code X slurm".

Requesting computing units

Private, see here

⚠️ **GitHub.com Fallback** ⚠️