RCC Guide - uchicago-bfi-gnlab/lab_manual GitHub Wiki
Your main reference for RCC documentation is the RCC user guide. Please consult this guide whenever you have more specific questions not addressed here.
- Request an account here. Your account should be associated with
pi-ganong. - Connect to the RCC (
midway2will be your default if you are using Peter's allocation) using SSH as described here - Clone the repo in your SSH session (SSH guide).
- Note that you cannot permanently log into your github account in the RCC. Every time you clone, pull, or push something in the RCC, you will be prompted for your github username and password. In the password field, provide your secure access token.
- You are now in a login node in the RCC. This node is not where you will run jobs. You can set up your Python and R environments and packages form here.
- To start Python, type
module load pythonin your login node, and then create your environment from there. For an example environment, you can copy the RDFO environment fromrdfo_env.yml. - Midway has a number of R packages pre-installed. Note that you are not able to update packages from within an interactive session. Instead, launch an R session from a login node, and install packages there as needed. e.g. to update dplyr
- To start Python, type
module load R
R
install.packages("dplyr")
q()
- Do not try to run parallelised computations from within RStudio in an interactive session. This will just lead to memory errors.
- Your home directory is
/home/<CNetId>
There are two types of jobs in the RCC: interactive jobs and batch jobs. Interactive jobs enables one to use RStudio of Jupyter Notebook on the RCC in the same way that it is used on a local computer.
- Start an interactive job with the command
sinteractive --account=pi-ganong. You can further customize parameters for the request. Details on interactive jobs here.- All our computing units are requested for
pi-ganongso this is the only option for this argument - You can request more memory, but if you need more than 64GB you need to additionally specify a partition. See here for the different partitions available on the RCC nodes.
- The time is specified by hh:mm:ss. The maximum time that can be requested is 32h. There is no cost in requesting more time than needed as the counter stops as soon as you finish the interactive session, but keep in mind that queuing time will increase.
- All our computing units are requested for
- Once the terminal stops loading you are in an interactive session. Now we need to load the programs we will use.
- Note: bigger requests will land you in the queue for longer
- To start RStudio, you need first to enter the command
module load rstudio, and then start RStudio withrstudio. RStudio can now be used like on your personal desktop. - To start Jupyter, run the following commands in your interactive session terminal (replacing the path to your Python environment). This will generate two links. Copy and paste these URLs into your browser: one of them will open your Jupyter session. Note: you must be using the campus Wi-Fi or VPN for this to work.
module load python
source activate <your_env_path>
HOST_IP=`/sbin/ip route get 8.8.8.8 | awk '{print $7;exit}'`
jupyter-notebook --no-browser --ip=$HOST_IP
- To start Stata, the commands are
module load xstataandxstata. - Note: interactive sessions will stop running after some amount of time if they detect you are "idle" (are not actively using the window where code is running). Be mindful of this, and try to run logner workflows in batch mode when you can
- Typing
exitin the terminal will close the interactive session.
Batch jobs on the RCC are scheduled by using a batch script in a .sh file. An example of such a script is here. More documentation on batch jobs here. These scripts have always a header that specifies the parameters of the job. Those are
-
--job-name: the name that will be used when sending the exit code (more to exit codes below) -
--account: same as for interactive job, the account which is used to par for the computing units -
--time: the time that should be allocated to this job. The maximum is 36 hours -
--partition: this will bebroadwlfor most jobs andbigmem2for anything with 64GB or more memory -
--nodes: the number of CPUs requested - we always use only one -
--mail-type: which emails should be send out. WithAllyou will get two emails, one when the job was queued (started) and one when the job finishes. -
--mem: the memory requested
- Most of the time the only two arguments that need to be adjusted are
--timeand--mem. - In the linked script above we source file paths from
analysis/source/CRISM/sh_scripts/file_names.sh. This is done to simplify changing directories. This one file controls the paths in all R-scripts. - Similar to interactive jobs, we need to load the software that we want to use. In this case, we are using
Rso we runmodule load R. Note that we don't need RStudio here since we aren't using any user interface. - The actual running of a script is done by running, for example,
Rscript Rscript --quiet --no-restore --no-save analysis/source/CRISM/R_scripts/14_ph_reg_data.R $scratch_dir $crism_data $corelogic_dir $output_dir. - The job can then be started from the terminal by using the command
sbatch strategic/analysis/source/CRISM/sh_scripts/R_run_ph_reg.sh. This will return in the terminalSubmitted batch job <batch job number>.- Note that while three scripts are called from the one example batch file, this is still only one job.
- Once the job started an email will be sent to your
uchicago.eduemail with the subject, for example,Slurm Job_id=18964351 Name=ph_reg Began, Queued time 00:00:46.- Sometimes the queued time will be super short, like here, but depending on the resources requested (time and memory) and usage of Midway this can take multiple hours.
- Once the job is done you will get an email with
Slurm Job_id=18964351 Name=ph_reg Ended, Run time 19:06:45, COMPLETED, ExitCode 0. - If you are calling many batch scripts that need to be executed in a specific order (for example, to construct data), it is possible to write a shell script that calls batch scripts like here.
- This is a shell script and not a batch script, so this file is run with
sh analysis/source/CRISM/master_default.sh.
- This is a shell script and not a batch script, so this file is run with
- The simplest use case for a batch job is a script that will take very long, and you want to run in the background on the RCC - even though you run it on your machine. Assume you are working on
uieip, and need to run a job for about (but less than) 12 hours. Then you would write a batch script of this form:
#!/bin/bash
#SBATCH --job-name=full_run_of_cps_fun
#SBATCH --account=pi-ganong
#SBATCH --time=12:00:00
#SBATCH --partition=broadwl
#SBATCH --nodes=1
#SBATCH --mail-type=ALL
#SBATCH --mem=32G
module load R
Rscript --quiet --no-restore --no-save issues/issue_1_cps_fun/full_cps_fun.R
- Then save this file as
issues/issue_1_cps_fun/run_full.shand runsbatch issues/issue_1_cps_fun/run_full.sh. - Note that you can submit many jobs at once. There is no need to wait until a job is ready to submit another one.
- In case you don't receive an email with
COMPLETED, ExitCode 0, you will get an email withFAILED, ExitCode X. Depending on the exit code the debugging method can be quite different. - The most common failed exit code is
FAILED, ExitCode 1. This is a generic code and means most of the time there is a problem in the script that you tried to run.- You can save console output and error outputs form a batch job by specifying
#SBATCH --output=<out_file_name>.outfor the console output, and#SBATCH --error=<err_file_name>.err - Note that, by default, Python processes console outputs in batches. So, you would not be able to check your
.outfile in real time. To bypass this, run a script withpython -u <my_script>.py, which will force console output to write in real time to your.outfile.
- You can save console output and error outputs form a batch job by specifying
- Another common exit code is
FAILED, ExitCode 135orFAILED, ExitCode 139. This will mean that your job ran out of memory. The simplest solution here is to restart the job with more memory. Sometimes this error will show up asOUT_OF_MEMORYwithout any ExitCode or the more genericNODE_FAIL, ExitCode 0. Note thatNODE_FAIL, ExitCode 0can also mean something else went wrong. - Another type of exit code is
TIMEOUT, ExitCode 0. As the name suggests here the job stopped before finishing because it ran out of time. It can be that this error shows up with another exit code but you can usually recognize this failure by the similarity of run time and requested time, for example, like here:Slurm Job_id=15065566 Name=ph_reg Failed, Run time 18:00:01, TIMEOUT, ExitCode 0. - For any exotic exit codes, you can always send a message to [email protected] or google "exit code X slurm".
Private, see here