Columbia HPC - sparklabnyc/resources GitHub Wiki

Using Columbia's HPC

Ginsburg HPC Cluster User Guide

Registering for Ginsburg account

Please note that you are not automatically part of the Ginsburg HPC. If you want to access Ginsburg, please let Robbie know, and he will be able to cc you and write to the HPC administrator to be allowed access to Ginsburg.

Logging In

You will need to use SSH (Secure Shell) to access the cluster. Windows users can use PuTTY or Cygwin. MacOS users can use the built-in Terminal application or Cyberduck.

Users log in to the cluster's submit node, located at ginsburg.rcs.columbia.edu or use the shorter form burg.rcs.columbia.edu. If logging in from a command line, type:

$ ssh <UNI>@ginsburg.rcs.columbia.edu

OR

$ ssh <UNI>@burg.rcs.columbia.edu

Once prompted, you need to provide your usual Columbia password.

To login with Putty, type in [email protected], then provide your password

Submit Account

You must specify your account whenever you submit a job to the cluster.

ehsmsph Environmental Health Sciences Mailman School of Public Health

*It is important to switch from the login node (original node you log into) and the ehs node, where you will run jobs This code will switch to the ehs node, and open a 2 hour session srun --pty -t 0-2:00 -A ehsmsph /bin/bash

*If you will use R, make sure you install R in the cluster environment if not already installed To see available versions: module avail R To install R: module load R/4.4.2 To confirm: R --version

##Linking your github to the cluster environment There are likely multiple ways to do this, this is what Wil did

  1. Generate SSH key on cluster ssh-keygen -t rsa -b 4096 -C "[email protected]"

When prompted: "Enter file in which to save the key" — just press Enter to accept the default: ~/.ssh/id_rsa You can optionally enter a passphrase (recommended for security) or leave it blank. You should now have: ~/.ssh/id_rsa → your private key (never share) ~/.ssh/id_rsa.pub → your public key (safe to share)

  1. Add your public key to Github Copy public key: cat ~/.ssh/id_rsa.pub

Then: Go to: https://github.com/settings/keys Click "New SSH key" Title: something meaningful like Cluster SSH Key Key: paste the full output from the cat command Click "Add SSH key"

  1. Test the connection Verify GitHub recognizes your key: ssh -T [email protected]

If it works, you’ll see: Hi your-username! You've successfully authenticated...

  1. Configure Git git config --global user.name "Your Full Name" git config --global user.email "[email protected]"

  2. Use github! Clone repos using the SSH URL: git clone [email protected]:your-username/your-repo.git

From the cluster you can now push: git add . git commit -m "Update results" git push origin main

and pull: git pull origin main

  1. In R, if not already done, set up the git connection and clone the repo
  2. Copy the SSH URL from GitHub Go to your GitHub repo (e.g., https://github.com/sparklabnyc/resources) Click "Code" → Choose SSH tab → Copy this: [email protected]:sparklabnyc/resources.git

2.In RStudio: Clone the Repo File → New Project → Version Control → Git

Repository URL: paste the SSH link ([email protected]:...) Project Directory Name: e.g., resources Create Project As Subdirectory Of: e.g., ~/Documents/code/ Click Create Project.

RStudio will: Clone the GitHub repo Create a new .Rproj file Set up Git tab in the RStudio toolbar

*Now you should be able to make commits push locally in R onto git, which can then be pulled onto the cluster

Submit Shell Scripts

The script will run the R script. The output will be written to a file in your current directory.

#!/bin/bash
# name_of_the_shellscript.sh
# Slurm script to run R program

#SBATCH -A ehsmsph               # Replace ACCOUNT with your group account name
#SBATCH -J                       # The job name
#SBATCH -p serial_requeue        # Partition to submit to
#SBATCH -c 4                     # The number of CPU cores to use. Max 32.
#SBATCH -t 2-12:30               # Runtime in D-HH:MM
#SBATCH --mem-per-cpu=12G        # The memory the job will use per CPU core

module load R/4.3.1

# create output dir
OUTPUT_DIR="/burg/home/path/to/your/project"

# Command to execute R code using srun
srun Rscript name_of_Rscript.R > ${OUTPUT_DIR}/routput_${TIMESTAMP} 2>&1

#Further example script from Wil -you need to use the 'nano' command in linux to edit text and create an .sh script, recommend typing out and creating in R first -this .sh script is created in nano, and it runs the Rscript "tthm_ginsburg_Step1.R"; so this R script needs to exist on your cluster environment For me I pulled it from github

#!/bin/bash

#SBATCH -A ehsmsph
#SBATCH --job-name=tthm_step1 #SBATCH -N 1 #SBATCH -c 8 #SBATCH --time=8:00:00
#SBATCH --mem=256G #SBATCH --output=tthm_step1_%j.out #SBATCH --error=tthm_step1_%j.err

Load R module

module load R/4.4.2

Navigate to the Git project folder

cd ~/CWSDBP/CWSDBP_git

Run the R script

Rscript tthm_ginsburg_Step1.R

Job Submission

If this script is saved as helloworld.sh you can submit it to the cluster with:

$ sbatch helloworld.sh

This job will create one output file name slurm-####.out, where the #'s will be replaced by the job ID assigned by Slurm. If all goes well the file will contain the words "Hello World" and the current date and time.

See the Slurm Quick Start Guide for a more in-depth introduction on using the Slurm scheduler.

#In the Wil example sbatch run_tthm_step1.sh in the cluster runs the job squeue -u $USER to see if the job is still running

Tips in operating the Slrum

Job Arrays

Multiple copies of a job can be submitted by using a job array. The --array option can be used to specify the job indexes Slurm should apply.

--array= -a Submit a job array. #SBATCH -a 1-4 See below for discussion of job arrays.

real example in the shell script file

#SBATCH --array=1-100 
#SBATCH --ntasks=20 # How many jobs to run at the same time and the total number of jobs

# Extract the array index
array_index=$SLURM_ARRAY_TASK_ID

We can also modify the desired index based on the setting array index, e.g.,

desired_index=$((array_index + 1000))

Command to execute R code using srun

srun Rscript /burg/home/name_of_Rscript.R $desired_index >> ${OUTPUT_DIR}/routput${desired_index} 2>&1

Or

Rscript --vanilla name_of_Rscript.R ${SLURM_ARRAY_TASK_ID} > routput${SLURM_ARRAY_TASK_ID} 2>&1

After making this change, in your R script, you should use:

args <- commandArgs(trailingOnly=TRUE)
index <- as.numeric(args[1])  # retrieve the 1st argument

You can also print the args to the standard output to double check

cat("Arguments received:", args, "\n")

Extract the failed tasks into an array

Sometimes we might face failed or incomplete tasks for certain arrays due to memory or computational competency, we can apply the code below to retrieve the failed tasks into an array

# Step 1: Extract the failed tasks into an array
FAILED_TASKS=($(sacct -j job_id --format=JobID,State | grep FAILED | awk 'NR % 3 == 1' | awk -F'[_]' '{print $2}' | awk '{print $1}'))

# Step 2: Convert the array to a comma-separated string
FAILED_TASKS_STR=$(IFS=,; echo "${FAILED_TASKS[*]}")

#SBATCH --array=${FAILED_TASKS_STR}
⚠️ **GitHub.com Fallback** ⚠️