Lab: ISAAC‐NG Intro - mestato/EPP622_2024 GitHub Wiki

Get set up on ISAAC

The following steps should already be complete, but just in case:

Navigate to https://portal.acf.utk.edu/accounts/request
Click on “I have a UT NetID”
Authenticate with NetID, password and Duo two factor
A form will then be presented with information collected from the University. There are at least two fields that the user needs to update: Salutation (Mr., Dr. etc.) Citizenship, and maybe another. Look for the required fields marked with an *
Once the form is filled out click through to the next item
Type that project name ISAAC-UTK0318 (with the alphabetic characters in uppercase) to request to be added. That should be it.

Notes

You should never run jobs on the login node! It is only to set up your scripts to launch your jobs through the scheduler.
Keep the documentation handy!
The user portal will enable you to see what projects you are a part of and where you can store data and how much
You should be a part of ISAAC-UTK0318 and see storage at /lustre/isaac/proj/UTK0138
By default, SLURM scheduler assumes the working directory to be the directory from which the jobs are being submitted

Log into Isaac Next Gen. After your password, it will send you a Duo push.

ssh <yourusername>@login.isaac.utk.edu

The software system is just like spack, only the command is "module". (Its actually spack but the command module is what is used in the documentation). Lets see what is available.

module avail

BWA is something we have used before, let's see if its installed

module avail bwa
module load bwa
bwa

Go to the project directory

cd /lustre/isaac/proj/UTK0318/

You will see a directory set up for our practice. cd into it and create a directory for your lab

cd analysis
mkdir <yourusername>
cd <yourusername>
mkdir results

I've already downloaded our old solenopsis data, the genome, and indexed the genome with bwa.

Simple example - single sbatch command on the command line

You can load the software into your environment then run the job and it will inherit your environment (i.e. the software will still be loaded)

Lets just get a quick test command going

echo Worked! > results/output.txt

Lets try to run that through the scheduler.

sbatch -n 1 -N 1 -A ISAAC-UTK0318 -p condo-epp622 -q condo -t 00:01:00 --wrap="echo Worked! > results/sbatch_output.txt"

The flags tell the job scheduler all about your job, including what kind of resources it needs. This can be tricky - you need to know how many threads and how much RAM your job will take, so that you request a sufficient amount.

To see where your job is in the queue

squeue -u <yourusername>

Did it work? Do you see results/sbatch_output.txt? What is the slurm file that got created?

Now let's try a real command. Increase time to 10min and let's try to run bwa mem through the scheduler with an sbatch script.

Simple example - single command in an sbatch script

Its more typical and more readable to create a submission script

In simple-bwa.qsh, put

#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00

module load bwa

bwa mem \
-o results/SRR6922311.sam \
/lustre/isaac/proj/UTK0318/reference/solenopsis_invicta_genome_chr_3.fna \
/lustre/isaac/proj/UTK0318/raw_data/GBS_reads/SRR6922141_1.fastq

The directives at the top tell the job scheduler all about your job, just like the flags did. bwa mem by default only needs one thread, so we'll keep that.

Submit the script

sbatch simple-bwa.qsh

You can again track progress with squeue and look at the output files.

Multiple jobs - single sbatch command inside a for loop

We learned about for loops in class, and we can use them here too.

Let's build a for loop in an sbatch script called loop-bwa.qsh:

#!/bin/bash
#SBATCH -J bwa_loop
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00

module load bwa

for FILE in /lustre/isaac/proj/UTK0318/raw_data/GBS_reads/*fastq
do
BASE=$( basename $FILE )
    bwa mem \
        -o results/${BASE%%.fastq}.sam \
        /lustre/isaac/proj/UTK0318/reference/solenopsis_invicta_genome_chr_3.fna \
        ${FILE}
done

In this case, a for loop won't work if you want the jobs to run in parallel. If you put all the jobs in the background, the main script will complete, the scheduler will think you are done, and then your jobs will be killed before finishing. Instead, we are going to use a task array (next class).