Lab: ISAAC‐NG Continued - mestato/EPP622_2024 GitHub Wiki

Multiple jobs - array script

Return to your analysis directory from last lab.

cd /lustre/isaac/proj/UTK0318/analysis/<your directory>

Task arrays are great but the scheduler assumes all data can be referred to by a convenient range of numbers, in this case 1 to 4. Let's see how this works. In array-example.qsh, put

#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4

echo "${SLURM_ARRAY_TASK_ID}" > ${SLURM_ARRAY_TASK_ID}.txt

What is SLURM_ARRAY_TASK_ID in this script and what is it doing?

Click to see the answer

The SLURM_ARRAY_TASK_ID variable is created when the job is run with sbatch. This variable depends on the range specified in --array and even though it appears only once in the above script it will run all of the the commands (echo in this example) for each and every task id.

Note we are using --cpus-per-task=4, we'll investigate later if these cpus are split amongst the tasks or if each task receives 4 cpus.

Submit the script

sbatch array-example.qsh

New files are created, showing the array of 4 jobs. And new slurm output file names. Check them out. What do you see?

Click to see the answer

There should be 4 .txt files from 1-4 and 4 slurm.out files with the job_id and corresponding SLURM_ARRAY_TASK_ID in the title.

But our actual sequence files aren't named 1-4. We can make aliases OR we can use some more clever sed.

Start by putting those names in a file:

ls /lustre/isaac/proj/UTK0318/raw_data/GBS_reads/*fastq > filenames.txt

You can check it worked with nano or cat.

Now tweaking the submission script. First, we'll use sed to take the number provided from ${SLURM_ARRAY_TASK_ID} and pull out the corresponding line number in filenames.txt.

Try it out quick and see how it behaves:

sed -n -e "2 p" filenames.txt

Next, we'll use sed again to create the output filename (using a regex and the input filename). In array-bwa.qsh, put

#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4

infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)

outfile=$( basename $infile | sed 's/_1.fastq/.sam/' )

mkdir -p array_results

module load bwa

bwa mem \
-o ./array_results/${outfile} \
/lustre/isaac/proj/UTK0318/reference/solenopsis_invicta_genome_chr_3.fna \
${infile}

This will request 4 simultaneous jobs of bwa mem, each with --cpus-per-task=4.

Submit and see if it worked.

What happens when you use squeue --me?

Click to see the answer

You should have 4 running jobs, each with an identical JOBID prefix and a suffix corresponding to the SLURM_ARRAY_TASK_ID. Cool!

You should have results in your array_results directory.

Challenge - putting it all together

Now that you've successfully used sbatch for individual and parallel jobs, it's time to put the skills you've acquired to the test! We currently have large .sam files, and we want to convert them to .bam files.

Complete the following:

  1. Create an appropriate SBATCH header for a file called array-samtools.qsh.
  2. Check to see which versions of samtools are available on ISAAC-NG.
  3. Have your new sbatch script load the newest version of samtools.
  4. Create a way to connect the .sam files in array_results to a task_array (using filenames.txt approach).
  5. Rename the output files with .bam.
  6. Run samtools view to produce a .bam file in array_results directory.

A few hints!

  • You can re-use much of the previous array-bwa.qsh file.
  • You can re-use the filenames.txt approach from before, using the full paths from your .sam results.
realpath array_results/*sam > filenames.txt

Converting sam to bam:

samtools view -S -b <input.sam> > <output.bam>

Answer!

#!/bin/bash
#SBATCH -J samtools_task_array
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4

realpath ./array_results/*.sam > filenames.txt

infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)

outfile=$( basename $infile | sed 's/.sam/.bam/' )

module load samtools/1.16.1-gcc

samtools view \
  -S \
  -b \
  ${infile} > ./array_results/${outfile}

More tips and tricks for using ISAAC-NG and UTK HPSC resources

If you want a simple, interactive session, the following salloc slurm command will request a node for you:

salloc --account ISAAC-UTK0318 --partition=condo-epp622 --qos=condo --nodes=1 --cpus-per-task=2 --mem=8G --time=0-03:00:00

This will produce a message similar to Nodes il1340 are ready for job, but the node will be different depending on what's available. To connect you can use:

ssh <node name from above>

Once you connect, you'll see the prompt shows your username@node. To exit simply type exit. You may have to type exit a second time or scancel the allocation to free up the resource.

Alternatively, use of srun will accomplish the exact same as salloc but will connect you immediately with a node with using ssh.

srun --account ISAAC-UTK0318 --partition=condo-epp622 --qos=condo --nodes=1 --cpus-per-task=2 --mem=8G --time=0-03:00:00 --pty /bin/bash

A list of useful commands and topics to mention:

  • scancel
  • showpartitions
  • sinfo
  • squeue -o "%i %P %j %C %m %q" --me
  • scontrol show job <job_id>
  • sacctmgr show qos where name=campus
  • transferring data with dtn[12]
  • tour of resources
    • home/scratch/projects
  • Open OnDemand
    • Open OnDemand RStudio
    • Open OnDemand Jupyter
    • Open OnDemand shell
    • Open OnDemand file browser
    • Globus
⚠️ **GitHub.com Fallback** ⚠️