Lab: ISAAC‐NG Continued - mestato/EPP622_2024 GitHub Wiki
Return to your analysis directory from last lab.
cd /lustre/isaac/proj/UTK0318/analysis/<your directory>
Task arrays
are great but the scheduler assumes all data can be referred to by a convenient range of numbers, in this case 1 to 4. Let's see how this works. In array-example.qsh, put
#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4
echo "${SLURM_ARRAY_TASK_ID}" > ${SLURM_ARRAY_TASK_ID}.txt
What is SLURM_ARRAY_TASK_ID in this script and what is it doing?
Click to see the answer
The
SLURM_ARRAY_TASK_ID
variable is created when the job is run withsbatch
. This variable depends on the range specified in--array
and even though it appears only once in the above script it will run all of the the commands (echo in this example) for each and every task id.
Note we are using --cpus-per-task=4, we'll investigate later if these cpus are split amongst the tasks or if each task receives 4 cpus.
Submit the script
sbatch array-example.qsh
New files are created, showing the array of 4 jobs. And new slurm output file names. Check them out. What do you see?
Click to see the answer
There should be 4
.txt
files from 1-4 and 4 slurm.out files with the job_id and correspondingSLURM_ARRAY_TASK_ID
in the title.
But our actual sequence files aren't named 1-4. We can make aliases OR we can use some more clever sed.
Start by putting those names in a file:
ls /lustre/isaac/proj/UTK0318/raw_data/GBS_reads/*fastq > filenames.txt
You can check it worked with nano or cat.
Now tweaking the submission script. First, we'll use sed to take the number provided from ${SLURM_ARRAY_TASK_ID} and pull out the corresponding line number in filenames.txt.
Try it out quick and see how it behaves:
sed -n -e "2 p" filenames.txt
Next, we'll use sed again to create the output filename (using a regex and the input filename). In array-bwa.qsh, put
#!/bin/bash
#SBATCH -J bwa
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4
infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)
outfile=$( basename $infile | sed 's/_1.fastq/.sam/' )
mkdir -p array_results
module load bwa
bwa mem \
-o ./array_results/${outfile} \
/lustre/isaac/proj/UTK0318/reference/solenopsis_invicta_genome_chr_3.fna \
${infile}
This will request 4 simultaneous jobs of bwa mem, each with --cpus-per-task=4.
Submit and see if it worked.
What happens when you use squeue --me
?
Click to see the answer
You should have 4 running jobs, each with an identical
JOBID
prefix and a suffix corresponding to theSLURM_ARRAY_TASK_ID
. Cool!
You should have results in your array_results
directory.
Now that you've successfully used sbatch for individual and parallel jobs, it's time to put the skills you've acquired to the test! We currently have large .sam
files, and we want to convert them to .bam
files.
Complete the following:
- Create an appropriate SBATCH header for a file called
array-samtools.qsh
. - Check to see which versions of samtools are available on ISAAC-NG.
- Have your new sbatch script load the newest version of samtools.
- Create a way to connect the
.sam
files in array_results to a task_array (using filenames.txt approach). - Rename the output files with
.bam
. - Run
samtools view
to produce a.bam
file inarray_results
directory.
A few hints!
- You can re-use much of the previous array-bwa.qsh file.
- You can re-use the
filenames.txt
approach from before, using the full paths from your.sam
results.
realpath array_results/*sam > filenames.txt
Converting sam to bam:
samtools view -S -b <input.sam> > <output.bam>
#!/bin/bash
#SBATCH -J samtools_task_array
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:10:00
#SBATCH --array=1-4
realpath ./array_results/*.sam > filenames.txt
infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" filenames.txt)
outfile=$( basename $infile | sed 's/.sam/.bam/' )
module load samtools/1.16.1-gcc
samtools view \
-S \
-b \
${infile} > ./array_results/${outfile}
If you want a simple, interactive session, the following salloc
slurm command will request a node for you:
salloc --account ISAAC-UTK0318 --partition=condo-epp622 --qos=condo --nodes=1 --cpus-per-task=2 --mem=8G --time=0-03:00:00
This will produce a message similar to Nodes il1340 are ready for job, but the node will be different depending on what's available. To connect you can use:
ssh <node name from above>
Once you connect, you'll see the prompt shows your username@node. To exit simply type exit
. You may have to type exit
a second time or scancel
the allocation to free up the resource.
Alternatively, use of srun
will accomplish the exact same as salloc
but will connect you immediately with a node with using ssh
.
srun --account ISAAC-UTK0318 --partition=condo-epp622 --qos=condo --nodes=1 --cpus-per-task=2 --mem=8G --time=0-03:00:00 --pty /bin/bash
A list of useful commands and topics to mention:
scancel
showpartitions
sinfo
squeue -o "%i %P %j %C %m %q" --me
scontrol show job <job_id>
sacctmgr show qos where name=campus
- transferring data with dtn[12]
- tour of resources
- home/scratch/projects
-
Open OnDemand
- Open OnDemand RStudio
- Open OnDemand Jupyter
- Open OnDemand shell
- Open OnDemand file browser
- Globus