02. Working Within the Queue - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

#------------------------------------------------------------------------------#

WORKING WITHIN THE QUEUE

#------------------------------------------------------------------------------#

Working with a interactive session allows you to interact directly with the file system but many of the jobs we will be running take hours if not days. Thus, keeping a terminal open is not feasible. Thus, you will need to become familiar with working in the queuing system (aka scheduler) that's built in to HPCC. For this exercise, you will NOT get an interactive session. But, don't get used to that. Generally, you should always get an interactive session when doing most everything on the system other than simply submitting a job.

To do this, we use submission scripts.

Transfer the following script to your gge folder,

cp /home/daray/counter.sh ~/gge

and then go to that directory.

cd ~/gge

Before you run the script, peek at the the queue with

squeue

A very long list will appear. It's long because it lists every active and queued job from every user.

Generally, we don't care about anything others are running. So, let's only look at what you're running.

squeue -u [eraider]

The -u portion of the command tells the scheduler to show only jobs for a specific user (you, in this case).

Assuming you don't have other jobs running, you'll see nothing but the header line.

     JOBID  PARTITION  PRIORI  ST      USER      NAME        TIME  NODES   CPUS      QOS  NODELIST(REASON)

Now, run the counter script using:

sbatch counter.sh

and look at the queue again. The details will be different but it should look something like this:

login-20-25:$ squeue -u [eraider]
   2968376     nocona    3581   R     daray   counter        6:08      1      1   normal  cpu-23-12

Each column means something. Here are the column headers that you're not seeing because you've selected only lines with your ID on it.

     JOBID  PARTITION  PRIORI  ST      USER      NAME        TIME  NODES   CPUS      QOS  NODELIST(REASON)

This shows that your job is running on whatever compute node to which it was assigned. Important things to notice.

  • 'JOBID' for this job is 2954962. What's yours? It will be important if you want to kill this job or keep up with that's happening with it.
  • 'PARTITION' The partition (part of HPCC) on which you set your job running.
  • 'PRIORI' is the priority status in the queue. The higher the number, the better for getting your job run.
  • 'ST' Your jobs status.
    • 'R' (running) is good.
    • 'PD' (pending) still good but you gotta wait your turn in the queue.
    • 'CG' (completing). Your job is ending without finishing. This usually means something was wrong with your script* USER You can figure this one out.
  • 'NAME' Try to name your jobs something easily identifiable but also note that you'll only see up to 8 characters.* 'USER' That's you. You can view processes other people are running but that's not important right now.
  • 'TIME' Time your script has been running.
  • 'NODES' The number of nodes assigned to this task.
  • 'CPUS' The list of nodes assigned to this task.
  • 'QOS' Quality of service. This has information on why your job may be being held up or other data.
  • 'NODELIST(REASON) Will tell you which specific nodes you're using or the reason for your job hold.

More information is available if you type man squeue

View counter.sh in less:

less counter.sh

#!/bin/bash
#SBATCH --job-name=counter
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1


for i in `seq 1 1000`;
do
    echo $i
    sleep 1
done

The first 7 lines are the submission script 'header'. Most of this stuff is standard. You'll only change a couple of things for your jobs.

  • '#!/bin/bash' Sets the command interpreter path. In this case, we're using bash and that interpeter is located at /bin/bash.
  • '#SBATCH --job-name=counter' Name your job 'counter'.
  • '#SBATCH --output=%x.%j' Indicates the name of the standard output file. For example 'counter.o1915516'.
  • '#SBATCH --error=%x.%j' Indicates the name of the standard error file. For example, 'counter.e1915516'.
  • '#SBATCH --partition=nocona' Instructs the schedule to use the partition named 'nocona'. You could also choose quanah or ivy.
  • '#SBATCH --nodes=1' Instructs the scheduler to use one node (on nocona, a node consists of 128 processors).
  • '#SBATCH --ntasks-per-node=1' Instructs the scheduler to use one task per node (aka one processor from each node).

There are other possible lines to specialize this setup but this class doesn't really need to go into them.

The next five lines tell the script what to do:

for i in `seq 1 1000`;
do
    echo $i
    sleep 1
done

This sets up variable, 'i', that ranges from 1 to 1000, counting by 1's, and then prints the first value to the screen, 'echo $i'. It then waits one second and repeats the loop by going to the next number. It will continue to do that until the number 1000 is reached.

To see what's happening, we can look at the standard output file. To find out what that standard output file is named use ls.

You should see at least two new files called counter.<job-ID>.out and counter.<job-ID>.err. The job-ID will be the value you got from the first column of the squeue results.

To see what's happening in this file, use tail with the -f (follow) option.

tail -f counter.<job-ID>.out

You should see a growing list of numbers. The actual numbers you see will depend on how far into the run, the program has gotten when you issue your command. Below is what you'd see if you started to follow the file after the script had already counted to '19'. Watch it for a few seconds and see what happens.

:$ tail -f counter.1915525.out
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

To exit this tail command use ctrl+c.

Now that you've gotten the idea, we can kill this job. After all, there's no need to use up processors that someone else could be using.

Anytime you want to kill a job just type

scancel <job-ID>.

You can also view any interactive sessions you might have active using squeue. Grab 10 processors on nocona with an interactive login.

interactive -p nocona -c 10. * Notice that these entries are similar to the last three lines of the submission script header.

Now check the queue. You'll see something like this.

cpu-4-7:$ squeue -u [eraider]
           2968515    nocona INTERACT    daray  R       0:15      1 cpu-25-60

Your interactive session is named INTERACT (remember, the queue will only give you the first 8 characters of the job name).

To exit your interactive session, type exit or kill that process with scancel <job-ID>.

#------------------------------------------------------------------------------#

FOR YOU TO DO

#------------------------------------------------------------------------------#

Alter the counter.sh to cause it to:

  • count to 500 rather than 1000.
  • use 5 processors rather than 1.
  • run under the name 'countermod' rather than 'counter'.

Run the script.

Under Assignment 2 - The queue on Blackboard, upload your modified version of the submission script. Then, after the job is complete (how would you know?), upload the countermod.<job-ID>.out output file. You'll need to download it to your local computer and upload it to Blackboard.

⚠️ **GitHub.com Fallback** ⚠️