Computation management tools - lorenzo-arcioni/HPC-T-Annotator GitHub Wiki

If you are using a workload manager to run HPC-Annotator, the following process management tools can be useful.

Monitoring and error checking

During the computation, its status can be monitored via the script monitor.sh.

./monitor.sh

It will return a table like this (in case you're using Slurm):

JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
7032348 g100_all_serial PA_proc-control larcioni RUNNING 0:04 4:00:00 1 login10
7032347 g100_usr_prod PA_proc-3 larcioni PENDING 0:00 4:00:00 1 (Priority)
7032346 g100_usr_prod PA_proc-2 larcioni PENDING 0:00 4:00:00 1 (Priority)
7032345 g100_usr_prod PA_proc-0 larcioni PENDING 0:00 4:00:00 1 (Priority)
7032344 g100_usr_prod PA_proc-4 larcioni RUNNING 56:30 4:00:00 1 Node2
7032343 g100_usr_prod PA_proc-1 larcioni RUNNING 1:54:43 4:00:00 1 Node1

It will not return the estimated end-of-calculation time as it cannot be determined a priori (because it depends on the system's workload).

There is also a useful built-in tool for error checking that tests the input sequences, the partial inputs and outputs file and the final output, in this way we are sure that all input sequences were analysed.

An other useful script allows to check for Slurm errors, it also allows processes that resulted in an error to be re-executed. For example, if you want to change the execution time to all partial scripts and re-execute them, simply give the following commands:

find . -name script.sh -exec sed -i "s/#SBATCH --time=3:00:00/#SBATCH --time=5:00:00/" {} \;
./slurm_error_checker.sh

Abort the computation

It is possible to cancel the entire computation by terminating all processes through the script cancel.sh.

./cancel.sh