Computation management tools - lorenzo-arcioni/HPC-T-Annotator GitHub Wiki
If you are using a workload manager to run HPC-Annotator, the following process management tools can be useful.
Monitoring and error checking
During the computation, its status can be monitored via the script monitor.sh.
./monitor.sh
It will return a table like this (in case you're using Slurm):
JOBID | PARTITION | NAME | USER | STATE | TIME | TIME_LIMIT | NODES | NODELIST(REASON) |
---|---|---|---|---|---|---|---|---|
7032348 | g100_all_serial | PA_proc-control | larcioni | RUNNING | 0:04 | 4:00:00 | 1 | login10 |
7032347 | g100_usr_prod | PA_proc-3 | larcioni | PENDING | 0:00 | 4:00:00 | 1 | (Priority) |
7032346 | g100_usr_prod | PA_proc-2 | larcioni | PENDING | 0:00 | 4:00:00 | 1 | (Priority) |
7032345 | g100_usr_prod | PA_proc-0 | larcioni | PENDING | 0:00 | 4:00:00 | 1 | (Priority) |
7032344 | g100_usr_prod | PA_proc-4 | larcioni | RUNNING | 56:30 | 4:00:00 | 1 | Node2 |
7032343 | g100_usr_prod | PA_proc-1 | larcioni | RUNNING | 1:54:43 | 4:00:00 | 1 | Node1 |
It will not return the estimated end-of-calculation time as it cannot be determined a priori (because it depends on the system's workload).
There is also a useful built-in tool for error checking that tests the input sequences, the partial inputs and outputs file and the final output, in this way we are sure that all input sequences were analysed.
An other useful script allows to check for Slurm errors, it also allows processes that resulted in an error to be re-executed. For example, if you want to change the execution time to all partial scripts and re-execute them, simply give the following commands:
find . -name script.sh -exec sed -i "s/#SBATCH --time=3:00:00/#SBATCH --time=5:00:00/" {} \;
./slurm_error_checker.sh
Abort the computation
It is possible to cancel the entire computation by terminating all processes through the script cancel.sh.
./cancel.sh