TITAN's batch queue - ATLAS-Titan/misc GitHub Wiki
Titan batch queue
The final aim of this section is to investigate TITAN's batch queue. To do this, we firstly characterize users' requests (a.k.a. the size the jobs in terms of number of nodes and wall-time) and secondly look at TITAN's response (a.k.a. queue time).
The following data has been extracted directly from TITAN's log files. These files contains records that reflects the so called ``wiki format'' of Moab accounting. We refer to Moab accounting guide for further detail.
Since we were interested in batch queue, we selected only the records that belong to such queue (thus debug and killable have been ignored). Furthermore, since we were not interested in the behaviour of users with extra privileges, we excluded from the data collection all the records in which job sizes do not fit the limits indicated in TITAN's official user guide User guide. As last, we excluded those jobs that are marked with the flag "RESTARTABLE" because we observed that their duration can exceed the wall-time. In the future, we plan to remove also those records that belong to backfill submission mode.
We considered a time interval that goes from 2016-06-01 to 2017-01-15. We read a total number of records equal to 3662110 from which we were able to extract information about 14395 different jobs (~65.91 jobs per day on average). These jobs were distributed on the bins as depicted in the plot below

Punctual values are the following:
- First bin: 9
- Second bin:96
- Third bin: 261
- Fourth bin: 389
- Fifth bin: 13640
The plot shows that the 95% of batch jobs belongs to the fifth bin. In order to have a better picture of the job distribution we provide the cumulative distribution of the jobs as function of the number of nodes (Note that both x and y axis are scaled in sake of readability).

The plot shows that ~70% of batch jobs ask between 1 and 8 nodes and ~25% of the jobs that ask between 45 and 128 nodes. In order to provide a hint about the sparsity of the request, we provide the frequency as well (Note that y axis is scaled in sake of readability).

The plot underlines the decay of job requests while increasing the number of nodes but it also points out how the support of the distribution is compact until ~1000 nodes. Then, the support of the distribution becomes extremely sparse. In other words for small job sizes, users tend to have a random behavior whereas large jobs show determinism.
In order to put the number of nodes in relation with the wall-time, we provide a plot that depicts the average wall-time as function of the number of nodes.

The plot shows a trend in which the larger is the number of jobs and the higher the average. However, we can notice that, except for the fifth bin, the average wall-time is always quite far from the maximum wall-time allowed for the bin (TODO:I should draw the limit inside the plot).
In the next plot we provide the frequency of the jobs as function of the wall-time from which it is possible to notice that only a small part of the jobs ask for more than 120 minutes. Additionally, no jobs use the interval between 721 and 1430 minutes.

In the next plot we provide the cumulative number of hours requested by the user and the real execution time as function of the number of nodes (Note that x axis is scaled in sake of readability).

As first, we can observe that a total amount of ~22K hours has been requested by batch jobs during the last six months whereas only 8K hours have been used. Secondly, we can observe that the contribution of large jobs becomes larger by considering the real execution time. This means that small jobs overestimate the execution time by asking for more minutes than it is needed. This conjecture is confirmed by the next plot where the difference between execution time (ET) and wall-time (WT) is analyzed as function of the bin (Note that the azure histogram refers to the y axis on the right wheres the purple and the green refer to the axis on the left).

We can observe that small jobs overestimate the wall-time (fifth and fourth bin use 45% and 50% respectively) whereas large jobs tend to use at least the 75% of the time available. It is also worth to mention that the smaller are the job inside a bin and the more large is the gap between average wall-time and maximum wall-time allowed. For example, jobs in the fifth bin have an average wall-time ~90 minutes against a maximum wall-time of 120 minutes. On the contrary, jobs that belong to the third bin ask ~200 minutes on average but potentially they could run for 720 minutes.
Queue time
The plot below depicts the average queue time as function of the bin.

The plot shows an astonishing drop for what concern jobs in the first bin (in which, at least in principle, the user can ask for the entire machine). This drop might be a consequence of killable jobs whose execution can be terminated by the system anytime (NOTE: As far as I understood from the user guide. it needs verification). Another aspect that is worth to mention is that the second bin has an average waiting time that is smaller than the fourth bin and the third bin. This suggests that the time to completion of several tasks could be reduced by using a single large size job instead of several small jobs.
