ProPE Job monitoring specifications - RRZE-HPC/DFG-PE GitHub Wiki
Below are job related information as used by ClusterCockpit. I find it important to separate data related to job accounting from information required for job performance monitoring. There can be links between job accounting and job performance monitoring, but those two things should not be mixed up in my opinion.
- job id
- user id
- project
- cluster name
- number of nodes - (redundant but useful for simplifying queries)
- job state - (running, aborted, finished successfully, etc.)
- start time - epoch time in s
- stop time - epoch time in s
- walltime - (to evaluate used vs requested time)
- node and CPU list - node and CPU IDs that specify the used compute resources (if nodes are only used exclusively, the CPU list can be omitted)
-
tag list - array of tags, tags are pairs of
<tag type>:<tag name>
, both can be any string
Optional:
- duration - seconds (redundant but useful to have this available for queries)
- job script
Performance footprints in form of metric job averages. Useful to analyse or sort jobs according to HPM metrics. This can be freely configured in ClusterCockpit but one could also settle on a fixed set of metrics.
- mem_capacity_avg
- flops_any_avg
- mem_bw_avg
- ib_bw_avg
- lustre_bw_avg
- user id - string
- user name - string
- email - string
- is active - boolean, indicates if user is active or if this is an inactive account
I currently removed additional fields as group and project as I believe that a performance project usually happens together with a user. But this could be discussed. In ClusterCockpit current handling would add a tag to the job with project:<project name>
and thereby all jobs of a project could be grouped.
In the ProPE project the focus is to include metrics which mainly quantify resource utilization.
Below lists specify the name of the metric, what it means, the smallest granularity it is valid for and how these values can be acquired. If one uses likwid-perfctr
for measuring HPM metrics all below metrics can be acquired using 2 performance groups: MEM_DP
and FLOPS_SP
.
- cpu_used - CPU core utilization (between 0 and 1) / cpu level / kernel fs
- ipc - avg ipc of active cores (cores executing instructions) / cpu level / HPM
- mem_used - memory capacity used / node level / kernel fs
- mem_bw - memory bandwidth / socket level / HPM
- flops_any - total flop rate with DP flops scaled up / cpu level / HPM
- rapl_power - CPU power consumption / socket level / HPM
- lustre_bw - total lustre fs bandwidth / node level / kernel fs
- ib_bw - total infiniband or omnipath bandwidth / node level / kernel fs
- gpu_used - GPU utilization / GPU level / NVML (NVIDIA GPUs only)
- gpu_mem_used - GPU memory capacity used / GPU level / NVML (NVIDIA GPUs only)
- gpu_power - GPU power consumption / GPU level / NVML (NVIDIA GPUs only)
- clock - avg core frequency / cpu level / HPM
- flops_sp - SP flop rate / cpu level / HPM
- flops_dp - DP flop rate / cpu level / HPM
- eth_read_bw - Ethernet read bandwidth / node level / kernel fs
- eth_write_bw - Ethernet write bandwidth / node level / kernel fs
- lustre_read_bw - Lustre read bandwidth / node level / kernel fs
- lustre_write_bw - Lustre write bandwidth / node level / kernel fs
- lustre_read_req- Lustre read requests / node level / kernel fs
- lustre_write_req - Lustre write requests / node level / kernel fs
- lustre_inodes - Lustre inodes / node level / kernel fs
- lustre_accesses - Lustre open close / node level / kernel fs
- lustre_fsync - Lustre fsync / node level / kernel fs
- lustre_create - Lustre create / node level / kernel fs
- ib_read_bw - Infiniband, Omnipath read bandwidth / node level / kernel fs
- ib_write_bw - Infiniband, Omnipath write bandwidth / node level / kernel fs
- ib_congestion - Infiniband, Omnipath congestion / node level / kernel fs
InfluxDB uses its own nomenclature for the database schema. Still there are according structures in relational database speak. In InfluxDB a database is structured into measurements, a measurements has tags (strings) and fields (numbers). A measurement in InfluxDB is similar to a table in SQL, where a tag is a column with an index optimized for queries and fields are regular columns without index.
For a low overhead reporting and storage of metrics in InfluxDB it would make sense to put together metrics into one measurement. All fields in one measurement statement must have the same timestamp. For timestamp granularity seconds should be the right choice.
One could for example use the smallest topological entity on the node level as measurements:
-
cpu
- tags: host, cpu
- fields: load, cpi, flops_any, clock
-
socket
- tags: host, socket
- fields: rapl_power, mem_bw
-
node
- tags: host
- fields: mem_used, lustre_bw, ib_bw
As there are so many IO and Network related additional metrics it could make sense to create extra measurements for them:
-
network
- tags: host
- fields: ib_read_bw, ib_write_bw, eth_read_bw, eth_write_bw
-
fileIO
- tags: host
- fields: lustre_read_bw, lustre_write_bw, lustre_read_requests, lustre_write_requests, lustre_create, lustre_open, lustre_close, lustre_seek, lustre_fsync
The following InfluxDB measurements are currently used in Dresden's ProPE-database (one measurement per data source)
-
cpu
- tags: hostname, cpu
- fields: used
-
infiniband
- tags: hostname
- fields: bw
-
likwid_cpu
- tags: hostname, cpu
- fields: cpi, flops_any
-
likwid_socket
- tags: hostname, cpu
- fields: mem_bw, rapl_power
-
lustre[_scratch|highiops] (Dresden has two lustre file systems)
- tags: hostname
- fields: read_bw, write_bw, read_requests, write_requests, create, open, close, seek, fsync
-
memory
- tags: hostname
- fields: used
-
nvml
- tags: hostname, gpu
- fields: gpu_used, mem_used, power, temp