common standards - RRZE-HPC/DFG-PE GitHub Wiki

Discussion of common standards for the monitoring infrastructure

Metrics

Metric set

To ensure some level of interoperability in the different software stacks, we should probably start with the common set of metrics that need to be collected (or maybe two sets - one necessary, and one optional). At the ProfiT-HPC project, we have tried to formalize a rather extensive set of metrics, so maybe it would make sense to start from it and see how it goes.

There is an onine version of the specification (still work in progress, so please bear with us) at a Hanover server. An example data set (artificial, basically serves as a placeholder) is also available.

To avoid data duplication, we can discuss the changes/improvements here and they will be reflected to the new version of the specification as soon as a consensus emerges.

The metric and meta data specifications created within the ProPE project are documented here and can serve as an additional base for discussion.

Metric format

In ProfiT-HPC we use the JSON format (as shown in the example above). Should it really become a bottleneck in the future, there is a plenty of options to explore - starting from a faster JSON parser, over using a Binary JSON format, or introducing an additional compression layer, to switching to some other serialization approach. If the serialization module is decoupled from the metric collection, the changes to the code should be minimal.

In general, we would advocate a structured data format over something like a InfluxDB's line protocol, because of the easier data validation and the inherent structure of the metric data itself.

Interfaces

The most common components (at least at the Dresden Workshop) were/are:

  • Collectors
  • Storage
  • Analytics
  • Visualization/Notification/Archiving

There are multiple data paths that can be considered, for example

  • Collector -> Storage -> Visualization (e.g. post-mortem introspection)
  • Collector -> Analytics -> Visualization (e.g. for failed jobs)
  • Collector -> Visualization (maybe useful for real-time monitoring)

To allow interoperability, every component should be decoupled and encapsulated behind a reasonably thin, standardized interface, and, to allow the different data paths, these interfaces should be compatible.

One possible approach, as considered in ProfiT-HPC, is a ReST-like interface. It has the benefits of being simple, well-known, easily combinable with JSON data (but not limited to it), and universal enough for this use case.

Again, an example read-only implementation (neither final nor optimized) for the Storage component, is available online. The data itself is rather artificial and should not be used as a baseline for performance discussions.