Workshop HPCMASPA Sept 2017 - RRZE-HPC/DFG-PE GitHub Wiki

Einleitung

Im Zuge der IEEE Cluster 2017 Konferenz fand der HPCMASPA (Monitoring and Analysis for High Performance Computing Systems Plus Applications) Workshop statt. Es gab 3 Sessions zu verschiedenen Themen:

Infrastructures and Analysis
Application-related
Networks + Panel

Thomas Gruber (Röhl) vom RRZE hat über den LIKWID Monitoring Stack berichtet.

Session 1: Infrastructures and Analysis

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Autoren

Park, Hukerikar, Adamsony und Engelmann vom ORNL, USA

Zusammenfassung

Datenanalyse-Framework auf Basis von Cassandra (NoSQL DB), Spark (Analyse) und einem eigenen Webfrontend (D3 + HTML5 canvas) zur Analyse von Log-Meldung des Systems und von Applikationen. Zur Analyse von Applikationen müssen diese die Daten in eine Log-Datei schreiben (namentlich erwähnt werden nur Abbrüche von Applikationen). Es geht nur nur bedingt um Performance-Daten sondern mehr um text-basierte Daten (Darstellung der interessantesten Wörter die in den Log-Daten auftauchen).

Abstract

Today’s high-performance computing (HPC) systems are heavily instrumented, generating logs containing in- formation about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application’s interactions with the system, providing valuable insights to domain scientists and system administrators. However, processing HPC logs requires a deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making it more difficult for system users and administrators to manually inspect the data. With rapid increases in the scale and complexity of HPC systems, log data processing is becoming a big data challenge. This paper introduces a HPC log data analytics framework that is based on a distributed NoSQL database technology, which provides scalability and high availability, and the Apache Spark framework for rapid in-memory processing of the log data. The analytics framework enables the extraction of a range of information about the system so that system administrators and end users alike can obtain necessary insights for their specific needs. We describe our experience with using this framework to glean insights from the log data about system behavior from the Titan supercomputer at the Oak Ridge National Laboratory.

Data Mining-based Analysis of HPC Center Operations

Autoren

Klinkenberg, Terboven, Lankes und Müller von der RWTH Aachen

Zusammenfassung

Dieses Paper zeigt eine Analyse um auf Basis von Systemtemperatur, Lüfterdrehzahlen, Energieverbrauch und Netzwerklast Systemfehler vorherzusagen. Die Architektur ist ein Zusammenspiel von OpenTSDM (Time-series DB) und Hadoop. Mehrere Zeitpunkte werden zu einem Frame zusammengefasst. Die Labels der Frames in den Testdaten werden aufgrund der Lock-Events im Batchsystem erzeugt. Für jeden Frame werden eine Vielzahl von Statistiken berechnet. Zur Analyse wurden mehrere Methoden in Betracht gezogen (Logistic Regression, Decision Tree, Random Forest, Support Vector Machine und Multilayer Perceptron). Auf den Testdaten konnte eine hohe Precision, Accuracy und Recall erreicht werden.

Abstract

Size and complexity of contemporary High Perfor- mance Computing (HPC) systems increases permanently. While the reliability of a single component and compute node is high, the huge amount of components comprising these systems results in the fact that defects happen regularly. This drives the need to manage failure situations. Common issues are component failures or node soft lock-ups that typically lead to crashes of the user jobs that are scheduled on the affected node, and may cause undesired downtime. One approach to mitigate the impact of such problems is to predict node failures with a sufficient lead time in order to take proactive measures. However, accurate prediction is a challenging task. The literature describes several approaches that focus on gathering and analyzing system event logs in order to create prediction models. In this paper, we present a different approach by using descriptive statistics and supervised machine learning to create a prediction model from monitoring data. Our approach is based on the assumption, that features of a certain time frame before a critical event (i. e., a failure or soft lock-up) can serve as an indicator. Consequently, our model is trained with monitoring data from critical and healthy time frames. The evaluation with standard monitoring data collected from the HPC systems at RWTH Aachen University shows that our classifier is able to locate potentially failing nodes with a 10-fold cross precision of 98 % and recall of 91 %

Monitoring Infrastructure: The Challenges of Moving Beyond Petascale

Autoren

Bonnie, Illescas und Mason vom LANL, USA

Zusammenfassung

Das Paper beschreibt das Redesign eines Monitoring Systems um bereit für Exascale zu sein. Auf den Hosts wird des Systems wird LDMS verwendet, die Daten werden mittels Aggregatoren zu Collectoren geschickt. Diese Senden die Daten anschließend an RabbitMQ und einen persistenten Speicher. Als Frontend wird ZENOSS genutzt. Es wird überlegt ob für Exascale Maschinen ein zusätzlicher Daten-Analyse-Cluster notwendig wird.

Abstract

Scaling clusters is no longer the only struggle in moving towards exascale in HPC. While scaling components such as the network and file systems is a widely accepted need, monitoring, on the other hand, is often left behind in the procurement of these large systems. Monitoring is often quite an afterthought that is expected to be incorporated in existing infrastructure. While that often works for small systems, even petascale systems are starting to push the capabilities of monitoring infrastructure and their ability to collect and analyze complete system wide logs. The need and desire to do more cross-component relations will only get more complex with scale. Preparing for monitoring an exascale class machine is no small task. This paper presents the current redesign of our commodity monitoring infrastructure, the upgraded sub-system monitoring for Trinity, and ideas and concepts for moving towards exascale class monitoring.

Holistic Measurement Driven System Assessment

Autoren

Jha, Brandt, Gentile, Kalbarczyk, Bauer, Enos, Showerman, Kaplan, Bode, Greiner, Bonnie, Mason, Iyer und Kramer von UIUC, SNL, NCSA, Cray, NERSC, LANL (alle USA)

Zusammenfassung

High-Level Vortrag wie man Monitoring verbessern kann

Abstract

In high-performance computing systems, application performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven insights into the potentially widespread scope and propagation of impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characterizations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

Session 2: Application-related

Job Storage Performance Monitoring on Sonexion with Project Caribou

Autoren

Flaskerud und Schumann von der Firma Cray (USA)

Zusammenfassung

Beschreibung wie man Performance Daten der Lustre Backend Server job-spezifisch messen kann

Abstract

This paper discusses the motivation and implementation for Cray’s Project Caribou. Project Caribou enables users to correlate HPC job performance with Lustre file systems through collected metrics and events. We will discuss use cases, the sources of metrics that are collected, correlation, and how the data is visualized. Additional topics to include events and alerts that are available, as well as data retention and reduction challenges anticipated at scale.

LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses

Autoren

Röhl, Eitzinger, Hager und Wellein vom Regionalen Rechenzentrum Erlangen (RRZE)

Zusammenfassung

Abstract

System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed to integrate in existing monitoring infrastructures to speed up the change from pure system monitoring to job-aware monitoring.

lo2s – Multi-Core System and Application Performance Analysis for Linux

Autoren

Ilsche, Schoene, Bielert, Gocht und Hackenberg von der Technischen Universität Dresden

Zusammenfassung

Kernelmodul zum Messen von Hardware Perofrmance Countern auf Funktionsebene. Eng verknüpft mit Vampir und anderen Tools.

Abstract

In this paper we present lo2s – a lightweight performance monitoring tool to sample applications as well as the executing system. It enables the user to analyze the performance of a parallel application without requiring the time-consuming and error-prone process of application instrumentation. The collected performance data is complemented with various metric data, i.e., perf counters, kernel tracepoints, model specific registers, and custom metric data provided by plugins. Comprehensive visualization is enabled by compatibility with established tools.

YAViT (Yet Another Viz Tool): Raising the Level of Abstraction in End-User HPC Interactions

Autoren

Aaziz, Panthi und Cook von der New Mexico State University, USA

Zusammenfassung

Abstract

Because data collection in HPC systems happens on the nodes and is easily related to the job running on the node, tools presenting the data and subsequent analyses to the user generally present them at the job level. Our position is that this is the wrong level of abstraction and thus limits the value of the analyses, often dissuading users from using any of the offered tools. In this paper we present the position that tools need to present analyses at the level users are interested in, which is their applications.

Session 3: Networks + Panel

PFAnalyzer: A Toolset for Analyzing Application-aware Dynamic Interconnects

Autoren

Takahashi, Date, Khureltulga, Kido und Shimojo vom Cybermedia Center, Osaka University, Japan

Zusammenfassung

Abstract

Recent rapid scale out of high performance computing systems has rapidly and continuously increased the scale and complexity of the interconnects. As a result, current static and over-provisioned interconnects are becoming cost-ineffective. Against this background, we have been working on the integration of network programmability into the interconnect control, based on the idea that dynamically controlling the packet flow in the interconnect according to the communication pattern of applications can increase the utilization of interconnects and improve application performance. Interconnect simulators come in handy especially when investigating the performance characteristics of interconnects with different topologies and parameters. However, little effort has been put towards the simulation of packet flow in dynamically controlled interconnects, while simulators for static interconnects have been extensively researched and developed. To facilitate analysis on the performance characteristics of dynamic interconnects, we have developed PFAnalyzer. PFAnalyzer is a toolset composed of PFSim, an interconnect simulator specialized for dynamic interconnects, and PFProf, a profiler. PFSim allows interconnect researchers and designers to investigate congestion in the interconnect for an arbitrary cluster configuration and a set of communication patterns collected by PFProf. PFAnalyzer is used to demonstrate how dynamically controlling the interconnects can reduce congestion and potentially improve the performance of applications.

Understanding Performance Variability on the Aries Dragonfly Network

Autoren

Groves, Gu und Wright vom NERSC, USA

Zusammenfassung

Abstract

This work evaluates performance variability in the Cray Aries dragonfly network and characterizes its impact on MPI Allreduce. The execution time of Allreduce is limited by the performance of the slowest participating process, which can vary by more than an order of magnitude. We utilize counters from the network routers to provide a better understanding of how competing workloads can influence performance. Specifically, we examine the relationships between message size, process counts, Aries counters and the Allreduce communication-time. Our results suggest that competing traffic from other jobs can significantly impact performance on the Aries Dragonfly Network. Furthermore, we show that Aries network counters are a valuable tool, explaining up to 70% of the performance variability for our experiments on a large-scale production system.

Measuring Minimum Switch Port Metric Retrieval Time and Impact for Multi-Layer Infiniband Fabrics

Autoren

Aguilar, Allan und Polevitzky vom SAIC, SNL, USA

Zusammenfassung

Abstract

In this work, we seek to gain an understanding of the InfiniBand network processing limitations that might exist in gathering performance metric information from InfiniBand switches using our new LDMS ibfabric sampler. The limitations studied consist of delays in gathering InfiniBand metric informa- tion from a specific switch device due to the switch’s processor response delays or RDMA contention for network bandwidth.

Panel: Application Performance Insights from System + Application Level Monitoring

Teilnehmer: Schumann, Röhl, Ilsche, Park