Final Report - mercury-hpc/fabtsuite GitHub Wiki

Abstract

The Mochi project explores a software defined storage approach for composing storage services that provides new levels of functionality, performance, and reliability for science applications at extreme scale. One key component of Mochi is the use of OpenFabrics Interfaces (OFI), a framework focused on exporting fabric communication services to applications. Libfabric is a core component of OFI. We have developed a test suite, fabtsuite, to assess libfabric for the features that the Mochi project requires.

Introduction

The Mochi project [1] provides a set of software building blocks for rapid construction of custom, composable HPC data services. These building blocks rely upon the libfabric [2] network transport library, by way of the Mercury RPC library [3], for high-performance network support on most HPC platforms.

Mochi and Mercury impose specific requirements on network transport libraries, and libfabric’s OFI interface meets all of these requirements. We have found, however, that many of the features that we rely upon are not as thoroughly tested as the features that are more routinely used in other environments (e.g., by message-passing libraries). The objective of the Mochi libfabric test suite, fabtsuite, is to evaluate the specific libfabric features required by Mochi in a controlled, reproducible manner, with minimal external dependencies, so that stakeholders can quickly assess the suitability of different providers for use in HPC data services.

Problem

We identify 7 features that are required for Mochi but not thoroughly tested in Exascale HPC environment.

FI_WAIT_FD File Descriptor

FI_WAIT_FD provides a file descriptor that users can block on (without busy spinning) until there are events to process in the libfabric interface. This allows data services to minimize compute resources when they are idle while they arestill responding in a timely manner to new requests.

fi_cancel() Function

fi_cancel() is used to cancel pending asynchronous operations and permit the reuse of memory buffers that were previously associated with those operations. This is necessary for graceful shutdown of persistent services that have pending buffers posted. It is also necessary for fault tolerant use cases, in which a process must cancel pending operations before re-issuing them to a secondary server.

Cross-Job Communication

HPC applications or collections of processes are typically launched with a parallel job launcher, like mpiexec, aprun, or srun. For many data service use cases, we need to be able to launch service daemons separately from applications. The separately launched collections of processes must be able to communicate with each other exchanging both messages and RMA transfers.

Graceful Resources Clean-up

Mercury uses connectionless RDM endpoints in libfabric, but some providers internally allocate resources or memory buffers on a per-peer basis. We need to make sure that these resources are correctly released when those peers exit. In the data service use case, a single daemon may be relatively long-lived as a large number of clients connect to it and exit over time.

Multithreading

HPC systems are trending towards systems with many cores sharing a single network interface. Data services will likely utilize multithreading to make more efficient use of these systems. We need to ensure that multithreaded workloads operate correctly and do not present a significant performance degradation.

Vectored I/O

HPC systems easily generate structured data. For example, a column of a multidimensional array can describe particles in a molecular dynamics code that generate trillions of rows. Being able to describe those non-contiguous requests with a list of offset-length vectors, and then transfering that request in a single operation gives the network transport a chance to deliver data with lower latency and higher bandwidth.

MPI Interoperability

When an application uses both MPI and libfabric, MPI may use libfabric in a way that conflicts with the appication's use of libfabric.

Solution

One Program Two Applications

We have developed fabtsuite. It has one test program that either runs as a transmitter (fabtput) or a receiver (fabtget).

fabtget.svg

Using command-line options, a user can select different operating modes to compare single- and multi-threaded (MT) libfabric operation, compare contiguous (fi_write) and vector (fi_writev) transfers, and so on.

option	fabtget	fabtput
cancel	-c	-c
contiguous		-g
reregister	-r	-r
wait	-w	-w
session	-n	-n, -k
thread	-p 'i-j'	-p 'i-j'

fabtput RDMA-writes multiple non-contiguous buffer segments at once using solitary fi_writemsg(3) calls.

Single Local Test Script

We wrote a single test script that tests all options on a single machine. This script will help users to ensure that the two applications can communicate properly.

The script uses several environment variables to adjust test parameters.

Name	Purpose
FABTSUITE_CANCEL_TIMEOUT	Cancel transfer after s seconds. Default value is 2 seconds.
FABTSUITE_RANDOM_FAIL	Fail randomly. Default value is no.
FI_MR_CACHE_MAX_SIZE	Disable memory-registration cache when it's set to 0.

The script tests a set of command line options in two phases -

get and 2) put. The following table summarizes the option set for each phase.

option set	get phase	put phase
default	Y	Y
cacheless	Y	Y
reregister	Y	Y
cacheless,reregister	Y	Y
wait	Y	Y
contiguous	N	Y
contiguous,reregister	N	Y
contiguous,reregister,cacheless	N	Y

As you can see from the above table, the script tests contiguous option in put phase only. When cacheless option is tested, the FI_MR_CACHE_MAX_SIZE environment variable is set to 0.

Command line options are supplied to either faptget during the get phase or faptput during the put phase except for cancel test. That is, if faptget starts with a -w option, faptput runs without -w option. However, there is one exception. The cancel requires both commands to have -c option.

Multiple Scripts for Multi-node Test using CTest

CTest

CTest is a free (unit) testing framework that is integrated with CMake. A CTest test is any command returning an exit code. It does not really matter how the command is issued or what is run.

The goal of using CTest is to allow users to compile the test suite with CMake, run the front-end CTest script on a login node, and dispatch batch jobs to the compute nodes to execute individual tests. This architecture will make it easy to to split off individual test cases for more detailed debugging/analysis.

CTest allows users to run a specific test easily. For example, you can run wait feature test only using -I option.

$ ctest -I 2,2
Test project /home/user/fabtsuite/build
    Start 2: FI_WAIT_FD
1/1 Test #2: FI_WAIT_FD .......................   Passed   74.32 sec

100% tests passed, 0 tests failed out of 1

The -I option can specify a range of tests as well. For example, if you use -I 2,3, it will run both wait and cancel feature tests.

Multi-node Test

We wrote one test script for single-node testing and some parallel batch job scripts for multi-node testing. All scripts can be submitted by CTest using a single command make test after compiling the test suite with make command. For batch jobs, we support both PBS and SLURM.

Each multi-node testing job submits fabtget job first on one node and 1 or more fabtput job(s) on other node(s). The diagram below illustrates the mult-node testing workflow.

mnode.svg

Experiment

We tested 5 features on one local system (A) for single-node test and on two ECP systems (B,C) for the multi-node test. The following table summarizes pass (P) / fail (F) result.

Feature	A	B	C
wait	P	P	P
cancel	P	F	F
cross	P	F	P
thread	P	P	P
vector	P	P	P
MPI-IO	N	N	N

The MPI-IO feature test, which means MPI interoperability, is not implemented (N) yet. However, we included a provisional script in CTest for future work.

At this stage, the test suite is functionally correct and can execute on a baseline TCP provider with libfabric 1.15.

Discussion

The wait test requires a slightly longer time allocation than other tests. Otherwise, receiver job will not finish on time and generate output.

The cancel test failure on C was caused by the interrupt signal.

fabtget: caught a signal, exiting.
real 1.99
user 0.53
sys 1.38
1
srun: error: SystemB: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=186257.0

The cross test failure on B is due to system queue issue that doesn't allow to run jobs on more than 3 machines.

Resource leak was checked using AddressSanitzer once although it is not a part of test suite.

Conclusion

Although libfabric provides an extensive set of unit testing, our test suite can provide an alternative set of exa-scalable tests in ECP systems.

Our work suggests that it is necessary to get the framework itself working with a broader collection of libraries and additional (RDMA-capable) providers, employ this in existing mochi regression testing, and expand the test cases over time.

Future Work

Additional work is needed to validate the test suite and its results on other versions of libfabric and with other libfabric providers.

Our test suite is fully open-source, and anyone can extend it. One can add a new command line option for a new feature testing. Or one can add a new C code with MPI APIs to test MPI inteoperability.

We provide sample test scripts for ECP systems that use PBS and SLURM only. They can be modified further to run on different schedulers like Cobalt.

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Glossary

ECP: Exascale Computing Project

MPI: Message Passing Interface

MR: Memory Registration

RDMA: Remote Direct Memory Access

RMA: Remote Memory Access

PBS: Portable Batch System

RPC: Remote Procedure Call

SLURM: Simple Linux Utility for Resource Management

TCP: Transmission Control Protocol