Measuring Runtime Performance - openpmix/openpmix GitHub Wiki

This page describes a methodology for benchmarking PMIx launch performance. It will be updated regularly with examples and notes on best practices for benchmarking PMIx based application launch and projecting to exascale+ systems.

Some dimensions of analysis:

  • Node scaling
  • PPN (Processes-per-node) scaling
  • Memory footprint

Current versions tested

  • PMIx Version: 2.1.1 release (or 2.1.x branch HEAD)
  • Open MPI Version: v3.1.x branch (DVM launch)
  • PMIx Reference Server: master, includes PMIx v3.0alpha

How to test the "direct launch" mode

In the direct-launch use-case, the underlying resource manager (RM) or RTE is directly responsible for launching all processes – i.e., there is no intermediate launcher such as mpiexec. The user invokes a tool (typically on a non-compute, or “head”, node) that connects to a system-level PMIx server hosted by the RM/RTE, constructs a PMIx_Spawn command, and communicates that command to the server.

The system-level PMIx server “uplifts” the spawn request to its host RM daemon for processing. The RM subsequently initiates the launch process, sending the application launch command to its daemons on the allocated nodes, which then start their local client processes.

Testing direct launch is therefore relatively easy - one simply launches the MPI app using the relevant native launcher (e.g., srun). Note that OMPI v3.1 uses PMIx v2.1, which means you will need to run against either the ORTE Distributed Virtual Machine (DVM), the PMIx reference server (PRSVR), or Slurm 17.11 (configured with PMIx v2.1).

Note that the PRSVR is identical to the DVM - it is simply provided as a standalone code base, minus the MPI and SHMEM layers found in OMPI. Thus, instructions on its use are identical to those of the DVM, replacing "orte-dvm" with the "psrvr" command.

How to launch the ORTE DVM

The ORTE DVM launches a persistent daemon on each compute node. Once the persistent daemons are established, applications can be launched without incurring the overhead of daemon startup. This emulates a direct launch mode in a PMIx-enabled RM environment. The easiest way to use it when you have sole use of the nodes is to launch it with:

$ orte-dvm --system-server

If you are not running in an allocation, then you should add a hostfile or dash-host argument to specify the resources. This will launch a daemon on each node in the allocation, and create a rendezvous file in the top-level session directory.

Note that only one DVM with the --system-server option can be executing on a node at a time - i.e., the node where "orte-dvm" is executing can only have one orte-dvm with the --system-server option on it - as the rendezvous files will otherwise overwrite each other. Compute nodes can be shared by multiple DVMs.

There are several MCA parameters that relate to launch performance - these must be given to the DVM (as opposed to the application start command line) since they impact DVM behavior:

  • odls_base_num_threads - Number of threads to use for spawning local procs. Defaults to zero, meaning that all local procs will be started by the main ORTE thread (and therefore, serialized). Experimentation indicates that a value of 4 was beneficial for fully-loaded KNL machines.
  • oob_base_num_progress_threads - Number of independent progress OOB messages for each interface. Defaults to zero, meaning that all messages use the main ORTE thread for progress. When running at large scale, the launch message will be sent by each daemon to a number of children equal to the fanout value (e.g., the radix of the routed/radix component, which defaults to 64). Adding oob progress threads allows these messages to be sent in parallel. Experimentation indicates that a value of 8 was beneficial for large DVMs.
  • routed_radix - Radix to be used for routed radix tree. Defaults to 64. The bigger the number, the fewer routing steps to reach the full allocation. Keeping the routing depth (i.e., the number of routing steps to span the cluster) at 3 or less is important to launch performance. Note that one should definitely increase the number of oob progress threads as the radix increases.

Running applications with the DVM or Reference Server

Regardless of whether you are using the ORTE DVM or the PSRVR, applications are executed using the prun command. prun must be executed on the same node as the local DVM/PRSVR master, and will automatically rendezvous with it (so no rendezvous cmd line options need be given). The prun cmd line is identical to that of mpirun.

NOTE: the biggest usage difference here is to remember that mpirun cmd line options that impact the behavior of mpirun itself and/or its daemons must be included on the DVM cmd line, and not on the cmd line of prun.

Command line options applying to the application itself can be specified on the prun cmd line and will apply solely to that specific application. This includes dash-host options, mapping directives, and (most importantly) directives pertaining to async modex and MPI init/finalize barriers. More relevant to the performance testing:

  • pmix_base_async_modex - Use asynchronous modex mode. Defaults to FALSE. IF no data is being collected, then the modex fence is simply not executed. IF data is being collected, then the modex fence is executed in the background.
  • pmix_base_collect_data - Collect all data during modex. Defaults to TRUE.
  • async_mpi_init - Do not execute a barrier operation at the end of MPI_Init. Defaults to FALSE, meaning that a barrier will be executed. Note that if the modex is executing in the background, then the code will block while waiting for it to complete and a separate fence operation will not be executed as there is no reason to do two barriers.
  • async_mpi_finalize - Do not execute a barrier operation at the beginning of MPI_Finalize. Defaults to FALSE.

Test Programs

Four types of applications:

  1. /bin/true : Baseline launch performance
  2. pmix_init_finalize : PMIx_Init/PMIx_Finalize (should be really fast)
  3. mpi_init_finalize : MPI_Init/MPI_Finalize (need a PML/BTL that supports async modex)
  4. pmix_perf : Microbenchmark for PMIx operations (e.g., put, get, fence, commit), also gathers memory consumption information

How to test with /bin/true

TODO: Josh

How to test with pmix_init_finalize

TODO: Josh

How to test with mpi_init_finalize

TODO: Josh

How to test with pmix_perf

pmix_perf is located in <pmix-root>/contrib/perf_tools/

Building

  • Fix the PMIX_BASE in the Makefile
  • run make

Running

  • pmix_perf has the following options available:
$ ./pmix_intra_perf -h
Usage:
  ./pmix_intra_perf [options]              start the benchmark

  -s, --key-size=<size>     size of the key's submitted
  -c, --key-count=<size>    number of keys submitted to local and remote parts
  -d, --direct-modex        use direct modex if available
  --debug                   force all processes to print out the timings themself
  • -s option determins key size, for evaluation purposes I sugest to use key value of 50B (so it sill be consistent for all measurements). Please update here if there are any specific sizes are preferred.
  • -c option determines the number of keys per-rank. For the sake of being close to reality number of keys should be relatively small. For example currently with Open MPI and pml/UCX only one key is being submitted (see modex internals study). I'd speculate that if BTLs are used number of keys may be around 5 so we can measure with both key number: 1 and 5.
  • -d option turns on direct modex, by default it is full modex mode.

MPI command line I suggest to use

np=<number of procs>
mpirun -np $np --bind-to core `pwd`/pmix_intra_perf <pmix-perf-options>

Note, that bind t core/hwthread (depending on the configuration) allows to get significantly better results at least in my experiments (see

How to test with the scaling.pl script

TODO: Ralph

⚠️ **GitHub.com Fallback** ⚠️