profiling - noma/dm-heom GitHub Wiki

Built-in Profiling

Especially for the distributed applications, HEOM provides a variety of built-in profiling features for getting insight into runtime, memory, and communication characteristics. All output goes into a profiling sub-directory relative to the current working directory.

The applications print summary tables at the end of the run and produce multiple files, some of which are intended to be further processed by scripts, and not easy to read.

Application Output

The summary tables summarise the profiling data by showing some key metrics for the best and worst performing rank of the job. Which rank was best, and which one the worst is determined by comparing the time spent in *solver step * (see below). The worst rank limits the performance of the whole job, and thus is the more interesting column, comparison with the best performing rank reveals imbalances which indicate a sub-optimal graph partitioning. The shown runtime sums are the accumulated timer values for all counted calls, the average (avg) refers to all counted timings for the shown rank.

----------------------------------------------------------------------------------------------------
                                       Solver Runtime Summary
----------------------------------------------------------------------------------------------------
                              |         |   sum (best) |  sum (worst) |   avg (best) |   avg (worst)
timer                         |   count |     rank 341 |     rank 424 |     rank 341 |      rank 424
----------------------------------------------------------------------------------------------------
solver step                   |      99 |     178.10 s |     179.95 s |   1798.99 ms |    1817.71 ms
  heom_ode kernel             |     396 |       0.51 s |       0.53 s |      1.29 ms |       1.34 ms
  rk_weighted_add kernel      |     396 |       0.09 s |       0.10 s |      0.24 ms |       0.24 ms
  hierarchy_norm kernel       |       0 |       0.00 s |       0.00 s |      0.00 ms |       0.00 ms
----------------------------------------------------------------------------------------------------

The solver runtime summary currently provides 4 time statistics.

solver step is the duration of 1 physical time step, including everything necessary like running computational kernels and doing network communication. It is the most important runtime metric for characterising the application performance.
heom_ode kernel is the runtime of the main computation, i.e. evaluating the ordinary differential equation of the HEOM model.
rk_weighted_add kernel is a generic numerics kernel used to compute Runge Kutta methods.
hierarchy_norm kernel computes some norm of the hierarchy, currently not used in production. When using a Runge Kutta 4 method, the ODE is evaluated 4 times, and 4 weighted add operations are needed. Thus There'll be 4 times as many calls counted as for the solver step. Every evaluation of the ODE also requires are prior neighbour exchange, which has it's own table (see below).

----------------------------------------------------------------------------------------------------
                                 Neighbour Exchange Runtime Summary
----------------------------------------------------------------------------------------------------
                              |         |   sum (best) |  sum (worst) |   avg (best) |   avg (worst)
timer                         |   count |     rank 341 |     rank 424 |     rank 341 |      rank 424
----------------------------------------------------------------------------------------------------
read buffer                   |     396 |       0.24 s |       0.30 s |      0.60 ms |       0.75 ms
neighbour exchange            |     396 |     177.17 s |     178.94 s |    447.40 ms |     451.86 ms
write buffer                  |     396 |       0.03 s |       0.03 s |      0.08 ms |       0.08 ms
----------------------------------------------------------------------------------------------------

The neighbour exchange runtime summary lists 3 runtime statistics:

read buffer is the time spent in transferring the data we need to read between OpenCL device memory, and host memory. If zero copy is configured within the OpenCL configuration file, the buffers will be mapped into host memory and no copy takes place. Whether or not zero copy was configured can be found in the OpenCL Runtime Configuration table printed during initialisation of the application.
neighbour exchange is the time spent in the actual neighbour exchange operation. The current communicator implementation uses MPI_neighbor_alltoallw(). This usually is the most expensive part of a solver step, due to the hierarchy graph's highly connected structure.
write buffer is the time spent transferring the received data from the neighbour exchange from host memory into OpenCL device memory. If zero copy is configured, this is just a memory unmap operation.

----------------------------------------------------------------------------------------------------
                                     Memory Allocation Summary
----------------------------------------------------------------------------------------------------
                                                  |         |       size (best) |       size (worst)
timer                                             |   count |          rank 341 |           rank 424 
----------------------------------------------------------------------------------------------------
local instance size                               |       1 |        374.00 MiB |         397.52 MiB 
  hierarchy buffer size                           |       1 |        372.08 MiB |         395.48 MiB 
    computed hierarchy nodes                      |       1 |        130.60 MiB |         128.53 MiB 
    halo hierarchy nodes                          |       1 |        241.48 MiB |         266.94 MiB 
max. heap via aligned::allocate(..)               |      11 |        374.02 MiB |         397.54 MiB 
max. OpenCL buffer allocations                    |      17 |       2234.56 MiB |        2375.06 MiB 
----------------------------------------------------------------------------------------------------

The memory allocation memory shows different memory sizes, and tracked allocations/deallocations.

local instance size is the size of the local hierarchy state within the host memory.
hierarchy buffer size is a just the memory for the matrices representing the hierarchy vertices of the local graph partition.
computed hierarchy nodes is the part of the hierarchy buffer used for vertices which are updated/computed on the specific rank.
halo hierarchy nodes is the part of the hierarchy buffer used for halo vertices, needed by the computed vertices, and received from other nodes during neighbour exchange.
max. heap via aligned::allocate(..) is the maximum number/amount of memory allocations via HEOM's built-in memory allocation facilities for aligned host memory.
max. OpenCL buffer allocations is the maximum number/amount of allocated OpenCL device memory. Depending on the OpenCL implementation and buffer allocation flags, OpenCL might use additional, uncounted host memory.

main(): rank 0: application runtime:      198.20 s

The last line is the application runtime as measured by rank 0. Comparing that with the accumulated time (sum) spent in solver step provides an estimate on the initialisation cost (which is negligible for long runs).

Generated Files

There are detailed runtime statistics for manual/scripted evaluation generated for every rank. The data for rank i from 0 to n - 1, for n ranks can be found in:

solver_statistics_rank_<i>.dat
neighbour_exchange_statistics_rank_<i>.dat

The files are tab-separated tables with a header and contain statistics over all measured individual timings, e.g. average, min, max, median, variance, standard error, etc. The solver runtime tables additionally contain some physics values from the hierarchy graph, while the neighbour exchange tables also contain the volume of data sent and received by the specific rank, and the number of source and destination ranks.

neighbour_exchange_matrix.dat

Is a human readable table that contains the volume of transferred data between ranks during each neighbour exchange operation. The first column is the sending rank, so each row shows a how much data is sent by that rank to any other rank. The last column is the sum, i.e. the total amount of data sent by that rank. The columns can be read as data received by the rank in the header from all the other ranks of the respective rows. The last row then again is the sum of data received by the rank of the respective column. The last entry of the last row, is the sum of the all data sent and received, i.e. the sum of the column above, and the row to the left. Data is only counted once, so the total volume = sum(send data) = sum(received data).

neighbour_exchange_matrix_raw.dat

Contains the inner part of that table, i.e. just the sizes in byte, for scripted processing.

Additional Files

Each process copies its proc/curproc/status file as proc_status_rank_<n> into the profiling subdirectory. At least for Linux, it contains the thread count and CPU mask, which can be useful to understand how many OpenCL threads per process have been created and how they where mapped to the hardware, e.g.

# assuming 8 MPI ranks per node, compactly mapped, check out the first node via
grep ^Threads profile/proc_status* | head -n 8
grep ^Cpus_allowed_list profile/proc_status* | head -n 8