benchmarking - noma/ham GitHub Wiki

Benchmarking

Performance metrics can be measured using benchmark_ham_offload.cpp.

Supported options:
  -h [ --help ]                Shows this message
  -f [ --filename ] arg        filename(-prefix) for results
  -r [ --runs ] arg (=1000)    number of identical inner runs for which the 
                               average time will be computed
  --warmup-runs arg (=1)       number of number of additional warmup runs 
                               before times are measured
  -s [ --size ] arg (=1048576) size of transferred data in byte (multiple of 4)
  -a [ --allocate ]            benchmark memory allocation/deallocation on 
                               target
  -i [ --copy-in ]             benchmark data copy to target
  -o [ --copy-out ]            benchmark data copy from target
  -c [ --call ]                benchmark function call on target
  -m [ --call-mul ]            benchmark function call (multiplication) on 
                               target
  -y [ --async ]               perform benchmark function calls asynchronously

NEC SX-Aurora TSUBASA:

# measure offload cost: -c
./build.vh/benchmark_ham_offload_vedma_vh --ham-cpu-affinity 0 --ham-process-count 2 --ham-veo-ve-nodes 0 --ham-veo-ve-lib ./build.ve/veorun_benchmark_ham_offload_vedma_ve --warmup-runs 100 -r 100000 -c

HAM-Offload v0.2 and older

There are benchmarks for HAM-Offload and Intel LEO. The ones for LEO are not built by default, but need to be built via:

$ b2 toolset=intel variant=release -j8 benchmark_intel_leo 

Scripts for automated benchmarking and generating figures exist and will be available soon.

When benchmarking, pinning processes and threads to hardware threads is important for reproducible results. For multi-socket hosts with Xeon Phi accelerators, the actual hardware thread, on both sides, is important too. The best communication path is between a CPU and the accelerator directly connected to its PCIe root complex. Assuming a system with two CPUs and one accelerator, then there is one CPU that will have a better communication performance with the accelerator (especially regarding latency). In general, the mapping between CPUs and accelerators should reflect the topology of the PCIe interconnect. The fields physical id and core id in /proc/cpuinfo provide a picture of how your hardware threads, cores, and CPUs are mapped to each other. The CPU (NUMA node) to which a Xeon Phi is connected can be found in /sys/class/mic/mic<number>/device/numa_node. On the Xeon Phi, the OS core should be avoided. This is the last physical core, whose 4 hardware threads map to the first and the last three logical cores, e.g. 0, 241, 242, 243 on an 7xxx Xeon Phi with 61 physical cores.

In the following example, we assume that mic0 is connected to the second 8-core CPU of the host system.

For measuring the kernel offloading overhead via (Intel) MPI, run the following:

$ mpirun -n 1 -host localhost -env I_MPI_PIN_PROCESSOR_LIST=8 bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_mpi -c -r 1000000 : -n 1 -host mic0  -env I_MPI_PIN_PROCESSOR_LIST=0 bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_mpi

For measuring the kernel offloading overhead via SCIF:

$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 --ham-cpu-affinity 8 -c -r 1000000 &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1 --ham-cpu-affinity 1

Result (times are in ns):

HAM-Offload function call runtime: 
name	average	median	min	max	variance	std_error	relative_std_error	conf95_error	relative_conf95_error	count
call:	1.786359e+03	1.678500e+03	1.626000e+03	3.181820e+05	1.260685e+05	1.122802e-01	6.285422e-05	2.200692e-01	1.231943e-04	10000000

Pinned to the wrong CPU (--ham-cpu-affinity 0 on the host):

$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 --ham-cpu-affinity 0 -c -r 1000000 &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1 --ham-cpu-affinity 1

Result (times are in ns):

HAM-Offload function call runtime: 
name	average	median	min	max	variance	std_error	relative_std_error	conf95_error	relative_conf95_error	count
call:	2.968820e+03	2.881500e+03	2.086000e+03	2.955695e+06	9.997552e+05	3.161890e-01	1.065033e-04	6.197305e-01	2.087464e-04	10000000

For the help screen, run:

$ bin/intel-linux/release/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 0 -h &
$ ssh mic0 env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH `pwd`/bin/intel-linux/release_mic/inlining-on/threading-multi/benchmark_ham_offload_scif --ham-process-count 2 --ham-address 1

Supported options:
  -h [ --help ]                Shows this message
  -f [ --filename ] arg        filename(-prefix) for results
  -r [ --runs ] arg (=1000)    number of identical inner runs for which the 
                               average time will be computed
  --warmup-runs arg (=1)       number of number of additional warmup runs 
                               before times are measured
  -s [ --size ] arg (=1048576) size of transferred data in byte (multiple of 4)
  -a [ --allocate ]            benchmark memory allocation/deallocation on 
                               target
  -i [ --copy-in ]             benchmark data copy to target
  -o [ --copy-out ]            benchmark data copy from target
  -c [ --call ]                benchmark function call on target
  -m [ --call-mul ]            benchmark function call (multiplication) on 
                               target
  -y [ --async ]               perform benchmark function calls asynchronously
⚠️ **GitHub.com Fallback** ⚠️