Numa Awareness in Performance Test - STEllAR-GROUP/hpx GitHub Wiki

This page is place to work on the documentation of Numa Awareness in HPX. This topic was brought up on our IRC channel and a transcript is posted below. We would like to take this transcript and eventually turn it into a more concise explanation of the subject:

<mariomulansky> i have a question about the paper: http://stellar.cct.lsu.edu/pubs/isc2012.pdf 1
<mariomulansky> I'm trying to reproduce Fig. 3
Fig. 3
<heller> ok
<mariomulansky> but my single socket performance is lower and the 8-socket performance is higher g
<heller> sure, that's expected ;)
<mariomulansky> than the one reported in the graph
<heller> err
<heller> you lost me ...
<heller> are you running the jacobi examples?
<mariomulansky> Fig. 3 in the paper reports memory bandwidth from STREAM
<heller> ahh, yes of course
<mariomulansky> so i'm running STREAM on lyra, but i get different numbers - namely a gain of factor ~9 when going from 1 to 48 threads
<heller> ok
<mariomulansky> i like your numbers more, what should i do to get those ;)
<heller> did you set interleaved numa memory placement for the 48 threads run?
<mariomulansky> how? and would that make it faster?
<heller> no
<heller> that would make the 48 thread run slower :P
<heller> the stream benchmarks has perfect NUMA placement
<heller> by default
<heller> that means, there is no inter socket communication
<heller> which makes it faster
<mariomulansky> ah right
<heller> that shouldn't matter for the 1 thread run
<mariomulansky> what is the bandwidth between nodes there?
<heller> the bandwidth between the nodes is neglectable
<heller> it's a few GB/s
<heller> the bottleneck is the bandwidth to main memory
<heller> you can measure the NUMA traffic with likwid-perfctr
<mariomulansky> ah well that's what i mean - the bandwidth to memory from other NUMA domains
<heller> you should also set the array size to 20,000,000 (stream.c line 57)
<heller> that is determined by the bandwidth of one memory controller
<heller> which is the maximum acheivable bandwidth for one socket
<mariomulansky> ok, so did you set interleaved memory placement?
<mariomulansky> i'm running with 10,000,000
<heller> the bandwidth reported for the copy test needs to be multiplied by 1.5
<mariomulansky> why? is that the one u used?
<heller> yeah
<heller> well, because you have two loads and one store
<mariomulansky> i see
<heller> I don't remember the exact reason ...
<heller> why it needs to be multiplied with 1.5
<heller> something to with caches ...
<mariomulansky> boah...
<mariomulansky> if i dont pin the threads on 48 runs, the performance gets much worse
<heller> compare: numactl --interleave=0-7 likwid-pin -c 0-47 ./stream_omp
<heller> with: likwid-pin -c 0-47 ./stream_omp
<mariomulansky> numactl: command not found
<heller> mariomulansky: /home/heller/bin/numactl
<heller> use that one
<mariomulansky> permission denied
<heller> one sec
<heller> try again
<heller> and compare:
<heller> /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
<heller> numactl --interleave=0,1 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
<heller> numactl --membind=0 /usr/local/bin/likwid-perfctr -g MEM -C S0:0@S1:0 ./stream_omp
<mariomulansky> $ ls /home/heller/bin/numactl
<mariomulansky> ls: cannot access /home/heller/bin/numactl: Permission denied
<heller> look at "| Memory bandwidth [MBytes/s] | 9075.92 | 8847.24 |"
<heller> for the different cores
<heller> grmp
<heller> one sec
<heller> mariomulansky: please try again
<heller> mariomulansky: those three commands execute the same benchmark, the only difference is how the memory is placed on the different NUMA domains
<heller> topology matters :P
<mariomulansky> i see
<heller> mariomulansky: also, the perfctr likwid tool will report the correct bandwidth ;)
<@wash> one sec, will just install numactl
<heller> wash: thanks
<heller> mariomulansky: also, try the above commands with the NUMA or NUMA2 performance group (after the -g switch)
<heller> and observe
<mariomulansky> ok i see
<mariomulansky> thanks a lot, this is way more complicated than i would it like to be
<mariomulansky> i have to go now
<mariomulansky> maybe you can tell me what settings you used for Fig. 3 :)
<mariomulansky> thanks wash !
<heller> mariomulansky: didn't i already?
<heller> what is it you are missing?
<heller> i used interleaved memory binding
<heller> which places the memory in a round robin fashion
<mariomulansky> ah ok, that are your settings
<mariomulansky> ok
<heller> where the argument to interleave are only the number of the NUMA domains involved
<heller> so, for a twelve thread run (only two NUMA domains), you only do --interleave=0,1
<heller> if you do --interleave=0-7
<heller> you'll see an increase of performance
<heller> because more memory controllers are used
<mariomulansky> i see - makes sense
<heller> makes perfect sense once you went through the pain of looking at those performance counters with the stream benchmark ;)
<heller> but it's a good exercise
<heller> everyone in this channel should have done this at least once ;)
<mariomulansky> so even if i run on one thread i can use memory from other sockets
<heller> sadly, no one except me did that :(
<heller> yes
<mariomulansky> with numactl
<heller> yup
<mariomulansky> ok
<heller> note, numactl is just a tool that uses the libnuma user level API
<mariomulansky> i will look into that further later
<mariomulansky> now i have to go
<mariomulansky> thanks a lot!
<mariomulansky> bye
<heller> you could even manually place your memory with this (documented) API
<heller> enjoy!
<mariomulansky> thanks :)
<mariomulansky> caio
<heller> aserio: btw, you could pick that conversation up and put into some kind of documentation form ;)

⚠️ **GitHub.com Fallback** ⚠️