Using PAPI on Intel Processors - STEllAR-GROUP/hpx GitHub Wiki

On Aug 31, 2016, at 6:42 PM, Stephane Eranian [email protected] wrote:

Hi Phil,

On Tue, Aug 30, 2016 at 3:31 AM, Philip Mucci [email protected] wrote:

Hi folks,

In some of my work, I frequently run into folks having problems with native Intel events. And most of you know, I like to harp on people to basically ignore preset events these days because they foster misunderstanding as hardware just isn't the same as when PAPI was first written (and through their general abstraction). However, native events aren't panacea… One still needs to often RT(f)M in order to fully understand what one is seeing. To save you that hassle, I'm providing this bit of info...as I find reading Intel docs right up there with having to read that Ayn Rand or Joel Osteen novel that crazy friends give you.

This is a message to a client who has had issues on HSX aka Haswell-EP and JKT aka Sandy Bridge EP processors, most of which is in common with much of the E5 processor line. I suppose this should be turned into a FAQ entry on the PAPI page, but that depends on your comments, which are most welcome.

Note that the below native events are in Intel ‘parlance' with the ‘.' qualifier. libpfm now accepts these fully, thanks to the work of my good friend and perf-dude extraordinaire, Mr. Stephane Eranian of Google.

Regards,

Phil

Below is my list of events that I suspect are not mapped correctly. These events remain consistently screwy for all of the applications that I've looked at so far, so it's not real application behavior.

SandyBridge (SNBEP aka JKT):

mem_load_uops_llc_miss_retired.remote_dram
mem_load_uops_retired.l1_hit
mem_load_uops_retired.l2_hit
mem_load_uops_retired.llc_hit
mem_load_uops_retired.llc_miss
mem_load_uops_llc_hit_retired.xsnp_hit
mem_load_uops_llc_hit_retired.xsnp_hitm
mem_load_uops_llc_hit_retired.xsnp_miss

All of the above have errata on the E5 processor. The errata are BT241 (undercounts) and BT243 (unreliable/corruption). The former is a hardware bug the latter is a bug that is a byproduct of hyperthreading.

See: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf and page 82/83

There is a workaround for BT241, but it increases L3 and main memory latencies. It also requires some permissions that most regular users don't have, i.e. writing bits to /dev/cpu_dma_latency and /sys/pci as one tweak some bits in MSRs and the PCI bus space.

Yes, this is the late GO (Global Observability) bug. The kernel does not do anything on this one simply because the tradeoff is severe given the performance loss of the workaround. It is left to each user to decide if they can tolerate the slowdown while measuring.

Workarounds exist in the pmu-tools latego.py script. https://github.com/andikleen/pmu-tools

Make sure you disable them after you count, otherwise you are hosing your machines performance!

$ latego.py enable mem_load_uops_retired.llc_miss
do papi stuff
$ latego.py enable mem_load_uops_retired.llc_miss

For hyperthreading, one can reduce the problem by making sure the per-thread mask only contains one of two threads on the same core. numactl or taskset ahead of time and make sure you understand the mappings. HT siblings are usually high-order processor numbers. But it's still there… the only foolproof way is to disable HT in BIOS…

Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed below may undercount.

MEM_LOAD_UOPS_RETIRED.LLC_HIT 
MEM_LOAD_UOPS_RETIRED.LLC_MISS*
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM*
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM*
MEM_TRANS_RETIRED.LOAD_LATENCY*

The undercount of these events can be partially resolved (but not eliminated) by setting MSR_PEBS_NUM_ALT. PEBS Accuracy Enable (MSR 39CH; bit 0) to 1. When using the events marked with an asterisk, set the Direct-to-core disable field (Bus 1; Device 14; Function 0; Offset 84; bit 1) to 1 for Local memory reads and (Bus 1; Device 8; Function 0; Offset 80; bit 1) to 1 and (Bus 1; Device 9; Function 0; Offset 80; bit 1) to 1 for Remote memory reads. The improved accuracy comes at the cost of a reduction in performance; this workaround generally should not be used during normal operation.

When operating with SMT enabled, a memory at-retirement performance monitoring event (from the list below) may be dropped or may increment an enabled event on the corresponding counter with the same number on the physical core's other thread rather than the thread experiencing the event. Processors with SMT disabled in BIOS are not affected by this erratum

The list of affected memory at-retirement events is as follows:

MEM_UOP_RETIRED.LOADS
MEM_UOP_RETIRED.STORES
MEM_UOP_RETIRED.LOCK
MEM_UOP_RETIRED.SPLIT 
MEM_UOP_RETIRED.STLB_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB
MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.LLC_HIT
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_RETIRED.LLC_MISS
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM
MEM_LOAD_UOPS_LLC_MISS_RETIRED.REMOTE_DRAM
MEM_LOAD_UOPS_RETIRED.L2_MISS

Yes, this is the infamous HT bug causing cross HT counter corruption. If any of these events is measure on counterX in one HT, then counterX on the sibling HT may get corrupted. For this problem, we have developed a kernel workaround which has been accepted in Linux 4.1 kernel. There will be a presentation on this work at SC16. The workaround avoid the corruption on the sibling counter. But it does not correct the leak from the corrupting counter. For all I know, this workaround may have been backported by Redhat and other distro to older kernels.

fp_comp_ops_exe.sse_scalar_single
fp_comp_ops_exe.sse_packed_single

As far as these go, there is no known issues with them from Intel AFAICT. If Mr. Bandwidth aka the famous John McCalpin is lurking here, he might have something to add. Some dated microbenchmarks seem to validate their counting. https://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops

I believe the FLOPS events were fixed in Broadwell and clearly documented in their event files here.

Haswell (HSX):

cycle_activity.cycles_l1d_pending
cycle_activity.stalls_l1d_pending

For these events, there is likely a bug in the released kernel scheduling it on the wrong counter. See https://github.com/andikleen/pmu-tools/issues/18

Yes, and it was fixed in Linux 4.0.

mem_load_uops_l3_hit_retired.xsnp_hit
mem_load_uops_l3_hit_retired.xsnp_hitm
mem_load_uops_l3_hit_retired.xsnp_miss
mem_load_uops_l3_miss_retired.remote_dram
mem_load_uops_l3_miss_retired.remote_fwd
mem_load_uops_l3_miss_retired.remote_hitm

Here again, there are two errata, HSM26 (this time no workaround) and HSM30 (hyperthreading). http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-mobile-specification-update.pdf

Reproduced here below:

Certain Local Memory Read / Load Retired PerfMon Events May

Undercount

Due to this erratum, the Local Memory Read / Load Retired PerfMon events listed below may undercount.

MEM_LOAD_UOPS_RETIRED.L3_HIT (Event D1H Umask 04H)
MEM_LOAD_UOPS_RETIRED.L3_MISS (Event D1H Umask 20H)
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS (Event D2H Umask 01H)
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT (Event D2H Umask 02H)
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM (Event D2H Umask 04H)
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE (Event D2H Umask 08H)
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM (Event D3H Umask 01H)
MEM_TRANS_RETIRED.LOAD_LATENCY (Event CDH Umask 01H)
PAGE_WALKER_LOADS.DTLB_L3 (Event BCH Umask 14H)
PAGE_WALKER_LOADS.ITLB_L3 (Event BCH Umask 24H)
PAGE_WALKER_LOADS.DTLB_Memory (Event BCH Umask 18H)
PAGE_WALKER_LOADS.ITLB_Memory (Event BCH Umask 28H)

The affected events may undercount, resulting in inaccurate memory profiles. Intel has observed undercounts by as much as 40%.

Performance Monitor Counters May Produce Incorrect Results

When operating with SMT enabled, a memory at-retirement performance monitoring event (from the list below) may be dropped or may increment an enabled event on the corresponding counter with the same number on the physical core's other thread rather than the thread experiencing the event. Processors with SMT disabled in BIOS are not affected by this erratum.

The list of affected memory at-retirement events is as follows:

MEM_UOP_RETIRED.LOADS
MEM_UOP_RETIRED.STORES
MEM_UOP_RETIRED.LOCK
MEM_UOP_RETIRED.SPLIT
MEM_UOP_RETIRED.STLB_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB
MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.L3_HIT
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM
MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM
MEM_LOAD_UOPS_RETIRED.L2_MISS

Due to this erratum, certain performance monitoring event will produce unreliable results during hyper-threaded operation.

Fixed by kernel workaround in 4.1

uops_issued_single_mul

This event is missing a period, it's called uops_issued.single_mul. This event is very likely a kernel scheduling bug. Although I don't know if anyone's ever tested this event and the Intel documentation does not clarify what packed means here, and whether it applies to x87, SSE or AVX. So it's usefulness is TBD.

This event is not marked with any constraints in the official event table. Are you saying it always counts to 0?

Hope this helps. Not sure I can post on the PAPI mailing list. If not, please forward to this list. Thanks.


Ptools-perfapi mailing list
[email protected]
http://lists.eecs.utk.edu/mailman/listinfo/ptools-perfapi

⚠️ **GitHub.com Fallback** ⚠️