perf - yszheda/wiki GitHub Wiki
References
Without sudo
sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
sudo sh -c 'echo kernel.perf_event_paranoid=1 > /etc/sysctl.d/local.conf'
--per-thread
Tutorial
perf record
Sampling with perf report
Sample analysis with perf report --sort=dso
# Events: 1K cycles
#
# Overhead Shared Object
# ........ ..............................
#
38.08% [kernel.kallsyms]
28.23% libxul.so
3.97% libglib-2.0.so.0.2800.6
3.72% libc-2.13.so
3.46% libpthread-2.13.so
2.13% firefox-bin
1.51% libdrm_intel.so.1.0.0
1.38% dbus-daemon
1.36% [drm]
[...]
perf record -a sleep 5
perf report --sort=cpu
# Events: 354 cycles
#
# Overhead CPU
# ........ ...
#
65.85% 1
34.15% 0
perf top
Live analysis with perf top
-------------------------------------------------------------------------------------------------------------------------------------------------------
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz
cycles], (all, 2 CPUs)
-------------------------------------------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ___________________________________________________________
80.00 23.7% read_hpet [kernel.kallsyms]
14.00 4.2% system_call [kernel.kallsyms]
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]
7.00 2.1% i8042_interrupt [kernel.kallsyms]
7.00 2.1% strcmp [kernel.kallsyms]
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so
6.00 1.8% fget_light [kernel.kallsyms]
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so
5.00 1.5% native_sched_clock [kernel.kallsyms]
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko
perf bench
Benchmarking with Documents
4.4 Stack Traces
Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. There are three ways to fix this: either using dwarf data to unwind the stack, using last branch record (LBR) if available (a processor feature), or returning the frame pointers.
5.3. User-Level Statically Defined Tracing (USDT)
Often you need to compile the application yourself using a --with-dtrace flag.
PEBS (Intel's Precise Event Based Sampling)
CPU Microarchitecture
The frontend and backend metrics refer to the CPU pipeline, and are also based on stall counts. The frontend processes CPU instructions, in order. It involves instruction fetch, along with branch prediction, and decode. The decoded instructions become micro-operations (uops) which the backend processes, and it may do so out of order. For a longer summary of these components, see Shannon Cepeda's great posts on frontend and backend.
The backend can also process multiple uops in parallel; for modern processors, three or four. Along with pipelining, this is how IPC can become greater than one, as more than one instruction can be completed ("retired") per CPU cycle.
Stalled cycles per instruction is similar to IPC (inverted), however, only counting stalled cycles, which will be for memory or resource bus access. This makes it easy to interpret: stalls are latency, reduce stalls. I really like it as a metric, and hope it becomes as commonplace as IPC/CPI. Lets call it SCPI.
So for x86-based processors, the front-end does two main things - fetch instructions (from where program binaries are stored in memory or the caching system), and decode them into micro-operations. As part of the fetching process, the front-end must also predict the targets of branch instructions (if-type statements) when they are encountered, so that it knows where to grab the next instruction from. All sorts of specialized logic and hardware work together to do these functions - a branch predictor, a specialized micro-operation cache, particular decoders for both simple and complex instructions, and more. All these bits of hardware contribute toward the front-end's main goal of supplying work - in the form of micro-operations - to the back-end. The Sandy Bridge Front-end is capable of delivering 4 uops per cycle (or processor clock-tick) to the back.
In order to make the best use of its resources, the back-end uses it's own bookkeeping system to keep track of each micro-operation, the pieces of data it requires, and its execution status. Then it executes the micro-operations in any order - according to when a micro-operations has all its data ready and when the execution resources are available. The execution resources that the back-end is keeping track of are called execution units.
6.8. eBPF
Stack Unwinder
ARM
multi-thread
perf record -g -F 99 -s ./your_program
perf report -T
perf report -T --tid=$TID
Flame Graph
- Flame Graphs
- The Flame Graph: This visualization of software execution is a new necessity for performance profiling and debugging.