perf - yszheda/wiki GitHub Wiki

References

Without sudo

Run perf without root-rights

sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'
sudo sh -c 'echo kernel.perf_event_paranoid=1 > /etc/sysctl.d/local.conf'

What restriction is perf_event_paranoid == 1 actually putting on x86 perf?

`--per-thread`

perf stat: Introduce --per-thread option

Tutorial

https://perf.wiki.kernel.org/index.php/Tutorial

Sampling with `perf record`

Sample analysis with `perf report`

perf report --sort=dso

# Events: 1K cycles
#
# Overhead                   Shared Object
# ........  ..............................
#
    38.08%  [kernel.kallsyms]
    28.23%  libxul.so
     3.97%  libglib-2.0.so.0.2800.6
     3.72%  libc-2.13.so
     3.46%  libpthread-2.13.so
     2.13%  firefox-bin
     1.51%  libdrm_intel.so.1.0.0
     1.38%  dbus-daemon
     1.36%  [drm]
     [...]

perf record -a sleep 5
perf report --sort=cpu

# Events: 354  cycles
#
# Overhead  CPU
# ........  ...
#
   65.85%  1
   34.15%  0

Live analysis with `perf top`

perf top
-------------------------------------------------------------------------------------------------------------------------------------------------------
  PerfTop:     260 irqs/sec  kernel:61.5%  exact:  0.0% [1000Hz
cycles],  (all, 2 CPUs)
-------------------------------------------------------------------------------------------------------------------------------------------------------

            samples  pcnt function                       DSO
            _______ _____ ______________________________ ___________________________________________________________

              80.00 23.7% read_hpet                      [kernel.kallsyms]
              14.00  4.2% system_call                    [kernel.kallsyms]
              14.00  4.2% __ticket_spin_lock             [kernel.kallsyms]
              14.00  4.2% __ticket_spin_unlock           [kernel.kallsyms]
               8.00  2.4% hpet_legacy_next_event         [kernel.kallsyms]
               7.00  2.1% i8042_interrupt                [kernel.kallsyms]
               7.00  2.1% strcmp                         [kernel.kallsyms]
               6.00  1.8% _raw_spin_unlock_irqrestore    [kernel.kallsyms]
               6.00  1.8% pthread_mutex_lock             /lib/i386-linux-gnu/libpthread-2.13.so
               6.00  1.8% fget_light                     [kernel.kallsyms]
               6.00  1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so
               5.00  1.5% native_sched_clock             [kernel.kallsyms]
               5.00  1.5% drm_addbufs_sg                 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko

Benchmarking with `perf bench`

Documents

http://www.brendangregg.com/perf.html

4.4 Stack Traces

Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. There are three ways to fix this: either using dwarf data to unwind the stack, using last branch record (LBR) if available (a processor feature), or returning the frame pointers.

5.3. User-Level Statically Defined Tracing (USDT)

Often you need to compile the application yourself using a --with-dtrace flag.

PEBS (Intel's Precise Event Based Sampling)

CPU Microarchitecture

The frontend and backend metrics refer to the CPU pipeline, and are also based on stall counts. The frontend processes CPU instructions, in order. It involves instruction fetch, along with branch prediction, and decode. The decoded instructions become micro-operations (uops) which the backend processes, and it may do so out of order. For a longer summary of these components, see Shannon Cepeda's great posts on frontend and backend.

The backend can also process multiple uops in parallel; for modern processors, three or four. Along with pipelining, this is how IPC can become greater than one, as more than one instruction can be completed ("retired") per CPU cycle.

Stalled cycles per instruction is similar to IPC (inverted), however, only counting stalled cycles, which will be for memory or resource bus access. This makes it easy to interpret: stalls are latency, reduce stalls. I really like it as a metric, and hope it becomes as commonplace as IPC/CPI. Lets call it SCPI.

Pipeline Speak: Learning More About Intel® Microarchitecture Codename Sandy Bridge

So for x86-based processors, the front-end does two main things - fetch instructions (from where program binaries are stored in memory or the caching system), and decode them into micro-operations. As part of the fetching process, the front-end must also predict the targets of branch instructions (if-type statements) when they are encountered, so that it knows where to grab the next instruction from. All sorts of specialized logic and hardware work together to do these functions - a branch predictor, a specialized micro-operation cache, particular decoders for both simple and complex instructions, and more. All these bits of hardware contribute toward the front-end's main goal of supplying work - in the form of micro-operations - to the back-end. The Sandy Bridge Front-end is capable of delivering 4 uops per cycle (or processor clock-tick) to the back.

Pipeline Speak, Part 2: The Second Part of the Sandy Bridge Pipeline

In order to make the best use of its resources, the back-end uses it's own bookkeeping system to keep track of each micro-operation, the pieces of data it requires, and its execution status. Then it executes the micro-operations in any order - according to when a micro-operations has all its data ready and when the execution resources are available. The execution resources that the back-end is keeping track of are called execution units.

6.8. eBPF

Stack Unwinder

What is the overhead of using Intel Last Branch Record?

ARM

交叉编译 perf for arm

multi-thread

Thread Utilization profiling on linux

perf record -g -F 99 -s ./your_program
perf report -T
perf report -T --tid=$TID

perf - yszheda/wiki GitHub Wiki

References

Without sudo

`--per-thread`

Tutorial

Sampling with `perf record`

Sample analysis with `perf report`

Live analysis with `perf top`

Benchmarking with `perf bench`

Documents

Stack Unwinder

ARM

multi-thread

Flame Graph

Memory Leak (and Growth) Flame Graphs

Tools

perf - yszheda/wiki GitHub Wiki

References

Without sudo

--per-thread

Tutorial

Sampling with perf record

Sample analysis with perf report

Live analysis with perf top

Benchmarking with perf bench

Documents

Stack Unwinder

ARM

multi-thread

Flame Graph

Memory Leak (and Growth) Flame Graphs

Tools

`--per-thread`

Sampling with `perf record`

Sample analysis with `perf report`

Live analysis with `perf top`

Benchmarking with `perf bench`