linux perf - animeshtrivedi/notes GitHub Wiki
Bookmarks
- [Main] perf Examples : https://www.brendangregg.com/perf.html
- [Further talks] https://www.brendangregg.com/linuxperf.html
Perf locking analysis
- Perf man page: https://man7.org/linux/man-pages/man1/perf-lock.1.html
- Example: https://lwn.net/Articles/897387/ (perf lock: New lock contention tracepoints support (v3))
Perf metric groups
https://lpc.events/event/17/contributions/1514/attachments/1335/2675/LPC%202023%20Linux%20Perf%20Metrics%20Talk%20Slides.pdf (local copy: file:///home/atr/Downloads/2023-LPC-Linux%20Perf%20Tool%20Metrics%202023%20Linux%20Perf%20Metrics%20Talk%20Slides.pdf)
Composite event class, to see how they are calculated
$ sudo perf list --details
[...]
Metric Groups:
branch_prediction:
branch_misprediction_ratio
[Execution-Time Branch Misprediction Ratio (Non-Speculative)]
[d_ratio(ex_ret_brn_misp, ex_ret_brn)]
data_fabric:
all_remote_links_outbound
[Approximate: Outbound data bytes for all Remote Links for a node (die)]
[remote_outbound_data_controller_0 + remote_outbound_data_controller_1 + remote_outbound_data_controller_2 + remote_outbound_data_controller_3]
nps1_die_to_dram
[Approximate: Combined DRAM B/bytes of all channels on a NPS1 node (die)]
[dram_channel_data_controller_0 + dram_channel_data_controller_1 + dram_channel_data_controller_2 + dram_channel_data_controller_3 +
dram_channel_data_controller_4 + dram_channel_data_controller_5 + dram_channel_data_controller_6 + dram_channel_data_controller_7]
decoder:
macro_ops_dispatched
[Macro-ops Dispatched]
[de_dis_cops_from_decoder.disp_op_type.any_integer_dispatch + de_dis_cops_from_decoder.disp_op_type.any_fp_dispatch]
l2_cache:
all_l2_cache_accesses
[All L2 Cache Accesses]
[l2_request_g1.all_no_prefetch + l2_pf_hit_l2 + l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
all_l2_cache_hits
[All L2 Cache Hits]
[l2_cache_req_stat.ic_dc_hit_in_l2 + l2_pf_hit_l2]
all_l2_cache_misses
[All L2 Cache Misses]
[l2_cache_req_stat.ic_dc_miss_in_l2 + l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
ic_fetch_miss_ratio
[Instruction Cache (32B) Fetch Miss Ratio]
[d_ratio(ic_tag_hit_miss.instruction_cache_miss, ic_tag_hit_miss.all_instruction_cache_accesses)]
l2_cache_accesses_from_l2_hwpf
[L2 Cache Accesses from L2 HWPF]
[l2_pf_hit_l2 + l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
l2_cache_misses_from_l2_hwpf
[L2 Cache Misses from L2 Cache HWPF]
[l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
op_cache_fetch_miss_ratio
[Op Cache (64B) Fetch Miss Ratio]
[d_ratio(op_cache_hit_miss.op_cache_miss, op_cache_hit_miss.all_op_cache_accesses)]
l3_cache:
l3_read_miss_latency
[Average L3 Read Miss Latency (in core clocks)]
[xi_sys_fill_latency * 16 / xi_ccx_sdp_req1]
tlb:
l1_itlb_misses
[L1 ITLB Misses]
[bp_l1_tlb_miss_l2_tlb_hit + bp_l1_tlb_miss_l2_tlb_miss]
[...]
Metric group only works with stats (not with record, for that you need to use the raw events)
sudo perf stat -M
-M, --metrics <metric/metric group list>
monitor specified metrics or metric groups (separated by ,)
So for CPU cache L2 monitoring with memset:
atr@f20u24:~/src/python-fio-automation/tmp/l2-profile$ sudo perf stat -M all_l2_cache_accesses,all_l2_cache_misses,all_l2_cache_hits -p `pidof memset` -- sleep 10
Performance counter stats for process id '5866':
729,753 l2_pf_miss_l2_l3 # 13335140259.00 all_l2_cache_accesses (66.65%)
1,496,264,279 l2_pf_hit_l2 (66.65%)
3,596,871,668 l2_pf_miss_l2_hit_l3 (66.65%)
8,241,274,559 l2_request_g1.all_no_prefetch (66.65%)
741,878 l2_pf_miss_l2_l3 # 7508684886.00 all_l2_cache_misses (33.42%)
7,167,909,193 l2_pf_miss_l2_hit_l3 (33.42%)
340,033,815 l2_cache_req_stat.ic_dc_miss_in_l2 (33.42%)
1,494,462,593 l2_pf_hit_l2 # 9448133571.00 all_l2_cache_hits (66.77%)
7,953,670,978 l2_cache_req_stat.ic_dc_hit_in_l2 (66.77%)
10.002230649 seconds time elapsed
$ formula
all_l2_cache_accesses
[All L2 Cache Accesses]
[l2_request_g1.all_no_prefetch + l2_pf_hit_l2 + l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
all_l2_cache_hits
[All L2 Cache Hits]
[l2_cache_req_stat.ic_dc_hit_in_l2 + l2_pf_hit_l2]
all_l2_cache_misses
[All L2 Cache Misses]
[l2_cache_req_stat.ic_dc_miss_in_l2 + l2_pf_miss_l2_hit_l3 + l2_pf_miss_l2_l3]
To remove statistical sampling, I dropped some counters (do an -r
, then :
atr@f20u24:~/src/python-fio-automation/tmp/l2-profile$ sudo perf stat -M all_l2_cache_accesses,all_l2_cache_hits -p `pidof memset` -r 3 -- sleep 10
Performance counter stats for process id '5909' (3 runs):
700,043 l2_pf_miss_l2_l3 # 19014233156.00 all_l2_cache_accesses ( +- 1.59% )
1,981,499,478 l2_pf_hit_l2 ( +- 0.04% )
7,866,034,241 l2_pf_miss_l2_hit_l3 ( +- 0.03% )
9,165,999,394 l2_request_g1.all_no_prefetch ( +- 0.03% )
1,981,499,664 l2_pf_hit_l2 # 1981502953.00 all_l2_cache_hits ( +- 0.04% )
3,289 l2_cache_req_stat.ic_dc_hit_in_l2 ( +- 37.91% )
10.001910 +- 0.000144 seconds time elapsed ( +- 0.00% )
atr@f20u24:~/src/python-fio-automation/tmp/l2-profile$
Generating Flame Graphs
https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
# git clone https://github.com/brendangregg/FlameGraph # or download it from github
# cd FlameGraph
# perf record -F 99 -a -g -- sleep 60
# perf script | ./stackcollapse-perf.pl > out.perf-folded
# ./flamegraph.pl out.perf-folded > perf.svg
# firefox perf.svg # or chrome, etc.
There is a script here: python-fio-automation/flame-graphs/cat sh-generate-graph.sh
July 30th, 2024
Mostly works out of the box many things. I have not tested it comprehensively.
How to compile with RH-DEB issues
- I can not build a deb package, so I just have to untar the source code which was generated during the build process and make the tool in there. The code is untar in
/usr/src/linux/
- Some issues in the make, but mostly ok, see here
Here is the command sequence Building Linux Perf tool:
sudo apt install libzstd1 libdwarf-dev libdw-dev binutils-dev libcap-dev libelf-dev libnuma-dev python3 python3-dev libssl-dev libunwind-dev libdwarf-dev zlib1g-dev liblzma-dev libaio-dev libtraceevent-dev debuginfod libpfm4-dev libslang2-dev systemtap-sdt-dev libperl-dev binutils-dev libbabeltrace-dev libiberty-dev libzstd-dev libpopt-dev libnl-3-dev libcapstone-dev
# I gave up building others as they take much time.
make tools/perf
Example run for collection and mapping traces:
atr@u24clean:~/tmp$ sudo /usr/src/linux-6.9.0-atr-2024-07-05/tools/perf/perf record -e cycles -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.011 MB perf.data (10 samples) ]
atr@u24clean:~/tmp$ sudo /usr/src/linux-6.9.0-atr-2024-07-05/tools/perf/perf report -n -m --stdio --full-source-path --source -s symbol,srcline
# To display the perf.data header info, please use --header/--header-only options.
#
addr2line /usr/lib/debug/boot/vmlinux-6.9.0-atr-2024-07-05: could not read first record
addr2line /usr/lib/debug/boot/vmlinux-6.9.0-atr-2024-07-05: could not read first record
#
# Total Lost Samples: 0
#
# Samples: 10 of event 'cycles'
# Event count (approx.): 1178318
#
# Overhead Samples Symbol Source:Line IPC [IPC Coverage]
# ........ ............ .............................. .......................................................... ....................
#
26.36% 1 [k] pud_val /home/atr/src/linux/mm/memory.c:473 - -
24.27% 1 [k] ___slab_alloc /home/atr/src/linux/mm/slub.c:3488 - -
19.48% 1 [k] mas_store_prealloc mas_store_prealloc+102 - -
13.58% 1 [k] kmem_cache_alloc_bulk /home/atr/src/linux/mm/slub.c:4623 - -
8.07% 1 [.] dl_main ./elf/./setup-vdso.h:38 - -
4.39% 1 [k] irqentry_exit_to_user_mode irqentry_exit_to_user_mode+352 - -
2.02% 1 [k] psi_group_change /home/atr/src/linux/kernel/sched/psi.c:238 - -
0.92% 1 [k] pwq_tryinc_nr_active /home/atr/src/linux/kernel/workqueue.c:1808 - -
0.91% 1 [k] _raw_spin_lock_irq /home/atr/src/linux/./arch/x86/include/asm/atomic.h:115 - -
0.00% 1 [k] native_write_msr /home/atr/src/linux/./arch/x86/include/asm/jump_label.h:27 - -
#
# (Tip: Show estimate cycles per function and IPC in annotate use perf record -b ... ; perf report --total-cycles)
#
(reverse-i-search)`dd': git a^C src/nvme-cdfs-iops/single-ssd.py
atr@u24clean:~/tmp$
NOTE: addr2line takes a long time, hence, we need to be patient.
Commands from ZR:
perf record -e instructions -F 99 -o $PERF_RES/${e}.perf.out CMD-OF-WHAT-TO-RECORD
perf report --vmlinux $VMLINUX -n -m --stdio --full-source-path --source -s symbol -i $PERF_RES/${e}.perf.out >> $PERF_PARSED_LIST/perf_parsed_${e}.txt
For sudo issues in the path, I made an alias
alias sudo='sudo '
alias perf='/usr/src/linux-6.9.0-atr-2024-07-05/tools/perf/perf'
Unresolved
Notes taken on 5.12 kernel
- [Resolved] full path expansion of the kernel symbols. It use to be there, now it just shows the binary file location. Not the source file location where the symbol comes from. Somehow I remember -P use to do that.
atr@node1:/home/atr/4tb/file-io$ sudo perf kallsyms -v pick_next_entity
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
/proc/{kallsyms,modules} inconsistency while looking for "[bpf]" module!
pick_next_entity: [kernel] [kernel.kallsyms] 0xffffffffbd0e0fb0-0xffffffffbd0e1100 (0xffffffffbd0e0fb0-0xffffffffbd0e1100)
atr@node1:/home/atr/4tb/file-io$
See: https://github.com/iovisor/bcc/issues/2872
Perf resources
- https://perf.wiki.kernel.org/index.php/Main_Page
- https://www.brendangregg.com/perf.html
- For the currently running kernel code, build
perf
from ./tools/perf/ and then make - [not tested] compiling from ubuntu sources: https://gist.github.com/joaomlneto/46f82b1f0ed5b4b0a6d0b3ec2ff4d0a5
- http://www.linux-kongress.org/2010/slides/lk2010-perf-acme.pdf
- understanding perf output: https://stackoverflow.com/questions/27742462/understanding-linux-perf-report-output
Further setup when symbols are not found: https://stackoverflow.com/questions/10933408/how-can-i-get-perf-to-find-symbols-in-my-program
1. compile the code with -g
2. record -g dwarf -F 97 /path/to/my/program
3. echo 0 > /proc/sys/kernel/kptr_restrict
Setup
Link the source code to /usr/src/linux
, also it takes the vmlinux file there.
Recording the instructions
sudo perf record -e instructions -a -F 99 -g -v --vmlinux /usr/src/linux/vmlinux -n -- ./perf-file-creat-test /mnt/sdb/atr/file-io/ 10000
- -n stopped it from sampling, with -F 99 frequency, hence, no samples
- Now the profile has all cumulative symbols with -g
Remove the call graph information, and the individual symbols are there:
sudo perf record -e instructions -a -v --vmlinux /usr/src/linux/vmlinux -- ./perf-file-creat-test /mnt/sdb/atr/file-io/ 10000
Remove the call graph, and use --stdio interface, there it is more intuitively to see how things are cumulative
Report
sudo perf report -k /usr/src/linux/vmlinux -n -m --demangle-kernel --full-source-path --source --stdio
Annotation (somehow ugly!)
This shows the annotated code, a bit
sudo perf annotate -k /usr/src/linux/vmlinux -n -m --source --stdio
Generating full path symbols
sort on based on the srcline and then it generates the complete kernel source path.
sudo perf report --vmlinux /usr/src/linux/vmlinux -n -m --stdio --full-source-path --source -s symbol,srcline
Example:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 52K of event 'instructions'
# Event count (approx.): 3364033240
#
# Overhead Samples Symbol Source:Line IPC [IPC Coverage]
# ........ ............ ........................................... .................................................................................................. ....................
#
2.18% 930 [k] fscrypt_match_name /mnt/sdb/atr/src/linux/fs/crypto/fname.c:488 - -
2.06% 854 [k] fscrypt_match_name /mnt/sdb/atr/src/linux/fs/crypto/fname.c:505 - -
1.40% 886 [k] _raw_spin_lock /mnt/sdb/atr/src/linux/./arch/x86/include/asm/atomic.h:202 - -
1.27% 552 [k] __ext4_check_dir_entry /mnt/sdb/atr/src/linux/fs/ext4/dir.c:108 - -
0.98% 429 [k] ext4_find_dest_de /mnt/sdb/atr/src/linux/fs/ext4/namei.c:1948 - -
0.88% 394 [k] __ext4_check_dir_entry /mnt/sdb/atr/src/linux/fs/ext4/dir.c:71 - -
0.78% 321 [k] ext4_search_dir /mnt/sdb/atr/src/linux/fs/ext4/namei.c:1394 - -
0.77% 452 [k] update_nohz_stats /mnt/sdb/atr/src/linux/kernel/sched/fair.c:8362 - -
0.72% 493 [k] jbd2_journal_write_revoke_records /mnt/sdb/atr/src/linux/fs/jbd2/revoke.c:549 - -
0.70% 497 [k] _raw_spin_lock_irqsave /mnt/sdb/atr/src/linux/./arch/x86/include/asm/atomic.h:202 - -
0.70% 517 [k] read_tsc /mnt/sdb/atr/src/linux/./arch/x86/include/asm/msr.h:234 - -
0.55% 385 [k] native_sched_clock /mnt/sdb/atr/src/linux/./arch/x86/include/asm/msr.h:205 - -
0.53% 224 [k] memcmp /mnt/sdb/atr/src/linux/lib/string.c:938 - -
0.52% 223 [k] fscrypt_match_name /mnt/sdb/atr/src/linux/fs/crypto/fname.c:493 - -
0.50% 215 [k] ext4_find_dest_de /mnt/sdb/atr/src/linux/fs/ext4/namei.c:1955 - -
0.46% 305 [k] __switch_to_asm /mnt/sdb/atr/src/linux/arch/x86/entry/entry_64.S:253 - -
0.45% 191 [k] fscrypt_match_name /mnt/sdb/atr/src/linux/./include/linux/fortify-string.h:235 - -
0.44% 300 [k] jbd2_clear_buffer_revoked_flags /mnt/sdb/atr/src/linux/fs/jbd2/revoke.c:494 - -
0.44% 187 [k] ext4_find_dest_de /mnt/sdb/atr/src/linux/fs/ext4/namei.c:1375 - -
0.39% 159 [k] ext4_search_dir /mnt/sdb/atr/src/linux/fs/ext4/namei.c:1360 - -
0.37% 229 [k] _raw_read_lock /mnt/sdb/atr/src/linux/./include/asm-generic/qrwlock.h:79 - -
0.35% 150 [k] __ext4_check_dir_entry /mnt/sdb/atr/src/linux/fs/ext4/dir.c:88 - -
0.35% 146 [k] __ext4_check_dir_entry /mnt/sdb/atr/src/linux/fs/ext4/dir.c:77 - -
0.34% 304 [k] menu_select /mnt/sdb/atr/src/linux/drivers/cpuidle/governors/menu.c:236 - -
0.33% 142 [k] fscrypt_match_name /mnt/sdb/atr/src/linux/fs/crypto/fname.c:496 - -