eBPF profiling - animeshtrivedi/notes GitHub Wiki
code and bookmarks
- [atr] code examples: https://github.com/animeshtrivedi/ebpf-example
- https://github.com/iovisor/bcc/tree/master/examples
- https://www.brendangregg.com/ebpf.html
- Perf with eBPF, https://www.brendangregg.com/perf.html#eBPF
- BCC events reference guide: https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md
- Python tools example: https://github.com/iovisor/bcc/tree/master/examples/tracing
- [ntehrany] setup bpftrace, https://github.com/nicktehrany/notes/wiki/bpftrace
Ubuntu: sudo apt-get install bpfcc-tools
(to get all the *-bpfcc
tools, so from https://github.com/iovisor/bcc/tree/master/examples/tracing become XXX-bpfcc
in home.
bpftools
Compile from source for a particular kernel verion fatal error: readline/readline.h
sudo apt-get install libreadline-dev
#Then
DESCEND runqslower
Couldn't find kernel BTF; set VMLINUX_BTF to specify its location.
make[1]: *** [Makefile:77: /home/animesh.trivedi/src/linux/tools/bpf/runqslower/.output//vmlinux.h] Error 1
make: *** [Makefile:122: runqslower] Error 2
Remove the pre-installed packages: https://github.com/iovisor/bcc/issues/3993#issuecomment-1228217609
apt purge bpfcc-tools libbpfcc python3-bpfcc
wget https://github.com/iovisor/bcc/releases/download/v0.25.0/bcc-src-with-submodule.tar.gz
tar xf bcc-src-with-submodule.tar.gz
cd bcc/
apt install -y python-is-python3
apt install -y bison build-essential cmake flex git libedit-dev libllvm11 llvm-11-dev libclang-11-dev zlib1g-dev libelf-dev libfl-dev python3-distutils
apt install -y checkinstall
# This you can follow the instruction below
mkdir build
cd build/
cmake -DCMAKE_INSTALL_PREFIX=/usr -DPYTHON_CMD=python3 ..
make
checkinstall
https://github.com/iovisor/bcc/blob/master/INSTALL.md#ubuntu---source
On Ubuntu 24 (make sure to use the llvm 18)
sudo apt install -y zip bison build-essential cmake flex git libedit-dev \
libllvm16 llvm-18-dev libclang-18-dev python3 zlib1g-dev libelf-dev libfl-dev python3-setuptools \
liblzma-dev libdebuginfod-dev arping netperf iperf libpolly-18-dev python-is-python3
Then clone and install
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake ..
make
sudo make install
Tracing framework (choices, formats)
https://blogs.oracle.com/linux/post/taming-tracepoints-in-the-linux-kernel
# show available events
sudo cat /sys/kernel/debug/tracing/available_events
atr@cordova:~$ sudo ls -l /sys/kernel/debug/tracing/events/ | wc -l
151
atr@cordova:~$ sudo cat /sys/kernel/debug/tracing/available_events | wc -l
2618
# There is a bit of difference in how many events have format directory
# showing the format. OK, it seems like there is a recursive directory structure where events are grouped together
sudo cat /sys/kernel/debug/tracing/events/xhci-hcd/xhci_setup_device/format
Some hints on how to compile the C/eBPF program directly
https://github.com/anakryiko/bpf-ringbuf-examples/tree/main
Get function "anything" histogram
I am taking the size as an example: see bitehist.py
file in the bcc github. https://github.com/iovisor/bcc/blob/master/examples/tracing/bitehist.py
Get function execution time distribution
atr@f20u24:~/src/ebpf-probes-traces$ sudo /usr/share/bcc/tools//funclatency -d 10 memset_probe2
Tracing 1 functions for "memset_probe2"... Hit Ctrl-C to end.
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 849777 |**** |
512 -> 1023 : 7156535 |****************************************|
1024 -> 2047 : 12235 | |
2048 -> 4095 : 69 | |
4096 -> 8191 : 1394 | |
8192 -> 16383 : 801 | |
16384 -> 32767 : 74 | |
32768 -> 65535 : 20 | |
65536 -> 131071 : 12 | |
131072 -> 262143 : 1 | |
262144 -> 524287 : 2 | |
524288 -> 1048575 : 3 | |
avg = 572 nsecs, total: 4590811412 nsecs, count: 8021354
Detaching...
kernel symbols which are non-traceable or probe-able
list the kernel functions that can be probed:
less /proc/kallsyms
There are function names with .constprop
or __pfx
names. What do the symbols means:
https://people.redhat.com/~jolawren/klp-compiler-notes/livepatch/compiler-considerations.html
What to do about them? https://github.com/iovisor/bcc/issues/4261
If not changing static inline void to void would resolve this.
On my own OOT nullblk, this did work.
[July 2024] Examples sessions refresher
Tracking io_uring performance on tmpfs
dump CPU profiles with fio
sudo profile-bpfcc -p `pidof -d, fio` -F 99 10 &> fast_stack
How to get a CPU off time histograms and stacks
Histogram:
atr@u24clean:~/tmp$ sudo cpudist-bpfcc -O -p 6271 10 1 2>/dev/null
Tracing off-CPU time... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 100 | |
2 -> 3 : 112 | |
4 -> 7 : 20752 | |
8 -> 15 : 1342784 |****************************************|
16 -> 31 : 12664 | |
32 -> 63 : 454 | |
64 -> 127 : 143 | |
128 -> 255 : 83 | |
256 -> 511 : 3 | |
512 -> 1023 : 1 | |
atr@u24clean:~/tmp$ sudo cpudist-bpfcc -O -p 6290 10 1 2>/dev/null
Tracing off-CPU time... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 1298518 |**************** |
2 -> 3 : 3098098 |****************************************|
4 -> 7 : 34802 | |
8 -> 15 : 7021 | |
16 -> 31 : 564 | |
32 -> 63 : 36 | |
64 -> 127 : 8 | |
128 -> 255 : 6 | |
256 -> 511 : 11 | |
512 -> 1023 : 1 | |
CPU stack histograms
here is an example of fio process. -d,
uses ',' as delimiter of pidof
output.
sudo profile-bpfcc -p `pidof -d, fio` -F 99 10 &> fast_stack
Workqueue dump
/usr/src/linux-6.9.0-atr-2024-07-05/tools/workqueue$ ./wq_dump.py
A collection of system tools to benchmark Linux with eBPF
It seems like when perf
is compiled from source it does not include eBPF tracepoint events.
Showing all supported tracepoint events
on node2, 5.17.59
. Also sudo
gives a different list than the normal user.
zebin@node2:~$ sudo perf list sched:*
List of pre-defined events (to be used in -e):
sched:sched_kthread_stop [Tracepoint event]
sched:sched_kthread_stop_ret [Tracepoint event]
sched:sched_kthread_work_execute_end [Tracepoint event]
sched:sched_kthread_work_execute_start [Tracepoint event]
sched:sched_kthread_work_queue_work [Tracepoint event]
sched:sched_migrate_task [Tracepoint event]
sched:sched_move_numa [Tracepoint event]
sched:sched_pi_setprio [Tracepoint event]
sched:sched_process_exec [Tracepoint event]
sched:sched_process_exit [Tracepoint event]
sched:sched_process_fork [Tracepoint event]
sched:sched_process_free [Tracepoint event]
sched:sched_process_hang [Tracepoint event]
sched:sched_process_wait [Tracepoint event]
sched:sched_stat_blocked [Tracepoint event]
sched:sched_stat_iowait [Tracepoint event]
sched:sched_stat_runtime [Tracepoint event]
sched:sched_stat_sleep [Tracepoint event]
sched:sched_stat_wait [Tracepoint event]
sched:sched_stick_numa [Tracepoint event]
sched:sched_swap_numa [Tracepoint event]
sched:sched_switch [Tracepoint event]
sched:sched_wait_task [Tracepoint event]
sched:sched_wake_idle_without_ipi [Tracepoint event]
sched:sched_wakeup [Tracepoint event]
sched:sched_wakeup_new [Tracepoint event]
sched:sched_waking [Tracepoint event]
zebin@node2:~$ sudo perf list syscalls:*
List of pre-defined events (to be used in -e):
syscalls:sys_enter_accept [Tracepoint event]
syscalls:sys_enter_accept4 [Tracepoint event]
syscalls:sys_enter_access [Tracepoint event]
syscalls:sys_enter_acct [Tracepoint event]
syscalls:sys_enter_add_key [Tracepoint event]
syscalls:sys_enter_adjtimex [Tracepoint event]
syscalls:sys_enter_alarm [Tracepoint event]
syscalls:sys_enter_arch_prctl [Tracepoint event]
syscalls:sys_enter_bind [Tracepoint event]
syscalls:sys_enter_bpf [Tracepoint event]
syscalls:sys_enter_brk [Tracepoint event]
syscalls:sys_enter_capget [Tracepoint event]
...
Counting number of system calls per second
https://www.brendangregg.com/blog/2014-07-03/perf-counting.html
zebin@node2:~$ sudo perf stat -e 'syscalls:sys_enter_*' -a sleep 5 | awk '{sum+=$1}; END {print sum}'
Performance counter stats for 'system wide':
3 syscalls:sys_enter_socket
0 syscalls:sys_enter_socketpair
0 syscalls:sys_enter_bind
0 syscalls:sys_enter_listen
1 syscalls:sys_enter_accept4
0 syscalls:sys_enter_accept
3 syscalls:sys_enter_connect
0 syscalls:sys_enter_getsockname
0 syscalls:sys_enter_getpeername
It generates the output but does not summarizes.
https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/
https://lwn.net/Articles/740157/
System instrumentation
While reading bpftrace
:
- Reference guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
- Manual: https://github.com/iovisor/bpftrace/blob/master/man/adoc/bpftrace.adoc
Setup
- make sure headers are installed. The 5.12 kernel I compiled is missing headers.
atr@node1:~$ sudo bpftrace --version
bpftrace v0.9.4
atr@node1:~$ which bpftrace
/usr/bin/bpftrace
atr@node1:~$
Example small run:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_nanosleep { printf("%s is sleeping.\n", comm); }'
-e
flag is for what to execute. Uses the same awk type execution profile.
How to look for probe
bpftrace -l '*sleep*'
How to look for the tracepoint signature?
atr@node1:~$ sudo bpftrace -lv tracepoint:syscalls:sys_enter_nanosleep
tracepoint:syscalls:sys_enter_nanosleep
int __syscall_nr;
struct __kernel_timespec * rqtp;
struct __kernel_timespec * rmtp;
atr@node1:~$
Question: comm
where does this come from? Looks like it says it is one of the builtins. Yes it is, see this: https://github.com/iovisor/bpftrace/blob/master/man/adoc/bpftrace.adoc#builtins
So the tracepoints have a clear signature and are well maintained. kprobes are not. There you need to look into the function signature and use that.
Including headers
bpftrace --include ./header.h
bpftrace --I ./folder/
Filtering example with kprobe
Filter out small file reads or "X" bytes
bpftrace -e 'kprobe:vfs_read /arg2 == 512/ { printf("%s small read: %d byte buffer\n", comm, arg2); }'
vfs_read
signature for v5.12 kernel: https://elixir.bootlin.com/linux/v5.12.19/source/fs/read_write.c#L476
The second argument is the count, hence this is where we are filtering. The arg count starts from 0.
Now I want to filter on the process name, use the builtin comm
name:
bpftrace -e 'kprobe:vfs_read /comm == "my_name"/ { printf("%s small read: %d byte buffer\n", comm, arg2); }'
With tracepoints how to reference arguments
Use args->
construct.
root@node1:/home/atr# bpftrace -lv tracepoint:syscalls:sys_enter_openat
tracepoint:syscalls:sys_enter_openat
int __syscall_nr;
int dfd;
const char * filename;
int flags;
umode_t mode;
root@node1:/home/atr#
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
Attaching 1 probe...
snmpd /proc/diskstats
snmpd /proc/stat
snmpd /proc/vmstat
Navigating structs as arguments
Include the header file
# cat path.bt
#include <linux/path.h>
#include <linux/dcache.h>
kprobe:vfs_open
{
printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
}
# bpftrace path.bt
Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms
[...]
Links
- Kprobe kernel documentation: https://www.kernel.org/doc/Documentation/kprobes.txt
Questions
- What is the difference between bpftrace and bpftool? bpftool is missing on the node1, dont know why.