BPF - hpaluch/hpaluch.github.io GitHub Wiki
BPF
BPF stands for Berkeley Packet Filter - originally framework to provide flexible packet filtering
for tcpdump
and similar tools - it used its own bytecode, however Linux eBPF is now able to convert
bytecode to real CPU instructions (so expressions are evaluated at native CPU speed).
Today it become universal framework to trace kernel.
Please see https://www.brendangregg.com/ebpf.html for official quick overview.
BCC versus BPF Trace:
- explained on https://github.com/iovisor/bcc/pull/4173/files
- Quote: bpftrace is at github.com/iovisor/bpftrace, and is a high-level front-end for BPF tracing, which uses libraries from bcc. bpftrace is ideal for ad hoc instrumentation with powerful custom one-liners and short scripts, whereas bcc is ideal for complex tools and daemons.
- BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.
- bpftrace programs generally looks lot of like
awk
and have extension.bt
- BCC tools are mostly written in Python, but the whole structure looks to me quite clumsy - there are long multi-line strings of BPF code and later lot of python glue...
Unfortunately there is some tensions where new examples for bptrace
are no longer accepted and
users are pointed to bcc
as can be seen on:
- https://github.com/bpftrace/bpftrace/issues/2616
- KVM example not merged:
Installing BPF tools
Example is for Proxmox VE 8.2.4, kernel 6.8.12-1-pve, which is more-or-less Debian 12:
Disclaimer: BPF uses deep internals of Linux kernel (there is no official stable API). SO it is quite normal that existing script will brake with current kernel...
There are actually two sets of examples that can be installed with:
apt-get install bpfcc-tools
apt-get install bpftrace
First command will install nice example scripts under /usr/sbin/*-bpfcc
Later command install /usr/sbin/*.bt
Let's try first example globally snooping which new processes were exec-uted (must be run as root):
$ /usr/sbin/execsnoop.bt
Attaching 3 probes...
TIME(ms) PID ARGS
4093 9736 nft -j -f -
4093 9736 nft -j -f -
4093 9736 nft -j -f -
4101 9737 nft -j -f -
4101 9737 nft -j -f -
4101 9737 nft -j -f -
6766 9740 iptables-save
6782 9741 iptables-restore -T filter -n
6797 9742 ip6tables-save
When you look inside you can see that source code is very short:
#!/usr/bin/env bpftrace
/*
* execsnoop.bt Trace new processes via exec() syscalls.
* For Linux, uses bpftrace and eBPF.
* ...
*/
BEGIN
{
printf("%-10s %-5s %s\n", "TIME(ms)", "PID", "ARGS");
}
tracepoint:syscalls:sys_enter_exec*
{
printf("%-10u %-5d ", elapsed / 1e6, pid);
join(args->argv);
}
You can list used tracepoints with:
$ bpftrace -l 'tracepoint:syscalls:sys_enter_exec*'
tracepoint:syscalls:sys_enter_execve
tracepoint:syscalls:sys_enter_execveat
NOTE: Still I'm not sure why there is message Attaching 3 probes...
When it is broken
Here is example from same Proxmox VE environment:
$ /usr/sbin/swapin.bt
/usr/sbin/swapin.bt:1-16: WARNING: swap_readpage is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 2 probes...
cannot attach kprobe, probe entry may not exist
ERROR: Error attaching probe: 'kprobe:swap_readpage'
WARNING!
- although it looks obvious that
kprobe:swap_readpage
was replaced withkprobe:swap_read_page
, is it NOT the case - as can be found on: https://lore.kernel.org/all/[email protected]/T/
- it was actually replaced by
kprobe:swap_read_folio
! - as can be verified on BCC version of swapin.py: https://github.com/iovisor/bcc/commit/36b40102cc94f537cdbcea35c5cc766a53643c1d
So proper patch is this:
diff -u broken/swapin.bt swapin.bt
--- broken/swapin.bt 2024-08-10 09:58:50.396672650 +0200
+++ swapin.bt 2024-08-12 16:35:47.276154899 +0200
@@ -13,7 +13,7 @@
* 26-Jan-2019 Brendan Gregg Created this.
*/
-kprobe:swap_readpage
+kprobe:swap_read_folio
{
@[comm, pid] = count();
}
First verify with vmstat -y 1
that there are is non-zero column si
(swap in).
Then run patched version of swapin.bt
:
$ ./swapin.bt
@[pve-firewall, 1617]: 2
16:52:49
@[pvestatd, 1625]: 24
16:52:50
@[pvedaemon worke, 1647]: 1
@[pveproxy worker, 1658]: 1
@[server, 1515]: 1
When it is really broken
Let's try more advanced example:
$ /usr/sbin/biosnoop.bt
/usr/sbin/biosnoop.bt:21-23: WARNING: blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:23-24: WARNING: __blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:30-32: WARNING: blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:32-33: WARNING: __blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 6 probes...
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_done, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_done, skipping.
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_start, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_start, skipping.
TIME(ms) DISK COMM PID LAT(ms)
Ctrl-C
In such case there is no real replacement candidate and we have to resort to standard option, e.g., using Google:
- https://github.com/iovisor/bcc/issues/4261
- there is reference to reaking kernel commit: https://github.com/torvalds/linux/commit/450b7879e34517c3ebc3a35a53806fe40e60fac2
- what is even worse, that global function is replaced with
static
- so it will be no longer reachable... - proper patch was submitted here: https://www.spinics.net/lists/linux-block/msg90434.html
- but not sure when (and if) it reaches standard kernel...
- this commit should contain fixed version
It seems that Proxmox kernel has these new tracepoints (-v
will reveal tracepoint arguments):
$ uname -r
6.8.12-1-pve # <<< kernel version
# replaces kprobe:blk_account_io_start, kprobe:__blk_account_io_start
$ bpftrace -l 'tracepoint:*block_io_start*' -v
tracepoint:block:block_io_start
dev_t dev
sector_t sector
unsigned int nr_sector
unsigned int bytes
char rwbs[8]
char comm[16]
__data_loc char[] cmd
# replaces probe:blk_account_io_done, kprobe:__blk_account_io_done
$ bpftrace -l 'tracepoint:*block_io_done' -v
tracepoint:block:block_io_done
dev_t dev
sector_t sector
unsigned int nr_sector
unsigned int bytes
char rwbs[8]
char comm[16]
__data_loc char[] cmd
There is open pull request that should fix it:
Custom script - I/O sector alignment
Goal: write script that will check I/O sector alignment.
We will tracepoint:block:block_io_start
for that.
First inspect data of tracepoint:block:block_io_start
:
$ bpftrace -lv tracepoint:block:block_io_start
tracepoint:block:block_io_start
dev_t dev
sector_t sector
unsigned int nr_sector
unsigned int bytes
char rwbs[8]
char comm[16]
__data_loc char[] cmd
Also:
$ cat /sys/kernel/debug/tracing/events/block/block_io_start/format
name: block_io_start
ID: 1257
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:dev_t dev; offset:8; size:4; signed:0;
field:sector_t sector; offset:16; size:8; signed:0;
field:unsigned int nr_sector; offset:24; size:4; signed:0;
field:unsigned int bytes; offset:28; size:4; signed:0;
field:char rwbs[8]; offset:32; size:8; signed:0;
field:char comm[16]; offset:40; size:16; signed:0;
field:__data_loc char[] cmd; offset:56; size:4; signed:0;
print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->comm
Did not find a way how to quickly use that fmt
to avoid rewriting it into bpftrace
...
Quick oneline test:
$ bpftrace -e 'tracepoint:block:block_io_start { printf("sector=%x nr=%u bytes=%u comm=%s\n", args->sector,args->nr_sector, args->bytes, args->comm); }'
sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=212e3288 nr=8 bytes=4096 comm=jbd2/dm-1-8
sector=210e2aa0 nr=0 bytes=0 comm=jbd2/dm-1-8
sector=212e32a0 nr=8 bytes=4096 comm=kworker/0:7
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200a00 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=lvs
sector=200a00 nr=256 bytes=131072 comm=lvs
Hmm, but why there is sometimes bytes=0?
Another problem:
- unable to get meaningful major,minor from dev:
tracepoint:block:block_io_start
{
// from include/linux/kdev_t.h
$major = ( (args->dev) & 0xfff00) >> 8;
$minor = ( (args->dev) & 0xff) | (( (args->dev) >> 12) & 0xfff00);
//$major = (args->dev) >> 8;
//$minor = (args->dev) & 0xff;
printf("sz=%d dev=0x%llx dev=%u,%u ", sizeof(args->dev), args->dev, $major, $minor);
printf("sector=%x nr=%u bytes=%u comm=%s rwbs=%s\n",
args->sector,args->nr_sector, args->bytes, args->comm,args->rwbs);
}
Getting:
sz=4 dev=0x800000 dev=0,2048 sector=318c910 nr=32 bytes=16384 comm=kworker/u4:0 rwbs=W
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N
Being so desperate also tried (under Fedora install dnf install perf
instead of linux-perf
):
$ apt-get install linux-perf
$ perf list | fgrep block_rq
$ perf trace -e 'block:block_io_start'
0.000 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
1216.080 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 556762200, nr_sector: 8, bytes: 4096, rwbs: "WSM", comm: "jbd2/dm-1-8", cmd: "")
1216.626 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 554663064, rwbs: "FWFSM", comm: "jbd2/dm-1-8", cmd: "")
1227.444 kworker/0:7-ev/1050 block:block_io_start(dev: 8388608, sector: 556762264, nr_sector: 8, bytes: 4096, rwbs: "WFSM", comm: "kworker/0:7", cmd: "")
2047.993 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
4095.984 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
4800.087 kworker/u4:22-/230 block:block_io_start(dev: 8388608, sector: 78316928, nr_sector: 16, bytes: 8192, rwbs: "W", comm: "kworker/u4:22", cmd: "")
6143.982 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
6152.624 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099200, nr_sector: 256, bytes: 131072, rwbs: "R", comm: "vgs", cmd: "")
6153.935 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099712, nr_sector: 256, bytes: 131072, rwbs: "R", comm:
$ perf trace -e 'block:block_rq_complete'
0.000 :0/0 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
2049.028 pve-firewall/1613 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
2278.587 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
2356.410 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
3676.119 :0/0 block:block_rq_complete(dev: 8388608, sector: 556755032, nr_sector: 264, rwbs: "WSM", cmd: "")
3693.873 :0/0 block:block_rq_complete(dev: 8388608, sector: -1, rwbs: "FF", cmd: "")
3694.072 :0/0 block:block_rq_complete(dev: 8388608, sector: 554656096, rwbs: "WSM", cmd: "")
3694.316 kworker/u4:22-/230 block:block_rq_complete(dev: 8388608, sector: 556755296, nr_sector: 8, rwbs: "WSM", cmd: "")
Find kernel calls from process
Here is example from Fedora - installing Windows server 2022 in LibVirt + KVM:
cd /usr/share/bcc/tools
./syscount -p PID_OF_QEMU
...
racing syscalls, printing top 10... Ctrl+C to quit.
^C[17:16:35]
SYSCALL COUNT
clock_gettime 1021651
ioctl 753043
futex 256742
gettimeofday 192409
ppoll 74836
write 7517
read 3994
poll 2385
recvmsg 960
writev 671
Or:
cd /usr/share/bcc/tools
./ext4slower -p PID_OF_QEMU
...
IME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
17:20:22 worker 5315 W 686080 9947504 12.87 win2k22.qcow2
17:20:22 worker 5315 W 753152 9948176 14.61 win2k22.qcow2
17:20:22 worker 5315 W 88576 9949524 11.28 win2k22.qcow2
17:20:22 worker 5315 R 131072 5088416 19.31 en-us_windows_server_2022_updated_july_2
17:20:24 worker 5315 S 0 0 488.64 win2k22.qcow2
17:20:25 worker 5315 S 0 0 877.78 win2k22.qcow2
17:20:26 worker 5315 R 131072 5100576 13.92 en-us_windows_server_2022_updated_july_2
17:20:26 worker 5315 R 131072 5106976 11.32 en-us_windows_server_2022_updated_july_2
17:20:28 worker 5315 R 4096 3260976 13.75 win2k22.qcow2
...
Now I have to understand what that means...
Flame graphs: quick start
Here is my Quick Start how to make Flame graphs (using Gregg's) guides.
Tested Host environment:
- Proxmox VE 8.2.4
- old dual-core Opteron X2 (Generation 2), MSI-7250 MB
- 8 GB RAM
- AHCI, 4TB disk Seagate IronWolf
NOTE: Host CPU does not support HAP (Hardware assisted paging) so any access to new page is costly (causes trap and host must setup so called "Shadow pages").
Guest VM booted:
- Windows Server 2022, BIOS, 1 CPU, 4GB RAM, 35GB disk Virtio BLK on ext4/raw, disabled Network (to stop Windows update affecting tests)
Install:
apt-get install bpftrace bpfcc-tools git-core
mkdir -p ~/projects
cd ~/projects
git clone https://github.com/brendangregg/FlameGraph.git
mkdir -p ~/bpf
CPU Flame graph
How to get CPU Flame graph
I start VM (with id=108 in example) in Proxmox and run while VM is booting:
cd ~/bpf
id=108
ulimit -n 65536
profile-bpfcc -F 99 -adf --stack-storage-size $(( 2 * 1024 * 1024 )) 120 > boot-$id.profile-folded
~/projects/FlameGraph/flamegraph.pl boot-$id.profile-folded > /usr/share/pve-manager/images/cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/cpu-$id.svg"
And point your browser to https://PROXMO_IP:8006/pve2/images/cpu-108.svg
.
Here is example graph:
NOTE: note how kvm_mmu_page_fault
dominates this chart.
NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.
Off-CPU Flame graph
Quick Off-CPU chart, following:
cd ~/bpf
id=108
ulimit -n 65536
/usr/sbin/offcputime-bpfcc -df 120 > off-cpu-$id.stacks
~/projects/FlameGraph/flamegraph.pl --color=io \
--title="Off-CPU Time Flame Graph" \
--countname=us < off-cpu-$id.stacks > /usr/share/pve-manager/images/off-cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/off-cpu-$id.svg"
Here is example graph:
NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.