BPF - hpaluch/hpaluch.github.io GitHub Wiki

BPF

BPF stands for Berkeley Packet Filter - originally framework to provide flexible packet filtering for tcpdump and similar tools - it used its own bytecode, however Linux eBPF is now able to convert bytecode to real CPU instructions (so expressions are evaluated at native CPU speed).

Today it become universal framework to trace kernel.

Please see https://www.brendangregg.com/ebpf.html for official quick overview.

BCC versus BPF Trace:

  • explained on https://github.com/iovisor/bcc/pull/4173/files
  • Quote: bpftrace is at github.com/iovisor/bpftrace, and is a high-level front-end for BPF tracing, which uses libraries from bcc. bpftrace is ideal for ad hoc instrumentation with powerful custom one-liners and short scripts, whereas bcc is ideal for complex tools and daemons.
  • BCC makes BPF programs easier to write, with kernel instrumentation in C (and includes a C wrapper around LLVM), and front-ends in Python and lua. It is suited for many tasks, including performance analysis and network traffic control.
  • bpftrace programs generally looks lot of like awk and have extension .bt
  • BCC tools are mostly written in Python, but the whole structure looks to me quite clumsy - there are long multi-line strings of BPF code and later lot of python glue...

Unfortunately there is some tensions where new examples for bptrace are no longer accepted and users are pointed to bcc as can be seen on:

Installing BPF tools

Example is for Proxmox VE 8.2.4, kernel 6.8.12-1-pve, which is more-or-less Debian 12:

Disclaimer: BPF uses deep internals of Linux kernel (there is no official stable API). SO it is quite normal that existing script will brake with current kernel...

There are actually two sets of examples that can be installed with:

apt-get install bpfcc-tools
apt-get install bpftrace

First command will install nice example scripts under /usr/sbin/*-bpfcc Later command install /usr/sbin/*.bt

Let's try first example globally snooping which new processes were exec-uted (must be run as root):

$ /usr/sbin/execsnoop.bt

Attaching 3 probes...
TIME(ms)   PID   ARGS
4093       9736  nft -j -f -
4093       9736  nft -j -f -
4093       9736  nft -j -f -
4101       9737  nft -j -f -
4101       9737  nft -j -f -
4101       9737  nft -j -f -
6766       9740  iptables-save
6782       9741  iptables-restore -T filter -n
6797       9742  ip6tables-save

When you look inside you can see that source code is very short:

#!/usr/bin/env bpftrace
/*
 * execsnoop.bt   Trace new processes via exec() syscalls.
 *                For Linux, uses bpftrace and eBPF.
 * ...
 */
 BEGIN
{
        printf("%-10s %-5s %s\n", "TIME(ms)", "PID", "ARGS");
}

tracepoint:syscalls:sys_enter_exec*
{
        printf("%-10u %-5d ", elapsed / 1e6, pid);
        join(args->argv);
}

You can list used tracepoints with:

$ bpftrace -l 'tracepoint:syscalls:sys_enter_exec*'

tracepoint:syscalls:sys_enter_execve
tracepoint:syscalls:sys_enter_execveat

NOTE: Still I'm not sure why there is message Attaching 3 probes...

When it is broken

Here is example from same Proxmox VE environment:

$ /usr/sbin/swapin.bt

/usr/sbin/swapin.bt:1-16: WARNING: swap_readpage is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 2 probes...
cannot attach kprobe, probe entry may not exist
ERROR: Error attaching probe: 'kprobe:swap_readpage'

WARNING!

So proper patch is this:

diff -u broken/swapin.bt swapin.bt
--- broken/swapin.bt    2024-08-10 09:58:50.396672650 +0200
+++ swapin.bt   2024-08-12 16:35:47.276154899 +0200
@@ -13,7 +13,7 @@
  * 26-Jan-2019  Brendan Gregg   Created this.
  */
 
-kprobe:swap_readpage
+kprobe:swap_read_folio
 {
         @[comm, pid] = count();
 }

First verify with vmstat -y 1 that there are is non-zero column si (swap in). Then run patched version of swapin.bt:

$ ./swapin.bt

@[pve-firewall, 1617]: 2
16:52:49
@[pvestatd, 1625]: 24
16:52:50
@[pvedaemon worke, 1647]: 1
@[pveproxy worker, 1658]: 1
@[server, 1515]: 1

When it is really broken

Let's try more advanced example:

$ /usr/sbin/biosnoop.bt

/usr/sbin/biosnoop.bt:21-23: WARNING: blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:23-24: WARNING: __blk_account_io_start is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:30-32: WARNING: blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
/usr/sbin/biosnoop.bt:32-33: WARNING: __blk_account_io_done is not traceable (either non-existing, inlined, or marked as "notrace"); attaching to it will likely fail
Attaching 6 probes...
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_done, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_done, skipping.
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:__blk_account_io_start, skipping.
cannot attach kprobe, Invalid argument
WARNING: could not attach probe kprobe:blk_account_io_start, skipping.
TIME(ms)     DISK    COMM             PID    LAT(ms)
 Ctrl-C

In such case there is no real replacement candidate and we have to resort to standard option, e.g., using Google:

It seems that Proxmox kernel has these new tracepoints (-v will reveal tracepoint arguments):


$ uname -r

6.8.12-1-pve # <<< kernel version

# replaces kprobe:blk_account_io_start, kprobe:__blk_account_io_start
$ bpftrace -l 'tracepoint:*block_io_start*' -v

tracepoint:block:block_io_start
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

# replaces probe:blk_account_io_done, kprobe:__blk_account_io_done

$ bpftrace -l 'tracepoint:*block_io_done' -v

tracepoint:block:block_io_done
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

There is open pull request that should fix it:

Custom script - I/O sector alignment

Goal: write script that will check I/O sector alignment.

We will tracepoint:block:block_io_start for that.

First inspect data of tracepoint:block:block_io_start:

$ bpftrace -lv tracepoint:block:block_io_start
tracepoint:block:block_io_start
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

Also:

$ cat /sys/kernel/debug/tracing/events/block/block_io_start/format
name: block_io_start
ID: 1257
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:dev_t dev;	offset:8;	size:4;	signed:0;
	field:sector_t sector;	offset:16;	size:8;	signed:0;
	field:unsigned int nr_sector;	offset:24;	size:4;	signed:0;
	field:unsigned int bytes;	offset:28;	size:4;	signed:0;
	field:char rwbs[8];	offset:32;	size:8;	signed:0;
	field:char comm[16];	offset:40;	size:16;	signed:0;
	field:__data_loc char[] cmd;	offset:56;	size:4;	signed:0;

print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->comm

Did not find a way how to quickly use that fmt to avoid rewriting it into bpftrace...

Quick oneline test:

$ bpftrace -e 'tracepoint:block:block_io_start { printf("sector=%x nr=%u bytes=%u comm=%s\n", args->sector,args->nr_sector, args->bytes, args->comm); }'

sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=0 nr=0 bytes=8 comm=kworker/u4:22
sector=212e3288 nr=8 bytes=4096 comm=jbd2/dm-1-8
sector=210e2aa0 nr=0 bytes=0 comm=jbd2/dm-1-8
sector=212e32a0 nr=8 bytes=4096 comm=kworker/0:7
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200a00 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=vgs
sector=200800 nr=256 bytes=131072 comm=lvs
sector=200a00 nr=256 bytes=131072 comm=lvs

Hmm, but why there is sometimes bytes=0?

Another problem:

  • unable to get meaningful major,minor from dev:
tracepoint:block:block_io_start
{
   // from include/linux/kdev_t.h
   $major = ( (args->dev) & 0xfff00) >> 8;
   $minor = ( (args->dev) & 0xff) | (( (args->dev) >> 12) & 0xfff00);
   //$major = (args->dev) >> 8;
   //$minor = (args->dev) & 0xff;
   printf("sz=%d dev=0x%llx dev=%u,%u ", sizeof(args->dev), args->dev, $major, $minor);
   printf("sector=%x nr=%u bytes=%u comm=%s rwbs=%s\n",
         args->sector,args->nr_sector, args->bytes, args->comm,args->rwbs);
}

Getting:

sz=4 dev=0x800000 dev=0,2048 sector=318c910 nr=32 bytes=16384 comm=kworker/u4:0 rwbs=W
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N
sz=4 dev=0xb00000 dev=0,2816 sector=0 nr=0 bytes=8 comm=kworker/u4:0 rwbs=N

Being so desperate also tried (under Fedora install dnf install perf instead of linux-perf):

$ apt-get install linux-perf
$ perf list | fgrep block_rq
$ perf trace -e 'block:block_io_start'

     0.000 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
  1216.080 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 556762200, nr_sector: 8, bytes: 4096, rwbs: "WSM", comm: "jbd2/dm-1-8", cmd: "")
  1216.626 jbd2/dm-1-8/635 block:block_io_start(dev: 8388608, sector: 554663064, rwbs: "FWFSM", comm: "jbd2/dm-1-8", cmd: "")
  1227.444 kworker/0:7-ev/1050 block:block_io_start(dev: 8388608, sector: 556762264, nr_sector: 8, bytes: 4096, rwbs: "WFSM", comm: "kworker/0:7", cmd: "")
  2047.993 kworker/u4:0-d/18423 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:0", cmd: "")
  4095.984 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
  4800.087 kworker/u4:22-/230 block:block_io_start(dev: 8388608, sector: 78316928, nr_sector: 16, bytes: 8192, rwbs: "W", comm: "kworker/u4:22", cmd: "")
  6143.982 kworker/u4:22-/230 block:block_io_start(dev: 11534336, bytes: 8, rwbs: "N", comm: "kworker/u4:22", cmd: "")
  6152.624 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099200, nr_sector: 256, bytes: 131072, rwbs: "R", comm: "vgs", cmd: "")
  6153.935 vgs/18976 block:block_io_start(dev: 8388608, sector: 2099712, nr_sector: 256, bytes: 131072, rwbs: "R", comm:

$ perf trace -e 'block:block_rq_complete'

     0.000 :0/0 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
  2049.028 pve-firewall/1613 block:block_rq_complete(dev: 11534336, sector: -1, rwbs: "N", cmd: "")
  2278.587 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
  2356.410 :0/0 block:block_rq_complete(dev: 8388608, sector: 2099200, nr_sector: 256, rwbs: "R", cmd: "")
  3676.119 :0/0 block:block_rq_complete(dev: 8388608, sector: 556755032, nr_sector: 264, rwbs: "WSM", cmd: "")
  3693.873 :0/0 block:block_rq_complete(dev: 8388608, sector: -1, rwbs: "FF", cmd: "")
  3694.072 :0/0 block:block_rq_complete(dev: 8388608, sector: 554656096, rwbs: "WSM", cmd: "")
  3694.316 kworker/u4:22-/230 block:block_rq_complete(dev: 8388608, sector: 556755296, nr_sector: 8, rwbs: "WSM", cmd: "")

Find kernel calls from process

Here is example from Fedora - installing Windows server 2022 in LibVirt + KVM:

cd /usr/share/bcc/tools
./syscount -p PID_OF_QEMU

...

racing syscalls, printing top 10... Ctrl+C to quit.
^C[17:16:35]
SYSCALL                   COUNT
clock_gettime           1021651
ioctl                    753043
futex                    256742
gettimeofday             192409
ppoll                     74836
write                      7517
read                       3994
poll                       2385
recvmsg                     960
writev                      671

Or:

cd /usr/share/bcc/tools
./ext4slower -p PID_OF_QEMU

...
IME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
17:20:22 worker         5315   W 686080  9947504    12.87 win2k22.qcow2
17:20:22 worker         5315   W 753152  9948176    14.61 win2k22.qcow2
17:20:22 worker         5315   W 88576   9949524    11.28 win2k22.qcow2
17:20:22 worker         5315   R 131072  5088416    19.31 en-us_windows_server_2022_updated_july_2
17:20:24 worker         5315   S 0       0         488.64 win2k22.qcow2
17:20:25 worker         5315   S 0       0         877.78 win2k22.qcow2
17:20:26 worker         5315   R 131072  5100576    13.92 en-us_windows_server_2022_updated_july_2
17:20:26 worker         5315   R 131072  5106976    11.32 en-us_windows_server_2022_updated_july_2
17:20:28 worker         5315   R 4096    3260976    13.75 win2k22.qcow2
...

Now I have to understand what that means...

Flame graphs: quick start

Here is my Quick Start how to make Flame graphs (using Gregg's) guides.

Tested Host environment:

  • Proxmox VE 8.2.4
  • old dual-core Opteron X2 (Generation 2), MSI-7250 MB
  • 8 GB RAM
  • AHCI, 4TB disk Seagate IronWolf

NOTE: Host CPU does not support HAP (Hardware assisted paging) so any access to new page is costly (causes trap and host must setup so called "Shadow pages").

Guest VM booted:

  • Windows Server 2022, BIOS, 1 CPU, 4GB RAM, 35GB disk Virtio BLK on ext4/raw, disabled Network (to stop Windows update affecting tests)

Install:

apt-get install bpftrace bpfcc-tools git-core
mkdir -p ~/projects
cd ~/projects
git clone https://github.com/brendangregg/FlameGraph.git
mkdir -p ~/bpf

CPU Flame graph

How to get CPU Flame graph

I start VM (with id=108 in example) in Proxmox and run while VM is booting:

cd ~/bpf

id=108
ulimit -n 65536
profile-bpfcc -F 99 -adf --stack-storage-size $(( 2 * 1024 * 1024 )) 120 > boot-$id.profile-folded
~/projects/FlameGraph/flamegraph.pl boot-$id.profile-folded > /usr/share/pve-manager/images/cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/cpu-$id.svg"

And point your browser to https://PROXMO_IP:8006/pve2/images/cpu-108.svg.

Here is example graph:

files/cpu-108.svg

Click for SVG graph

NOTE: note how kvm_mmu_page_fault dominates this chart.

NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.

Off-CPU Flame graph

Quick Off-CPU chart, following:

cd ~/bpf

id=108

ulimit -n 65536
/usr/sbin/offcputime-bpfcc -df 120 > off-cpu-$id.stacks
~/projects/FlameGraph/flamegraph.pl --color=io \
       	--title="Off-CPU Time Flame Graph" \
       	--countname=us < off-cpu-$id.stacks > /usr/share/pve-manager/images/off-cpu-$id.svg
echo "Image URL: https://`hostname -i`:8006/pve2/images/off-cpu-$id.svg"

Here is example graph:

files/off-cpu-108.svg

Click for SVG graph

NOTE: GitHub blocks JavaScript in SVG image so it is not interactive - you have to publish it somewhere, where is correct ContentType to make it interactive.

Resources