bpftrace - nicktehrany/notes GitHub Wiki
This documents the guide how I installed, setup, and use the bpftrace
tracing language for BPF
to trace NVMe calls to storage devices. I put all source code to manually build (i.e., anything I clone from git) to a src
directory into my home. Hence, if you see src
in the commands, replace this with wherever you put your code. I also install everything globally, since I'm alone in the VM.
Note things got so messed up that the VM didn't work anymore, therefore I had to install it again. Likely the steps here will not help you then, then skip directly to the tracing section.
Installation
For installing we can largely follow the install guide here, however I ran into some issues.
Firstly we need to enable configurations in the kernel build (I'm using a custom Kernel build, see the final config here). The configurations are
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_KPROBES=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DEBUG_FS=y
From there we build and boot the kernel, and then setup the bpf libraries. Install the requirements:
sudo apt-get install -y libbpfcc-dev
sudo apt-get update
sudo apt-get install -y \
bison \
cmake \
flex \
g++ \
git \
libelf-dev \
zlib1g-dev \
libfl-dev \
systemtap-sdt-dev \
binutils-dev \
libcereal-dev \
llvm-12-dev \
llvm-12-runtime \
libclang-12-dev \
clang-12 \
libpcap-dev \
libgtest-dev \
libgmock-dev \
asciidoctor
git clone https://github.com/iovisor/bpftrace
cd bpftrace
mkdir build && cd build
NOTE: Can skip from here, second time around I installed all packages globally and built bpftrace
from sources without issues.
Skip this, only here for reference
However, I ran into the following error when building bpftrace
user@stosys:~/src/bpftrace/build$ cmake .. -DKERNEL_INCLUDE_DIRS=/tmp/libbpf/include/uapi/
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found LibBcc: /usr/lib/x86_64-linux-gnu/libbcc.so
-- Performing Test LIBBCC_ATTACH_KPROBE_SIX_ARGS_SIGNATURE
-- Performing Test LIBBCC_ATTACH_KPROBE_SIX_ARGS_SIGNATURE - Success
-- Performing Test LIBBCC_ATTACH_UPROBE_SEVEN_ARGS_SIGNATURE
-- Performing Test LIBBCC_ATTACH_UPROBE_SEVEN_ARGS_SIGNATURE - Failed
-- Looking for bcc_usdt_foreach
-- Looking for bcc_usdt_foreach - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Please install the libbpf development package (missing: LIBBPF_LIBRARIES
LIBBPF_INCLUDE_DIRS)
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
cmake/FindLibBpf.cmake:31 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
CMakeLists.txt:103 (find_package)
-- Configuring incomplete, errors occurred!
See also "/home/user/src/bpftrace/build/CMakeFiles/CMakeOutput.log".
See also "/home/user/src/bpftrace/build/CMakeFiles/CMakeError.log".
Therefore, we need to install it manually. As I'm using ubuntu server, it doesn't package libbpf
(see this issue), we'll install it manually as is done in the docker build script.
git clone https://github.com/libbpf/libbpf.git ~/src/libbpf
# Checking out the latest stable release
git checkout v1.0.0
cd src
make -j
# Install everything (globally)
PREFIX=/usr/local/ LIBDIR=/usr/local/lib sudo make install install_uapi_headers
Now we can go back to bpftrace
and build it
cd ~/src/bpftrace/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
sudo make install
Now we can test if everything is installed and working correctly
# Syscall counter
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
But we get some more errors now
user@stosys:~/src/zoned_bpftrace$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
bpftrace: error while loading shared libraries: libbpf.so.1: cannot open shared object file: No such file or directory
But we know it's installed and can manually preload it
user@stosys:~/src/zoned_bpftrace$ whereis libbpf.so
libbpf: /usr/lib64/libbpf.so /usr/lib64/libbpf.a
Then we can just run (and add sudo
to run as root)
sudo LD_PRELOAD=/usr/lib64/libbpf.so bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
which works successfully and shows an output such as (must hit Ctrl-C to stop tracing and dump data)
Attaching 1 probe...
^C
@[sudo]: 7
@[tmux: server]: 7
@[gmain]: 10
@[multipathd]: 16
@[irqbalance]: 34
@[bpftrace]: 48
Now we can test bpftrace
scripts for ZNS devices, making sure these work. See the scripts here.
user@stosys:~/src/zoned_bpftrace$ sudo LD_PRELOAD=/usr/lib64/libbpf.so bpftrace active_zone_tracker.bt
definitions.h:2:10: fatal error: 'linux/nvme.h' file not found
and we are missing header files, since we have a custom kernel, these are not installed. I went back to my qemu startup script
and added port forwarding so that I can ssh into the qemu host with ssh -p 8888 user@localhost
and copy the needed header files from the kernel build to the system.
Firstly, on the qemu system we make a directory to keep track of what header files we copy.
mkdir ~/src/include
Copying a single header file to the VM
Then we can copy the files from host to the qemu. The header files are in the include/
directory of the Kernel build.
# Copy nvme header file
scp -P 8888 include/linux/nvme.h [email protected]:/home/user/src/include/
Back on the qemu system we can then symlink the header file to the /usr/include
dir. We use symlinks in case we change the header files, then we only need to scp
to the VM (unlikely we change these header files, but to be sure).
sudo ln -sf ~/src/include/nvme.h /usr/include/linux/nvme.h
Copying all header files to the VM
I realized quickly that just adding a single header file at a time gets painful very quickly when they require each other recursively. Therefore, it is easier to simply add all header files from the kernel build to the host and link compilations against these. First we have to copy all header files from the host to the VM
scp -rP 8888 include/* [email protected]:/home/user/src/include/
Now rebuild bpftrace
, linking against a custom include path failed for me, therefore I went back to symlink the needed headers files that we just copied from the host to the VM. A bit painful to do this one by one and recompile every time.
Files that were needed:
linux/nvme.h
linux/blkdev.h
linux/blk_types.h
linux/bvec.h'
linux/highmem.h
linux/bug.h
asm/bug.h
and many more
Copying the entire Kernel build
I got tired of adding them one by one so I went the simplest route, which is creating a new device where I put the entire kernel build, mount it in the VM and install everything. I'm likely not going to touch any of the header files for now so this should work fine. Then simply mount the device with the kernel build and install everything
sudo mkdir /mnt/kernel
# Mount with whatever device you have
sudo mkfs.ext4 /dev/nvme2n1
sudo mount /dev/nvme2n1 /mnt/kernel
sudo chown -R user /mnt/kernel
Then copy the kernel from the host to the VM (I exclude my qemu setup stuff as it's all the backing images for the nvme devices)
rsync -e 'ssh -p 8888' --progress -a --exclude={'qemu/','.git/'} f2fs/ [email protected]:/mnt/kernel/
Then we can install all the header files
sudo make headers_install
Now we run into more errors of an older libc version
user@stosys:/mnt/kernel$ sudo make headers_install
HOSTCC scripts/unifdef
scripts/basic/fixdep: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by scripts/basic/fixdep)
make[1]: *** [scripts/Makefile.host:95: scripts/unifdef] Error 1
make[1]: *** Deleting file 'scripts/unifdef'
make: *** [Makefile:1296: scripts_unifdef] Error 2
We are on version 2.31 (run ldd --version
to see the version), and no updates are available, therefore we need to manually get the newer version. We'll install it in the ~/src
directory to avoid messing up the current libc install if things go south.
cd ~/src
mkdir glibc && cd glibc
wget http://ftp.gnu.org/gnu/libc/glibc-2.33.tar.gz
tar -xvzf glibc-2.33.tar.gz
mkdir build
mkdir glibc-2.33-install
cd build
~/src/glibc/glibc-2.33/configure --prefix=$HOME/src/glibc/glibc-2.33-install
make -j3
make install
After building and installing we can check if it is setup correctly
~/src/glibc/glibc-2.33-install/bin/ldd --version
note now any command in the VM is seg faulting, might have messed up somewhere something (although I thought I was not installing things all over the place). Gonna reformat the base image for ubuntu and go again (updating ubuntu version as well, avoid this issue altogether).
Back to the issue
After the prior work failed, now we have an updated VM but are still missing the header files. Now we will simply add the flag to include the directory (again copied from the kernel build to the VM) in the CMake. I reinstalled everything, only building bpftrace
from sources (see the earlier commands for that) and missing header files I installed globally by compiling the kernel on the VM and installing the headers. Note, headers are backwards compatible, so newer header files for an older kernel (whenever I'm not running the custom kernel build) are no problem. In order to run bpftrace
we just need to include the header files in the command with the -I /path/to/header/file
option. We can specify as many as we want
The only issue is the kernel building asm-generic
instead of asm
, hence it fails to still find some header files. We solve this by simply creating a symlink for asm
pointing to asm-generic
. NOTE This is not recommended as they are different things, however I run this in a VM and have multiple header files for the 5.19 Kernel, so messing things up isn't too bad at this point. Plus it's only needed for very few things, unlikely it'll break something I need.
sudo ln -sf /usr/src/linux-headers-5.19.0-051900/include/asm-generic/ /usr/src/linux-headers-5.19.0-051900/include/asm
Tracing
As I still ran into issues when my qemu is booting with a custom Kernel running bpftrace
constantly fails (even bpftrace --info
seg faults), I installed the full compiled Kernel in the VM and just booted the entire entire kernel. Now everything works, and we can now build our tracing pipeline for ZNS. Seems odd that there is something wrong with the setup of custom kernel in qemu, even if everything is installed and manually linked, but I guess every time we want to trace we need to compile and install the Kernel.
From this point are starting my notes on bpftrace and coding with it. See my full implementation for ZNS tracing here.
Sector to Zone Starting LBA conversion
We need to do a bitwise AND on the sector number with the zone MASK in order to the get the starting LBA of the zone (ZLBAS). Append command have the ZLBAS as the sector number, therefore the bitwise AND will not change the value. However, write (nvme write) commands require the LBA to write at (the host issues the LBA as opposed to ZNS handling LBA on append), where the bitwise AND with the MASK changes the value to the ZLBAS. Hence, the code looks like:
$nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));
$secnum = $nvme_cmd->rw.slba;
// Bitwise And to get zone starting LBA with zone MASK
$zlbas = ($secnum & 0xffc0000);
Commands
We utilize two nvme commands, of which each structure can be checked.
nvme_cmd_setup
root@stosys:/home/user/src# cat /sys/kernel/debug/tracing/events/nvme/nvme_setup_cmd/format
name: nvme_setup_cmd
ID: 1556
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char disk[32]; offset:8; size:32; signed:1;
field:int ctrl_id; offset:40; size:4; signed:1;
field:int qid; offset:44; size:4; signed:1;
field:u8 opcode; offset:48; size:1; signed:0;
field:u8 flags; offset:49; size:1; signed:0;
field:u8 fctype; offset:50; size:1; signed:0;
field:u16 cid; offset:52; size:2; signed:0;
field:u32 nsid; offset:56; size:4; signed:0;
field:bool metadata; offset:60; size:1; signed:0;
field:u8 cdw10[24]; offset:61; size:24; signed:0;
print fmt: "nvme%d: %sqid=%d, cmdid=%u, nsid=%u, flags=0x%x, meta=0x%x, cmd=(%s %s)", REC->ctrl_id, nvme_trace_disk_name(p, REC->disk), REC->qid, REC->cid, REC->nsid, REC->flags, REC->metadata, ((REC->opcode) == nvme_fabrics_command ? __print_symbolic(REC->fctype, { nvme_fabrics_type_property_set, "nvme_fabrics_type_property_set" }, { nvme_fabrics_type_connect, "nvme_fabrics_type_connect" }, { nvme_fabrics_type_property_get, "nvme_fabrics_type_property_get" }) : ((REC->qid) ? __print_symbolic(REC->opcode, { nvme_cmd_flush, "nvme_cmd_flush" }, { nvme_cmd_write, "nvme_cmd_write" }, { nvme_cmd_read, "nvme_cmd_read" }, { nvme_cmd_write_uncor, "nvme_cmd_write_uncor" }, { nvme_cmd_compare, "nvme_cmd_compare" }, { nvme_cmd_write_zeroes, "nvme_cmd_write_zeroes" }, { nvme_cmd_dsm, "nvme_cmd_dsm" }, { nvme_cmd_resv_register, "nvme_cmd_resv_register" }, { nvme_cmd_resv_report, "nvme_cmd_resv_report" }, { nvme_cmd_resv_acquire, "nvme_cmd_resv_acquire" }, { nvme_cmd_resv_release, "nvme_cmd_resv_release" }, { nvme_cmd_zone_mgmt_send, "nvme_cmd_zone_mgmt_send" }, { nvme_cmd_zone_mgmt_recv, "nvme_cmd_zone_mgmt_recv" }, { nvme_cmd_zone_append, "nvme_cmd_zone_append" }) : __print_symbolic(REC->opcode, { nvme_admin_delete_sq, "nvme_admin_delete_sq" }, { nvme_admin_create_sq, "nvme_admin_create_sq" }, { nvme_admin_get_log_page, "nvme_admin_get_log_page" }, { nvme_admin_delete_cq, "nvme_admin_delete_cq" }, { nvme_admin_create_cq, "nvme_admin_create_cq" }, { nvme_admin_identify, "nvme_admin_identify" }, { nvme_admin_abort_cmd, "nvme_admin_abort_cmd" }, { nvme_admin_set_features, "nvme_admin_set_features" }, { nvme_admin_get_features, "nvme_admin_get_features" }, { nvme_admin_async_event, "nvme_admin_async_event" }, { nvme_admin_ns_mgmt, "nvme_admin_ns_mgmt" }, { nvme_admin_activate_fw, "nvme_admin_activate_fw" }, { nvme_admin_download_fw, "nvme_admin_download_fw" }, { nvme_admin_ns_attach, "nvme_admin_ns_attach" }, { nvme_admin_keep_alive, "nvme_admin_keep_alive" }, { nvme_admin_directive_send, "nvme_admin_directive_send" }, { nvme_admin_directive_recv, "nvme_admin_directive_recv" }, { nvme_admin_dbbuf, "nvme_admin_dbbuf" }, { nvme_admin_format_nvm, "nvme_admin_format_nvm" }, { nvme_admin_security_send, "nvme_admin_security_send" }, { nvme_admin_security_recv, "nvme_admin_security_recv" }, { nvme_admin_sanitize_nvm, "nvme_admin_sanitize_nvm" }, { nvme_admin_get_lba_status, "nvme_admin_get_lba_status" }))), ((REC->opcode) == nvme_fabrics_command ? nvme_trace_parse_fabrics_cmd(p, REC->fctype, REC->cdw10) : ((REC->qid) ? nvme_trace_parse_nvm_cmd(p, REC->opcode, REC->cdw10) : nvme_trace_parse_admin_cmd(p, REC->opcode, REC->cdw10)))
nvme_complete_rq
root@stosys:/home/user/src# cat /sys/kernel/debug/tracing/events/nvme/nvme_complete_rq/format
name: nvme_complete_rq
ID: 1555
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char disk[32]; offset:8; size:32; signed:1;
field:int ctrl_id; offset:40; size:4; signed:1;
field:int qid; offset:44; size:4; signed:1;
field:int cid; offset:48; size:4; signed:1;
field:u64 result; offset:56; size:8; signed:0;
field:u8 retries; offset:64; size:1; signed:0;
field:u8 flags; offset:65; size:1; signed:0;
field:u16 status; offset:66; size:2; signed:0;
print fmt: "nvme%d: %sqid=%d, cmdid=%u, res=%#llx, retries=%u, flags=0x%x, status=%#x", REC->ctrl_id, nvme_trace_disk_name(p, REC->disk), REC->qid, REC->cid, REC->result, REC->retries, REC->flags, REC->status
Struct Information
As we have to do a lot of casting to command structs (e.g., to struct request
for BIO requests), we can check the fields of the structs easily with bpftrace
.
sudo bpftrace -lv "struct request"
nvme_opcode
We need to identify which type of command is being run by the opcode that the command has.
enum nvme_opcode {
nvme_cmd_flush = 0x00,
nvme_cmd_write = 0x01,
nvme_cmd_read = 0x02,
nvme_cmd_write_uncor = 0x04,
nvme_cmd_compare = 0x05,
nvme_cmd_write_zeroes = 0x08,
nvme_cmd_dsm = 0x09,
nvme_cmd_verify = 0x0c,
nvme_cmd_resv_register = 0x0d,
nvme_cmd_resv_report = 0x0e,
nvme_cmd_resv_acquire = 0x11,
nvme_cmd_resv_release = 0x15,
nvme_cmd_zone_mgmt_send = 0x79,
nvme_cmd_zone_mgmt_recv = 0x7a,
nvme_cmd_zone_append = 0x7d,
}
req_opf
We use the request operation flag to identify zone reset commands. These flags can similarly be used to track zone management commands (e.g., opening/closing zones), however we do not utilize these here.
enum req_opf {
/* read sectors from the device */
REQ_OP_READ = 0,
/* write sectors to the device */
REQ_OP_WRITE = 1,
/* flush the volatile write cache */
REQ_OP_FLUSH = 2,
/* discard sectors */
REQ_OP_DISCARD = 3,
/* securely erase sectors */
REQ_OP_SECURE_ERASE = 5,
/* write the zero filled sector many times */
REQ_OP_WRITE_ZEROES = 9,
/* Open a zone */
REQ_OP_ZONE_OPEN = 10,
/* Close a zone */
REQ_OP_ZONE_CLOSE = 11,
/* Transition a zone to full */
REQ_OP_ZONE_FINISH = 12,
/* write data at the current zone write pointer */
REQ_OP_ZONE_APPEND = 13,
/* reset a zone write pointer */
REQ_OP_ZONE_RESET = 15,
/* reset all the zone present on the device */
REQ_OP_ZONE_RESET_ALL = 17,
/* Driver private requests */
REQ_OP_DRV_IN = 34,
REQ_OP_DRV_OUT = 35,
REQ_OP_LAST,
};
Troubleshooting
In case things do not work, we provide our full bpftrace setup information. Make sure yours is equal in configuration.
user@stosys:~/src$ sudo bpftrace --info
System
OS: Linux 5.19.0-051900-generic #202207312230 SMP PREEMPT_DYNAMIC Sun Jul 31 22:34:11 UTC 2022
Arch: x86_64
Build
version: v0.15.0-95-ga5760
LLVM: 12.0.1
ORC: v2
foreach_sym: yes
unsafe uprobe: no
bfd: yes
bcc_usdt_addsem: yes
bcc bpf_attach_uprobe refcount: yes
bcc library path resolution: yes
libbpf btf dump: yes
libbpf btf dump type decl: yes
libbpf bpf_prog_load: no
libbpf bpf_map_create: no
libdw (DWARF support): no
Kernel helpers
probe_read: yes
probe_read_str: yes
probe_read_user: yes
probe_read_user_str: yes
probe_read_kernel: yes
probe_read_kernel_str: yes
get_current_cgroup_id: yes
send_signal: yes
override_return: yes
get_boot_ns: yes
dpath: yes
skboutput: no
Kernel features
Instruction limit: 1000000
Loop support: yes
btf: yes
map batch: yes
uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes
Map types
hash: yes
percpu hash: yes
array: yes
percpu array: yes
stack_trace: yes
perf_event_array: yes
Probe types
kprobe: yes
tracepoint: yes
perf_event: yes
kfunc: yes
iter:task: yes
iter:task_file: yes
kprobe_multi: no
raw_tp_special: yes
Debugging
In order to debug the script if modifying it, enable the logging in order to trace events with all their outputs, and we provide some sample commands that are beneficial to issue individual writes/reads/resets. The -s
flag in all provides the starting LBA, and is given in 512B, -z
indicates the size in bytes.
# Write the first zone at LBA 0
user@stosys:~/src$ echo "hello" | sudo nvme write /dev/nvme9n1 -s 0 -z 4096
write: Success
# Append to the first zone
user@stosys:~/src/zoned_bpftrace$ echo "hello world" | sudo nvme zns zone-append /dev/nvme9n1 -z 4096
Success appended data to LBA 8
# Read the first 8 LBAs of zone 0
user@stosys:~/src$ sudo nvme read /dev/nvme9n1 -z 4096
hello
read: Success
# Write to the second zone which has lbas at 0x40000
user@stosys:~/src$ echo "hello" | sudo nvme write /dev/nvme9n1 -s 262144 -z 4096
write: Success
# Read the second zone
write: Success
user@stosys:~/src/zoned_bpftrace$ sudo nvme read /dev/nvme9n1 -s 262144 -z 4096
hello
read: Success
# Reset the second zone
user@stosys:~/src$ sudo nvme zns reset-zone /dev/nvme9n1 -s 262144
zns-reset-zone: Success, action:4 zone:40000 all:0 nsid:1