bpftrace - nicktehrany/notes GitHub Wiki

This documents the guide how I installed, setup, and use the bpftrace tracing language for BPF to trace NVMe calls to storage devices. I put all source code to manually build (i.e., anything I clone from git) to a src directory into my home. Hence, if you see src in the commands, replace this with wherever you put your code. I also install everything globally, since I'm alone in the VM.

Note things got so messed up that the VM didn't work anymore, therefore I had to install it again. Likely the steps here will not help you then, then skip directly to the tracing section.

Installation

For installing we can largely follow the install guide here, however I ran into some issues.

Firstly we need to enable configurations in the kernel build (I'm using a custom Kernel build, see the final config here). The configurations are

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_FUNCTION_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_HAVE_KPROBES=y
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DEBUG_FS=y

From there we build and boot the kernel, and then setup the bpf libraries. Install the requirements:

sudo apt-get install -y libbpfcc-dev
sudo apt-get update
sudo apt-get install -y \
  bison \
  cmake \
  flex \
  g++ \
  git \
  libelf-dev \
  zlib1g-dev \
  libfl-dev \
  systemtap-sdt-dev \
  binutils-dev \
  libcereal-dev \
  llvm-12-dev \
  llvm-12-runtime \
  libclang-12-dev \
  clang-12 \
  libpcap-dev \
  libgtest-dev \
  libgmock-dev \
  asciidoctor

git clone https://github.com/iovisor/bpftrace
cd bpftrace
mkdir build && cd build

NOTE: Can skip from here, second time around I installed all packages globally and built bpftrace from sources without issues.

Skip this, only here for reference

However, I ran into the following error when building bpftrace

user@stosys:~/src/bpftrace/build$ cmake .. -DKERNEL_INCLUDE_DIRS=/tmp/libbpf/include/uapi/
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found LibBcc: /usr/lib/x86_64-linux-gnu/libbcc.so
-- Performing Test LIBBCC_ATTACH_KPROBE_SIX_ARGS_SIGNATURE
-- Performing Test LIBBCC_ATTACH_KPROBE_SIX_ARGS_SIGNATURE - Success
-- Performing Test LIBBCC_ATTACH_UPROBE_SEVEN_ARGS_SIGNATURE
-- Performing Test LIBBCC_ATTACH_UPROBE_SEVEN_ARGS_SIGNATURE - Failed
-- Looking for bcc_usdt_foreach
-- Looking for bcc_usdt_foreach - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Please install the libbpf development package (missing: LIBBPF_LIBRARIES
  LIBBPF_INCLUDE_DIRS)
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindLibBpf.cmake:31 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  CMakeLists.txt:103 (find_package)


-- Configuring incomplete, errors occurred!
See also "/home/user/src/bpftrace/build/CMakeFiles/CMakeOutput.log".
See also "/home/user/src/bpftrace/build/CMakeFiles/CMakeError.log".

Therefore, we need to install it manually. As I'm using ubuntu server, it doesn't package libbpf (see this issue), we'll install it manually as is done in the docker build script.

git clone https://github.com/libbpf/libbpf.git ~/src/libbpf

# Checking out the latest stable release
git checkout v1.0.0
cd src
make -j

# Install everything (globally)
PREFIX=/usr/local/ LIBDIR=/usr/local/lib sudo make install install_uapi_headers

Now we can go back to bpftrace and build it

cd ~/src/bpftrace/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
sudo make install

Now we can test if everything is installed and working correctly

# Syscall counter
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

But we get some more errors now

user@stosys:~/src/zoned_bpftrace$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' 
bpftrace: error while loading shared libraries: libbpf.so.1: cannot open shared object file: No such file or directory

But we know it's installed and can manually preload it

user@stosys:~/src/zoned_bpftrace$ whereis libbpf.so
libbpf: /usr/lib64/libbpf.so /usr/lib64/libbpf.a

Then we can just run (and add sudo to run as root)

sudo LD_PRELOAD=/usr/lib64/libbpf.so bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

which works successfully and shows an output such as (must hit Ctrl-C to stop tracing and dump data)

Attaching 1 probe...
^C

@[sudo]: 7
@[tmux: server]: 7
@[gmain]: 10
@[multipathd]: 16
@[irqbalance]: 34
@[bpftrace]: 48

Now we can test bpftrace scripts for ZNS devices, making sure these work. See the scripts here.

user@stosys:~/src/zoned_bpftrace$ sudo LD_PRELOAD=/usr/lib64/libbpf.so bpftrace active_zone_tracker.bt
definitions.h:2:10: fatal error: 'linux/nvme.h' file not found

and we are missing header files, since we have a custom kernel, these are not installed. I went back to my qemu startup script and added port forwarding so that I can ssh into the qemu host with ssh -p 8888 user@localhost and copy the needed header files from the kernel build to the system.

Firstly, on the qemu system we make a directory to keep track of what header files we copy.

mkdir ~/src/include
Copying a single header file to the VM

Then we can copy the files from host to the qemu. The header files are in the include/ directory of the Kernel build.

# Copy nvme header file
scp -P 8888 include/linux/nvme.h [email protected]:/home/user/src/include/

Back on the qemu system we can then symlink the header file to the /usr/include dir. We use symlinks in case we change the header files, then we only need to scp to the VM (unlikely we change these header files, but to be sure).

sudo ln -sf ~/src/include/nvme.h /usr/include/linux/nvme.h
Copying all header files to the VM

I realized quickly that just adding a single header file at a time gets painful very quickly when they require each other recursively. Therefore, it is easier to simply add all header files from the kernel build to the host and link compilations against these. First we have to copy all header files from the host to the VM

scp -rP 8888 include/* [email protected]:/home/user/src/include/

Now rebuild bpftrace, linking against a custom include path failed for me, therefore I went back to symlink the needed headers files that we just copied from the host to the VM. A bit painful to do this one by one and recompile every time.

Files that were needed:

linux/nvme.h
linux/blkdev.h
linux/blk_types.h
linux/bvec.h'
linux/highmem.h
linux/bug.h
asm/bug.h
and many more
Copying the entire Kernel build

I got tired of adding them one by one so I went the simplest route, which is creating a new device where I put the entire kernel build, mount it in the VM and install everything. I'm likely not going to touch any of the header files for now so this should work fine. Then simply mount the device with the kernel build and install everything

sudo mkdir /mnt/kernel

# Mount with whatever device you have
sudo mkfs.ext4 /dev/nvme2n1
sudo mount /dev/nvme2n1 /mnt/kernel

sudo chown -R user /mnt/kernel

Then copy the kernel from the host to the VM (I exclude my qemu setup stuff as it's all the backing images for the nvme devices)

rsync -e 'ssh -p 8888' --progress -a --exclude={'qemu/','.git/'} f2fs/ [email protected]:/mnt/kernel/

Then we can install all the header files

sudo make headers_install

Now we run into more errors of an older libc version

user@stosys:/mnt/kernel$ sudo make headers_install
  HOSTCC  scripts/unifdef
scripts/basic/fixdep: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by scripts/basic/fixdep)
make[1]: *** [scripts/Makefile.host:95: scripts/unifdef] Error 1
make[1]: *** Deleting file 'scripts/unifdef'
make: *** [Makefile:1296: scripts_unifdef] Error 2

We are on version 2.31 (run ldd --version to see the version), and no updates are available, therefore we need to manually get the newer version. We'll install it in the ~/src directory to avoid messing up the current libc install if things go south.

cd ~/src
mkdir glibc && cd glibc
wget http://ftp.gnu.org/gnu/libc/glibc-2.33.tar.gz
tar -xvzf glibc-2.33.tar.gz
mkdir build
mkdir glibc-2.33-install
cd build
~/src/glibc/glibc-2.33/configure --prefix=$HOME/src/glibc/glibc-2.33-install
make -j3
make install

After building and installing we can check if it is setup correctly

~/src/glibc/glibc-2.33-install/bin/ldd --version

note now any command in the VM is seg faulting, might have messed up somewhere something (although I thought I was not installing things all over the place). Gonna reformat the base image for ubuntu and go again (updating ubuntu version as well, avoid this issue altogether).

Back to the issue

After the prior work failed, now we have an updated VM but are still missing the header files. Now we will simply add the flag to include the directory (again copied from the kernel build to the VM) in the CMake. I reinstalled everything, only building bpftrace from sources (see the earlier commands for that) and missing header files I installed globally by compiling the kernel on the VM and installing the headers. Note, headers are backwards compatible, so newer header files for an older kernel (whenever I'm not running the custom kernel build) are no problem. In order to run bpftrace we just need to include the header files in the command with the -I /path/to/header/file option. We can specify as many as we want

The only issue is the kernel building asm-generic instead of asm, hence it fails to still find some header files. We solve this by simply creating a symlink for asm pointing to asm-generic. NOTE This is not recommended as they are different things, however I run this in a VM and have multiple header files for the 5.19 Kernel, so messing things up isn't too bad at this point. Plus it's only needed for very few things, unlikely it'll break something I need.

sudo ln -sf /usr/src/linux-headers-5.19.0-051900/include/asm-generic/ /usr/src/linux-headers-5.19.0-051900/include/asm

Tracing

As I still ran into issues when my qemu is booting with a custom Kernel running bpftrace constantly fails (even bpftrace --info seg faults), I installed the full compiled Kernel in the VM and just booted the entire entire kernel. Now everything works, and we can now build our tracing pipeline for ZNS. Seems odd that there is something wrong with the setup of custom kernel in qemu, even if everything is installed and manually linked, but I guess every time we want to trace we need to compile and install the Kernel.

From this point are starting my notes on bpftrace and coding with it. See my full implementation for ZNS tracing here.

Sector to Zone Starting LBA conversion

We need to do a bitwise AND on the sector number with the zone MASK in order to the get the starting LBA of the zone (ZLBAS). Append command have the ZLBAS as the sector number, therefore the bitwise AND will not change the value. However, write (nvme write) commands require the LBA to write at (the host issues the LBA as opposed to ZNS handling LBA on append), where the bitwise AND with the MASK changes the value to the ZLBAS. Hence, the code looks like:

$nvme_cmd = (struct nvme_command *)*(arg1+sizeof(struct request));

$secnum = $nvme_cmd->rw.slba;

// Bitwise And to get zone starting LBA with zone MASK
$zlbas = ($secnum & 0xffc0000);

Commands

We utilize two nvme commands, of which each structure can be checked.

nvme_cmd_setup

root@stosys:/home/user/src# cat /sys/kernel/debug/tracing/events/nvme/nvme_setup_cmd/format
name: nvme_setup_cmd
ID: 1556
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char disk[32];    offset:8;       size:32;        signed:1;
        field:int ctrl_id;      offset:40;      size:4; signed:1;
        field:int qid;  offset:44;      size:4; signed:1;
        field:u8 opcode;        offset:48;      size:1; signed:0;
        field:u8 flags; offset:49;      size:1; signed:0;
        field:u8 fctype;        offset:50;      size:1; signed:0;
        field:u16 cid;  offset:52;      size:2; signed:0;
        field:u32 nsid; offset:56;      size:4; signed:0;
        field:bool metadata;    offset:60;      size:1; signed:0;
        field:u8 cdw10[24];     offset:61;      size:24;        signed:0;

print fmt: "nvme%d: %sqid=%d, cmdid=%u, nsid=%u, flags=0x%x, meta=0x%x, cmd=(%s %s)", REC->ctrl_id, nvme_trace_disk_name(p, REC->disk), REC->qid, REC->cid, REC->nsid, REC->flags, REC->metadata, ((REC->opcode) == nvme_fabrics_command ? __print_symbolic(REC->fctype, { nvme_fabrics_type_property_set, "nvme_fabrics_type_property_set" }, { nvme_fabrics_type_connect, "nvme_fabrics_type_connect" }, { nvme_fabrics_type_property_get, "nvme_fabrics_type_property_get" }) : ((REC->qid) ? __print_symbolic(REC->opcode, { nvme_cmd_flush, "nvme_cmd_flush" }, { nvme_cmd_write, "nvme_cmd_write" }, { nvme_cmd_read, "nvme_cmd_read" }, { nvme_cmd_write_uncor, "nvme_cmd_write_uncor" }, { nvme_cmd_compare, "nvme_cmd_compare" }, { nvme_cmd_write_zeroes, "nvme_cmd_write_zeroes" }, { nvme_cmd_dsm, "nvme_cmd_dsm" }, { nvme_cmd_resv_register, "nvme_cmd_resv_register" }, { nvme_cmd_resv_report, "nvme_cmd_resv_report" }, { nvme_cmd_resv_acquire, "nvme_cmd_resv_acquire" }, { nvme_cmd_resv_release, "nvme_cmd_resv_release" }, { nvme_cmd_zone_mgmt_send, "nvme_cmd_zone_mgmt_send" }, { nvme_cmd_zone_mgmt_recv, "nvme_cmd_zone_mgmt_recv" }, { nvme_cmd_zone_append, "nvme_cmd_zone_append" }) : __print_symbolic(REC->opcode, { nvme_admin_delete_sq, "nvme_admin_delete_sq" }, { nvme_admin_create_sq, "nvme_admin_create_sq" }, { nvme_admin_get_log_page, "nvme_admin_get_log_page" }, { nvme_admin_delete_cq, "nvme_admin_delete_cq" }, { nvme_admin_create_cq, "nvme_admin_create_cq" }, { nvme_admin_identify, "nvme_admin_identify" }, { nvme_admin_abort_cmd, "nvme_admin_abort_cmd" }, { nvme_admin_set_features, "nvme_admin_set_features" }, { nvme_admin_get_features, "nvme_admin_get_features" }, { nvme_admin_async_event, "nvme_admin_async_event" }, { nvme_admin_ns_mgmt, "nvme_admin_ns_mgmt" }, { nvme_admin_activate_fw, "nvme_admin_activate_fw" }, { nvme_admin_download_fw, "nvme_admin_download_fw" }, { nvme_admin_ns_attach, "nvme_admin_ns_attach" }, { nvme_admin_keep_alive, "nvme_admin_keep_alive" }, { nvme_admin_directive_send, "nvme_admin_directive_send" }, { nvme_admin_directive_recv, "nvme_admin_directive_recv" }, { nvme_admin_dbbuf, "nvme_admin_dbbuf" }, { nvme_admin_format_nvm, "nvme_admin_format_nvm" }, { nvme_admin_security_send, "nvme_admin_security_send" }, { nvme_admin_security_recv, "nvme_admin_security_recv" }, { nvme_admin_sanitize_nvm, "nvme_admin_sanitize_nvm" }, { nvme_admin_get_lba_status, "nvme_admin_get_lba_status" }))), ((REC->opcode) == nvme_fabrics_command ? nvme_trace_parse_fabrics_cmd(p, REC->fctype, REC->cdw10) : ((REC->qid) ? nvme_trace_parse_nvm_cmd(p, REC->opcode, REC->cdw10) : nvme_trace_parse_admin_cmd(p, REC->opcode, REC->cdw10)))

nvme_complete_rq

root@stosys:/home/user/src# cat /sys/kernel/debug/tracing/events/nvme/nvme_complete_rq/format
name: nvme_complete_rq
ID: 1555
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char disk[32];    offset:8;       size:32;        signed:1;
        field:int ctrl_id;      offset:40;      size:4; signed:1;
        field:int qid;  offset:44;      size:4; signed:1;
        field:int cid;  offset:48;      size:4; signed:1;
        field:u64 result;       offset:56;      size:8; signed:0;
        field:u8 retries;       offset:64;      size:1; signed:0;
        field:u8 flags; offset:65;      size:1; signed:0;
        field:u16 status;       offset:66;      size:2; signed:0;

print fmt: "nvme%d: %sqid=%d, cmdid=%u, res=%#llx, retries=%u, flags=0x%x, status=%#x", REC->ctrl_id, nvme_trace_disk_name(p, REC->disk), REC->qid, REC->cid, REC->result, REC->retries, REC->flags, REC->status

Struct Information

As we have to do a lot of casting to command structs (e.g., to struct request for BIO requests), we can check the fields of the structs easily with bpftrace.

sudo bpftrace -lv "struct request"

nvme_opcode

We need to identify which type of command is being run by the opcode that the command has.

enum nvme_opcode {
    nvme_cmd_flush          = 0x00,
    nvme_cmd_write          = 0x01,
    nvme_cmd_read           = 0x02,
    nvme_cmd_write_uncor    = 0x04,
    nvme_cmd_compare        = 0x05,
    nvme_cmd_write_zeroes   = 0x08,
    nvme_cmd_dsm            = 0x09,
    nvme_cmd_verify         = 0x0c,
    nvme_cmd_resv_register  = 0x0d,
    nvme_cmd_resv_report    = 0x0e,
    nvme_cmd_resv_acquire   = 0x11,
    nvme_cmd_resv_release   = 0x15,
    nvme_cmd_zone_mgmt_send = 0x79,
    nvme_cmd_zone_mgmt_recv = 0x7a,
    nvme_cmd_zone_append    = 0x7d,
}

req_opf

We use the request operation flag to identify zone reset commands. These flags can similarly be used to track zone management commands (e.g., opening/closing zones), however we do not utilize these here.

enum req_opf {
    /* read sectors from the device */
    REQ_OP_READ     = 0,
    /* write sectors to the device */
    REQ_OP_WRITE        = 1,
    /* flush the volatile write cache */
    REQ_OP_FLUSH        = 2,
    /* discard sectors */
    REQ_OP_DISCARD      = 3,
    /* securely erase sectors */
    REQ_OP_SECURE_ERASE = 5,
    /* write the zero filled sector many times */
    REQ_OP_WRITE_ZEROES = 9,
    /* Open a zone */
    REQ_OP_ZONE_OPEN    = 10,
    /* Close a zone */
    REQ_OP_ZONE_CLOSE   = 11,
    /* Transition a zone to full */
    REQ_OP_ZONE_FINISH  = 12,
    /* write data at the current zone write pointer */
    REQ_OP_ZONE_APPEND  = 13,
    /* reset a zone write pointer */
    REQ_OP_ZONE_RESET   = 15,
    /* reset all the zone present on the device */
    REQ_OP_ZONE_RESET_ALL   = 17,

    /* Driver private requests */
    REQ_OP_DRV_IN       = 34,
    REQ_OP_DRV_OUT      = 35,

    REQ_OP_LAST,
};

Troubleshooting

In case things do not work, we provide our full bpftrace setup information. Make sure yours is equal in configuration.

user@stosys:~/src$ sudo bpftrace --info
System
  OS: Linux 5.19.0-051900-generic #202207312230 SMP PREEMPT_DYNAMIC Sun Jul 31 22:34:11 UTC 2022
  Arch: x86_64

Build
  version: v0.15.0-95-ga5760
  LLVM: 12.0.1
  ORC: v2
  foreach_sym: yes
  unsafe uprobe: no
  bfd: yes
  bcc_usdt_addsem: yes
  bcc bpf_attach_uprobe refcount: yes
  bcc library path resolution: yes
  libbpf btf dump: yes
  libbpf btf dump type decl: yes
  libbpf bpf_prog_load: no
  libbpf bpf_map_create: no
  libdw (DWARF support): no

Kernel helpers
  probe_read: yes
  probe_read_str: yes
  probe_read_user: yes
  probe_read_user_str: yes
  probe_read_kernel: yes
  probe_read_kernel_str: yes
  get_current_cgroup_id: yes
  send_signal: yes
  override_return: yes
  get_boot_ns: yes
  dpath: yes
  skboutput: no

Kernel features
  Instruction limit: 1000000
  Loop support: yes
  btf: yes
  map batch: yes
  uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
  hash: yes
  percpu hash: yes
  array: yes
  percpu array: yes
  stack_trace: yes
  perf_event_array: yes

Probe types
  kprobe: yes
  tracepoint: yes
  perf_event: yes
  kfunc: yes
  iter:task: yes
  iter:task_file: yes
  kprobe_multi: no
  raw_tp_special: yes

Debugging

In order to debug the script if modifying it, enable the logging in order to trace events with all their outputs, and we provide some sample commands that are beneficial to issue individual writes/reads/resets. The -s flag in all provides the starting LBA, and is given in 512B, -z indicates the size in bytes.

# Write the first zone at LBA 0
user@stosys:~/src$ echo "hello" | sudo nvme write /dev/nvme9n1 -s 0 -z 4096
write: Success

# Append to the first zone
user@stosys:~/src/zoned_bpftrace$ echo "hello world" | sudo nvme zns zone-append /dev/nvme9n1 -z 4096
Success appended data to LBA 8

# Read the first 8 LBAs of zone 0 
user@stosys:~/src$ sudo nvme read /dev/nvme9n1 -z 4096
hello
read: Success

# Write to the second zone which has lbas at 0x40000
user@stosys:~/src$ echo "hello" | sudo nvme write /dev/nvme9n1 -s 262144 -z 4096
write: Success

# Read the second zone
write: Success
user@stosys:~/src/zoned_bpftrace$ sudo nvme read /dev/nvme9n1 -s 262144 -z 4096
hello
read: Success

# Reset the second zone
user@stosys:~/src$ sudo nvme zns reset-zone /dev/nvme9n1 -s 262144
zns-reset-zone: Success, action:4 zone:40000 all:0 nsid:1