ZNS - nicktehrany/notes GitHub Wiki

Resources

  1. Zoned Storage Documentation. https://zonedstorage.io/
  2. ZNS: Avoiding the Block Interface Tax for Flash-based SSDs. https://www.usenix.org/conference/atc21/presentation/bjorling

Overview

  • Address space is divided into zones, and zones need to be written sequentially with a write pointer that keeps track of the position for the next write. Data in a zone cannot be overwritten (zone must first be erased using zone reset)
  • SSDs implement zoned namespaces to reduce write amplification, reduce its onboard DRAM needs and improve QoS, with NVMe Zoned NameSpace (ZNS) protocol it is added to the NVMe interface standard. Shingled Magnetic Recording (SMR) can also implement zoned storage and uses the SCSI Zoned Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces.
  • Zoned storage device added to Linux with 4.10.0 and supports (disk driver, file system, device mapper drivers), based on Zoned Block Device (ZBD) abstraction.
  • Ways of using zoned storage for applications:
    1. If the user has no need for OS or fs to manage the zoned storage, can directly issue zone management commands using a passthrough interface (with libzbz library).
    2. Can rely on Kernel ZBD support to handle device, it then provides regular POSIX system calls. (still keeps zoned access constraints application must handle, sequential and erase before rewrite).
    3. User more advanced Kernel ZBD compliant file system which hides all the constraints from the application (manages them itself), or by using a device mapper driver that will expose the device as a regular block device.

Recap, Storage interfaces

  • SCSI (Small Computer System Interface) is the standards for data transfer between computer and peripherals, however it was designed for older devices such as HDDs therefore isn't optimal for flash based devices as it assumes those are HDDs (will follow certain access patterns). Therefore it is a legacy protocol. It is broader than ATA, as it can handle many storage devices (CD, DVD, Tape, etc.)
  • ATA (Advanced Technology Attachment) is the standards for connection storage devices to the computer, and was designed for HDDs.
  • SATA (Serial ATA) is the bus interface for connecting host bus adapters to storage devices. This is the connection interface and ATA is the protocol.
  • NVMe (Non-Volatile Memory Express) is the specification for accessing non-volatile memories (typically attached on PCIe). Eliminates the SCSI bottleneck of only having one command queue as it has up to 64K queues that can each hold 64K requests simultaneously.

SMR

  • SMR is used in HDDs to increase areal density. With it there are no gaps between the tracks on HDD tracks (as opposed to conventional magnetic recording, CMR which needs gaps to account for track miss registrations). Tracks are than written in an overlapping manner (think of it like roof shingles), data is written sequentially then overlapped with another track to shingle it, resulting in more data that can be placed on each magnetic surface. Recording head is only partially moved after writing, as the next track is partially overlapping with the previous one, then reading can be done on the partial view of the track that is "visible" (the part not overlapping), this is what saves the space as not the entire track needs to be visible to read it. Overlapping tracks are called bands.
  • Data therefore needs to be written sequentially, for overwriting the entire band needs to be rewritten. Random reads still perform as with CMR, making it good for large data storage and sequential workloads.
  • There are different command interface models:
    1. Host managed allows only for sequential write workloads, and gives control to host. No backwards compatibility with legacy storage stacks, host must handle write constraints. Device is split into many zones with most sequential zones and a few conventional zones (can be written random but are typically for metadata). Recovery of out of order writes (write pointer on device keeps track of next write location and throws error if invalid) must be handled by the host.
    2. Drive managed handles sequential write constraints, application can do sequential and random writes. Device deals with sequential write constraints internally.
    3. Host aware offers backward compatibility with regular HDDs but with same control as host managed model.
  • With SCSI and ATA standards, there are ZBC and ZAC, respectively, with the following zone management commands:
    • REPORT ZONES for discovery of zone organization, returning list of zone descriptors with starting LBA (logical block address), size, type and condition, as well the current position of the write pointer (in host managed/aware model).
    • RESET ZONE WRITE POINTER after this all data in zone is gone and the zone can be written to from the beginning again.
    • OPEN ZONE indicate to drive that resources should be available to write to this zone until it is closed or fully written to.
    • CLOSE ZONE closes a zone.
    • FINISH ZONE move the write pointer of a zone to the end to prevent any more writing in the zone.

ZNS SSDs

  • The SSD is groupred into zones, can be read out of order but must be written sequentially. With the zone abstraction the host can align the writes to sequential writes, improving performance and optimizing data placement on the storage medium. Media reliability is still managed by the device, as with conventional devices.
  • ZNS does not support the random write zones (as ZBC and ZAC do), since NVMe supports multiple namespaces and can then expose another namespace that provides the I/O access.
  • It also includes a ZONE CAPACITY attribute that indicates the number of usable blocks within a zone.
  • It may define a limit on the number of active zones (open or currently being written to or partially written). This is so an application cannot take all the zones, and once it reaches its limit it must reset or finish its active zones.
  • Since NVMe can reorder write operations (that the host issued sequentially) from all the queues, which would result in errors because then it would be a random write, therefore there is a ZONE APPEND operation that will specify the first logical block of a zone as the write position and the device controller will write the data within the specified zone but at the current zone write pointer. This allows for submitting simultaneous append operations that the device can process in any order. (Write position of the request will be returned by the append operation, so you will always know the location.)

Paper notes [2]

  • Current SSDs have old block interface, comes with penalty of over-provisioning, DRAM usage for page mappings, garbage collection, and complex host software to avoid minimize garbage collection.
  • Zones have a writable zone capacity attribute which divides the zone into writable and non-writable part. This is to align the zone space to a power of 2.
  • No longer need FTL as it was responsible for supporting random writes and its penalties. No more write amplification, which also eliminates the need for over-provisioning.
  • With block-interface SSDs the block size is chosen so that data can be striped across several flash dies (typically 16-128 dies) giving a writable zone of hundreds MBs. Therefore low I/O queue depths should be used for achieving the smallest zone size.
  • Can have 8-32 active zones to be able to handle device resources in case of failure (ie. power capacitors to persist data in case of failure). However this can be increased if the number of capacitors is increased or data movement to DRAM is optimized with a write back cache.
  • Linux Kernel Zoned Block Device (ZBD) subsystem provides an abstraction as a single zoned storage API from all the zoned storage device types (gives API for kernel and user-space).

Kernel Suppport

  • Different POSIX I/O paths:
    1. File Access Interface, file system is modified to handle the sequential write constraints (random writes on files are transformed by the fs), this is for ZBD compliant ones, legacy ones use a device mapper driver which, as earlier stated will expose the storage device as a regular block device.
    2. Raw Block Access Interface, raw block device file access. (again with legacy using a device mapper driver)
  • Additional interfaces for applications that already comply with sequential write constraints:
    1. File Access Interface, implemented in zonefs (sequential writes still responsibility of application)
    2. Zoned Raw Block Access Interface, no driver to handle constraints, application will open device file for the device and can retrieve info and do management.
    3. Passthrough Device Access Interface, provided by SCSI driver and NVMe driver to send commands directly to device without the kernel (only minimal inference). Application must handle device constraints.
  • Current kernel support (5.9 latest) requires all zones to be of the same size (even tough that is not a requirement of the device) and must be a power of 2 logical blocks
  • Kernel page cache does not guarantee that dirty pages in the cache are going to be flushed to the device in sequential order. (use write with O_DIRECT flag)

Setup Commands

Commands for setting up emulation of ZNS SSD with qemu are here.

wget https://download.qemu.org/qemu-6.1.0.tar.xz
tar xvJf qemu-6.1.0.tar.xz
cd qemu-6.1.0
mkdir build
cd build
../configure --target-list=x86_64-softmmu --enable-kvm --enable-linux-aio --enable-trace-backends=log --disable-werror --disable-gtk
make -j

qemu start script (create the backing files for the devices with the according sizes):

#!/bin/bash
set -e

QEMU_HOME=/home/nty/src/qemu-6.1.0/
U20_IMG=/home/nty/xfs/nty/ubuntu-20.04-zns-for-nick.qcow
ZNS=/home/nty/xfs/nty/znsssd-1G.img
#ZNS=/home/nty/xfs/nty/znsssd-8M.img
NVME=/home/nty/xfs/nty/nvmessd-1G.img

sudo $QEMU_HOME/build/qemu-system-x86_64 -name qemuzns -m 4G --enable-kvm -cpu host -smp 2 \
        -hda $U20_IMG \
        -net user,hostfwd=tcp::8888-:22,hostfwd=tcp::3333-:3000 -net nic \
        -drive file=$ZNS,id=zns-device,format=raw,if=none \
        -drive file=$NVME,id=nvme-device,format=raw,if=none \
        -device nvme,serial=zns-dev,id=nvme2,zoned.zasl=5\
        -device nvme,drive=nvme-device,serial=nvme-dev,physical_block_size=4096,logical_block_size=4096\
        -device nvme-ns,drive=zns-device,bus=nvme2,nsid=1,logical_block_size=4096,physical_block_size=4096,zoned=true,zoned.zone_size=8M,zoned.zone_capacity=7M,zoned.max_open=0,zoned.max_active=0,uuid=5e40ec5f-eeb6-4317-bc5e-c919796a5f79

Start VNC server and create ssh tunnel and connect with remmina on localhost (or ssh on localhost)

# update name and port
ssh node -L port:localhost:port

# Pay attention to username! Use what the vm has set up as login credential
ssh -p 8888 atr@localhost

Now running some simple commands with nvme cli

atr@stosys-qemu-vm:~$ echo "hello world" | sudo nvme zns zone-append /dev/nvme0n1 -z 4096
Success appended data to LBA 0
atr@stosys-qemu-vm:~$ sudo nvme zns report-zones /dev/nvme0n1 -d 8
nr_zones: 8
SLBA: 0x0        WP: 0x1        Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x100      WP: 0x100      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x200      WP: 0x200      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x300      WP: 0x300      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x400      WP: 0x400      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x500      WP: 0x500      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x600      WP: 0x600      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x700      WP: 0x700      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
atr@stosys-qemu-vm:~$ sudo nvme read /dev/nvme0n1 -z 4096
hello world
read: Success
atr@stosys-qemu-vm:~$ sudo nvme zns close-zone /dev/nvme0n1
zns-close-zone: Success, action:1 zone:0 all:0 nsid:1
atr@stosys-qemu-vm:~$ sudo nvme zns report-zones /dev/nvme0n1 -d 1
nr_zones: 8
SLBA: 0x0        WP: 0x1        Cap: 0x100      State: CLOSED       Type: SEQWRITE_REQ   Attrs: 0x0
atr@stosys-qemu-vm:~$ sudo nvme zns reset-zone /dev/nvme0n1 -a
zns-reset-zone: Success, action:4 zone:0 all:1 nsid:1
atr@stosys-qemu-vm:~$ sudo nvme zns report-zones /dev/nvme0n1 -d 1
nr_zones: 8
SLBA: 0x0        WP: 0x0        Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0

Performance profiling with perf, comparing ZNS and SSD with f2fs

Using f2fs since it has support for zoned. These will be a log of all the commands for setting everything up and running the profiling.

Installing f2fs, creating fs on the device, and mounting it.

git clone https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
cd f2fs-tools
./configure
make
# Add mkfs/ to $PATH (see Section "Setting up on node3")

# Changing the I/O scheduler to work with zoned devices
echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler

# make the f2fs file system (Make sure zoned and ssd are big enough and zoned
# has > 0 segments per section, if no segments floating point exception, or with
# not big enough nvmessd fsync error because pages are lost, shown in dmesg)
sudo mkfs.f2fs -f -m -c /dev/nvme0n1 /dev/nvme1n1
sudo mount -t f2fs /dev/nvme1n1 /mnt/f2fs/

# For f2fs on regular ssd (to do comparison later)
sudo mkfs.f2fs /dev/nvme1n1
sudo mount /dev/nvme1n1 /mnt/f2fs/

For setting up perf to then do performance profiling

# Couldn't find packages or deb files for perf-5.12 (and linux-tools-5.12.0+) so building from sources
wget https://github.com/torvalds/linux/archive/refs/tags/v5.12.tar.gz
tar xvf v5.12.tar.gz
cd linux-5.12/tools/
# one missing package
sudo apt install flex
make -C perf
sudo make perf_install
# took a resource/logout for perf to work

4 Configuration Benchmark on /dev/nullb*

First setting up the null block device for regular block device and then zoned block device using the script from Zoned Storage Documentation. We need to enable/disable zoned support in the script (set echo to 0 or 1), then run with

# Use nullblk.sh for regular block and nullblk_zoned.sh for zoned block device
sudo ./nullblk.sh 4096 64 16 48

For installing libs in $HOME/local check instructions for running on node 3, below this section. Running commands are still the same (so use section right under this).

Configuration 1: Regular Block Device with f2fs

This config will serve as the baseline config with a regular block device and f2fs mounted on it

sudo mkfs.f2fs /dev/nullb0
sudo mount -t f2fs /dev/nullb0 /mnt/f2fs

# Run the benchmark with db_bench
./db_bench --db=/mnt/f2fs --benchmarks=fillseq,fillrandom,readseq,readrandom --key_size=16 --value_size=100 --num=1000000 --reads=1000000 --use_direct_reads --use_direct_io_for_flush_and_compaction --compression_type=none

Or with the script for automation of running multiple times on each of the configs.

./bench.sh -m /mnt/f2fs -c 1

Configuration 2: dm-zoned device mapper with fs

Installing device mapper libraries

git clone https://github.com/westerndigitalcorporation/dm-zoned-tools
cd dm-zoned-tools

#missing package
sudo apt install libblkid-dev libkmod-dev libudev-dev libdevmapper-dev

sh ./autogen.sh
./configure
sudo make install

Now setting up device mapper (delete the previous device as it's not zoned or make a new one, and update the nullblk.sh script to use zoned devices by setting echo 1 on zoned)

sudo ./nullblk_zoned.sh 4096 64 16 48

sudo modprobe dm-zoned
sudo dmzadm --format /dev/nullb0
sudo dmzadm --start /dev/nullb0
sudo mkfs.f2fs /dev/dm-0
sudo mount -t f2fs /dev/dm-0 /mnt/f2fs/

# run benchmark
./bench.sh -m /mnt/f2fs -c 2

Cleanup:

sudo umount /dev/dm-0
sudo dmzadm --stop /dev/nullb0
sudo ./nullblk_delete.sh 0

Configuration 3: f2fs with zoned support enabled

Note: f2fs took up 1.9GB when mounting for 2GB device, therefore I had to increase the size (number of zones) to give us some usable space (did this for all benchmarks on new iteration). Not sure why f2fs needs 2GB metadata?!

sudo ./nullblk_zoned.sh 4096 64 16 48

sudo mkfs.f2fs -f -m /dev/nullb0
sudo mount -t f2fs /dev/nullb0 /mnt/f2fs

# run benchmark
./bench.sh -m /mnt/f2fs -c 3

Configuration 4: Zoned device with rocksdb and ZenFS plugin

First install ZenFS, instructions here

sudo ./nullblk_zoned.sh 4096 64 16 48

# from rocksdb dir
sudo ./plugin/zenfs/util/zenfs mkfs --zbd=nullb0 --aux-path=/tmp/zone-aux

# run the benchmark
./bench.sh -m nullb0 -c 4

Setting up on node3

Setting up installing libs for user in $HOME/local

mkdir local

# to .bashrc add 
export PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:/home/nty/src/linux-5.4/tools/perf/:$PATH 
export LD_LIBRARY_PATH=/home/nty/local/lib/:/home/nty/local/usr/local/lib/:$LD_LIBRARY_PATH
export CPATH=:/home/nty/local/include/:/home/nty/local/usr/local/include/:
export PKG_CONFIG_PATH=/home/nty/local/usr/local/lib/pkgconfig/:/home/nty/local/lib/pkgconfig:$PKG_CONFIG_PATH

Setting up rocksdb and zenfs

# Need libzbd
git clone https://github.com/westerndigitalcorporation/libzbd
cd libzbd
sh ./autogen.sh
./configure --prefix=$HOME/local
make install

# libzbc
https://github.com/westerndigitalcorporation/libzbc
cd libzbc
sh ./autogen.sh
./configure --prefix=$HOME/local
make install

# Rocksdb with zenfs
git clone https://github.com/facebook/rocksdb.git
cd rocksdb

# checkout the correct version
git clone https://github.com/westerndigitalcorporation/zenfs plugin/zenfs

Now the messy part, have to modify the Makefile as it was not picking up the AM_LINK paths

nty@node3:/home/nty/src/rocksdb$ git diff
diff --git a/Makefile b/Makefile
index 6c056c4ec..1a1c7ba4c 100644
--- a/Makefile
+++ b/Makefile
@@ -225,8 +225,8 @@ LIB_SOURCES += utilities/env_librados.cc
 LDFLAGS += -lrados
 endif

-AM_LINK = $(AM_V_CCLD)$(CXX) -L. $(patsubst lib%.a, -l%, $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^)) $(EXEC_LDFLAGS) -o $@ $(LDFLAGS) $(COVERAGEFLAGS)
-AM_SHARE = $(AM_V_CCLD) $(CXX) $(PLATFORM_SHARED_LDFLAGS)$@ -L. $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^) $(LDFLAGS) -o $@
+AM_LINK = $(AM_V_CCLD)$(CXX) -L/home/nty/local/lib -L/home/nty/local/usr/local/lib/ -L. $(patsubst lib%.a, -l%, $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^)) $(EXEC_LDFLAGS) -o $@ $(LDFLAGS) $(COVERAGEFLAGS)
+AM_SHARE = $(AM_V_CCLD) $(CXX) $(PLATFORM_SHARED_LDFLAGS)$@ -L/home/nty/local/lib -L/home/nty/local/usr/local/lib/ -L. $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^) $(LDFLAGS) -o $@

 # Detect what platform we're building on.
 # Export some common variables that might have been passed as Make variables

Now the actual install

DEBUG_LEVEL=0 ROCKSDB_PLUGINS=zenfs PREFIX=/home/nty/local/ make -j4 db_bench install

# and building zenfs
cd plugins/zenfs/utils

Again modify the Makefile

nty@node3:/home/nty/src/rocksdb/plugin/zenfs/util$ git diff
diff --git a/util/Makefile b/util/Makefile
index e544685..4ab323a 100644
--- a/util/Makefile
+++ b/util/Makefile
@@ -11,7 +11,7 @@ LIBS = $(shell pkg-config --static --libs rocksdb)
 all: $(TARGET)

 $(TARGET): $(TARGET).cc
-       $(CXX) $(CPPFLAGS) -o $(TARGET) $< $(LIBS)
+       $(CXX) $(CPPFLAGS) -L/home/nty/local/lib/ -L/home/nty/local/usr/local/lib/ -Wl,-rpath=/home/nty/local/lib -o $(TARGET) $< $(LIBS)

 clean:
        $(RM) $(TARGET)

Now build it

make

Installing f2fs

f2fs-utils can be built without installing as it also has path issues in the Makefile and don't want to deal with that, can just run the binary from there instead. Check here

Installing dm-zoned

# Need some packages before
sudo apt install libkmod-dev 
# sticking to already installed libudev version
sudo apt install libudev-dev=245.4-4ubuntu3.11
sudo apt install libdevmapper-dev

git clone https://github.com/westerndigitalcorporation/dm-zoned-tools
cd dm-zoned-tools
sh ./autogen.sh
./configure --prefix=$HOME/local

make # note no need to install actually since it would go into sbin, which we do not want to include directly in user PATH
# in .bashrc
export PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:$PATH


NOTES:
wget https://github.com/torvalds/linux/blob/v5.12/include/uapi/linux/blkzoned.h # From the kernel version inside /home/nty/local/include

Installing libnvme and nvme-cli

libnvme

git clone https://github.com/linux-nvme/libnvme
cd libnvme
meson .build --prefix=/home/nty/local
cd .build
ninja
meson install

nvme-cli

git clone https://github.com/linux-nvme/nvme-cli
cd nvme-cli
make

# Add it to path
export PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:/home/nty/src/linux-5.4/tools/perf/:/home/nty/src/nvme-cli/:$PATH

Installing util-linx

This is needed for blkzone capacity and maybe other utilities (BUT ONLY COMPILING BLKZONE FOR NOW).

git clone https://github.com/util-linux/util-linux
./autogen.sh
./configure
make
# Then add it to PATH

Running on node3

Typically need to run things as sudo so need to pass env variables of $HOME/local/* dirs to sudo

# For rocksdb pass LD_LIBRARY_PATH
sudo env LD_LIBRARY_PATH=/home/nty/local/lib/:/home/nty/local/usr/local/lib/:$LD_LIBRARY_PATH ./db_bench ....

# For dm-mapper
sudo env PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:$PATH dmzadm --format /dev/nullb0
sudo env PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:$PATH dmzadm --start /dev/nullb0
sudo env PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:$PATH dmzadm --stop /dev/nullb0

# For mkfs.f2fs
sudo env PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:$PATH mkfs.f2fs /dev/nullb0

# OR just add an alias to make stuff easier
alias sudo='sudo env "LD_LIBRARY_PATH=/home/nty/local/lib/:/home/nty/local/usr/local/lib/:$LD_LIBRARY_PATH" "PATH=/home/nty/local/bin/:/home/nty/src/dm-zoned-tools/:/home/nty/src/f2fs-tools/mkfs/:$PATH"'

Running with perf profiling

Need to install perf for kenel 5.4, from sources.

Gives this error when running make -C perf in linux-5.4/tools

Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/unistd.h' differs from latest version at 'arch/arm64/include/uapi/asm/unistd.h'
diff -u tools/arch/arm64/include/uapi/asm/unistd.h arch/arm64/include/uapi/asm/unistd.h

...

util/srcline.c:200:7: error: implicit declaration of function ‘bfd_get_section_flags’; did you mean ‘bfd_set_section_flags’? [-Werror=implicit-function-declaration]
200 |  if ((bfd_get_section_flags(abfd, section) & SEC_ALLOC) == 0)
|       ^~~~~~~~~~~~~~~~~~~~~
|       bfd_set_section_flags
util/srcline.c:200:7: error: nested extern declaration of ‘bfd_get_section_flags’ [-Werror=nested-externs]
util/srcline.c:204:8: error: implicit declaration of function ‘bfd_get_section_vma’; did you mean ‘bfd_set_section_vma’? [-Werror=implicit-function-declaration]
204 |  vma = bfd_get_section_vma(abfd, section);
|        ^~~~~~~~~~~~~~~~~~~
|        bfd_set_section_vma
util/srcline.c:204:8: error: nested extern declaration of ‘bfd_get_section_vma’ [-Werror=nested-externs]
util/srcline.c:205:9: error: implicit declaration of function ‘bfd_get_section_size’; did you mean ‘bfd_set_section_size’? [-Werror=implicit-function-declaration]
205 |  size = bfd_get_section_size(section);
|         ^~~~~~~~~~~~~~~~~~~~
|         bfd_set_section_size
util/srcline.c:205:9: error: nested extern declaration of ‘bfd_get_section_size’ [-Werror=nested-externs]

This is due to binutils having changed APIs in 2.34 version (check version with ld -v). See bug report, bug tracker, and workaround

We fix this by doing

diff -u old_srcline.c srcline.c
--- old_srcline.c       2021-10-18 14:24:06.116438919 +0200
+++ srcline.c   2021-10-18 14:23:48.880311405 +0200
@@ -197,12 +197,12 @@
        if (a2l->found)
                return;

-       if ((bfd_get_section_flags(abfd, section) & SEC_ALLOC) == 0)
+       if ((bfd_section_flags(section) & SEC_ALLOC) == 0)
                return;

        pc = a2l->addr;
-       vma = bfd_get_section_vma(abfd, section);
-       size = bfd_get_section_size(section);
+       vma = bfd_section_vma(section);
+       size = bfd_section_size(section);

        if (pc < vma || pc >= vma + size)
                return;

Then just run make -C perf and add it to the $PATH.

Fio Benchmark

git clone https://github.com/axboe/fio
cd fio
./configure --prefix=$HOME/local

# If fio says "libzbc no" then check Troubleshooting at the bottom on how to enable!

make
make install

Running now (gotta mount configs, instructions were mentioned above this in Section: 4 Configuration Benchmark on /dev/nullb*)

# For regular block device run this on nullblk
sudo fio --name=zns-fio --filename=/dev/nullb0 --direct=1 --size=1G --ioengine=libaio --iodepth=8 --rw=write --bs=4K  --runtime=30s --time_based --thread=1

# For zoned block device, figure out starting LBA of first sequential zone, use this as --offset for fio
sudo blkzone report /dev/nullb0
sudo fio --name=zns-fio --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=2097152 --size=1G --ioengine=libaio --iodepth=8 --rw=write --bs=4K --runtime=30s --time_based --thread=1

# Config-[1-3]
sudo fio --name=zns-fio --filename=/dev/zns-bench/fio-bench --direct=1 --size=1G --ioengine=libaio --iodepth=8 --rw=write --bs=4K  --runtime=30s --time_based --thread=1

Running 4-config Benchmarks on node3 with real ZNS hardware

This set of commands will give the setup and running of the 4-configs benchmarks for real ZNS hardware on node3. It will include some basic nvme testing commands to see I/O performance and get to know the device, as well as the fio and db_bench benchmarks.

/dev/nvm1n2 the zoned device

# doing some writing with nvme commands (-z is block_size)
nty@node3:~$ echo "hello world" | sudo nvme zns zone-append /dev/nvme1n2 -z 4096
Success appended data to LBA 0
nty@node3:~$ sudo nvme read /dev/nvme1n2 -z 4096
hello world
read: Success
# Now checking zone status
nty@node3:~$ sudo nvme zns report-zones /dev/nvme1n2 -d 5
nr_zones: 3688
SLBA: 0x0        WP: 0x1        Cap: 0x43500    State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x80000    WP: 0x80000    Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x100000   WP: 0x100000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x180000   WP: 0x180000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x200000   WP: 0x200000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
# Now resetting zone and recheck status
nty@node3:~$ sudo blkzone reset --count 1 /dev/nvme1n2
nty@node3:~$ sudo nvme zns report-zones /dev/nvme1n2 -d 5
nr_zones: 3688
SLBA: 0x0        WP: 0x0        Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x80000    WP: 0x80000    Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x100000   WP: 0x100000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x180000   WP: 0x180000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x200000   WP: 0x200000   Cap: 0x43500    State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0

Now running fio on the zoned device, note size needs to be at least the zone_size

sudo fio --name=zns-fio --filename=/dev/nvme1n2 --direct=1 --zonemode=zbd --size=$((4194304*512)) --ioengine=libaio --iodepth=2 --rw=write --bs=4K --runtime=30s --time_based --thread=1

/dev/nvme1n1 the conventional zones

sudo fio --name=zns-fio --filename=/dev/nvme1n1 --direct=1 --size=$((4194304*512)) --ioengine=libaio --iodepth=2 --rw=write --bs=4K --runtime=30s --time_based --thread=1

Config-[1-3]

# Do all setup for the config and mount it on /mnt/zns-bench
sudo fio --name=zns-fio --filename=/mnt/zns-bench/fio_bench --direct=1 --size=$((4194304*512)) --ioengine=libaio --iodepth=2 --rw=write --bs=4K --runtime=30s --time_based --thread=1

# Config 2: For device mapper with 2 namespaces, with:
# single conventional followed by zoned device(s), but only one regular block device
sudo dmzadm --format /dev/nvme0n1 /dev/nvme2n2
sudo dmzadm --start /dev/nvme0n1 /dev/nvme2n2
sudo mkfs.f2fs /dev/dm-0
sudo mount -t f2fs /dev/dm-0 /mnt/zns-bench/

# Config 3: For creating f2fs with 2 namespaces (conventional and zoned)
sudo mkfs.f2fs -f -m -c /dev/nvme2n2 /dev/nvme0n1
sudo mount -t f2fs /dev/nvme0n1 /mnt/zns-bench/

Resizing namespaces on zoned device

sudo nvme delete-ns /dev/nvme2 -n 2
# Create 100GB namespace
sudo nvme create-ns /dev/nvme2 -s 209715200 -c 209715200 -b 512 --csi=2
sudo nvme attach-ns /dev/nvme2 -n 2 -c 0

# For the conventional 4G namespace
sudo nvme create-ns /dev/nvme2 -s 8388608 -c 8388608 -b 512 --csi=0

Troubleshooting

Creating/Deleting null_blk

# Might have to enable memory backing
sudo mkdir /sys/kernel/config/nullb/nullb0
echo 1 | sudo tee /sys/kernel/config/nullb/nullb0/memory_backed

# for removing (do this for all /dev/nullb*)
echo 0 | sudo tee /sys/kernel/config/nullb/nullb0/power
sudo rmdir /sys/kernel/config/nullb/nullb0
sudo rmmod null_blk

fio not finding libzbc

Simple fix, need to manually link libs in configure and Makefile

nty@node3:/home/nty/src/fio$ git diff
diff --git a/Makefile b/Makefile
index 4ae5a371..f8b04097 100644
--- a/Makefile
+++ b/Makefile
@@ -218,7 +218,7 @@ ifdef CONFIG_IME
 endif
 ifdef CONFIG_LIBZBC
   libzbc_SRCS = engines/libzbc.c
-  libzbc_LIBS = -lzbc
+  libzbc_LIBS = -L/home/nty/local/lib -lzbc
   ENGINES += libzbc
 endif

diff --git a/configure b/configure
index 84ccce04..4c19b60e 100755
--- a/configure
+++ b/configure
@@ -2548,7 +2548,7 @@ int main(int argc, char **argv)
 }
 EOF
 if test "$libzbc" != "no" ; then
-  if compile_prog "" "-lzbc" "libzbc"; then
+  if compile_prog "" "-L/home/nty/local/lib -lzbc" "libzbc"; then
     libzbc="yes"
     if ! check_min_lib_version libzbc 5; then
       libzbc="no"