zns - animeshtrivedi/notes GitHub Wiki

The kernel build is badly broken, tried multiple things, now doing a deb package build and see if that succeeds.

Unlike SMR disks, NVMe ZNS do not mix conventional and SMR zones in a same namespace: http://zonedstorage.io/introduction/zns/ . A single drive, or controller can have different namespaces, one with the conventional tracks and the other with ZNS tracks.

http://zonedstorage.io/introduction/zns/

  • ZAC/ZBA commands: zone size as the total number of logical blocks

  • NVMe ZNS: zone capacity, the number of usable logical blocks within each zone, capacity <= size. Why this distinction? GC issues? Other internal provisioning?

    • This new attribute was introduced to allow for the zone size to remain a power of two number of logical blocks (facilitating easy logical block to zone number conversions) while allowing optimized mapping of a zone storage capacity to the underlying media characteristics. For instance, in the case a flash based device, a zone capacity can be aligned to the size of flash erase blocks without requiring that the device implements a power-of-two sized erased block.
  • Hence, an actual NVMe ZNS size is actually the sum of individual "capacities" not the sizes of different zones.

  • Limits:

    • General ZNS limit: simultaneously in the implicit open or explicit open conditions
    • NVMe ZNS limit: Active zones, which are number of zones that can be in the implicit open, explicit open or closed conditions. limit on the maximum number of "Active" zones.
      • the maximum number of active zones imposes a limit on the number of zones that an application can choose for storing data. How do you deal with it? Isn't this the maximum amount of data storage possible?
  • ZNS NVMe: Zone Append -- special nameless write type command, that let the device have multiple outstanding requests, and let the host know the written location instead of the host telling the device where to write.

    • Based on the command responses in the zone append, the host can discover the write order. In a case where the host was to force a certain order, the host must issue command one by one with the right zone write pointer offset, otherwise fail.
      • So in the normal case, tracking write point offset is the responsibility of the host? I think it can be queried from the device.
  • ZONE APPEND: is a cool feature. The point being that the device is doing the block allocation for you, and freeing the CPU. Shows in the performance gain, see this https://www.youtube.com/watch?v=9yVWb3rbces. Build a tx log, or other applications? Build a single machine Tango design.

  • Write Ordering Control -- so now the new problem, with the zone writing on SCSI devices - if the kernel reorders the writes then the offsets will be wrong, and hence, failed writes. To avoid this, there is a serialization point in the kernel, which is now only enabled with mq-deadline scheduled, not noop, http://zonedstorage.io/getting-started/prerequisite/

    • How does this work with NVMe? With deep queues?
    • util-linux is missing blkzone command on Ubuntu 18
  • What is libzbc --> libzbc is a user space library providing functions for manipulating ZBC and ZAC disks.

Adding the util-linux

git clone https://github.com/karelzak/util-linux.git 
# install autopoint and then usual configure --prefix=/home/atr/local; make; make install 
# make install fails due to 
# chgrp tty /home/atr/local/bin/wall # atr not being part of the tty group 
# add an existing user to an exisiting group, https://askubuntu.com/questions/79565/how-to-add-existing-user-to-an-existing-group 
usermod -a -G tty atr 
#only comes into effect log out :( 
# for now I have added the build path in the $PATH, works 
# I installed this for /home/atr/vu/github/util-linux//blkzone command (and other new utilities?)

Setting up the practical part

# modprobe null_blk nr_devices=1 zoned=1 # works 
# modinfo 
...
parm:           no_sched:No io scheduler (int)
parm:           submit_queues:Number of submission queues (int)
parm:           home_node:Home node for the device (int)
parm:           queue_mode:Block interface to use (0=bio,1=rq,2=multiqueue)
parm:           gb:Size in GB (int)
parm:           bs:Block size (in bytes) (int)
parm:           nr_devices:Number of devices to register (uint)
parm:           blocking:Register as a blocking blk-mq driver device (bool)
parm:           shared_tags:Share tag set between devices for blk-mq (bool)
parm:           irqmode:IRQ completion handler. 0-none, 1-softirq, 2-timer
parm:           completion_nsec:Time in ns to complete a request in hardware. Default: 10,000ns (ulong)
parm:           hw_queue_depth:Queue depth for each hardware queue. Default: 64 (int)
parm:           use_per_node_hctx:Use per-node allocation for hardware context queues. Default: false (bool)
parm:           zoned:Make device as a host-managed zoned block device. Default: false (bool)
parm:           zone_size:Zone size in MB when block device is zoned. Must be power-of-two: Default: 256 (ulong)
parm:           zone_nr_conv:Number of conventional zones when block device is zoned. Default: 0 (uint)

Features

atr@atr-XPS-13:~$ cat /sys/kernel/config/nullb/features
memory_backed,discard,bandwidth,cache,badblocks,zoned,zone_size
atr@atr-XPS-13:~$ 

Creating scripts and configuration

# location of the script, http://zonedstorage.io/getting-started/nullblk/ 
atr@atr-XPS-13:~/local/bin$ ./create-nullblk-zone.sh 4096 64 4 8 

atr@atr-XPS-13:~$ sudo /home/atr/vu/github/util-linux//blkzone report /dev/nullb1 
  start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000100000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000120000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000140000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000160000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
# example of configuration space, 
atr@atr-XPS-13:~$ cat /sys/block/nullb0/queue/zoned
host-managed
atr@atr-XPS-13:~$ 

Day 2

installed nvme-cli tools, the old one do not have support for ZNS

/home/atr/vu/github/storage/nvme-cli
atr@atr-XPS-13:~/vu/github/storage/nvme-cli$ nvme zns 
nvme-1.14
usage: nvme zns <command> [<device>] [<args>]

The '<device>' may be either an NVMe character device (ex: /dev/nvme0) or an
nvme block device (ex: /dev/nvme0n1).

Zoned Namespace Command Set

The following are all implemented sub-commands:
  id-ctrl             Retrieve ZNS controller identification
  id-ns               Retrieve ZNS namespace identification
  zone-mgmt-recv      Sends the zone management receive command
  zone-mgmt-send      Sends the zone management send command
  report-zones        Retrieve the Report Zones report
  close-zone          Closes one or more zones
  finish-zone         Finishes one or more zones
  open-zone           Opens one or more zones
  reset-zone          Resets one or more zones
  offline-zone        Offlines one or more zones
  set-zone-desc       Attaches zone descriptor extension data
  zone-append         Writes data and metadata (if applicable), appended to the end of the requested zone
  changed-zone-list   Retrieves the changed zone list log
  version             Shows the program version
  help                Display this help

See 'nvme zns help <command>' for more information on a specific command
atr@atr-XPS-13:~/vu/github/storage/nvme-cli$ 

With the new compiled nvme commands

The nullb0 device is not recognize as a NVMe device, hence, no interaction with the nvme zns command. 

Setting up LevelDB with ZNS

Links

sudo apt install autoconf
sudo apt-get install libgflags-dev
sudo apt-get install libtool
sudo apt install autoconf-archive

The autoconf-archive is confusing...without it got error like

atr@stosys-qemu-vm:/home/atr/src/libzbd$ ./configure 
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether gcc understands -c and -o together... yes
checking whether make supports the include directive... yes (GNU style)
checking dependency style of gcc... gcc3
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking minix/config.h usability... no
checking minix/config.h presence... no
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for ar... ar
checking the archiver (ar) interface... ar
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/bin/sed
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc object... ok
checking for sysroot... no
checking for a working dd... /usr/bin/dd
checking how to truncate binary pipes... /usr/bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc supports -fno-rtti -fno-exceptions... no
checking for gcc option to produce PIC... -fPIC -DPIC
checking if gcc PIC flag -fPIC -DPIC works... yes
checking if gcc static flag -static works... yes
checking if gcc supports -c -o file.o... yes
checking if gcc supports -c -o file.o... (cached) yes
checking whether the gcc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
./configure: line 9226: AX_PTHREAD: command not found
checking for rpmbuild... notfound
checking for rpm... notfound
checking for libgen.h... no
configure: error: Couldn't find libgen.h
atr@stosys-qemu-vm:/home/atr/src/libzbd$ locate libgen.h
/usr/include/libgen.h

Hint: https://github.com/ANSSI-FR/libapn/issues/1

Compilation and run commands

$# Update the make file to locate the path of the library 
atr@atr-XPS-13:~/vu/github/storage/rocksdb$ git diff 
diff --git a/Makefile b/Makefile
index 8571facaa..67dda5456 100644
--- a/Makefile
+++ b/Makefile
@@ -252,8 +252,8 @@ LIB_SOURCES += utilities/env_librados.cc
 LDFLAGS += -lrados
 endif
 
-AM_LINK = $(AM_V_CCLD)$(CXX) -L. $(patsubst lib%.a, -l%, $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^)) $(EXEC_LDFLAGS) -o $@ $(LDFLAGS) $(COVERAGEFLAGS)
-AM_SHARE = $(AM_V_CCLD) $(CXX) $(PLATFORM_SHARED_LDFLAGS)$@ -L. $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^) $(LDFLAGS) -o $@
+AM_LINK = $(AM_V_CCLD)$(CXX) -L/home/atr/local/lib -L/home/atr/local/usr/local/lib/ -L. $(patsubst lib%.a, -l%, $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^)) $(EXEC_LDFLAGS) -o $@ $(LDFLAGS) $(COVERAGEFLAGS)
+AM_SHARE = $(AM_V_CCLD) $(CXX) $(PLATFORM_SHARED_LDFLAGS)$@ -L/home/atr/local/lib -L/home/atr/local/usr/local/lib/ -L. $(patsubst lib%.$(PLATFORM_SHARED_EXT), -l%, $^) $(LDFLAGS) -o $@
 
 # Detect what platform we're building on.
 # Export some common variables that might have been passed as Make variables
@@ -1455,7 +1455,7 @@ librocksdb_env_basic_test.a: $(OBJ_DIR)/env/env_basic_test.o $(LIB_OBJECTS) $(TE
        $(AM_V_at)$(AR) $(ARFLAGS) $@ $^
 
 db_bench: $(OBJ_DIR)/tools/db_bench.o $(BENCH_OBJECTS) $(TESTUTIL) $(LIBRARY)
-       $(AM_LINK)
+       $(AM_LINK) -Wl,-rpath,/home/atr/local/lib/ -Wl,-rpath,/home/atr/local/usr/local/lib/
 
 trace_analyzer: $(OBJ_DIR)/tools/trace_analyzer.o $(ANALYZE_OBJECTS) $(TOOLS_LIBRARY) $(LIBRARY)
        $(AM_LINK)
atr@atr-XPS-13:~/vu/github/storage/rocksdb$ 

Building and running

$ DEBUG_LEVEL=0 ROCKSDB_PLUGINS=zenfs DESTDIR=/home/atr/local/ make -j db_bench install
$ sudo ./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction  --compression_type=none

Some more issues, if I do not do an install then failed compilation like this:

atr@stosys-qemu-vm:/home/atr/src/rocksdb/plugin/zenfs/util$ make
Package rocksdb was not found in the pkg-config search path.
Perhaps you should add the directory containing `rocksdb.pc'
to the PKG_CONFIG_PATH environment variable
No package 'rocksdb' found
Package rocksdb was not found in the pkg-config search path.
Perhaps you should add the directory containing `rocksdb.pc'
to the PKG_CONFIG_PATH environment variable
No package 'rocksdb' found
g++   -o zenfs zenfs.cc 
zenfs.cc:15:10: fatal error: rocksdb/file_system.h: No such file or directory
   15 | #include <rocksdb/file_system.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make: *** [Makefile:12: zenfs] Error 1
atr@stosys-qemu-vm:/home/atr/src/rocksdb/plugin/zenfs/util$ cd -

What did not work

Non standard building, compiling and linking paths

so much pain, so much pain

  • gcc option is -I (modify it in the Makefile, but somehow it did not pick properly)
  • export CPATH=:/home/atr/local/include/:/home/atr/local/usr/local/include/: (put this in the bashrc)

Rocksb compilcation guide:

https://github.com/facebook/rocksdb/blob/master/INSTALL.md

$DEBUG_LEVEL=0 ROCKSDB_PLUGINS=zenfs make -j db_bench #(i skipped the install step, but include db_bench) 

How to include shared library execution path for a loader

Some part of bash/sudo can be preserved with sudo env "PATH=$PATH;LD_LIBRARY_PATH=$LD_LIBRARY_PATH" zbd

atr@node3:/home/atr/src/storage/rocksdb$ ldd ./plugin/zenfs/util/zenfs 
	linux-vdso.so.1 (0x00007fff8d54c000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc3b17b4000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc3b17ae000)
	libgflags.so.2.2 => /lib/x86_64-linux-gnu/libgflags.so.2.2 (0x00007fc3b1783000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fc3b1767000)
	libzbd-1.3.0.so => /home/atr/local/lib/libzbd-1.3.0.so (0x00007fc3b175e000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc3b157b000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc3b142c000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc3b1411000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc3b121f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fc3b1ca1000)
atr@node3:/home/atr/src/storage/rocksdb$ sudo ldd ./plugin/zenfs/util/zenfs 
	linux-vdso.so.1 (0x00007ffd17f9b000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb2f8f13000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb2f8f0d000)
	libgflags.so.2.2 => /lib/x86_64-linux-gnu/libgflags.so.2.2 (0x00007fb2f8ee2000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb2f8ec6000)
	libzbd-1.3.0.so => not found
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb2f8ce3000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb2f8b94000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb2f8b79000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb2f8987000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb2f9400000)

I had to modify the Makefile in zenfs/utils/Makefile to explicity put in the path with the loader flags

-Wl,-rpath,/home/atr/local/lib/ -Wl,/home/atr/local/usr/local/lib/ 

atr@atr-XPS-13:~/vu/github/storage/rocksdb/plugin/zenfs$ git diff 
diff --git a/util/Makefile b/util/Makefile
index 3bd0ea1..e3dc5b4 100644
--- a/util/Makefile
+++ b/util/Makefile
@@ -9,7 +9,7 @@ LIBS = $(shell pkg-config --static --libs rocksdb)
 all: $(TARGET)
 
 $(TARGET): $(TARGET).cc
-       $(CC) $(CPPFLAGS)  -o $(TARGET) $< $(LIBS)
+       $(CC) $(CPPFLAGS) -L/home/atr/local/lib/ -L/home/atr/local/usr/local/lib/ -Wl,-rpath=/home/atr/local/lib  -o $(TARGET) $< $(LIBS)
 
 clean:
        $(RM) $(TARGET)
atr@atr-XPS-13:~/vu/github/storage/rocksdb/plugin/zenfs$ 

Again there was a bit of mess where to put these flags in the Makefile, AF_LINK flags from RocsDB file did not pick it up.

PKG config

These build and compile dependencies are picked up (unsuccessfuly by zenfs building)

export PKG_CONFIG_PATH=/home/atr/local/usr/local/lib/pkgconfig/:$PKG_CONFIG_PATH

What is the device name?

./plugin/zenfs/util/zenfs mkfs --zbd=/dev/<zoned block device> --aux-path=<path to store LOG and LOCK files>

What is the name you should pass here? there is a typo, it should be just --zbd=<zoned block device> (without the dev)

Also there is a size issue,

atr@stosys-qemu-vm:/home/atr/src/rocksdb$ ~/src/zns-resources/scripts/create-nullblk-zone.sh 
Usage: /home/atr//src/zns-resources/scripts/create-nullblk-zone.sh <sect size (B)> <zone size (MB)> <nr conv zones> <nr seq zones>
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ~/src/zns-resources/scripts/create-nullblk-zone.sh 4096 1 0 8 
Created /dev/nullb0
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ./plugin/zenfs/util/zenfs mkfs --zbd=nullb0 --aux-path=/home/atr/rocksdb-aux-path/ 
Error: aux path exists
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ./plugin/zenfs/util/zenfs mkfs --zbd=nullb0 --aux-path=/home/atr/rocksdb-aux-path-nullb0/ 
Failed to open zoned block device: nullb0, error: Not implemented: To few zones on zoned block device (32 required)
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ~/src/zns-resources/scripts/destroy-nullblk-zone.sh 
Usage: /home/atr//src/zns-resources/scripts/destroy-nullblk-zone.sh <nullb ID>
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ~/src/zns-resources/scripts/destroy-nullblk-zone.sh nullb0 
/dev/nullbnullb0: No such device
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ~/src/zns-resources/scripts/destroy-nullblk-zone.sh 0 
Destroyed /dev/nullb0
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ~/src/zns-resources/scripts/create-nullblk-zone.sh 4096 1 0 32 
Created /dev/nullb0
atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ./plugin/zenfs/util/zenfs mkfs --zbd=nullb0 --aux-path=/home/atr/rocksdb-aux-path-nullb0/ 
INFO: For ZBD nullb0, device scheduler is set to mq-deadline.
ZenFS file system created. Free space: 29 MB

Creation of the aux file?

atr@node3:/home/atr/src/storage/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction
 I am here withthe path nullb0 
HERE zdb_open with /dev/nullb0 ret code 3 
Error: Not implemented: IO error: No such file or directory: While mkdir if missing: ~/tmp/: No such file or directory: zenfs://dev:nullb0
# At this point, I recreated the zenfs setup 
atr@node3:/home/atr/src/storage/rocksdb$ sudo ./plugin/zenfs/util/zenfs mkfs --zbd=nullb0 --aux-path=/home/atr/tmp/zns/  --force
INFO: For ZBD nullb0, device scheduler is set to mq-deadline.
ZenFS file system created. Free space: 1856 MB
atr@node3:/home/atr/src/storage/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction 
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
open error: Invalid argument: Compression type Snappy is not linked with the binary.

Disable compression setup for now

https://github.com/facebook/rocksdb/issues/761

./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction  --compression_type=none 
atr@node3:/home/atr/src/storage/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction  --compression_type=none 
 I am here withthe path nullb0 
HERE zdb_open with /dev/nullb0 ret code 3 
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.19
Date:       Sat May  1 09:56:17 2021
CPU:        40 * Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
CPUCache:   14080 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [rocksdbtest/dbbench]
fillrandom   :       3.258 micros/op 306884 ops/sec;   33.9 MB/s
atr@node3:/home/atr/src/storage/rocksdb$

Doing ls -l on the zenfs, kind of primitive and broken

atr@node3:/home/atr/src/storage/rocksdb/plugin/zenfs$ sudo ./util/zenfs list --zbd nullb0 --path rocksdbtest/dbbench/ 
HDHDHDHD
 I am here withthe path nullb0 
HERE zdb_open with /dev/nullb0 ret code 3 
           0	LOCK                            
       31418	LOG                             
Failed to get size of file 000009.sst
$ #you can see more details from the zenfs log file which files are written on the zenfs 
$ less /tmp/zenfs_nullb0_2021-05-04_09\:28\:20.log
atr@node3:/home/atr/src/storage/rocksdb/plugin/zenfs$ cat /tmp/zenfs_nullb0_2021-05-04_09\:28\:20.log | grep "New writable file"
2021/05/04-09:28:20.262233 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000000.dbtmp direct: 0
2021/05/04-09:28:20.262809 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/MANIFEST-000001 direct: 0
2021/05/04-09:28:20.263134 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000001.dbtmp direct: 0
2021/05/04-09:28:20.266392 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/MANIFEST-000004 direct: 0
2021/05/04-09:28:20.266729 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000004.dbtmp direct: 0
2021/05/04-09:28:20.267530 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000005.log direct: 0
2021/05/04-09:28:20.268298 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/OPTIONS-000006.dbtmp direct: 0
2021/05/04-09:28:20.301842 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000000.dbtmp direct: 0
2021/05/04-09:28:20.302332 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/MANIFEST-000001 direct: 0
2021/05/04-09:28:20.302628 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000001.dbtmp direct: 0
2021/05/04-09:28:20.305607 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/MANIFEST-000004 direct: 0
2021/05/04-09:28:20.305901 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000004.dbtmp direct: 0
2021/05/04-09:28:20.306770 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/000005.log direct: 0
2021/05/04-09:28:20.307551 7f170499aac0 [DEBUG] New writable file: rocksdbtest/dbbench/OPTIONS-000006.dbtmp direct: 0
2021/05/04-09:28:29.584170 7f16fa993700 [DEBUG] New writable file: rocksdbtest/dbbench/000008.log direct: 0
2021/05/04-09:28:29.584808 7f16fc196700 [DEBUG] New writable file: rocksdbtest/dbbench/000009.sst direct: 1
2021/05/04-09:28:38.192693 7f16fa993700 [DEBUG] New writable file: rocksdbtest/dbbench/000010.log direct: 0
2021/05/04-09:28:38.193608 7f16fc196700 [DEBUG] New writable file: rocksdbtest/dbbench/000011.sst direct: 1
atr@node3:/home/atr/src/storage/rocksdb/plugin/zenfs$ 

What was written and where after the RocksDB test

DB log

atr@node3:/home/atr/src/storage/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nullb0 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction --compression_type=none
[atr] Allocating a new NewZenFS from uri = zenfs://dev:nullb0 and dev = nullb0 
[atr] readonly = 0 read_f_ 4 read_direct_f (why a second pointer) 5 and write_f_ 6 , info.mode is ZBD_DM_HOST_MANAGED 
[atr] total number of zones reported are 40 
 	 [all] m = 0 and i = 0 , addr 00000000000000 and wp 00000067108864 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 1 , addr 00000067108864 and wp 00000134217728 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 2 , addr 00000134217728 and wp 00000201326592 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 3 , addr 00000201326592 and wp 00000268435456 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 4 , addr 00000268435456 and wp 00000335544320 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 5 , addr 00000335544320 and wp 00000402653184 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 6 , addr 00000402653184 and wp 00000469762048 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 7 , addr 00000469762048 and wp 00000536870912 capacity is 00000067108864 len is 00000067108864 
	 [all] m = 0 and i = 8 , addr 00000536870912 and wp 00000537174016 capacity is 00000067108864 len is 00000067108864 
		 [atr] metadata zone being pushed, start 00000536870912 and write pointers 00000537174016 written data: 00000000303104
	 [all] m = 1 and i = 9 , addr 00000603979776 and wp 00000603979776 capacity is 00000067108864 len is 00000067108864 
		 [atr] metadata zone being pushed, start 00000603979776 and write pointers 00000603979776 written data: 00000000000000
	 [all] m = 2 and i = 10 , addr 00000671088640 and wp 00000671088640 capacity is 00000067108864 len is 00000067108864 
		 [atr] metadata zone being pushed, start 00000671088640 and write pointers 00000671088640 written data: 00000000000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.19
Date:       Tue May  4 09:41:39 2021
CPU:        40 * Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
CPUCache:   14080 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [rocksdbtest/dbbench]
fillrandom   :      19.850 micros/op 50378 ops/sec;    5.6 MB/s
$

From the log, what was eventually written and where

2021/05/04-09:28:40.083910 7f170499aac0 ZenFS shutting down
2021/05/04-09:28:40.083918 7f170499aac0 [DEBUG] Zone 0x2C000000 used capacity: 304 bytes (0 MB)
2021/05/04-09:28:40.083923 7f170499aac0 [DEBUG] Zone 0x30000000 used capacity: 6152 bytes (0 MB)
2021/05/04-09:28:40.083927 7f170499aac0 [DEBUG] Zone 0x34000000 used capacity: 44053053 bytes (42 MB)
2021/05/04-09:28:40.083930 7f170499aac0 [DEBUG] Zone 0x38000000 used capacity: 58307167 bytes (55 MB)
2021/05/04-09:28:40.083936 7f170499aac0   Files:
2021/05/04-09:28:40.083942 7f170499aac0     rocksdbtest/dbbench/000009.sst                sz: 43993188 lh: 3
2021/05/04-09:28:40.083946 7f170499aac0           Extent 0 {start=0x38000000, zone=14, len=43993188} 
2021/05/04-09:28:40.083952 7f170499aac0     rocksdbtest/dbbench/000010.log                sz: 14313979 lh: 2
2021/05/04-09:28:40.083955 7f170499aac0           Extent 0 {start=0x3a9f5000, zone=14, len=14313979} 
2021/05/04-09:28:40.083959 7f170499aac0     rocksdbtest/dbbench/000011.sst                sz: 44053053 lh: 3
2021/05/04-09:28:40.083963 7f170499aac0           Extent 0 {start=0x34000000, zone=13, len=44053053} 
2021/05/04-09:28:40.083967 7f170499aac0     rocksdbtest/dbbench/CURRENT                   sz: 16 lh: 0
2021/05/04-09:28:40.083970 7f170499aac0           Extent 0 {start=0x30000000, zone=12, len=16} 
2021/05/04-09:28:40.083974 7f170499aac0     rocksdbtest/dbbench/IDENTITY                  sz: 37 lh: 0
2021/05/04-09:28:40.083977 7f170499aac0           Extent 0 {start=0x2c000000, zone=11, len=37} 
2021/05/04-09:28:40.083981 7f170499aac0     rocksdbtest/dbbench/MANIFEST-000004           sz: 267 lh: 0
2021/05/04-09:28:40.083984 7f170499aac0           Extent 0 {start=0x2c003000, zone=11, len=57} 
2021/05/04-09:28:40.083987 7f170499aac0           Extent 1 {start=0x2c004000, zone=11, len=104} 
2021/05/04-09:28:40.083991 7f170499aac0           Extent 2 {start=0x2c005000, zone=11, len=106} 
2021/05/04-09:28:40.083995 7f170499aac0     rocksdbtest/dbbench/OPTIONS-000007            sz: 6136 lh: 0
2021/05/04-09:28:40.083998 7f170499aac0           Extent 0 {start=0x30001000, zone=12, len=6136} 
2021/05/04-09:28:40.084001 7f170499aac0 Sum of all files: 97 MB of data 

How are different libraries are put together?

  • libzbd - libzbd is a user library providing functions for manipulating zoned block devices. http://zonedstorage.io/projects/libzbd/
    • Unlike the libzbc library, libzbd does not implement direct command access to zoned block devices. Rather, libzbd uses the kernel provided zoned block device interface based on the ioctl() system call. A direct consequence of this is that libzbd will only allow access to zoned block devices supported by the kernel running. This includes both physical devices such as hard-disks supporting the ZBC and ZAC standards, as well as all logical block devices implemented by various device drivers such as nullblk and device mapper drivers.
    • Hence, whatever kernel supports we get that even though device might have supported for newer commands and standards.
    • zenfs uses this, https://github.com/westerndigitalcorporation/zenfs
      • Zenfs says that: ZenFS depends on libzbd and Linux kernel 5.4 or later to perform zone management operations. To use ZenFS on SSDs with Zoned Namespaces kernel 5.9 or later is required.
    • libzdb also have a nice GUI visualizer
    • sudo blkzone report /dev/nullb0 command is somehow broken in the size of zones it reports.
atr@node3:/home/atr/src/storage/rocksdb$ sudo blkzone report /dev/nullb0
  start: 0x000000000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000020000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000040000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000060000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000080000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x0000a0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x0000c0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x0000e0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000100000, len 0x020000, wptr 0x000250 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000120000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000140000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000160000, len 0x020000, wptr 0x000030 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000180000, len 0x020000, wptr 0x000018 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0001a0000, len 0x020000, wptr 0x015018 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0001c0000, len 0x020000, wptr 0x01bce0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0001e0000, len 0x020000, wptr 0x01d7f8 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000200000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000220000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000240000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000260000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000280000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0002a0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0002c0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0002e0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000300000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000320000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000340000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000360000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000380000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0003a0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0003c0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0003e0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000400000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000420000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000440000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000460000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000480000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0004a0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0004c0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x0004e0000, len 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
atr@node3:/home/atr/src/storage/rocksdb$ 

fio support

What does fio supports, http://zonedstorage.io/benchmarking/fio/

preparing to install fio with zns support and libzbd

git clone [email protected]:westerndigitalcorporation/libzbc.git
cd libzdb 
atr@node1:/home/atr/src/storage/libzbc$ sh ./autogen.sh 

atr@node1:/home/atr/src/storage/libzbc$ ./configure --prefix=/home/atr/local/
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
./configure: line 2920: AX_RPM_INIT: command not found
./configure: line 2922: syntax error near unexpected token no,
./configure: line 2922: AX_CHECK_ENABLE_DEBUG(no, _DBG_)
atr@node1:/home/atr/src/storage/libzbc$ sudo apt install autoconf-archive
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  autoconf-archive
0 upgraded, 1 newly installed, 0 to remove and 124 not upgraded.
Need to get 665 kB of archives.
After this operation, 5,894 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 autoconf-archive all 20190106-2.1ubuntu1 [665 kB]
Fetched 665 kB in 0s (9,783 kB/s)        
Selecting previously unselected package autoconf-archive.
(Reading database ... 221809 files and directories currently installed.)
Preparing to unpack .../autoconf-archive_20190106-2.1ubuntu1_all.deb ...
Unpacking autoconf-archive (20190106-2.1ubuntu1) ...
Setting up autoconf-archive (20190106-2.1ubuntu1) ...
Processing triggers for install-info (6.7.0.dfsg.2-5) ...
atr@node1:/home/atr/src/storage/libzbc$ sh ./autogen.sh 
atr@node1:/home/atr/src/storage/libzbc$ sh ./autogen.sh 
libtoolize: putting auxiliary files in AC_CONFIG_AUX_DIR, 'build-aux'.
libtoolize: copying file 'build-aux/ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
configure.ac:33: installing 'build-aux/compile'
configure.ac:25: installing 'build-aux/missing'
lib/Makefile.am: installing 'build-aux/depcomp'
atr@node1:/home/atr/src/storage/libzbc$ ./configure --prefix=/home/atr/local/
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
Not trying to build rpms for your system (use --enable-rpm-rules to override) 
checking whether to enable debugging... no
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes

atr@node1:/home/atr/src/storage/libzbc$ make install 
----------------------------------------------------------------------
Libraries have been installed in:
   /home/atr/local/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the '-LLIBDIR'
flag during linking and do at least one of the following:
   - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
     during execution
   - add LIBDIR to the 'LD_RUN_PATH' environment variable
     during linking
   - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
   - have your system administrator add LIBDIR to '/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------
 /usr/bin/mkdir -p '/home/atr/local/lib/pkgconfig'
 /usr/bin/install -c -m 644 libzbc.pc '/home/atr/local/lib/pkgconfig'
 /usr/bin/mkdir -p '/home/atr/local/include/libzbc'
 /usr/bin/install -c -m 644 ../include/libzbc/zbc.h '/home/atr/local/include/libzbc'
make[2]: Leaving directory '/home/atr/src/storage/libzbc/lib'
make[1]: Leaving directory '/home/atr/src/storage/libzbc/lib'
Making install in tools
make[1]: Entering directory '/home/atr/src/storage/libzbc/tools'
Making install in .
make[2]: Entering directory '/home/atr/src/storage/libzbc/tools'
make[3]: Entering directory '/home/atr/src/storage/libzbc/tools'
 /usr/bin/mkdir -p '/home/atr/local/bin'
  /bin/bash ../libtool   --mode=install /usr/bin/install -c zbc_info zbc_report_zones zbc_reset_zone zbc_open_zone zbc_close_zone zbc_finish_zone zbc_read_zone zbc_write_zone zbc_set_write_ptr zbc_set_zones gzbc gzviewer '/home/atr/local/bin'
libtool: install: /usr/bin/install -c .libs/zbc_info /home/atr/local/bin/zbc_info
libtool: install: /usr/bin/install -c .libs/zbc_report_zones /home/atr/local/bin/zbc_report_zones
libtool: install: /usr/bin/install -c .libs/zbc_reset_zone /home/atr/local/bin/zbc_reset_zone
libtool: install: /usr/bin/install -c .libs/zbc_open_zone /home/atr/local/bin/zbc_open_zone
libtool: install: /usr/bin/install -c .libs/zbc_close_zone /home/atr/local/bin/zbc_close_zone
libtool: install: /usr/bin/install -c .libs/zbc_finish_zone /home/atr/local/bin/zbc_finish_zone
libtool: install: /usr/bin/install -c .libs/zbc_read_zone /home/atr/local/bin/zbc_read_zone
libtool: install: /usr/bin/install -c .libs/zbc_write_zone /home/atr/local/bin/zbc_write_zone
libtool: install: /usr/bin/install -c .libs/zbc_set_write_ptr /home/atr/local/bin/zbc_set_write_ptr
libtool: install: /usr/bin/install -c .libs/zbc_set_zones /home/atr/local/bin/zbc_set_zones
libtool: install: /usr/bin/install -c .libs/gzbc /home/atr/local/bin/gzbc
libtool: install: /usr/bin/install -c .libs/gzviewer /home/atr/local/bin/gzviewer
 /usr/bin/mkdir -p '/home/atr/local/share/man/man8'
 /usr/bin/install -c -m 644 info/zbc_info.8 report_zones/zbc_report_zones.8 reset_zone/zbc_reset_zone.8 open_zone/zbc_open_zone.8 close_zone/zbc_close_zone.8 finish_zone/zbc_finish_zone.8 read_zone/zbc_read_zone.8 write_zone/zbc_write_zone.8 set_write_ptr/zbc_set_write_ptr.8 set_zones/zbc_set_zones.8 gui/gzbc.8 viewer/gzviewer.8 '/home/atr/local/share/man/man8'
make[3]: Leaving directory '/home/atr/src/storage/libzbc/tools'
make[2]: Leaving directory '/home/atr/src/storage/libzbc/tools'
make[1]: Leaving directory '/home/atr/src/storage/libzbc/tools'
make[1]: Entering directory '/home/atr/src/storage/libzbc'
make[2]: Entering directory '/home/atr/src/storage/libzbc'
make[2]: Nothing to be done for 'install-exec-am'.
make[2]: Nothing to be done for 'install-data-am'.
make[2]: Leaving directory '/home/atr/src/storage/libzbc'
make[1]: Leaving directory '/home/atr/src/storage/libzbc'
atr@node1:/home/atr/src/storage/libzbc$ 

Now setup the fio compilation ....

atr@node3:/home/atr/src/storage/fio$ sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
animesh: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=8
fio-3.26-39-g6308
Starting 1 process
/dev/nullb0: rounded down io_size from 4294967296 to 2147483648
Jobs: 1 (f=0): [f(1)][100.0%][w=522MiB/s][w=134k IOPS][eta 00m:00s]
animesh: (groupid=0, jobs=1): err= 0: pid=973937: Tue May  4 12:37:48 2021
  write: IOPS=136k, BW=531MiB/s (557MB/s)(2048MiB/3855msec); 32 zone resets
    slat (usec): min=4, max=444, avg= 6.64, stdev= 2.05
    clat (nsec): min=1633, max=516246, avg=51482.56, stdev=13817.98
     lat (usec): min=7, max=524, avg=58.24, stdev=15.55
    clat percentiles (usec):
     |  1.00th=[   43],  5.00th=[   44], 10.00th=[   44], 20.00th=[   45],
     | 30.00th=[   45], 40.00th=[   45], 50.00th=[   49], 60.00th=[   50],
     | 70.00th=[   50], 80.00th=[   52], 90.00th=[   66], 95.00th=[   95],
     | 99.00th=[  103], 99.50th=[  104], 99.90th=[  122], 99.95th=[  159],
     | 99.99th=[  225]
   bw (  KiB/s): min=511640, max=569544, per=100.00%, avg=547178.29, stdev=19402.23, samples=7
   iops        : min=127910, max=142384, avg=136794.57, stdev=4850.23, samples=7
  lat (usec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=68.81%
  lat (usec)   : 100=29.19%, 250=1.97%, 500=0.01%, 750=0.01%
  cpu          : usr=11.11%, sys=36.22%, ctx=524329, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=2048MiB (2147MB), run=3855-3855msec

Disk stats (read/write):
  nullb0: ios=31/494194, merge=0/0, ticks=1/1904, in_queue=0, util=96.87%
atr@node3:/home/atr/src/storage/fio$ 
  • the size determines the amount to be written, hence the number of zones reset.

FIO example combinations

 1071  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=140660178944 --size=$((512*1024*1024))
 1073  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=1G 
 1074  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=2G 
 1075  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=20G 
 1076  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=20G --ioengine=libaio --iodepth=8 --rw=write --bs=256K 
 1077  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=20G --ioengine=io_uring --iodepth=8 --rw=write --bs=256K 
 1078  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=1G --ioengine=io_uring --iodepth=8 --rw=write --bs=256K 
 1079  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=1G --ioengine=io_uring --iodepth=8 --rw=write --bs=64M 
 1080  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=1G --ioengine=io_uring --iodepth=8 --rw=write --bs=128K 
 1081  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=1G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1084  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024))--size=512M --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1085  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=512M --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1086  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=1G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1087  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1088  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K 
 1089  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K --time_based 
 1090  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K --help
 1091  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K  runtime=30s --time_based
 1092  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K  --runtime=30s --time_based
 1102  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=io_uring --iodepth=8 --rw=write --bs=4K  --runtime=30s --time_based
 1103  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=10s --time_based
 1104  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=2 --rw=write --bs=4K  --runtime=10s --time_based
 1105  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=2 --rw=write --bs=4K  --runtime=10s --time_based --thread=2
 1106  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=2
 1107  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=4
 1108  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=1
 1146  cd /home/atr/src/storage/fio
 1147  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=1 
 1148  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=1x 
 1149  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=4G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=20s --time_based --thread=2
 1150  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=20s --time_based --thread=2
 1151  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=20s --time_based --thread=1
 1152  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1153  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=read --bs=4K  --runtime=30s --time_based --thread=1
 1171  cd /home/atr/src/storage/fio
 1172  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=read --bs=4K  --runtime=30s --time_based --thread=1 
 1205  cd /home/atr/src/storage/fio
 1206  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=read --bs=4K  --runtime=30s --time_based --thread=1
 1208  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1266  cd /home/atr/src/storage/fio 
 1267  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1271  #sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1272  sudo ./fio --name=animesh --filename=/dev/ram0 --direct=1 --offset=$((0*1024*1024)) --size=1G --ioengine=libaio --iodepth=64 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1273  sudo ./fio --name=animesh --filename=/dev/ram0 --direct=1 --offset=$((0*1024*1024)) --size=1G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1274  sudo ./fio --name=animesh --filename=/dev/ram0 --direct=1 --offset=$((0*1024*1024)) --size=1G --ioengine=libaio --iodepth=1 --rw=read --bs=4K  --runtime=30s --time_based --thread=1
 1275  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=libaio --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1276  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=iouring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1277  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1278  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=ioring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1279  sudo ./fio --name=animesh --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1280  sudo ./fio --name=animesh --numjobs=2 --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread=1
 1281  sudo ./fio --name=animesh --numjobs=2 --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --thread
 1282  sudo ./fio --name=animesh --numjobs=2 --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based 
 1283  sudo ./fio --name=animesh --numjobs=2 --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --group_reporting 
 1284  sudo ./fio --name=animesh --numjobs=2 --filename=/dev/nullb0 --direct=1 --zonemode=zbd --offset=$((512*1024*1024)) --size=2G --ioengine=io_uring --iodepth=1 --rw=write --bs=4K  --runtime=30s --time_based --group_reporting --thread

2nd round with NVMe ZNS in QEMU

What does fio uses

Compiling libnvme

Makefile is broken

atr@atr-XPS-13:~/vu/github/storage/libnvme$ git diff 
diff --git a/test/Makefile b/test/Makefile
index 1620622..aafd2de 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -1,5 +1,5 @@
 CFLAGS ?= -g -O2
-override CFLAGS += -Wall -D_GNU_SOURCE -L../src/ -I../src/ -luuid
+override CFLAGS += -Wall -D_GNU_SOURCE -L../src/ -I../src/
 
 include ../Makefile.quiet
 
@@ -23,10 +23,10 @@ all: $(all_targets)
 CXXFLAGS ?= -lstdc++
 
 %: %.cc
-       $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) $(CXXFLAGS) -o $@ $< -lnvme
+       $(QUIET_CC)$(CXX) $(CFLAGS) $(LDFLAGS) $(CXXFLAGS) -o $@ $< -lnvme
 
 %: %.c
-       $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< -lnvme
+       $(QUIET_CC)$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $< -lnvme -luuid
 
 clean:
        rm -f $(all_targets)
atr@atr-XPS-13:~/vu/github/storage/libnvme$ 

QEMU boot

#!/bin/bash 

#qemu-img create -f raw znsssd.img 16777216
echo "needs qemu 6.0.0 or above" 

sudo /home/atr/src/qemu-6.0.0/build/qemu-system-x86_64 -name qemuzns -m 4G --enable-kvm -cpu host -smp 2 \
	-hda /home/atr/xfs/images/ubuntu-20.04-zns.qcow \
	-net user,hostfwd=tcp::7777-:22,hostfwd=tcp::2222-:2000 -net nic \
	-drive file=/home/atr/xfs/images/znsssd.img,id=znsd,format=raw,if=none \
	-drive file=/home/atr/xfs/images/nvmessd.img,id=nvmd,format=raw,if=none \
	-device nvme,drive=nvmd,serial=1234,physical_block_size=4096,logical_block_size=4096\
	-device nvme,serial=baz,id=nvme2,zoned.zasl=7\
	-device nvme-ns,id=ns2,drive=znsd,nsid=2,logical_block_size=4096,physical_block_size=4096,zoned=true,zoned.zone_size=131072,zoned.zone_capacity=131072,zoned.max_open=0,zoned.max_active=0,bus=nvme2


# https://github.com/qemu/qemu/blob/master/hw/nvme/ctrl.c
# * - `zoned.zasl`
# *   Indicates the maximum data transfer size for the Zone Append command. Like
# *   `mdts`, the value is specified as a power of two (2^n) and is in units of
# *   the minimum memory page size (CAP.MPSMIN). The default value is 0 (i.e.
# *   defaulting to the value of `mdts`).
# *

#  * Setting `zoned` to true selects Zoned Command Set at the namespace.
# * In this case, the following namespace properties are available to configure
# * zoned operation:
# *     zoned.zone_size=<zone size in bytes, default: 128MiB>
# *         The number may be followed by K, M, G as in kilo-, mega- or giga-.
# *
# *     zoned.zone_capacity=<zone capacity in bytes, default: zone size>
# *         The value 0 (default) forces zone capacity to be the same as zone
# *         size. The value of this property may not exceed zone size.
# *
# *     zoned.descr_ext_size=<zone descriptor extension size, default 0>
# *         This value needs to be specified in 64B units. If it is zero,
# *         namespace(s) will not support zone descriptor extensions.
# *
# *     zoned.max_active=<Maximum Active Resources (zones), default: 0>
# *         The default value means there is no limit to the number of
# *         concurrently active zones.
# *
# *     zoned.max_open=<Maximum Open Resources (zones), default: 0>
# *         The default value means there is no limit to the number of
# *         concurrently open zones.
# *
# *     zoned.cross_read=<enable RAZB, default: false>
# *         Setting this property to true enables Read Across Zone Boundaries.
# */
# --older verion 
sudo /home/atr/src/qemu-6.0.0/build/qemu-system-x86_64 -name qemuzns -m 4G --enable-kvm -cpu host -smp 2 \
	-hda /home/atr/xfs/images/ubuntu-20.04-zns.qcow \
	-net user,hostfwd=tcp::7777-:22,hostfwd=tcp::2222-:2000 -net nic \
	-drive file=/home/atr/xfs/images/znsssd.img,id=mynvme,format=raw,if=none \
	-device nvme,serial=baz,id=nvme2,zoned.zasl=7\
	-device nvme-ns,id=ns2,drive=mynvme,nsid=2,logical_block_size=4096,physical_block_size=4096,zoned=true,zoned.zone_size=131072,zoned.zone_capacity=131072,zoned.max_open=0,zoned.max_active=0,bus=nvme2

All code is at https://github.com/animeshtrivedi/zns-resources

all nvme cli examples

http://zonedstorage.io/projects/zns/

they have basic r/w tests with zone appends as well. It worked in the VM

atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ cat /sys/block/nvme0n1/queue/zoned
host-managed
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ cat /sys/block/nvme0n1/queue/chunk_sectors
256
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ cat /sys/block/nvme0n1/queue/nr_zones 
128
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ sudo nvme zns id-ctrl /dev/nvme0n1 
NVMe ZNS Identify Controller:
zasl    : 7
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ sudo nvme zns id-ns /dev/nvme0n1
ZNS Command Set Identify Namespace:
zoc     : 0
ozcs    : 0
mar     : 0xffffffff
mor     : 0xffffffff
rrl     : 0
frl     : 0
lbafe  0: zsze:0x20 zdes:0
lbafe  1: zsze:0x20 zdes:0 (in use)
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ echo "hello world" | sudo nvme zns zone-append /dev/nvme0n1 -z 4096
Success appended data to LBA 2
atr@stosys-qemu-vm:/home/atr/src/zns-resources/scripts$ 

Understanding libnvme and nvme-cli

  • they have common, renamed structures definitions in
    • nvme-cli/linux/nvme.h
    • libnvme/src/nvme/types.h
  • nvme-cli has
  • nvme-cli has support for SUBMIT_IO and PASSTRHOUGH (passthrough goes directly to the device)
    • SUBMIT_IO needs support from the kernel for a particular feature and command
  • libnvme only has passthrough as it does not use SUBMIT_IO command set

where are QEMU NVMe controller parameters

https://github.com/qemu/qemu/blob/master/hw/nvme/ctrl.c#L23

example of feature-id

atr@node3:/home/atr/src/zns-resources$ sudo nvme get-feature -H -f 1  /dev/nvme1n1  
get-feature:0x1 (Arbitration), Current value:00000000
	High Priority Weight   (HPW): 1
	Medium Priority Weight (MPW): 1
	Low Priority Weight    (LPW): 1
	Arbitration Burst       (AB): 1
atr@node3:/home/atr/src/zns-resources$ 

Summarizing notes from the development cycle May 17-25th on NVMe/ZNS

All code is in the repo zns-example repo.

  • TRIM command is deallocate, and uncorrectable write is like artifically introducing errors for not being able to read certain LBA ranges.
  • There is a bit of mess between how libnvme is developed (missing error codes) and examples in nvme-cli. The current zns-example repo has code copied from all over the place.
  • LBA is a address on +1 offsets, not LBA_SIZE offsets.
  • nvme-cli code is in nvme_ioctl.[ch] files.
    • It also has a passthrough function but it is not used anywhere.
  • in libnvme, it has types.h which has many definitions, and then other logic in ioctl.[ch] files. It only uses passthrough.

    Why there are structures pad to 4096

The 4096 are response structures for a always 64B command. See example of CSI NS and Ctrl identify examples in the main.cpp

General comments

  • NVMe commands have a common structure and typically have DW10,11,12,14,14,15,16 etc. for command specific customization. See section 4.2 and figure 105 for the common command strucutre in the NVMe 1.4 base specification.

  • Zones can be managed using Zone Management send/recv commands. See section 4.3 and 4.4. There is a working code. I have not tired supporting the extended report attributes.

  • ZNS read/write commands are the same as in the NVMe base specification. Only Zone management and appends are new, and the associated logic how to manage zones with their associated transitions.

How do I find out the transfer units for an NVMe device?

the LBA size can be extracted from identify the namespace command, like

atr@node3:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme id-ns /dev/nvme1n1 
NVME Identify Namespace 1:
nsze    : 0x209a97b0
ncap    : 0x209a97b0
nuse    : 0x209a97b0
nsfeat  : 0
nlbaf   : 0
flbas   : 0
mc      : 0
dpc     : 0
dps     : 0
nmic    : 0
rescap  : 0
fpi     : 0
dlfeat  : 0
nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
mssrl   : 0
mcl     : 0
msrc    : 0
anagrpid: 0
nsattr	: 0
nvmsetid: 1
endgid  : 1
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000000
lbaf  0 : ms:0   lbads:9  rp:0x2 (in use)  # This is 9, 2^9 = 512 bytes 
atr@node3:/home/atr/src/zns-resources/zns-rw-example$

Another examples from the VM

atr@stosys-qemu-vm:~$ sudo nvme id-ns /dev/nvme0n1 
NVME Identify Namespace 1:
nsze    : 0x800
ncap    : 0x800
nuse    : 0x800
nsfeat  : 0x14
nlbaf   : 1
flbas   : 0x1
mc      : 0
dpc     : 0
dps     : 0
nmic    : 0
rescap  : 0
fpi     : 0
dlfeat  : 1
nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
npwg    : 0
npwa    : 0
npdg    : 0
npda    : 0
nows    : 0
mssrl   : 0
mcl     : 0
msrc    : 0
anagrpid: 0
nsattr	: 0
nvmsetid: 0
endgid  : 0
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000000
lbaf  0 : ms:0   lbads:9  rp:0 
lbaf  1 : ms:0   lbads:12 rp:0 (in use)
atr@stosys-qemu-vm:~$

It has two supported sizes, 512 bytes and 4096 bytes. The latter one is in use.

I also print this in the ZNS example code

a specific device name is passed : /dev/nvme1n1 
device /dev/nvme1n1 opened successfully 3 
nsze    : 0x1000
ncap    : 0x1000
nuse    : 0x1000
nsfeat  : 0x14
  [4:4] : 0x1	NPWG, NPWA, NPDG, NPDA, and NOWS are Supported
  [3:3] : 0	NGUID and EUI64 fields if non-zero, Reused
  [2:2] : 0x1	Deallocated or Unwritten Logical Block error Supported
  [1:1] : 0	Namespace uses AWUN, AWUPF, and ACWU
  [0:0] : 0	Thin Provisioning Not Supported

nlbaf   : 1
flbas   : 0x1
  [4:4] : 0	Metadata Transferred in Separate Contiguous Buffer
  [3:0] : 0x1	Current LBA Format Selected

mc      : 0
  [1:1] : 0	Metadata Pointer Not Supported
  [0:0] : 0	Metadata as Part of Extended Data LBA Not Supported

dpc     : 0
  [4:4] : 0	Protection Information Transferred as Last 8 Bytes of Metadata Not Supported
  [3:3] : 0	Protection Information Transferred as First 8 Bytes of Metadata Not Supported
  [2:2] : 0	Protection Information Type 3 Not Supported
  [1:1] : 0	Protection Information Type 2 Not Supported
  [0:0] : 0	Protection Information Type 1 Not Supported

dps     : 0
  [3:3] : 0	Protection Information is Transferred as Last 8 Bytes of Metadata
  [2:0] : 0	Protection Information Disabled

nmic    : 0
  [0:0] : 0	Namespace Multipath Not Capable

rescap  : 0
  [7:7] : 0	Ignore Existing Key - Used as defined in revision 1.2.1 or earlier
  [6:6] : 0	Exclusive Access - All Registrants Not Supported
  [5:5] : 0	Write Exclusive - All Registrants Not Supported
  [4:4] : 0	Exclusive Access - Registrants Only Not Supported
  [3:3] : 0	Write Exclusive - Registrants Only Not Supported
  [2:2] : 0	Exclusive Access Not Supported
  [1:1] : 0	Write Exclusive Not Supported
  [0:0] : 0	Persist Through Power Loss Not Supported

fpi     : 0
  [7:7] : 0	Format Progress Indicator Not Supported

dlfeat  : 1
  [4:4] : 0	Guard Field of Deallocated Logical Blocks is set to 0xFFFF
  [3:3] : 0	Deallocate Bit in the Write Zeroes Command is Not Supported
  [2:0] : 0x1	Bytes Read From a Deallocated Logical Block and its Metadata are 0x00

nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
npwg    : 0
npwa    : 0
npdg    : 0
npda    : 0
nows    : 0
mssrl   : 128
mcl     : 128
msrc    : 127
anagrpid: 0
nsattr	: 0
nvmsetid: 0
endgid  : 0
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000000
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best 
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
size of vs is 3712 
number of LBA formats? 1 (a zero based value) 
the LBA size is 4096
...

So, the LBA size is in the min transfer units. There are more data strucutres asociated with the controller too, like

  • Maximum Data Transfer Unit (MDTS)
  • CAP.MPSMIN
Getting MDTS
atr@stosys-qemu-vm:~$ sudo nvme id-ctrl /dev/nvme0n1 
NVME Identify Controller:
vid       : 0x1b36
ssvid     : 0x1af4
sn        : 1234                
mn        : QEMU NVMe Ctrl                          
fr        : 1.0     
rab       : 6
ieee      : 525400
cmic      : 0
mdts      : 7
cntlid    : 0
ver       : 0x10400
rtd3r     : 0
rtd3e     : 0
oaes      : 0x100
ctratt    : 0
rrls      : 0
cntrltype : 1
fguid     : 
crdt1     : 0
crdt2     : 0
crdt3     : 0
oacs      : 0xa
acl       : 3
aerl      : 3
frmw      : 0x3
lpa       : 0x7
elpe      : 0
npss      : 0
avscc     : 0
apsta     : 0
wctemp    : 343
cctemp    : 373
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 256
oncs      : 0x15d
fuses     : 0
fna       : 0
vwc       : 0x7
awun      : 0
awupf     : 0
icsvscc     : 0
nwpc      : 0
acwu      : 0
sgls      : 0x10001
mnan      : 0
subnqn    : nqn.2019-08.org.qemu:1234
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps    0 : mp:25.00W operational enlat:16 exlat:4 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
atr@stosys-qemu-vm:~$

See NVMe identify controller, figure 251 Bytes 77 in 1.4 specification.

MDTS says 7, which is The value is in units of the minimum memory page size (CAP.MPSMIN) and is reported as a power of two (2^n).

See below, so the MDTS for this device is : 128 x 4kb = 512kB

So what is the CAP.MPSMIN size?

Memory Page Size Minimum (MPSMIN): This field indicates the minimum host memory page size that the controller supports. The minimum memory page size is (2 ^ (12 + MPSMIN)). The host shall not configure a memory page size in CC.MPS that is smaller than this value.

This information is in the Offset 0h: CAP – Controller Capabilities register. See section 3.1.1 in the NVMe specification (figure 69). How do I read this register? Using nvme-cli command

atr@stosys-qemu-vm:~$ sudo nvme show-regs -H /dev/nvme1 
cap     : 4018200f0107ff
	Controller Memory Buffer Supported (CMBS): The Controller Memory Buffer is Not Supported
	Persistent Memory Region Supported (PMRS): The Persistent Memory Region is Not Supported
	Memory Page Size Maximum         (MPSMAX): 65536 bytes
	Memory Page Size Minimum         (MPSMIN): 4096 bytes
	Boot Partition Support              (BPS): No
	Command Sets Supported              (CSS): NVM command set is Supported
	                                           One or more I/O Command Sets are Supported
	NVM Subsystem Reset Supported     (NSSRS): No
	Doorbell Stride                   (DSTRD): 4 bytes
	Timeout                              (TO): 7500 ms
	Arbitration Mechanism Supported     (AMS): Weighted Round Robin with Urgent Priority Class is not supported
	Contiguous Queues Required          (CQR): Yes
	Maximum Queue Entries Supported    (MQES): 2048

version : 10400
	NVMe specification 1.4

cc      : 460061
	I/O Completion Queue Entry Size (IOCQES): 16 bytes
	I/O Submission Queue Entry Size (IOSQES): 64 bytes
	Shutdown Notification              (SHN): No notification; no effect
	Arbitration Mechanism Selected     (AMS): Round Robin
	Memory Page Size                   (MPS): 4096 bytes
	I/O Command Set Selected           (CSS): All supported I/O Command Sets
	Enable                              (EN): Yes

csts    : 1
	Processing Paused               (PP): No
	NVM Subsystem Reset Occurred (NSSRO): No
	Shutdown Status               (SHST): Normal operation (no shutdown has been requested)
	Controller Fatal Status        (CFS): False
	Ready                          (RDY): Yes

nssr    : 0
	NVM Subsystem Reset Control (NSSRC): 0

intms   : 0
	Interrupt Vector Mask Set (IVMS): 0

intmc   : 0
	Interrupt Vector Mask Clear (IVMC): 0

aqa     : 1f001f
	Admin Completion Queue Size (ACQS): 32
	Admin Submission Queue Size (ASQS): 32

asq     : 11112d000
	Admin Submission Queue Base (ASQB): 11112d000

acq     : 111362000
	Admin Completion Queue Base (ACQB): 111362000

cmbloc  : 0
	Controller Memory Buffer feature is not supported

cmbsz   : 0
	Controller Memory Buffer feature is not supported

bpinfo  : 0
	Boot Partition feature is not supported

bprsel  : 0
	Boot Partition feature is not supported

bpmbl   : 0
	Boot Partition feature is not supported

cmbmsc	: 0
	Controller Base Address         (CBA): 0
	Controller Memory Space Enable (CMSE): 0
	Capabilities Registers Enabled  (CRE): CMBLOC and CMBSZ registers are NOT enabled

cmbsts	: 0
	Controller Base Address Invalid (CBAI): 0

pmrcap  : 0
	Controller Memory Space Supported                   (CMSS): Referencing PMR with host supplied addresses is Not Supported
	Persistent Memory Region Timeout                   (PMRTO): 0
	Persistent Memory Region Write Barrier Mechanisms (PMRWBM): 0
	Persistent Memory Region Time Units                (PMRTU): PMR time unit is 500 milliseconds
	Base Indicator Register                              (BIR): 0
	Write Data Support                                   (WDS): Write data to the PMR is not supported
	Read Data Support                                    (RDS): Read data from the PMR is not supported
pmrctl  : 0
	Enable (EN): PMR is Disabled
pmrsts  : 0
	Controller Base Address Invalid (CBAI): 0
	Health Status                   (HSTS): Normal Operation
	Not Ready                       (NRDY): The Persistent Memory Region is Not Ready to process PCI Express memory read and write requests
	Error                            (ERR): 0
pmrebs  : 0
	PMR Elasticity Buffer Size Base  (PMRWBZ): 0
	Read Bypass Behavior                     : memory reads not conflicting with memory writes in the PMR Elasticity Buffer MAY bypass those memory writes
	PMR Elasticity Buffer Size Units (PMRSZU): Bytes
pmrswtp : 0
	PMR Sustained Write Throughput       (PMRSWTV): 0
	PMR Sustained Write Throughput Units (PMRSWTU): Bytes/second
pmrmscl	: 0
	Controller Base Address         (CBA): 0
	Controller Memory Space Enable (CMSE): 0

pmrmscu	: 0
	Controller Base Address         (CBA): 0
atr@stosys-qemu-vm:~$

Figure 78, section 3.1.5 shows the register that can be used to change these values. I have no tried how to change these values yet. It does not look like it is supported directly in the nvme-cli command. For example, you can set the Memory Page Size (MPS) in the bit offset (10:07).

How to show the NVMe version used and supported

See above, the controller register dump also shows the supported NVMe version. We have

  • on Dell XPS version : 10300 NVMe specification 1.3
  • on Node 3 version : 10000 NVMe specification 1.0
  • Inside the Ubuntu VM version : 10400 NVMe specification 1.4

How to read the NVMe controller PCIe/NVMe registers

With the command nvme show-regs -H /dev/nvme1 (needs the character device)

CNS data structures and CSI extensions

Section 5.15 has a set of identify commands and associated components that should reply to them. There is CNS (figure 245) and associated values in figure 248. The two important ones are namespace (0) and controller (1). The namespace return data structure is defined in figure 249 and is 4096 bytes. The controller is in figure 251.

The values in figure 248 are not aware of ZNS extensions. Hence in the libnvme, they have a new non-standard extension value for CNN as (in ioctl.h)

enum nvme_identify_cns {
	NVME_IDENTIFY_CNS_NS					= 0x00,
	NVME_IDENTIFY_CNS_CTRL					= 0x01,
	NVME_IDENTIFY_CNS_NS_ACTIVE_LIST			= 0x02,
	NVME_IDENTIFY_CNS_NS_DESC_LIST				= 0x03,
	NVME_IDENTIFY_CNS_NVMSET_LIST				= 0x04,
	NVME_IDENTIFY_CNS_CSI_NS				= 0x05, /* XXX: Placeholder until assigned */
	NVME_IDENTIFY_CNS_CSI_CTRL				= 0x06, /* XXX: Placeholder until assigned */
	NVME_IDENTIFY_CNS_ALLOCATED_NS_LIST			= 0x10,
	NVME_IDENTIFY_CNS_ALLOCATED_NS				= 0x11,
	NVME_IDENTIFY_CNS_NS_CTRL_LIST				= 0x12,
	NVME_IDENTIFY_CNS_CTRL_LIST				= 0x13,
	NVME_IDENTIFY_CNS_PRIMARY_CTRL_CAP			= 0x14,
	NVME_IDENTIFY_CNS_SECONDARY_CTRL_LIST			= 0x15,
	NVME_IDENTIFY_CNS_NS_GRANULARITY			= 0x16,
	NVME_IDENTIFY_CNS_UUID_LIST				= 0x17,
	NVME_IDENTIFY_CNS_CSI_ALLOCATED_NS			= 0x18, /* XXX: Placeholder until assigned */
};

You can see the new placeholders. Plus they also use a new Command Set Identifier (CSI) as

/**
 * enum nvme_csi - Defined command set indicators
 * @NVME_CSI_NVM:	NVM Command Set Indicator
 */
enum nvme_csi {
	NVME_CSI_NVM			= 0,
	NVME_CSI_ZNS			= 2,
};

The combination of CNS and CSI is now used to identify the controller and namespace. This is how the identify command is packed

int nvme_identify(int fd, enum nvme_identify_cns cns, __u32 nsid, __u16 cntid,
		  __u16 nvmsetid, __u8 uuidx, __u8 csi, void *data)
{
	__u32 cdw10 = NVME_SET(cntid, IDENTIFY_CDW10_CNTID) |
			NVME_SET(cns, IDENTIFY_CDW10_CNS);
	__u32 cdw11 = NVME_SET(nvmsetid, IDENTIFY_CDW11_NVMSETID) |
			NVME_SET(csi, IDENTIFY_CDW11_CSI);
	__u32 cdw14 = NVME_SET(uuidx, IDENTIFY_CDW14_UUID);

See figure 245 (DWord10) and 246 (DWord11) for packing the identify command. The funny thing is that NVM_CSI_NVM is just artifical as it has a value zero. Hence, all zero value DWOrd11 are regular NVMe devices.

why are we doing this? To identify if a device is a ZNS device or a regular NVMe device.

So now, there are few combinations possible:

  • Using the standard defined combination: NVME_IDENTIFY_CNS_NS and NVME_IDENTIFY_CNS_CTRL (and implicit CSI_NVM)
    • They return the well defined struct nvme_id_ns (figure 249) and struct nvme_id_ctrl (figure 251) as their responses. This only works on the standard NVMe devices.
    • in QEMU, that supports this CSI extensions, we get two different data structures for the controller combinations
      • NVME_IDENTIFY_CNS_CTRL with NVM_CSI_NVM = struct nvme_id_ctrl_nvm (I am not sure what this is modeled after?) (works with normal QEMU NVMe and ZNS devices)
      • NVME_IDENTIFY_CNS_CTRL with NVM_CSI_ZNS = struct nvme_zns_id_ctrl (Defined in the ZNS specification at figure 10, section 3.1.2) (works with normal QEMU NVMe and ZNS devices)
    • For QEMU namespaces:
      • NVME_IDENTIFY_CNS_NS with NVM_CSI_NVM = struct nvme_id_ns (I am not sure if NVMe CSI have other ds?) (works with normal QEMU NVMe and ZNS devices)
      • NVME_IDENTIFY_CNS_NS with NVM_CSI_ZNS = struct nvme_zns_id_ns (Defined in the ZNS specification at figure 10, section 3.1.2) (fails with normal NVMe device, works with ZNS devices) The failure of the identify ZNS namespace is how currently I am figuring out a ZNS device.

QEMU NVMe ZNS device are not persistent

QEMU VM restart unconditionally initialized Zones inside the device. See here https://github.com/qemu/qemu/blob/master/hw/nvme/ns.c#L235

This could be a small project to fix this. There are clear shutdown and init functions that can be used to read/write metadata and data back to the device.

Zone Append Example

atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 2
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 3
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 4
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 5
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 6
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 7
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns zone-append -zslba 0 -z 4096 -d ./4096slba0 -f /dev/nvme1n1 
Success appended data to LBA 8
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns report-zones /dev/nvme1n1 | less
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$ sudo nvme zns report-zones /dev/nvme1n1 | head -5 
nr_zones: 128
SLBA: 0x0        WP: 0x9        Cap: 0x20       State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x20       WP: 0x20       Cap: 0x20       State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x40       WP: 0x40       Cap: 0x20       State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x60       WP: 0x60       Cap: 0x20       State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
atr@stosys-qemu-vm:/home/atr/src/zns-resources/zns-rw-example$

Using sudo alias to resolve path issues

alias sudo='sudo env "PATH=$PATH;LD_LIBRARY_PATH=$LD_LIBRARY_PATH"'

QEMU

atr@atr-XPS-13:~/vu/github/animeshtrivedi/qemu$ ./configure --target-list=x86_64-softmmu --enable-kvm --enable-linux-aio --enable-trace-backends=log --disable-werror 

All QEMU NVMe config

atr@node3:/home/atr/src/zns-resources/scripts$ qemu-system-x86_64 -device nvme,help
nvme options:
  acpi-index=<uint32>    -  (default: 0)
  addr=<int32>           - Slot and optional function number, example: 06.0 or 06 (default: -1)
  aer_max_queued=<uint32> -  (default: 64)
  aerl=<uint8>           -  (default: 3)
  bootindex=<int32>
  cmb_size_mb=<uint32>   -  (default: 0)
  discard_granularity=<size> -  (default: 4294967295)
  drive=<str>            - Node name or ID of a block device to use as a backend
  failover_pair_id=<str>
  logical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  max_ioqpairs=<uint32>  -  (default: 64)
  mdts=<uint8>           -  (default: 7)
  min_io_size=<size>     -  (default: 0)
  msix_qsize=<uint16>    -  (default: 65)
  multifunction=<bool>   - on/off (default: false)
  num_queues=<uint32>    -  (default: 0)
  opt_io_size=<size>     -  (default: 0)
  physical_block_size=<size> - A power of two between 512 B and 2 MiB (default: 0)
  pmrdev=<link<memory-backend>>
  rombar=<uint32>        -  (default: 1)
  romfile=<str>
  romsize=<uint32>       -  (default: 4294967295)
  serial=<str>
  share-rw=<bool>        -  (default: false)
  smart_critical_warning=<uint8>
  subsys=<link<nvme-subsys>>
  use-intel-id=<bool>    -  (default: false)
  vsl=<uint8>            -  (default: 7)
  write-cache=<OnOffAuto> - on/off/auto (default: "auto")
  x-pcie-extcap-init=<bool> - on/off (default: true)
  x-pcie-lnksta-dllla=<bool> - on/off (default: true)
  zoned.zasl=<uint8>     -  (default: 0)
atr@node3:/home/atr/src/zns-resources/scripts$

RocksDB setup

the device is too small

atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nvme1n1 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction  --compression_type=none 
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.20
Date:       Tue May 25 13:09:47 2021
CPU:        8 * Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
CPUCache:   16384 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [rocksdbtest/dbbench]
put error: IO error: No space left on device: Zone allocation failure

Now I made a device with 1MB zones, 4096 LBA and size 1GB, then

atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo ./db_bench --fs_uri=zenfs://dev:nvme1n1 --benchmarks=fillrandom --use_direct_io_for_flush_and_compaction  --compression_type=none 
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.20
Date:       Tue May 25 13:24:08 2021
CPU:        8 * Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
CPUCache:   16384 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: NoCompression
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [rocksdbtest/dbbench]
fillrandom   :       2.448 micros/op 408442 ops/sec;   45.2 MB/s

then the zone reports as

atr@stosys-qemu-vm:/home/atr/src/rocksdb$ sudo nvme zns report-zones /dev/nvme1n1 
nr_zones: 1024
SLBA: 0x0        WP: 0x0        Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x100      WP: 0x142      Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x200      WP: 0x200      Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x300      WP: 0x305      Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x400      WP: 0x403      Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x500      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x600      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x700      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x800      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x900      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xa00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xb00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xc00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xd00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xe00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0xf00      WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1d00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1e00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x1f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2d00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2e00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x2f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3100     WP: 0x318d     Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3200     WP: 0x3254     Cap: 0x100      State: IMP_OPENED   Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3400     WP: 0x3400     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3500     WP: 0x3500     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3700     WP: 0x3700     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3800     WP: 0x3800     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3a00     WP: 0x3a00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3b00     WP: 0x3b00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3d00     WP: 0x3d00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3e00     WP: 0x3e00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x3f00     WP: 0x3f00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4d00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4e00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x4f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5d00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5e00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x5f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6d00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6e00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x6f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7000     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7200     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7300     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7500     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7600     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7800     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7900     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7a00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7b00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7c00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7d00     WP: 0x7d00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7e00     WP: 0x7e00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x7f00     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8000     WP: 0x8000     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8100     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8200     WP: 0x8200     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8300     WP: 0x8300     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8400     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8500     WP: 0x8500     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8600     WP: 0x8600     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8700     WP: 0xffffffffffffffff Cap: 0x100      State: FULL         Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8800     WP: 0x8800     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8900     WP: 0x8900     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8a00     WP: 0x8a00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8b00     WP: 0x8b00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8c00     WP: 0x8c00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8d00     WP: 0x8d00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8e00     WP: 0x8e00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
SLBA: 0x8f00     WP: 0x8f00     Cap: 0x100      State: EMPTY        Type: SEQWRITE_REQ   Attrs: 0x0
..to 1024 zones (not shown)
...

Multiple version of libnvme

Aug, 15th 2021

Compiling xNVMe, to disable SPDK BE (./configure --disable-be-spdk) https://xnvme.io/docs/latest/getting_started/

Which numa node pcie device is attached to

atr@node1:/home/atr/tmp/libnvme$ sudo cat /sys/class/nvme/nvme0/numa_node 
1
atr@node1:/home/atr/tmp/libnvme$ 

MTDS schenanigens

On node3

$ sudo nvme show-regs -H /dev/nvme1n1

cap     : 2004010fff
        Controller Memory Buffer Supported (CMBS): The Controller Memory Buffer is Not Supported
        Persistent Memory Region Supported (PMRS): The Persistent Memory Region is Not Supported
        Memory Page Size Maximum         (MPSMAX): 4096 bytes
        Memory Page Size Minimum         (MPSMIN): 4096 bytes
        Boot Partition Support              (BPS): No
        Command Sets Supported              (CSS): NVM command set is Supported
                                                   One or more I/O Command Sets are Not Supported
        NVM Subsystem Reset Supported     (NSSRS): No
        Doorbell Stride                   (DSTRD): 4 bytes
        Timeout                              (TO): 2000 ms
        Arbitration Mechanism Supported     (AMS): Weighted Round Robin with Urgent Priority Class is not supported
        Contiguous Queues Required          (CQR): Yes
        Maximum Queue Entries Supported    (MQES): 4096
[...]
$ sudo nvme id-ctrl -H /dev/nvme1n1 
version : 10000
        NVMe specification 1.0

cc      : 460001
        I/O Completion Queue Entry Size (IOCQES): 16 bytes
        I/O Submission Queue Entry Size (IOSQES): 64 bytes
        Shutdown Notification              (SHN): No notification; no effect
        Arbitration Mechanism Selected     (AMS): Round Robin
        Memory Page Size                   (MPS): 4096 bytes
        I/O Command Set Selected           (CSS): NVM Command Set
        Enable                              (EN): Yes

NVME Identify Controller:
vid       : 0x8086
ssvid     : 0x8086
sn        : PHM203410051280AGN  
mn        : INTEL SSDPE21D280GA                     
fr        : E2010480
rab       : 0
ieee      : 5cd2e4
cmic      : 0
  [3:3] : 0     ANA not supported
  [2:2] : 0     PCI
  [1:1] : 0     Single Controller
  [0:0] : 0     Single Port

mdts      : 5
[...]

With MDTS being 5 => (4096 (minmps) * 2^5 (mdts)) => 128KB. Also remember LBA size is 512 bytes on Optane
testing 128 KB

atr@node3:~$ DD=256; sudo nvme read /dev/nvme1n1 -s 0x100 -b $(($DD -1)) -z $(($DD * 512)) -d 512KB
read: Success
atr@node3:~$ DD=256; sudo nvme write /dev/nvme1n1 -s 0x100 -b $(($DD -1)) -z $(($DD * 512)) -d 512KB
write: Success
atr@node3:~$ DD=257; sudo nvme write /dev/nvme1n1 -s 0x100 -b $(($DD -1)) -z $(($DD * 512)) -d 512KB
submit-io: Invalid argument
atr@node3:~$ DD=257; sudo nvme read /dev/nvme1n1 -s 0x100 -b $(($DD -1)) -z $(($DD * 512)) -d 512KB
submit-io: Invalid argument
atr@node3:~$ 

so 256 * 512 = 128KB - it works!

Now inside the VM with ZNS

cap     : 4018200f0107ff
        Controller Memory Buffer Supported (CMBS): The Controller Memory Buffer is Not Supported
        Persistent Memory Region Supported (PMRS): The Persistent Memory Region is Not Supported
        Memory Page Size Maximum         (MPSMAX): 65536 bytes
        Memory Page Size Minimum         (MPSMIN): 4096 bytes
        Boot Partition Support              (BPS): No
        Command Sets Supported              (CSS): NVM command set is Supported
                                                   One or more I/O Command Sets are Supported
        NVM Subsystem Reset Supported     (NSSRS): No
        Doorbell Stride                   (DSTRD): 4 bytes
        Timeout                              (TO): 7500 ms
        Arbitration Mechanism Supported     (AMS): Weighted Round Robin with Urgent Priority Class is not supported
        Contiguous Queues Required          (CQR): Yes
        Maximum Queue Entries Supported    (MQES): 2048

version : 10400
        NVMe specification 1.4

cc      : 460061
        I/O Completion Queue Entry Size (IOCQES): 16 bytes
        I/O Submission Queue Entry Size (IOSQES): 64 bytes
        Shutdown Notification              (SHN): No notification; no effect
        Arbitration Mechanism Selected     (AMS): Round Robin
        Memory Page Size                   (MPS): 4096 bytes
        I/O Command Set Selected           (CSS): All supported I/O Command Sets
        Enable                              (EN): Yes
[...]

and the id-ctrl 

NVME Identify Controller:
vid       : 0x1b36
ssvid     : 0x1af4
sn        : zns-dev             
mn        : QEMU NVMe Ctrl                          
fr        : 1.0     
rab       : 6
ieee      : 525400
cmic      : 0
  [3:3] : 0     ANA not supported
  [2:2] : 0     PCI
  [1:1] : 0     Single Controller
  [0:0] : 0     Single Port

mdts      : 7
cntlid    : 0
ver       : 0x10400
rtd3r     : 0
rtd3e     : 0
oaes      : 0x100

The default QEMU MDTS is 7. So here maximum I/O size is : (4096 * 2^7) => 512kB the LBA block size in use here is 4096.

This is normal NVMe device, see the random pattern of failure

atr@stosys-qemu-vm:/home/atr/new/zns-resources/stosys-class/stosys-project-code$ DD=128; for((i=0;i<10;i++)); do sudo nvme write /dev/nvme1n1 -s 0x0 -b $(($DD -1)) -z $(($DD * 4096)) -d 512KB; done 
write: Success
submit-io: Invalid argument
submit-io: Invalid argument
submit-io: Invalid argument
submit-io: Invalid argument
write: Success
write: Success
write: Success
write: Success
write: Success
atr@stosys-qemu-vm:/home/atr/new/zns-resources/stosys-class/stosys-project-code$ DD=128; for((i=0;i<10;i++)); do sudo nvme write /dev/nvme1n1 -s 0x0 -b $(($DD -1)) -z $(($DD * 4096)) -d 512KB; done 
submit-io: Invalid argument
write: Success
write: Success
write: Success
write: Success
write: Success
write: Success
write: Success
submit-io: Invalid argument
submit-io: Invalid argument
atr@stosys-qemu-vm:/home/atr/new/zns-resources/stosys-class/stosys-project-code$ DD=128; for((i=0;i<10;i++)); do sudo nvme write /dev/nvme1n1 -s 0x0 -b $(($DD -1)) -z $(($DD * 4096)) -d 512KB; done 
write: Success
submit-io: Invalid argument
submit-io: Invalid argument
submit-io: Invalid argument
submit-io: Invalid argument
submit-io: Invalid argument
write: Success
write: Success
write: Success
write: Success
atr@stosys-qemu-vm:/home/atr/new/zns-resources/stosys-class/stosys-project-code$ 

Simillar story is on the zone devices, but with a reset. On actual Optane NVMe is always works properly.

RocksDB and M3 plugin

atr@atr-xps-13:~/vu/github/atr-zns-resources/stosys-class/stosys-project-code/src/m3$ make
g++ -std=c++11 -faligned-new -DHAVE_ALIGNED_NEW -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX -fno-builtin-memcmp -DROCKSDB_FALLOCATE_PRESENT -DGFLAGS=1 -DZLIB -DNUMA -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_SCHED_GETCPU_PRESENT -DROCKSDB_AUXV_GETAUXVAL_PRESENT -march=native -DHAVE_SSE42 -DHAVE_PCLMUL -DHAVE_AVX2 -DHAVE_BMI -DHAVE_LZCNT -DHAVE_UINT128_EXTENSION -DROCKSDB_SUPPORT_THREAD_LOCAL -isystem third-party/gtest-1.8.1/fused-src -isystem ./third-party/folly -I/usr/local/include -I/home/atr/vu/github/storage/rocksdb/ -L/home/atr/local/lib/ -L/home/atr/local/usr/local/lib/ -Wl,-rpath=/home/atr/local/lib  -o m3 src/m3_main.o -L/usr/local/lib -ldl -lrocksdb -lpthread -lrt -ldl -lgflags -lz -lnuma -lzstd -lbz2 -llz4 -lsnappy -u m3_leveldb_reg
/usr/bin/ld: cannot find -lzstd
/usr/bin/ld: cannot find -lbz2
/usr/bin/ld: cannot find -llz4
/usr/bin/ld: cannot find -lsnappy
collect2: error: ld returned 1 exit status
make: *** [Makefile:17: m3] Error 1
atr@atr-xps-13:~/vu/github/atr-zns-resources/stosys-class/stosys-project-code/src/m3$ 

RocksbDB needs libraries:

sudo apt install autoconf libgflags-dev libtool autoconf-archive
sudo apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev

see this: https://github.com/facebook/rocksdb/blob/main/INSTALL.md#dependencies

In case of missing references like:

atr@atr-xps-13:~/vu/github/atr-zns-resources/stosys-class/stosys-project-code$ g++ -std=c++11 -frtti -faligned-new -DHAVE_ALIGNED_NEW -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX -fno-builtin-memcmp -DROCKSDB_FALLOCATE_PRESENT -DGFLAGS=1 -DZLIB -DNUMA -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_SCHED_GETCPU_PRESENT -DROCKSDB_AUXV_GETAUXVAL_PRESENT -march=native -DHAVE_SSE42 -DHAVE_PCLMUL -DHAVE_AVX2 -DHAVE_BMI -DHAVE_LZCNT -DHAVE_UINT128_EXTENSION -DROCKSDB_SUPPORT_THREAD_LOCAL -isystem third-party/gtest-1.8.1/fused-src -isystem ./third-party/folly -I/usr/local/include -I/home/atr/vu/github/storage/rocksdb/ -L/home/atr/local/lib/ -L/home/atr/local/usr/local/lib/ -Wl,-rpath=/home/atr/local/lib CMakeFiles/m3.dir/src/m3/src/m3_main.cpp.o CMakeFiles/m3.dir/src/m3/src/m3.cpp.o -o bin/m3 -L/usr/local/lib -ldl -lrocksdb -lpthread -lrt -ldl -lgflags -lz -lnuma -lzstd -lbz2 -llz4 -lsnappy -u m3_leveldb_reg
/usr/bin/ld: CMakeFiles/m3.dir/src/m3/src/m3.cpp.o:(.data.rel.ro._ZTIN7rocksdb2M3E[_ZTIN7rocksdb2M3E]+0x10): undefined reference to `typeinfo for rocksdb::FileSystem'
/usr/bin/ld: CMakeFiles/m3.dir/src/m3/src/m3.cpp.o:(.data.rel.ro._ZTIN7rocksdb17FileSystemWrapperE[_ZTIN7rocksdb17FileSystemWrapperE]+0x10): undefined reference to `typeinfo for rocksdb::FileSystem'
collect2: error: ld returned 1 exit status
atr@atr-xps-13:~/vu/github/atr-zns-resources/stosys-class/stosys-project-code$ 

use the RTTI flag to compile the rocksdb

DEBUG_LEVEL=0 USE_RTTI=1 DESTDIR=/home/atr/local/ make -j 4 db_bench install

NVMe commands

How to list all namesapces on a controller

atr@stosys-qemu-vm:~$ sudo nvme list-ns /dev/nvme1 -H 
list-ns: unrecognized option '-H'
Usage: nvme list-ns <device> [OPTIONS]

For the specified controller handle, show the namespace list in the
associated NVMe subsystem, optionally starting with a given nsid.

Options:
  [  --namespace-id=<NUM>, -n <NUM> ]   --- first nsid returned list should
                                            start from
  [  --csi=<NUM>, -y <NUM> ]            --- I/O command set identifier
  [  --all, -a ]                        --- show all namespaces in the
                                            subsystem, whether attached or
                                            inactive
atr@stosys-qemu-vm:~$ sudo nvme list-ns /dev/nvme1 -a
[   0]:0x1
atr@stosys-qemu-vm:~$ 

Does the device support NS management commands

atr@stosys-qemu-vm:~$ sudo nvme id-ctrl /dev/nvme1 -H | grep 'NS Management'
  [3:3] : 0x1	NS Management and Attachment Supported
atr@stosys-qemu-vm:~$ 

See figure 251 specification 1.4 field OACS.

Detach a namespace

atr@node1:/home/atr/zns-fw$ sudo nvme id-ctrl /dev/nvme1 | grep cntlid 
cntlid    : 0
atr@node1:/home/atr/zns-fw$ 

atr@stosys-qemu-vm:~$ sudo nvme detach-ns /dev/nvme1 -n 1 -c 0 
NVMe status: INVALID_FIELD: A reserved coded value or an unsupported value in a defined field(0x4002)
atr@stosys-qemu-vm:~$ 

Trying to detach a namespace (it says in the delete that you should detach it first)

Delete a namespace

atr@stosys-qemu-vm:~$ sudo nvme delete-ns -h 
Usage: nvme delete-ns <device> [OPTIONS]

Delete the given namespace by sending a namespace management command to the
provided device. All controllers should be detached from the namespace prior
to namespace deletion. A namespace ID becomes inactive when that namespace
is detached or, if the namespace is not already inactive, once deleted.

Options:
  [  --namespace-id=<NUM>, -n <NUM> ]   --- namespace to delete
  [  --timeout=<NUM>, -t <NUM> ]        --- timeout value, in milliseconds
atr@stosys-qemu-vm:~$ sudo nvme delete-ns /dev/nvme1 -n 1 
NVMe status: INVALID_OPCODE: The associated command opcode field is not valid(0x4001)

https://www.ibm.com/docs/en/linux-on-systems?topic=drive-deleting-stray-nvme-namespaces-nvme

Create a namespace

Example run on node1 with firmware update and ns management

atr@node1:/home/atr/zns-fw$ ll
total 7120
drwxrwxr-x 3 atr atr    4096 Nov  2 09:38 ./
drwxr-xr-x 8 atr atr    4096 Nov  2 09:38 ../
-rw-rw-r-- 1 atr atr 3645440 Oct 14 13:10 borabora_zns_GZ_R6Z10011.vpkg
-rw-rw-r-- 1 atr atr 3619845 Oct 27 08:27 borabora_zns_GZ_R6Z10011.zip
-rw-rw-r-- 1 atr atr     355 Apr  9  2021 create_zns.sh
-rw-rw-r-- 1 atr atr     372 Oct 27 08:08 load_fw.sh
drwxrwxr-x 2 atr atr    4096 Nov  2 09:38 Previous/
-rw-rw-r-- 1 atr atr     491 Oct 27 08:27 readme.txt
atr@node1:/home/atr/zns-fw$ cat readme.txt 
Disclaimer: Firmware binary is shared under NDA and confidential.

How to apply the firmware:

1. Copy files to server
2. Load firmware. All device namespaces will be deleted.
   sudo ./load_fw.sh /dev/nvmeX borabora_zns_GZ_R6ZXXXXX.vpdk
3. Cold reboot the system

Note that if namespaces are deleted, the drive will not be visible by "nvme list" until it has namespaces recreated.

To create a single large zoned namespace, the command ./create_zns.sh /dev/nvmeX may be used.

atr@node1:/home/atr/zns-fw$ cat load_fw.sh 
#!/bin/sh

#nvme delete-ns $1 -n 0xffffffff
#nvme format $1 -n 0xffffffff -l 2 -f

sleep 1

nvme fw-download $1 -f $2
sleep 1
nvme fw-activate $1 -s 1 -a 0
sleep 1
nvme fw-download $1 -f $2
sleep 1
nvme fw-activate $1 -s 2 -a 0
sleep 1
nvme fw-download $1 -f $2
sleep 1
nvme fw-activate $1 -s 3 -a 0
sleep 1
nvme fw-download $1 -f $2
sleep 1
nvme fw-activate $1 -s 4 -a 1
atr@node1:/home/atr/zns-fw$ sudo nvme detach-ns /dev/nvme1
nvme1    nvme1n1  nvme1n2  
atr@node1:/home/atr/zns-fw$ sudo nvme detach-ns /dev/nvme1 -n 1 
NVMe status: CONTROLLER_LIST_INVALID: The controller list provided is invalid(0x611c)
atr@node1:/home/atr/zns-fw$ sudo nvme id-ctrl /dev/nvme1 | grep cntlid 
cntlid    : 0
atr@node1:/home/atr/zns-fw$ #sudo nvme detach-ns /dev/nvme1 -n 1 -c 0 C
atr@node1:/home/atr/zns-fw$ sudo nvme list-ns /dev/nvme1 -a
[   0]:0x1
[   1]:0x2
atr@node1:/home/atr/zns-fw$ sudo nvme detach-ns /dev/nvme1 -n 1 -c 0 
detach-ns: Success, nsid:1
atr@node1:/home/atr/zns-fw$ sudo nvme detach-ns /dev/nvme1 -n 2 -c 0 
detach-ns: Success, nsid:2
atr@node1:/home/atr/zns-fw$ chmod +x load_fw.sh 
atr@node1:/home/atr/zns-fw$ sudo ./load_fw.sh /dev/nvme
nvme0    nvme0n1  nvme1    
atr@node1:/home/atr/zns-fw$ sudo ./load_fw.sh /dev/nvme^C
atr@node1:/home/atr/zns-fw$ sudo nvme list-ns /dev/nvme1 -a
[   0]:0x1
[   1]:0x2
atr@node1:/home/atr/zns-fw$ sudo nvme delete-ns /dev/nvme1 -n 1 
delete-ns: Success, deleted nsid:1
atr@node1:/home/atr/zns-fw$ sudo nvme delete-ns /dev/nvme1 -n 2 
delete-ns: Success, deleted nsid:2
atr@node1:/home/atr/zns-fw$ sudo nvme list-ns /dev/nvme1 -a
atr@node1:/home/atr/zns-fw$ sudo ./load_fw.sh /dev/nvme
nvme0    nvme0n1  nvme1    
atr@node1:/home/atr/zns-fw$ sudo ./load_fw.sh /dev/nvme1 ./borabora_zns_GZ_R6Z10011.vpkg 
Firmware download success
NVMe status: FIRMWARE_SLOT: The firmware slot indicated is invalid or read only. This error is indicated if the firmware slot exceeds the number supported(0x6106)
Firmware download success
Success committing firmware action:0 slot:2
Firmware download success
Success committing firmware action:0 slot:3
Firmware download success
Success committing firmware action:1 slot:4
atr@node1:/home/atr/zns-fw$ sudo sync 
atr@node1:/home/atr/zns-fw$ sudo sync 
atr@node1:/home/atr/zns-fw$ sudo reboot 

Decoding NVMe errors with ioctl and libnvme

the v1.4 has section 4.6 as completion queue entry status. Then the section 4.6.1.2.1 has generic error codes in the figure 128.

Examples

atr@stosys-qemu-vm:/home/atr/msc-stosys-framework$ sudo nvme write /dev/nvme0n1 -s 0 -c 0 -z 4096 -d ./4kb 
NVMe status: ZONE_INVALID_WRITE: The write to zone was not at the write pointer offset(0x41bc)
atr@stosys-qemu-vm:/home/atr/msc-stosys-framework$ 

Here 0x41 goes to command specific codes Then 0xBC goes to the ZNS

The version 2.0 has these details in 3.3.3.2.1

How to change the LBA format

atr@node3:~$ sudo nvme format /dev/nvme1n2 -lbaf 0 
You are about to format nvme1n2, namespace 0x2.
Namespace nvme1n2 has parent controller(s):nvme1

WARNING: Format may irrevocably delete this device's data.
You have 10 seconds to press Ctrl-C to cancel this operation.

Use the force [--force|-f] option to suppress this warning.
Sending format operation ... 
NVMe status: INVALID_FORMAT: The LBA Format specified is not supported. This may be due to various conditions(0x610a)
atr@node3:~$ sudo nvme id-ctrl /dev/nvme1 |grep tnvmcap 
tnvmcap   : 7924214661120
atr@node3:~$ sudo nvme id-ctrl /dev/nvme1 |grep unvmcap 
unvmcap   : 0
atr@node3:~$ sudo nvme delete-ns /dev/nvme1 -n 1
delete-ns: Success, deleted nsid:1
atr@node3:~$ sudo nvme delete-ns /dev/nvme1 -n 2
delete-ns: Success, deleted nsid:2
atr@node3:~$ sudo nvme create-ns /dev/nvme1 -s 8388608 -c 8388608 -b 512 --csi=0
create-ns: Success, created nsid:1
atr@node3:~$ sudo nvme create-ns /dev/nvme1 -s 15468593152 -c 15468593152 -b 512 --csi=2
create-ns: Success, created nsid:2
atr@node3:~$ sudo nvme create-ns /dev/nvme1 -s 15468593152 -c 15468593152 -b 512 --csi=2^C
atr@node3:~$ sudo nvme attach-ns /dev/nvme1 -n 1 -c 0
attach-ns: Success, nsid:1
atr@node3:~$ sudo nvme attach-ns /dev/nvme1 -n 2 -c 0
attach-ns: Success, nsid:2
atr@node3:~$ sudo ./nvme attach-ns /dev/nvme1 -n 1 -c 0^C
atr@node3:~$ sudo nvme list 
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          PHM20341005S280AGN   INTEL SSDPE21D280GA                      1         280.07  GB / 280.07  GB    512   B +  0 B   E2010480
/dev/nvme1n1          21123U900167         WZS4C8T4TDSP303                          1           0.00   B /   4.29  GB    512   B +  0 B   R6Z0701D
/dev/nvme1n2          21123U900167         WZS4C8T4TDSP303                          2           0.00   B /   7.92  TB    512   B +  0 B   R6Z0701D
atr@node3:~$ lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
loop0         7:0    0  55.4M  1 loop /snap/core18/2128
loop1         7:1    0  55.5M  1 loop /snap/core18/2246
loop2         7:2    0  61.9M  1 loop /snap/core20/1169
loop4         7:4    0  61.8M  1 loop /snap/core20/1081
loop5         7:5    0  32.5M  1 loop /snap/snapd/13640
loop6         7:6    0  67.2M  1 loop /snap/lxd/21835
loop7         7:7    0  32.4M  1 loop /snap/snapd/13270
loop8         7:8    0  67.2M  1 loop /snap/lxd/21803
sda           8:0    0 447.1G  0 disk 
├─sda1        8:1    0     1M  0 part 
├─sda2        8:2    0    50G  0 part /boot
├─sda3        8:3    0   200G  0 part /
└─sda4        8:4    0 197.1G  0 part 
nvme0n1     259:0    0 260.9G  0 disk 
├─nvme0n1p1 259:3    0   150G  0 part /mnt/xfs
└─nvme0n1p2 259:4    0 110.8G  0 part 
nvme1n1     259:1    0     4G  0 disk 
nvme1n2     259:2    0   7.2T  0 disk 
atr@node3:~$ 

F2FS mounting

The packaged mkfs.f2fs is old, so it needs to be updated. The changes were merged into https://www.mail-archive.com/[email protected]/msg17381.html and https://www.mail-archive.com/[email protected]/msg17379.html (last year, April 2020).

mkfs.f2fs utilitiies: https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/about/

git clone git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
cd f2fs-tools/
./autogen.sh 
./configure --prefix=/home/atr/local/
make -j 

Then if the size is not enough

atr@node1:/home/atr/src/f2fs-tools$ sudo /home/atr/local/sbin/mkfs.f2fs -l atrf2fs -f -m -c /dev/nvme1n2 /dev/nvme1n1 # conventional and then the zone device 

	F2FS-tools: mkfs.f2fs Ver: 1.14.0 (2021-09-28)

Info: Disable heap-based policy
Info: Debug level = 0
Info: Label = atrf2fs
Info: Trim is enabled
Info: Host-managed zoned block device:
      3688 zones, 0 randomly writeable zones
      524288 blocks per zone
Info: Segments per section = 1024
Info: Sections per zone = 1
Info: sector size = 512
Info: total sectors = 15476981760 (7557120 MB)
Info: zone aligned segment0 blkaddr: 524288
	Error: Conventional device /dev/nvme1n1 is too small, (14336 MiB needed).
	Error: Failed to prepare a super block!!!
	Error: Could not format the device!!!

If block size mismatch

atr@node1:/home/atr/src/f2fs-tools$ sudo /home/atr/local/sbin/mkfs.f2fs -l atrf2fs -f -m -c /dev/nvme1n2 /dev/nvme0n1 

	F2FS-tools: mkfs.f2fs Ver: 1.14.0 (2021-09-28)

Info: Disable heap-based policy
Info: Debug level = 0
Info: Label = atrf2fs
Info: Trim is enabled
	Error: Different sector sizes!!!

Then I reformatted ns to 512 bytes, now

atr@node1:/home/atr/src/f2fs-tools$ sudo /home/atr/local/sbin/mkfs.f2fs -l atrf2fs -f -m -c /dev/nvme1n2 /dev/nvme0n1 

	F2FS-tools: mkfs.f2fs Ver: 1.14.0 (2021-09-28)

Info: Disable heap-based policy
Info: Debug level = 0
Info: Label = atrf2fs
Info: Trim is enabled
Info: Host-managed zoned block device:
      3688 zones, 0 randomly writeable zones
      524288 blocks per zone
	/dev/nvme0n1 appears to contain an existing filesystem (ext4).
Info: Segments per section = 1024
Info: Sections per zone = 1
Info: sector size = 512
Info: total sectors = 16015595440 (7820114 MB)
Info: zone aligned segment0 blkaddr: 742134
Info: format version with
  "Linux version 5.12.0+ (atr@node3) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #6 SMP Tue Apr 27 14:01:35 UTC 2021"
Info: [/dev/nvme0n1] Discarding device
Info: This device doesn't support BLKSECDISCARD
Info: Discarded 267090 MB
Info: [/dev/nvme1n2] Discarding device
Info: Discarded 7553024 MB
Info: Overprovision ratio = 2.290%
Info: Overprovision segments = 184707 (GC reserved = 97624)
Info: format successful

Then mount

atr@node1:/home/atr/src/f2fs-tools$ sudo mount -t f2fs /dev/nvme0n1 /home/atr/mnt-f2fs/
⚠️ **GitHub.com Fallback** ⚠️