How could write zeroes performance be worse at higher queue depths - vincentkfu/fio-blog GitHub Wiki

I am grateful to Adam Manzanares and Klaus Jensen for their support.

A Performance Mystery

Fio recently gained the ability to submit write zeroes commands to NVMe devices via the io_uring_cmd ioengine. It now also has an option to set the write zeroes command's DEAC bit which signals to the device that it should deallocate the specified blocks if the namespace is set up to return all zeroes when reading deallocated logical blocks.

Testing with the current tip of fio revealed that write zeroes performance with the DEAC bit set was actually worse at higher queue depths. A simple queue depth 1 job with 16 MiB block sizes produced 30.5 GiB/s but at a queue depth of 4 the device only achieved 19.8 GiB/s! How could this be?

Fio runs: 30.5 GiB/s vs 19.8 GiB/s
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=1
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=29.5GiB/s][w=1885 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12200: Mon Oct 28 17:43:03 2024
  write: IOPS=1952, BW=30.5GiB/s (32.8GB/s)(305GiB/10001msec); 0 zone resets
    slat (usec): min=149, max=1478, avg=308.93, stdev=76.59
    clat (usec): min=171, max=944, avg=202.15, stdev=34.46
     lat (usec): min=322, max=1674, avg=511.09, stdev=78.63
    clat percentiles (usec):
     |  1.00th=[  176],  5.00th=[  180], 10.00th=[  182], 20.00th=[  184],
     | 30.00th=[  188], 40.00th=[  190], 50.00th=[  192], 60.00th=[  194],
     | 70.00th=[  200], 80.00th=[  208], 90.00th=[  233], 95.00th=[  269],
     | 99.00th=[  355], 99.50th=[  367], 99.90th=[  383], 99.95th=[  408],
     | 99.99th=[  635]
   bw (  MiB/s): min=30048, max=32993, per=100.00%, avg=31266.62, stdev=841.97, samples=20
   iops        : min= 1878, max= 2062, avg=1953.90, stdev=52.76, samples=20
  lat (usec)   : 250=93.37%, 500=6.61%, 750=0.02%, 1000=0.01%
  cpu          : usr=60.99%, sys=0.63%, ctx=19526, majf=0, minf=36
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,19524,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=30.5GiB/s (32.8GB/s), 30.5GiB/s-30.5GiB/s (32.8GB/s-32.8GB/s), io=305GiB (328GB), run=10001-10001msec

root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=19.8GiB/s][w=1270 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12211: Mon Oct 28 17:47:01 2024
  write: IOPS=1269, BW=19.8GiB/s (21.3GB/s)(199GiB/10002msec); 0 zone resets
    slat (usec): min=353, max=1963, avg=786.71, stdev=143.30
    clat (usec): min=201, max=3705, avg=2362.09, stdev=246.66
     lat (usec): min=1007, max=4482, avg=3148.80, stdev=280.86
    clat percentiles (usec):
     |  1.00th=[ 1713],  5.00th=[ 1975], 10.00th=[ 2057], 20.00th=[ 2147],
     | 30.00th=[ 2212], 40.00th=[ 2376], 50.00th=[ 2442], 60.00th=[ 2474],
     | 70.00th=[ 2507], 80.00th=[ 2540], 90.00th=[ 2573], 95.00th=[ 2606],
     | 99.00th=[ 2769], 99.50th=[ 3458], 99.90th=[ 3589], 99.95th=[ 3621],
     | 99.99th=[ 3654]
   bw (  MiB/s): min=20128, max=20736, per=100.00%, avg=20341.97, stdev=138.39, samples=20
   iops        : min= 1258, max= 1296, avg=1271.10, stdev= 8.60, samples=20
  lat (usec)   : 250=0.01%
  lat (msec)   : 2=5.98%, 4=94.03%
  cpu          : usr=99.84%, sys=0.16%, ctx=15, majf=0, minf=36
  IO depths    : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,12701,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=19.8GiB/s (21.3GB/s), 19.8GiB/s-19.8GiB/s (21.3GB/s-21.3GB/s), io=199GiB (213GB), run=10002-10002msec

Could this be a hardware issue? I decided to carry out a test using an emulated NVMe device to check. I created a RAM disk and used it as the backing store for an NVMe fabrics loopback device.

nvme-fabrics-loop.sh
#!/bin/bash
set -x

modprobe brd rd_nr=1 rd_size=33554432

modprobe nvmet
modprobe nvme-loop
modprobe nvme-fabrics
cd /sys/kernel/config/nvmet/subsystems/
mkdir nvme-test-target
cd nvme-test-target/
echo 1 > attr_allow_any_host
mkdir namespaces/1
cd namespaces/1
echo -n /dev/ram0 > device_path
cat device_path
echo 1 > enable
cat enable
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo loop > addr_trtype
cat addr_trtype
ln -s /sys/kernel/config/nvmet/subsystems/nvme-test-target/ /sys/kernel/config/nvmet/ports/1/subsystems/nvme-test-target
nvme discover -t loop
sleep 1
nvme connect -t loop -n nvme-test-target
sleep 1
nvme list
Console log
vincent@localhost $ sudo ./nvme-fabrics-loop.sh
+ modprobe brd rd_nr=1 rd_size=33554432
+ modprobe nvmet
+ modprobe nvme-loop
+ modprobe nvme-fabrics
+ cd /sys/kernel/config/nvmet/subsystems/
+ mkdir nvme-test-target
+ cd nvme-test-target/
+ echo 1
+ mkdir namespaces/1
+ cd namespaces/1
+ echo -n /dev/ram0
+ cat device_path
/dev/ram0
+ echo 1
+ cat enable
1
+ mkdir /sys/kernel/config/nvmet/ports/1
+ cd /sys/kernel/config/nvmet/ports/1
+ echo loop
+ cat addr_trtype
loop
+ ln -s /sys/kernel/config/nvmet/subsystems/nvme-test-target/ /sys/kernel/config/nvmet/ports/1/subsystems/nvme-test-target
+ nvme discover -t loop

Discovery Log Number of Records 2, Generation counter 2
=====Discovery Log Entry 0======
trtype:  loop
adrfam:  unrecognized
subtype: unrecognized
treq:    not specified, sq flow control disable supported
portid:  1
trsvcid:
subnqn:  nqn.2014-08.org.nvmexpress.discovery
traddr:
=====Discovery Log Entry 1======
trtype:  loop
adrfam:  unrecognized
subtype: nvme subsystem
treq:    not specified, sq flow control disable supported
portid:  1
trsvcid:
subnqn:  nvme-test-target
traddr:
+ sleep 1
+ nvme connect -t loop -n nvme-test-target
+ sleep 1
+ nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     461a6d8df1da98a9e46d Linux                                    1          34.36  GB /  34.36  GB    512   B +  0 B   6.6.0-rc

The bandwidth levels attained were much lower than the levels observed on a PCIe NVMe device, but it was still possible to elicit worse performance at higher queue depths. At a queue depth of 64, I measured write zeroes bandwidth of 8218 MiB/s but at a queue depth of 128, bandwidth declined to 7757 MiB/s. So even with a software emulated device, it was possible to induce worse performance by increasing the queue depth. This suggests that the hardware device is not responsible for the slowdown.

Fio runs: 8218 MiB/s vs 7757 MiB/s
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=64
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=64
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=8857MiB/s][w=553 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88098: Mon Oct 28 10:59:14 2024
  write: IOPS=507, BW=8218MiB/s (8617MB/s)(80.4GiB/10017msec); 0 zone resets
    slat (usec): min=492, max=21015, avg=1964.80, stdev=1632.94
    clat (msec): min=15, max=180, avg=123.40, stdev=15.82
     lat (msec): min=17, max=181, avg=125.37, stdev=15.76
    clat percentiles (msec):
     |  1.00th=[   93],  5.00th=[  102], 10.00th=[  106], 20.00th=[  111],
     | 30.00th=[  115], 40.00th=[  118], 50.00th=[  123], 60.00th=[  127],
     | 70.00th=[  132], 80.00th=[  138], 90.00th=[  144], 95.00th=[  150],
     | 99.00th=[  161], 99.50th=[  165], 99.90th=[  174], 99.95th=[  178],
     | 99.99th=[  180]
   bw (  MiB/s): min= 7040, max= 8977, per=98.97%, avg=8133.66, stdev=627.05, samples=20
   iops        : min=  440, max=  561, avg=508.20, stdev=39.30, samples=20
  lat (msec)   : 20=0.06%, 50=0.26%, 100=3.44%, 250=97.48%
  cpu          : usr=59.46%, sys=0.00%, ctx=1362, majf=0, minf=39
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,5081,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=8218MiB/s (8617MB/s), 8218MiB/s-8218MiB/s (8617MB/s-8617MB/s), io=80.4GiB (86.3GB), run=10017-10017msec
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=128
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=128
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=7664MiB/s][w=479 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88175: Mon Oct 28 10:59:35 2024
  write: IOPS=472, BW=7757MiB/s (8134MB/s)(75.9GiB/10022msec); 0 zone resets
    slat (usec): min=587, max=10235, avg=2109.36, stdev=1495.72
    clat (msec): min=20, max=319, avg=265.18, stdev=25.92
     lat (msec): min=21, max=321, avg=267.28, stdev=25.82
    clat percentiles (msec):
     |  1.00th=[  112],  5.00th=[  247], 10.00th=[  253], 20.00th=[  257],
     | 30.00th=[  262], 40.00th=[  264], 50.00th=[  268], 60.00th=[  271],
     | 70.00th=[  275], 80.00th=[  279], 90.00th=[  284], 95.00th=[  288],
     | 99.00th=[  300], 99.50th=[  305], 99.90th=[  313], 99.95th=[  313],
     | 99.99th=[  321]
   bw (  MiB/s): min= 7264, max= 7951, per=97.62%, avg=7572.61, stdev=192.38, samples=20
   iops        : min=  454, max=  496, avg=473.00, stdev=11.85, samples=20
  lat (msec)   : 50=0.38%, 100=0.55%, 250=6.38%, 500=95.37%
  cpu          : usr=63.46%, sys=0.00%, ctx=1384, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,4731,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=7757MiB/s (8134MB/s), 7757MiB/s-7757MiB/s (8134MB/s-8134MB/s), io=75.9GiB (81.5GB), run=10022-10022msec

Scrambled buffers

What could be going on? After some head scratching, gnashing of teeth, and pulling of hair along with some research, investigation, and testing, fio's documentation and source code revealed the answer. For write operations, fio by default will scramble the data buffer for each command. With the large block sizes used for these workloads, the time that fio spends scrambling the data buffer takes away from the time available for fio to submit commands and reap completions. Since we are sending write zeroes commands the data buffer contents are irrelevant for our workload. In fact fio does not even send a data buffer address for these write zeroes commands. So fio should not be spending time randomizing the contents of the data buffer.

We can add the --scramble_buffers=0 option to our jobs to disable this behavior. Fio will simply leave the data buffer untouched for each command. The --zero_buffers option will also suffice and directs fio to fill data buffers with all zeroes and then leave them untouched for the rest of the run. 

With the --zero_buffers option our hardware NVMe device now produces bandwidth of 52.6 GiB/s at queue depth 1. This scales with queue depth and increases to 58.9 GiB/s at queue depth 4.

Fio runs: 52.6 GiB/s vs 58.9 GiB/s
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=1
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=52.6GiB/s][w=3369 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12279: Mon Oct 28 18:41:23 2024
  write: IOPS=3364, BW=52.6GiB/s (56.4GB/s)(526GiB/10001msec); 0 zone resets
    slat (nsec): min=1886, max=1247.0k, avg=3365.45, stdev=7045.02
    clat (usec): min=165, max=1650, avg=292.40, stdev=130.16
     lat (usec): min=187, max=1653, avg=295.77, stdev=130.17
    clat percentiles (usec):
     |  1.00th=[  192],  5.00th=[  200], 10.00th=[  206], 20.00th=[  210],
     | 30.00th=[  212], 40.00th=[  217], 50.00th=[  221], 60.00th=[  229],
     | 70.00th=[  241], 80.00th=[  502], 90.00th=[  519], 95.00th=[  529],
     | 99.00th=[  537], 99.50th=[  545], 99.90th=[  660], 99.95th=[  906],
     | 99.99th=[ 1074]
   bw (  MiB/s): min=53482, max=54092, per=100.00%, avg=53879.90, stdev=126.79, samples=20
   iops        : min= 3342, max= 3380, avg=3367.20, stdev= 7.93, samples=20
  lat (usec)   : 250=73.03%, 500=6.74%, 750=20.17%, 1000=0.04%
  lat (msec)   : 2=0.02%
  cpu          : usr=2.61%, sys=3.23%, ctx=33648, majf=0, minf=37
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,33648,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=52.6GiB/s (56.4GB/s), 52.6GiB/s-52.6GiB/s (56.4GB/s-56.4GB/s), io=526GiB (565GB), run=10001-10001msec
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=57.2GiB/s][w=3659 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12285: Mon Oct 28 18:42:27 2024
  write: IOPS=3770, BW=58.9GiB/s (63.3GB/s)(589GiB/10002msec); 0 zone resets
    slat (nsec): min=1329, max=50623, avg=3205.78, stdev=1682.49
    clat (usec): min=188, max=2794, avg=1056.13, stdev=37.97
     lat (usec): min=208, max=2797, avg=1059.33, stdev=37.82
    clat percentiles (usec):
     |  1.00th=[ 1012],  5.00th=[ 1020], 10.00th=[ 1029], 20.00th=[ 1037],
     | 30.00th=[ 1045], 40.00th=[ 1045], 50.00th=[ 1057], 60.00th=[ 1057],
     | 70.00th=[ 1057], 80.00th=[ 1074], 90.00th=[ 1090], 95.00th=[ 1106],
     | 99.00th=[ 1156], 99.50th=[ 1221], 99.90th=[ 1418], 99.95th=[ 1483],
     | 99.99th=[ 2114]
   bw (  MiB/s): min=58464, max=61088, per=100.00%, avg=60381.51, stdev=839.68, samples=20
   iops        : min= 3654, max= 3818, avg=3773.65, stdev=52.43, samples=20
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.41%
  lat (msec)   : 2=99.56%, 4=0.01%
  cpu          : usr=2.84%, sys=3.58%, ctx=37684, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,37714,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=58.9GiB/s (63.3GB/s), 58.9GiB/s-58.9GiB/s (63.3GB/s-63.3GB/s), io=589GiB (633GB), run=10002-10002msec

Testing our emulated NVMe device with the --zero_buffers option, bandwidth explodes to 47.0 GiB/s at queue depth 64 and increases as expected to 67.2 GiB/s at queue depth 128.

Fio runs: 47.0 GiB/s vs 67.2 GiB/s
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=64 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=64
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=47.6GiB/s][w=3044 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88732: Mon Oct 28 11:38:46 2024
  write: IOPS=3003, BW=47.0GiB/s (50.5GB/s)(473GiB/10053msec); 0 zone resets
    slat (nsec): min=593, max=9361.6k, avg=19932.84, stdev=101944.11
    clat (usec): min=193, max=335535, avg=21309.85, stdev=21609.98
     lat (msec): min=3, max=335, avg=21.33, stdev=21.60
    clat percentiles (msec):
     |  1.00th=[    5],  5.00th=[    7], 10.00th=[    8], 20.00th=[   10],
     | 30.00th=[   12], 40.00th=[   14], 50.00th=[   16], 60.00th=[   20],
     | 70.00th=[   23], 80.00th=[   28], 90.00th=[   38], 95.00th=[   51],
     | 99.00th=[  120], 99.50th=[  155], 99.90th=[  253], 99.95th=[  288],
     | 99.99th=[  313]
   bw (  MiB/s): min=44640, max=54188, per=100.00%, avg=48347.36, stdev=2747.91, samples=20
   iops        : min= 2790, max= 3386, avg=3021.40, stdev=171.71, samples=20
  lat (usec)   : 250=0.01%
  lat (msec)   : 4=0.35%, 10=20.26%, 20=42.77%, 50=31.77%, 100=3.60%
  lat (msec)   : 250=1.33%, 500=0.11%
  cpu          : usr=3.22%, sys=8.03%, ctx=21794, majf=0, minf=38
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,30196,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=47.0GiB/s (50.5GB/s), 47.0GiB/s-47.0GiB/s (50.5GB/s-50.5GB/s), io=473GiB (508GB), run=10053-10053msec
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=128 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=128
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=69.4GiB/s][w=4441 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88940: Mon Oct 28 11:39:12 2024
  write: IOPS=4289, BW=67.2GiB/s (72.2GB/s)(678GiB/10090msec); 0 zone resets
    slat (nsec): min=675, max=6459.4k, avg=24095.93, stdev=239147.26
    clat (usec): min=32, max=342220, avg=29828.23, stdev=40676.39
     lat (msec): min=3, max=342, avg=29.85, stdev=40.66
    clat percentiles (msec):
     |  1.00th=[    5],  5.00th=[    5], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[    9], 50.00th=[   11], 60.00th=[   15],
     | 70.00th=[   25], 80.00th=[   48], 90.00th=[   88], 95.00th=[  122],
     | 99.00th=[  188], 99.50th=[  209], 99.90th=[  268], 99.95th=[  284],
     | 99.99th=[  321]
   bw (  MiB/s): min=55360, max=71616, per=100.00%, avg=69232.21, stdev=4050.56, samples=20
   iops        : min= 3460, max= 4476, avg=4326.95, stdev=253.13, samples=20
  lat (usec)   : 50=0.01%, 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.05%, 4=0.56%, 10=47.80%, 20=18.29%, 50=14.44%
  lat (msec)   : 100=11.19%, 250=7.73%, 500=0.16%
  cpu          : usr=2.65%, sys=5.13%, ctx=23220, majf=0, minf=40
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,43279,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=67.2GiB/s (72.2GB/s), 67.2GiB/s-67.2GiB/s (72.2GB/s-72.2GB/s), io=678GiB (728GB), run=10090-10090msec

The table below summarizes the performance measurements collected. With fio taking the time to randomize the unused data buffers, performance can decrease at higher queue depths. If we add the zero_buffers option, performance increases with queue depth as expected.

Default scrambled buffers zero_buffers
PCIe NVMe Software NVMe PCIe NVMe Software NVMe
QD1 30.5 GiB/s 52.6 GiB/s
QD4 19.8 GiB/s 58.9 GiB/s
QD64 8218 MiB/s 47.0 GiB/s
QD128 7757 MiB/s 67.2 GiB/s

Perf

We can gain more insight into what is happening by using the Linux profiler perf to collect data from jobs with and without the zero_buffers option. I ran fio jobs against the PCIe NVMe device using perf record and viewed the results with perf report. The job with fio needlessly scrambling the buffers showed that a whopping 98.27 percent of the time, the CPU was running in fio's get_io_u(). When the zero_buffers option was activated, the function that appeared the most was __vdso_clock_gettime() at 10.83 percent. get_io_u() fell to eighth on the list appearing only 1.60 percent of the time.

Samples: 61K of event 'cycles:P', Event count (approx.): 51751837222
Overhead  Command  Shared Object              Symbol
  98.27%  fio      fio                        [.] get_io_u
   0.25%  fio      [vdso]                     [.] __vdso_clock_gettime
   0.13%  fio      [kernel.kallsyms]          [k] clear_page_erms
   0.07%  fio      fio                        [.] __fill_random_buf
   0.06%  fio      [kernel.kallsyms]          [k] __raw_callee_save___pv_queued_spin_unlock
   0.05%  fio      fio                        [.] calibrate_cpu_clock
   0.04%  fio      ld-linux-x86-64.so.2       [.] do_lookup_x
   0.03%  fio      [kernel.kallsyms]          [k] nvme_poll_cq
   0.03%  fio      fio                        [.] thread_main
   0.03%  fio      fio                        [.] td_io_queue
...
Samples: 40K of event 'cycles:P', Event count (approx.): 1389337265
Overhead  Command  Shared Object          Symbol
  10.83%  fio      [vdso]                 [.] __vdso_clock_gettime
   4.77%  fio      [kernel.kallsyms]      [k] __raw_callee_save___pv_queued_spin_unlock
   4.65%  fio      [kernel.kallsyms]      [k] clear_page_erms
   2.12%  fio      [kernel.kallsyms]      [k] native_write_msr
   2.06%  fio      fio                    [.] calibrate_cpu_clock
   1.83%  fio      [kernel.kallsyms]      [k] clear_bhb_loop
   1.68%  fio      ld-linux-x86-64.so.2   [.] do_lookup_x
   1.60%  fio      fio                    [.] get_io_u
   1.42%  fio      [kernel.kallsyms]      [k] pvclock_clocksource_read_nowd
   1.39%  fio      [kernel.kallsyms]      [k] __schedule
...
perf runs
root@localhost:~/fio-dev/fio-canonical# perf record ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=17.9GiB/s][w=1147 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=15514: Tue Oct 29 14:08:31 2024
  write: IOPS=1146, BW=17.9GiB/s (19.2GB/s)(179GiB/10002msec); 0 zone resets
    slat (usec): min=407, max=2017, avg=870.97, stdev=164.97
    clat (usec): min=271, max=3905, avg=2616.63, stdev=282.72
     lat (usec): min=1214, max=4891, avg=3487.61, stdev=321.84
    clat percentiles (usec):
     |  1.00th=[ 1860],  5.00th=[ 2180], 10.00th=[ 2278], 20.00th=[ 2343],
     | 30.00th=[ 2442], 40.00th=[ 2638], 50.00th=[ 2704], 60.00th=[ 2737],
     | 70.00th=[ 2802], 80.00th=[ 2835], 90.00th=[ 2868], 95.00th=[ 2933],
     | 99.00th=[ 3228], 99.50th=[ 3687], 99.90th=[ 3851], 99.95th=[ 3884],
     | 99.99th=[ 3916]
   bw (  MiB/s): min=18176, max=18565, per=100.00%, avg=18367.67, stdev=131.55, samples=20
   iops        : min= 1136, max= 1160, avg=1147.80, stdev= 8.18, samples=20
  lat (usec)   : 500=0.01%
  lat (msec)   : 2=3.92%, 4=96.10%
  cpu          : usr=99.80%, sys=0.20%, ctx=13, majf=0, minf=36
  IO depths    : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,11466,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=17.9GiB/s (19.2GB/s), 17.9GiB/s-17.9GiB/s (19.2GB/s-19.2GB/s), io=179GiB (192GB), run=10002-10002msec
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.376 MB perf.data (61543 samples) ]
root@localhost:~/fio-dev/fio-canonical# perf report
root@localhost:~/fio-dev/fio-canonical# perf record ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=57.1GiB/s][w=3654 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=15527: Tue Oct 29 14:09:48 2024
  write: IOPS=3658, BW=57.2GiB/s (61.4GB/s)(572GiB/10002msec); 0 zone resets
    slat (nsec): min=1021, max=196845, avg=14056.28, stdev=20093.66
    clat (usec): min=216, max=2842, avg=1074.29, stdev=52.87
     lat (usec): min=245, max=2875, avg=1088.34, stdev=49.64
    clat percentiles (usec):
     |  1.00th=[  979],  5.00th=[ 1004], 10.00th=[ 1020], 20.00th=[ 1045],
     | 30.00th=[ 1057], 40.00th=[ 1074], 50.00th=[ 1074], 60.00th=[ 1090],
     | 70.00th=[ 1090], 80.00th=[ 1106], 90.00th=[ 1123], 95.00th=[ 1139],
     | 99.00th=[ 1205], 99.50th=[ 1270], 99.90th=[ 1450], 99.95th=[ 1598],
     | 99.99th=[ 2343]
   bw (  MiB/s): min=58420, max=58869, per=100.00%, avg=58603.65, stdev=129.91, samples=20
   iops        : min= 3651, max= 3679, avg=3662.55, stdev= 8.13, samples=20
  lat (usec)   : 250=0.01%, 500=0.02%, 750=0.06%, 1000=4.41%
  lat (msec)   : 2=95.49%, 4=0.02%
  cpu          : usr=4.69%, sys=16.11%, ctx=36432, majf=0, minf=37
  IO depths    : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,36588,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=4

Run status group 0 (all jobs):
  WRITE: bw=57.2GiB/s (61.4GB/s), 57.2GiB/s-57.2GiB/s (61.4GB/s-61.4GB/s), io=572GiB (614GB), run=10002-10002msec
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.698 MB perf.data (40941 samples) ]
 root@localhost:~/fio-dev/fio-canonical# perf report

Let us examine the get_io_u() code more closely. The relevant parts are here:

/*
 * Return an io_u to be processed. Gets a buflen and offset, sets direction,
 * etc. The returned io_u is fully ready to be prepped, populated and submitted.
 */
struct io_u *get_io_u(struct thread_data *td)
{
...
	int do_scramble = 0;

...

                if (io_u->ddir == DDIR_WRITE) {
                        if (td->flags & TD_F_REFILL_BUFFERS) {
                                io_u_fill_buffer(td, io_u,
                                        td->o.min_bs[DDIR_WRITE],
                                        io_u->buflen);
                        } else if ((td->flags & TD_F_SCRAMBLE_BUFFERS) &&
                                   !(td->flags & TD_F_COMPRESS) &&
                                   !(td->flags & TD_F_DO_VERIFY)) {
                                do_scramble = 1;
                        }
                }

...

                if (do_scramble)
                        small_content_scramble(io_u);

...

}

For write commands, fio uses various flags to determine if it needs to scramble the data buffer. If so, it sets the do_scramble flag which later on is used to determine whether or not to call small_content_scramble() to randomize the contents of the data buffer. When zero_buffers is set, do_scramble remains false and small_content_scramble() is not called.

Conclusion

Fio's default options are generally sane and work as expected for most situations. However, new features may not fit the conventions assumed by fio's defaults. Write zeroes is indeed a write command but unlike most write commands, no data buffer is involved, so in this case fio's defaults do not work well. Also worth noting is that these tests use an unusually large block size which amplifies the time fio could spend randomizing the data buffer. For most other write workloads any time that fio spends scrambling the data buffer is dwarfed by the device response time.

All in all this was an interesting puzzle with a satisfying resolution. Fio's defaults are not top of mind because they generally work well but it can be helpful to be familiar with them.

⚠️ **GitHub.com Fallback** ⚠️