How could write zeroes performance be worse at higher queue depths - vincentkfu/fio-blog GitHub Wiki
I am grateful to Adam Manzanares and Klaus Jensen for their support.
Fio recently gained the ability to submit write zeroes commands to NVMe devices
via the io_uring_cmd
ioengine. It now also has an option to set the write
zeroes command's DEAC bit which signals to the device that it should deallocate
the specified blocks if the namespace is set up to return all zeroes when
reading deallocated logical blocks.
Testing with the current tip of fio revealed that write zeroes performance with the DEAC bit set was actually worse at higher queue depths. A simple queue depth 1 job with 16 MiB block sizes produced 30.5 GiB/s but at a queue depth of 4 the device only achieved 19.8 GiB/s! How could this be?
Fio runs: 30.5 GiB/s vs 19.8 GiB/s
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=1
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=29.5GiB/s][w=1885 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12200: Mon Oct 28 17:43:03 2024
write: IOPS=1952, BW=30.5GiB/s (32.8GB/s)(305GiB/10001msec); 0 zone resets
slat (usec): min=149, max=1478, avg=308.93, stdev=76.59
clat (usec): min=171, max=944, avg=202.15, stdev=34.46
lat (usec): min=322, max=1674, avg=511.09, stdev=78.63
clat percentiles (usec):
| 1.00th=[ 176], 5.00th=[ 180], 10.00th=[ 182], 20.00th=[ 184],
| 30.00th=[ 188], 40.00th=[ 190], 50.00th=[ 192], 60.00th=[ 194],
| 70.00th=[ 200], 80.00th=[ 208], 90.00th=[ 233], 95.00th=[ 269],
| 99.00th=[ 355], 99.50th=[ 367], 99.90th=[ 383], 99.95th=[ 408],
| 99.99th=[ 635]
bw ( MiB/s): min=30048, max=32993, per=100.00%, avg=31266.62, stdev=841.97, samples=20
iops : min= 1878, max= 2062, avg=1953.90, stdev=52.76, samples=20
lat (usec) : 250=93.37%, 500=6.61%, 750=0.02%, 1000=0.01%
cpu : usr=60.99%, sys=0.63%, ctx=19526, majf=0, minf=36
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,19524,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=30.5GiB/s (32.8GB/s), 30.5GiB/s-30.5GiB/s (32.8GB/s-32.8GB/s), io=305GiB (328GB), run=10001-10001msec
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=19.8GiB/s][w=1270 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12211: Mon Oct 28 17:47:01 2024
write: IOPS=1269, BW=19.8GiB/s (21.3GB/s)(199GiB/10002msec); 0 zone resets
slat (usec): min=353, max=1963, avg=786.71, stdev=143.30
clat (usec): min=201, max=3705, avg=2362.09, stdev=246.66
lat (usec): min=1007, max=4482, avg=3148.80, stdev=280.86
clat percentiles (usec):
| 1.00th=[ 1713], 5.00th=[ 1975], 10.00th=[ 2057], 20.00th=[ 2147],
| 30.00th=[ 2212], 40.00th=[ 2376], 50.00th=[ 2442], 60.00th=[ 2474],
| 70.00th=[ 2507], 80.00th=[ 2540], 90.00th=[ 2573], 95.00th=[ 2606],
| 99.00th=[ 2769], 99.50th=[ 3458], 99.90th=[ 3589], 99.95th=[ 3621],
| 99.99th=[ 3654]
bw ( MiB/s): min=20128, max=20736, per=100.00%, avg=20341.97, stdev=138.39, samples=20
iops : min= 1258, max= 1296, avg=1271.10, stdev= 8.60, samples=20
lat (usec) : 250=0.01%
lat (msec) : 2=5.98%, 4=94.03%
cpu : usr=99.84%, sys=0.16%, ctx=15, majf=0, minf=36
IO depths : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,12701,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
WRITE: bw=19.8GiB/s (21.3GB/s), 19.8GiB/s-19.8GiB/s (21.3GB/s-21.3GB/s), io=199GiB (213GB), run=10002-10002msec
Could this be a hardware issue? I decided to carry out a test using an emulated NVMe device to check. I created a RAM disk and used it as the backing store for an NVMe fabrics loopback device.
nvme-fabrics-loop.sh
#!/bin/bash
set -x
modprobe brd rd_nr=1 rd_size=33554432
modprobe nvmet
modprobe nvme-loop
modprobe nvme-fabrics
cd /sys/kernel/config/nvmet/subsystems/
mkdir nvme-test-target
cd nvme-test-target/
echo 1 > attr_allow_any_host
mkdir namespaces/1
cd namespaces/1
echo -n /dev/ram0 > device_path
cat device_path
echo 1 > enable
cat enable
mkdir /sys/kernel/config/nvmet/ports/1
cd /sys/kernel/config/nvmet/ports/1
echo loop > addr_trtype
cat addr_trtype
ln -s /sys/kernel/config/nvmet/subsystems/nvme-test-target/ /sys/kernel/config/nvmet/ports/1/subsystems/nvme-test-target
nvme discover -t loop
sleep 1
nvme connect -t loop -n nvme-test-target
sleep 1
nvme list
Console log
vincent@localhost $ sudo ./nvme-fabrics-loop.sh
+ modprobe brd rd_nr=1 rd_size=33554432
+ modprobe nvmet
+ modprobe nvme-loop
+ modprobe nvme-fabrics
+ cd /sys/kernel/config/nvmet/subsystems/
+ mkdir nvme-test-target
+ cd nvme-test-target/
+ echo 1
+ mkdir namespaces/1
+ cd namespaces/1
+ echo -n /dev/ram0
+ cat device_path
/dev/ram0
+ echo 1
+ cat enable
1
+ mkdir /sys/kernel/config/nvmet/ports/1
+ cd /sys/kernel/config/nvmet/ports/1
+ echo loop
+ cat addr_trtype
loop
+ ln -s /sys/kernel/config/nvmet/subsystems/nvme-test-target/ /sys/kernel/config/nvmet/ports/1/subsystems/nvme-test-target
+ nvme discover -t loop
Discovery Log Number of Records 2, Generation counter 2
=====Discovery Log Entry 0======
trtype: loop
adrfam: unrecognized
subtype: unrecognized
treq: not specified, sq flow control disable supported
portid: 1
trsvcid:
subnqn: nqn.2014-08.org.nvmexpress.discovery
traddr:
=====Discovery Log Entry 1======
trtype: loop
adrfam: unrecognized
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid:
subnqn: nvme-test-target
traddr:
+ sleep 1
+ nvme connect -t loop -n nvme-test-target
+ sleep 1
+ nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 461a6d8df1da98a9e46d Linux 1 34.36 GB / 34.36 GB 512 B + 0 B 6.6.0-rc
The bandwidth levels attained were much lower than the levels observed on a PCIe NVMe device, but it was still possible to elicit worse performance at higher queue depths. At a queue depth of 64, I measured write zeroes bandwidth of 8218 MiB/s but at a queue depth of 128, bandwidth declined to 7757 MiB/s. So even with a software emulated device, it was possible to induce worse performance by increasing the queue depth. This suggests that the hardware device is not responsible for the slowdown.
Fio runs: 8218 MiB/s vs 7757 MiB/s
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=64
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=64
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=8857MiB/s][w=553 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88098: Mon Oct 28 10:59:14 2024
write: IOPS=507, BW=8218MiB/s (8617MB/s)(80.4GiB/10017msec); 0 zone resets
slat (usec): min=492, max=21015, avg=1964.80, stdev=1632.94
clat (msec): min=15, max=180, avg=123.40, stdev=15.82
lat (msec): min=17, max=181, avg=125.37, stdev=15.76
clat percentiles (msec):
| 1.00th=[ 93], 5.00th=[ 102], 10.00th=[ 106], 20.00th=[ 111],
| 30.00th=[ 115], 40.00th=[ 118], 50.00th=[ 123], 60.00th=[ 127],
| 70.00th=[ 132], 80.00th=[ 138], 90.00th=[ 144], 95.00th=[ 150],
| 99.00th=[ 161], 99.50th=[ 165], 99.90th=[ 174], 99.95th=[ 178],
| 99.99th=[ 180]
bw ( MiB/s): min= 7040, max= 8977, per=98.97%, avg=8133.66, stdev=627.05, samples=20
iops : min= 440, max= 561, avg=508.20, stdev=39.30, samples=20
lat (msec) : 20=0.06%, 50=0.26%, 100=3.44%, 250=97.48%
cpu : usr=59.46%, sys=0.00%, ctx=1362, majf=0, minf=39
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,5081,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=8218MiB/s (8617MB/s), 8218MiB/s-8218MiB/s (8617MB/s-8617MB/s), io=80.4GiB (86.3GB), run=10017-10017msec
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=128
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=128
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=7664MiB/s][w=479 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88175: Mon Oct 28 10:59:35 2024
write: IOPS=472, BW=7757MiB/s (8134MB/s)(75.9GiB/10022msec); 0 zone resets
slat (usec): min=587, max=10235, avg=2109.36, stdev=1495.72
clat (msec): min=20, max=319, avg=265.18, stdev=25.92
lat (msec): min=21, max=321, avg=267.28, stdev=25.82
clat percentiles (msec):
| 1.00th=[ 112], 5.00th=[ 247], 10.00th=[ 253], 20.00th=[ 257],
| 30.00th=[ 262], 40.00th=[ 264], 50.00th=[ 268], 60.00th=[ 271],
| 70.00th=[ 275], 80.00th=[ 279], 90.00th=[ 284], 95.00th=[ 288],
| 99.00th=[ 300], 99.50th=[ 305], 99.90th=[ 313], 99.95th=[ 313],
| 99.99th=[ 321]
bw ( MiB/s): min= 7264, max= 7951, per=97.62%, avg=7572.61, stdev=192.38, samples=20
iops : min= 454, max= 496, avg=473.00, stdev=11.85, samples=20
lat (msec) : 50=0.38%, 100=0.55%, 250=6.38%, 500=95.37%
cpu : usr=63.46%, sys=0.00%, ctx=1384, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,4731,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
WRITE: bw=7757MiB/s (8134MB/s), 7757MiB/s-7757MiB/s (8134MB/s-8134MB/s), io=75.9GiB (81.5GB), run=10022-10022msec
What could be going on? After some head scratching, gnashing of teeth, and pulling of hair along with some research, investigation, and testing, fio's documentation and source code revealed the answer. For write operations, fio by default will scramble the data buffer for each command. With the large block sizes used for these workloads, the time that fio spends scrambling the data buffer takes away from the time available for fio to submit commands and reap completions. Since we are sending write zeroes commands the data buffer contents are irrelevant for our workload. In fact fio does not even send a data buffer address for these write zeroes commands. So fio should not be spending time randomizing the contents of the data buffer.
We can add the --scramble_buffers=0
option to our jobs to disable this
behavior. Fio will simply leave the data buffer untouched for each command. The
--zero_buffers
option will also suffice and directs fio to fill data buffers
with all zeroes and then leave them untouched for the rest of the run.
With the --zero_buffers
option our hardware NVMe device now produces
bandwidth of 52.6 GiB/s at queue depth 1. This scales with queue depth and
increases to 58.9 GiB/s at queue depth 4.
Fio runs: 52.6 GiB/s vs 58.9 GiB/s
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=1
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=52.6GiB/s][w=3369 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12279: Mon Oct 28 18:41:23 2024
write: IOPS=3364, BW=52.6GiB/s (56.4GB/s)(526GiB/10001msec); 0 zone resets
slat (nsec): min=1886, max=1247.0k, avg=3365.45, stdev=7045.02
clat (usec): min=165, max=1650, avg=292.40, stdev=130.16
lat (usec): min=187, max=1653, avg=295.77, stdev=130.17
clat percentiles (usec):
| 1.00th=[ 192], 5.00th=[ 200], 10.00th=[ 206], 20.00th=[ 210],
| 30.00th=[ 212], 40.00th=[ 217], 50.00th=[ 221], 60.00th=[ 229],
| 70.00th=[ 241], 80.00th=[ 502], 90.00th=[ 519], 95.00th=[ 529],
| 99.00th=[ 537], 99.50th=[ 545], 99.90th=[ 660], 99.95th=[ 906],
| 99.99th=[ 1074]
bw ( MiB/s): min=53482, max=54092, per=100.00%, avg=53879.90, stdev=126.79, samples=20
iops : min= 3342, max= 3380, avg=3367.20, stdev= 7.93, samples=20
lat (usec) : 250=73.03%, 500=6.74%, 750=20.17%, 1000=0.04%
lat (msec) : 2=0.02%
cpu : usr=2.61%, sys=3.23%, ctx=33648, majf=0, minf=37
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,33648,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=52.6GiB/s (56.4GB/s), 52.6GiB/s-52.6GiB/s (56.4GB/s-56.4GB/s), io=526GiB (565GB), run=10001-10001msec
root@localhost:~/fio-dev/fio-canonical# ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=57.2GiB/s][w=3659 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12285: Mon Oct 28 18:42:27 2024
write: IOPS=3770, BW=58.9GiB/s (63.3GB/s)(589GiB/10002msec); 0 zone resets
slat (nsec): min=1329, max=50623, avg=3205.78, stdev=1682.49
clat (usec): min=188, max=2794, avg=1056.13, stdev=37.97
lat (usec): min=208, max=2797, avg=1059.33, stdev=37.82
clat percentiles (usec):
| 1.00th=[ 1012], 5.00th=[ 1020], 10.00th=[ 1029], 20.00th=[ 1037],
| 30.00th=[ 1045], 40.00th=[ 1045], 50.00th=[ 1057], 60.00th=[ 1057],
| 70.00th=[ 1057], 80.00th=[ 1074], 90.00th=[ 1090], 95.00th=[ 1106],
| 99.00th=[ 1156], 99.50th=[ 1221], 99.90th=[ 1418], 99.95th=[ 1483],
| 99.99th=[ 2114]
bw ( MiB/s): min=58464, max=61088, per=100.00%, avg=60381.51, stdev=839.68, samples=20
iops : min= 3654, max= 3818, avg=3773.65, stdev=52.43, samples=20
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.41%
lat (msec) : 2=99.56%, 4=0.01%
cpu : usr=2.84%, sys=3.58%, ctx=37684, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,37714,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
WRITE: bw=58.9GiB/s (63.3GB/s), 58.9GiB/s-58.9GiB/s (63.3GB/s-63.3GB/s), io=589GiB (633GB), run=10002-10002msec
Testing our emulated NVMe device with the --zero_buffers
option, bandwidth
explodes to 47.0 GiB/s at queue depth 64 and increases as expected to 67.2
GiB/s at queue depth 128.
Fio runs: 47.0 GiB/s vs 67.2 GiB/s
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=64 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=64
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=47.6GiB/s][w=3044 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88732: Mon Oct 28 11:38:46 2024
write: IOPS=3003, BW=47.0GiB/s (50.5GB/s)(473GiB/10053msec); 0 zone resets
slat (nsec): min=593, max=9361.6k, avg=19932.84, stdev=101944.11
clat (usec): min=193, max=335535, avg=21309.85, stdev=21609.98
lat (msec): min=3, max=335, avg=21.33, stdev=21.60
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 7], 10.00th=[ 8], 20.00th=[ 10],
| 30.00th=[ 12], 40.00th=[ 14], 50.00th=[ 16], 60.00th=[ 20],
| 70.00th=[ 23], 80.00th=[ 28], 90.00th=[ 38], 95.00th=[ 51],
| 99.00th=[ 120], 99.50th=[ 155], 99.90th=[ 253], 99.95th=[ 288],
| 99.99th=[ 313]
bw ( MiB/s): min=44640, max=54188, per=100.00%, avg=48347.36, stdev=2747.91, samples=20
iops : min= 2790, max= 3386, avg=3021.40, stdev=171.71, samples=20
lat (usec) : 250=0.01%
lat (msec) : 4=0.35%, 10=20.26%, 20=42.77%, 50=31.77%, 100=3.60%
lat (msec) : 250=1.33%, 500=0.11%
cpu : usr=3.22%, sys=8.03%, ctx=21794, majf=0, minf=38
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,30196,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=47.0GiB/s (50.5GB/s), 47.0GiB/s-47.0GiB/s (50.5GB/s-50.5GB/s), io=473GiB (508GB), run=10053-10053msec
vincent@localhost:~/fio-dev/fio-canonical$ sudo ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=128 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=128
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=69.4GiB/s][w=4441 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=88940: Mon Oct 28 11:39:12 2024
write: IOPS=4289, BW=67.2GiB/s (72.2GB/s)(678GiB/10090msec); 0 zone resets
slat (nsec): min=675, max=6459.4k, avg=24095.93, stdev=239147.26
clat (usec): min=32, max=342220, avg=29828.23, stdev=40676.39
lat (msec): min=3, max=342, avg=29.85, stdev=40.66
clat percentiles (msec):
| 1.00th=[ 5], 5.00th=[ 5], 10.00th=[ 6], 20.00th=[ 7],
| 30.00th=[ 8], 40.00th=[ 9], 50.00th=[ 11], 60.00th=[ 15],
| 70.00th=[ 25], 80.00th=[ 48], 90.00th=[ 88], 95.00th=[ 122],
| 99.00th=[ 188], 99.50th=[ 209], 99.90th=[ 268], 99.95th=[ 284],
| 99.99th=[ 321]
bw ( MiB/s): min=55360, max=71616, per=100.00%, avg=69232.21, stdev=4050.56, samples=20
iops : min= 3460, max= 4476, avg=4326.95, stdev=253.13, samples=20
lat (usec) : 50=0.01%, 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%
lat (usec) : 1000=0.01%
lat (msec) : 2=0.05%, 4=0.56%, 10=47.80%, 20=18.29%, 50=14.44%
lat (msec) : 100=11.19%, 250=7.73%, 500=0.16%
cpu : usr=2.65%, sys=5.13%, ctx=23220, majf=0, minf=40
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,43279,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
WRITE: bw=67.2GiB/s (72.2GB/s), 67.2GiB/s-67.2GiB/s (72.2GB/s-72.2GB/s), io=678GiB (728GB), run=10090-10090msec
The table below summarizes the performance measurements collected. With fio
taking the time to randomize the unused data buffers, performance can decrease
at higher queue depths. If we add the zero_buffers
option, performance
increases with queue depth as expected.
Default scrambled buffers | zero_buffers | ||||
PCIe NVMe | Software NVMe | PCIe NVMe | Software NVMe | ||
QD1 | 30.5 GiB/s | 52.6 GiB/s | |||
QD4 | 19.8 GiB/s | 58.9 GiB/s | |||
QD64 | 8218 MiB/s | 47.0 GiB/s | |||
QD128 | 7757 MiB/s | 67.2 GiB/s |
We can gain more insight into what is happening by using the Linux profiler
perf to collect data from jobs with and
without the zero_buffers
option. I ran fio jobs against the PCIe NVMe device
using perf record
and viewed the results with perf report
. The job with fio
needlessly scrambling the buffers showed that a whopping 98.27 percent of the
time, the CPU was running in fio's get_io_u()
. When the zero_buffers
option
was activated, the function that appeared the most was __vdso_clock_gettime()
at 10.83 percent. get_io_u()
fell to eighth on the list appearing only 1.60
percent of the time.
Samples: 61K of event 'cycles:P', Event count (approx.): 51751837222
Overhead Command Shared Object Symbol
98.27% fio fio [.] get_io_u
0.25% fio [vdso] [.] __vdso_clock_gettime
0.13% fio [kernel.kallsyms] [k] clear_page_erms
0.07% fio fio [.] __fill_random_buf
0.06% fio [kernel.kallsyms] [k] __raw_callee_save___pv_queued_spin_unlock
0.05% fio fio [.] calibrate_cpu_clock
0.04% fio ld-linux-x86-64.so.2 [.] do_lookup_x
0.03% fio [kernel.kallsyms] [k] nvme_poll_cq
0.03% fio fio [.] thread_main
0.03% fio fio [.] td_io_queue
...
Samples: 40K of event 'cycles:P', Event count (approx.): 1389337265
Overhead Command Shared Object Symbol
10.83% fio [vdso] [.] __vdso_clock_gettime
4.77% fio [kernel.kallsyms] [k] __raw_callee_save___pv_queued_spin_unlock
4.65% fio [kernel.kallsyms] [k] clear_page_erms
2.12% fio [kernel.kallsyms] [k] native_write_msr
2.06% fio fio [.] calibrate_cpu_clock
1.83% fio [kernel.kallsyms] [k] clear_bhb_loop
1.68% fio ld-linux-x86-64.so.2 [.] do_lookup_x
1.60% fio fio [.] get_io_u
1.42% fio [kernel.kallsyms] [k] pvclock_clocksource_read_nowd
1.39% fio [kernel.kallsyms] [k] __schedule
...
perf runs
root@localhost:~/fio-dev/fio-canonical# perf record ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=17.9GiB/s][w=1147 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=15514: Tue Oct 29 14:08:31 2024
write: IOPS=1146, BW=17.9GiB/s (19.2GB/s)(179GiB/10002msec); 0 zone resets
slat (usec): min=407, max=2017, avg=870.97, stdev=164.97
clat (usec): min=271, max=3905, avg=2616.63, stdev=282.72
lat (usec): min=1214, max=4891, avg=3487.61, stdev=321.84
clat percentiles (usec):
| 1.00th=[ 1860], 5.00th=[ 2180], 10.00th=[ 2278], 20.00th=[ 2343],
| 30.00th=[ 2442], 40.00th=[ 2638], 50.00th=[ 2704], 60.00th=[ 2737],
| 70.00th=[ 2802], 80.00th=[ 2835], 90.00th=[ 2868], 95.00th=[ 2933],
| 99.00th=[ 3228], 99.50th=[ 3687], 99.90th=[ 3851], 99.95th=[ 3884],
| 99.99th=[ 3916]
bw ( MiB/s): min=18176, max=18565, per=100.00%, avg=18367.67, stdev=131.55, samples=20
iops : min= 1136, max= 1160, avg=1147.80, stdev= 8.18, samples=20
lat (usec) : 500=0.01%
lat (msec) : 2=3.92%, 4=96.10%
cpu : usr=99.80%, sys=0.20%, ctx=13, majf=0, minf=36
IO depths : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,11466,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
WRITE: bw=17.9GiB/s (19.2GB/s), 17.9GiB/s-17.9GiB/s (19.2GB/s-19.2GB/s), io=179GiB (192GB), run=10002-10002msec
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.376 MB perf.data (61543 samples) ]
root@localhost:~/fio-dev/fio-canonical# perf report
root@localhost:~/fio-dev/fio-canonical# perf record ./fio --name=test --ioengine=io_uring_cmd --filename=/dev/ng0n1 --write_mode=zeroes --bs=16M --rw=write --time_based --runtime=10s --ramp_time=5s --deac=1 --iodepth=4 --zero_buffers
test: (g=0): rw=write, bs=(R) 16.0MiB-16.0MiB, (W) 16.0MiB-16.0MiB, (T) 16.0MiB-16.0MiB, ioengine=io_uring_cmd, iodepth=4
fio-3.38-7-ga1916
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=57.1GiB/s][w=3654 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=15527: Tue Oct 29 14:09:48 2024
write: IOPS=3658, BW=57.2GiB/s (61.4GB/s)(572GiB/10002msec); 0 zone resets
slat (nsec): min=1021, max=196845, avg=14056.28, stdev=20093.66
clat (usec): min=216, max=2842, avg=1074.29, stdev=52.87
lat (usec): min=245, max=2875, avg=1088.34, stdev=49.64
clat percentiles (usec):
| 1.00th=[ 979], 5.00th=[ 1004], 10.00th=[ 1020], 20.00th=[ 1045],
| 30.00th=[ 1057], 40.00th=[ 1074], 50.00th=[ 1074], 60.00th=[ 1090],
| 70.00th=[ 1090], 80.00th=[ 1106], 90.00th=[ 1123], 95.00th=[ 1139],
| 99.00th=[ 1205], 99.50th=[ 1270], 99.90th=[ 1450], 99.95th=[ 1598],
| 99.99th=[ 2343]
bw ( MiB/s): min=58420, max=58869, per=100.00%, avg=58603.65, stdev=129.91, samples=20
iops : min= 3651, max= 3679, avg=3662.55, stdev= 8.13, samples=20
lat (usec) : 250=0.01%, 500=0.02%, 750=0.06%, 1000=4.41%
lat (msec) : 2=95.49%, 4=0.02%
cpu : usr=4.69%, sys=16.11%, ctx=36432, majf=0, minf=37
IO depths : 1=0.0%, 2=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,36588,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs):
WRITE: bw=57.2GiB/s (61.4GB/s), 57.2GiB/s-57.2GiB/s (61.4GB/s-61.4GB/s), io=572GiB (614GB), run=10002-10002msec
[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 1.698 MB perf.data (40941 samples) ]
root@localhost:~/fio-dev/fio-canonical# perf report
Let us examine the get_io_u()
code more closely. The relevant parts are here:
/*
* Return an io_u to be processed. Gets a buflen and offset, sets direction,
* etc. The returned io_u is fully ready to be prepped, populated and submitted.
*/
struct io_u *get_io_u(struct thread_data *td)
{
...
int do_scramble = 0;
...
if (io_u->ddir == DDIR_WRITE) {
if (td->flags & TD_F_REFILL_BUFFERS) {
io_u_fill_buffer(td, io_u,
td->o.min_bs[DDIR_WRITE],
io_u->buflen);
} else if ((td->flags & TD_F_SCRAMBLE_BUFFERS) &&
!(td->flags & TD_F_COMPRESS) &&
!(td->flags & TD_F_DO_VERIFY)) {
do_scramble = 1;
}
}
...
if (do_scramble)
small_content_scramble(io_u);
...
}
For write commands, fio uses various flags to determine if it needs to scramble
the data buffer. If so, it sets the do_scramble
flag which later on is used
to determine whether or not to call small_content_scramble()
to randomize the
contents of the data buffer. When zero_buffers
is set, do_scramble
remains
false and small_content_scramble()
is not called.
Fio's default options are generally sane and work as expected for most situations. However, new features may not fit the conventions assumed by fio's defaults. Write zeroes is indeed a write command but unlike most write commands, no data buffer is involved, so in this case fio's defaults do not work well. Also worth noting is that these tests use an unusually large block size which amplifies the time fio could spend randomizing the data buffer. For most other write workloads any time that fio spends scrambling the data buffer is dwarfed by the device response time.
All in all this was an interesting puzzle with a satisfying resolution. Fio's defaults are not top of mind because they generally work well but it can be helpful to be familiar with them.