xNVMe ioengine Part 2 - vincentkfu/fio-blog GitHub Wiki
This blog post is co-authored by Ankit Kumar and Vincent Fu. We are grateful to Kanchan Joshi, Adam Manzanares, and Krishna Kanth Reddy for their helpful comments, but of course all remaining errors and omissions are ours alone.
In Part 1 we covered the basics of building and using xNVMe, provided a few examples, and compared the performance of xNVMe backends with fio ioengines. In this entry we will cover some xNVMe ioengine options, provide more performance comparisons (including the new io_uring_cmd backend), and discuss xNVMe's SPDK backend.
The xNVMe ioengine provides options for polling and vectored I/O and allows selection of the submission path for NVMe administrative commands. Support for polling and vectored I/O options varies by backend. These options are:
Option | Description | Availability |
---|---|---|
hipri |
If this option is set, fio will attempt to use polled I/O completions. Normal I/O completions generate interrupts to signal the completion of I/O, but polled completions do not. Hence they require active reaping by the application. The benefits are more efficient I/O for high IOPS scenarios and lower latencies for low queue depth I/O. These benefits come at the cost of potentially higher CPU utilization. | Only supported by xnvme_async={libaio, io_uring, io_uring_cmd}
|
sqthread_poll |
Normally fio will submit I/O by issuing a system call to notify the kernel of available items in the submission queue. If this option is set, the act of submitting I/O will be done by a kernel polling thread. This reduces context switches from user-space to kernel-space and also frees up cycles for fio but comes at the cost of using more CPU in the system. | Only supported by xnvme_async={io_uring, io_uring_cmd}
|
xnvme_iovec |
If set, xNVMe will use vectored read/write commands. For non-vectored I/O the submission queue entry addr field contains the source or target buffer address and the len field contains the byte count for I/O transfer. Meanwhile for vectored I/O, the submission queue entry len field specifies the length of the vector of buffers at the addr field. Because fio can currently only send vectors of length 1 this option is admittedly not useful. |
Supported by all the sync and async backends |
To select the NVMe administrative command submission pathway, specify:
xnvme_admin={nvme, block}
The table below describes the two options:
Pathway | Description |
---|---|
nvme |
This is the default option and uses the Linux/FreeBSD NVMe driver ioctl(). |
block |
If specified xNVMe will instead use the Linux/FreeBSD block layer ioctl() and sysfs entries for administrative commands. This option is useful when using xNVMe to access a device (such as null_blk) that is not NVMe and for which the NVMe driver ioctls will not be available. For more details see the discussion in the xNVMe documentation. |
Let us examine the impact of xNVMe's polling options with some performance data. The table below lists the key hardware and software components of our evaluation system. It is comprised of low cost commodity hardware such that reconstructing the environment and reproducing the experiments is possible with little effort and cost. We use a low latency SSD as the target device.
Item | Description |
---|---|
CPU | AMD Ryzen 9 5900X 4.9GHz |
Memory | 16GB DDR4 2667Mhz |
Motherboard | MEG X570 GODLIKE (MS-7C34) |
Linux | Ubuntu 22.04 LTS |
fio | 3.32 |
xNVMe | 0.5.0 |
gcc | 11.2.0 |
kernel | 6.0.0-rc2 |
This is the same hardware platform that we used in Part 1 although we are now using a later build of fio, a later release of xNVMe, and a newer kernel.
Let us explore the two polling options for the xNVMe ioengine with the
xnvme_async=io_uring
backend. We will focus on CPU utilization and the number
of context switches.
To effectively use completion polling (hipri
) we need to set the NVMe
driver's poll_queues
option. This will specify the number of queues to use
for polled I/O with the NVMe driver. The default is zero. Instantiate one queue
for use with polled I/O with the following commands:
cat /sys/block/nvme0n1/queue/io_poll
0
rmmod nvme-core
rmmod nvme
insmod drivers/nvme/host/nvme-core.ko
insmod drivers/nvme/host/nvme.ko poll_queues=1
cat /sys/block/nvme0n1/queue/io_poll
1
Below is the fio job file used for our tests:
xnvme-compare1.fio
[global]
thread=1
filename=/dev/nvme0n1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
cpus_allowed=1
norandommap
direct=1
[default]
ioengine=xnvme
xnvme_async=io_uring
[hipri]
ioengine=xnvme
xnvme_async=io_uring
hipri=1
[sqpoll]
ioengine=xnvme
xnvme_async=io_uring
sqthread_poll=1
Here is the output for the three different configurations we tested:
Default configuration
$ fio xnvme-compare1.fio --section=default
default: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=446MiB/s][r=114k IOPS][eta 00m:00s]
default: (groupid=0, jobs=1): err= 0: pid=519771: Thu Oct 13 01:07:36 2022
read: IOPS=114k, BW=445MiB/s (467MB/s)(26.1GiB/60001msec)
slat (nsec): min=800, max=30210, avg=873.33, stdev=59.77
clat (nsec): min=110, max=975201, avg=7694.55, stdev=836.92
lat (nsec): min=7370, max=976301, avg=8567.89, stdev=839.26
clat percentiles (nsec):
| 1.00th=[ 7520], 5.00th=[ 7584], 10.00th=[ 7584], 20.00th=[ 7584],
| 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7648], 60.00th=[ 7648],
| 70.00th=[ 7648], 80.00th=[ 7648], 90.00th=[ 7712], 95.00th=[ 7840],
| 99.00th=[ 8512], 99.50th=[11968], 99.90th=[14272], 99.95th=[14400],
| 99.99th=[15936]
bw ( KiB/s): min=444448, max=458208, per=100.00%, avg=455874.96, stdev=1297.00, samples=119
iops : min=111112, max=114552, avg=113968.74, stdev=324.25, samples=119
lat (nsec) : 250=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.29%, 20=0.71%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
cpu : usr=7.48%, sys=21.07%, ctx=6836163, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6836299,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s), io=26.1GiB (28.0GB), run=60001-60001msec
Polled completions with hipri
$ fio xnvme-compare1.fio --section=hipri
hipri: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=595MiB/s][r=152k IOPS][eta 00m:00s]
hipri: (groupid=0, jobs=1): err= 0: pid=519927: Thu Oct 13 01:11:25 2022
read: IOPS=152k, BW=595MiB/s (624MB/s)(34.9GiB/60001msec)
slat (nsec): min=750, max=104598, avg=816.08, stdev=67.95
clat (nsec): min=80, max=23389, avg=5606.79, stdev=470.59
lat (nsec): min=5960, max=110448, avg=6422.86, stdev=474.30
clat percentiles (nsec):
| 1.00th=[ 5472], 5.00th=[ 5472], 10.00th=[ 5536], 20.00th=[ 5536],
| 30.00th=[ 5536], 40.00th=[ 5536], 50.00th=[ 5536], 60.00th=[ 5600],
| 70.00th=[ 5600], 80.00th=[ 5600], 90.00th=[ 5600], 95.00th=[ 5664],
| 99.00th=[ 6560], 99.50th=[ 9024], 99.90th=[12224], 99.95th=[12224],
| 99.99th=[12608]
bw ( KiB/s): min=607336, max=611000, per=100.00%, avg=609484.77, stdev=1179.68, samples=119
iops : min=151834, max=152750, avg=152371.19, stdev=294.92, samples=119
lat (nsec) : 100=0.01%, 250=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.40%, 50=0.01%
cpu : usr=5.91%, sys=94.09%, ctx=243, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=9140795,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=595MiB/s (624MB/s), 595MiB/s-595MiB/s (624MB/s-624MB/s), io=34.9GiB (37.4GB), run=60001-60001msec
Submission queue polling with sqthread_poll
$ fio xnvme-compare1.fio --section=sqpoll
sqpoll: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=499MiB/s][r=128k IOPS][eta 00m:00s]
sqpoll: (groupid=0, jobs=1): err= 0: pid=519964: Thu Oct 13 01:14:31 2022
read: IOPS=128k, BW=498MiB/s (523MB/s)(29.2GiB/60000msec)
slat (nsec): min=120, max=2680, avg=241.33, stdev=42.61
clat (usec): min=6, max=23381, avg= 7.47, stdev= 8.59
lat (usec): min=7, max=23383, avg= 7.71, stdev= 8.59
clat percentiles (nsec):
| 1.00th=[ 7328], 5.00th=[ 7328], 10.00th=[ 7328], 20.00th=[ 7392],
| 30.00th=[ 7392], 40.00th=[ 7456], 50.00th=[ 7456], 60.00th=[ 7456],
| 70.00th=[ 7456], 80.00th=[ 7456], 90.00th=[ 7520], 95.00th=[ 7520],
| 99.00th=[ 8160], 99.50th=[ 9792], 99.90th=[14272], 99.95th=[14272],
| 99.99th=[15936]
bw ( KiB/s): min=482216, max=511152, per=100.00%, avg=510294.45, stdev=2935.98, samples=119
iops : min=120554, max=127788, avg=127573.61, stdev=733.99, samples=119
lat (usec) : 10=99.51%, 20=0.49%, 50=0.01%
lat (msec) : 10=0.01%, 50=0.01%
cpu : usr=99.98%, sys=0.00%, ctx=112, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=7653842,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=498MiB/s (523MB/s), 498MiB/s-498MiB/s (523MB/s-523MB/s), io=29.2GiB (31.3GB), run=60000-60000msec
The table below summarizes the results for the three different configurations,
listing IOPS, latency, userspace CPU utilization, system CPU utilization, and
the number of context switches experienced by fio. Note that fio's reported
user space and system (kernel) CPU usage and context switches are obtained from
a call to getrusage
with the RUSAGE_SELF
flag. This includes only resource
usage by the calling process (fio) and its children. CPU usage and context
switches from other contexts are not included.
default | hipri |
sqthread_poll |
|
---|---|---|---|
Mean total latency (ns) | 8568 | 6423 | 7710 |
IOPS | 114K | 152K | 128K |
CPU usr | 7.48% | 5.91% | 99.98% |
CPU sys | 21.07% | 94.09% | 0% |
Context switches | 6836163 | 243 | 112 |
Comparing the results yields the following observations:
- default: The most striking result is the large number of context switches, essentially one for each read command. These come from completion interrupts as well as system calls as fio submits and reaps completions. Compared to the other jobs, the default configuration has much lower CPU utilization, slightly lower IOPS, and slightly higher latency.
-
hipri
: This enables completion polling via theIORING_SETUP_IOPOLL
flag. This configuration performs busy-waiting for I/O completions, and provides lower latency but much higher CPU utilization compared to the default. CPU consumption is mostly in the kernel context as most of the CPU usage occurs as the kernel is polling the device for completions. As there are no interrupts from the NVMe driver for IO completions, the context switches measured by fio are few. -
sqthread_poll
: This enables submission queue polling via theIORING_SETUP_SQPOLL
flag. As this creates a kernel thread for submission queue polling there is no context switch during I/O submission. Furthermore, because this kernel thread submits IO commands from a different CPU than the one running the fio process, the fio process is not interrupted by completion interrupts. Moreover, when submission queue polling is enabled, xNVMe (and fio's io_uring ioengine as well) essentially polls the completion queue from user space. This explains the high user space CPU consumption as CPU cycles are consumed by request submission and polling of the completion queue. To inspect system CPU usage and observe resource usage of the kernel polling thread use eitherhtop
or thesar
utility. Forsar
the following will report CPU usage for each core with an interval of one second:
sar -P ALL 1
For more information see cpu-utilization.
Also see the note at the end for a libaio polling example.
Now let us continue the comparison between the xNVMe ioengine's backends and fio ioengines from Part 1.
We show a few more workloads to illustrate the performance penalty introduced by xNVMe. The first series of experiments compares xNVMe's io_uring backend against fio's io_uring ioengine with polled completions. Then we will compare the xNVMe io_uring_cmd backend with fio's new io_uring_cmd ioengine. Later on we will also see how the xNVMe ioengine using the SPDK driver fares against the default SPDK fio ioengine.
The job file we used for the experiments is below.
xnvme-compare2.fio
[global]
thread=1
time_based
runtime=1m
numjobs=1
rw=randread
bs=4k
cpus_allowed=1
norandommap
[uring]
ioengine=io_uring
direct=1
hipri=1
nonvectored=1
[xnvme_uring]
ioengine=xnvme
xnvme_async=io_uring
direct=1
hipri=1
[ucmd]
ioengine=io_uring_cmd
nonvectored=1
[xnvme_ucmd]
ioengine=xnvme
xnvme_async=io_uring_cmd
[ucmd_sqthread]
ioengine=io_uring_cmd
nonvectored=1
sqthread_poll=1
[xnvme_ucmd_sqthread]
ioengine=xnvme
xnvme_async=io_uring_cmd
sqthread_poll=1
Here is the output for the six different configurations:
io_uring ioengine with polled completions
$ fio xnvme-compare2.fio --section=uring --filename=/dev/nvme0n1
uring: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1
fio-3.32-39-g0574e
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=597MiB/s][r=153k IOPS][eta 00m:00s]
uring: (groupid=0, jobs=1): err= 0: pid=520027: Thu Oct 13 01:21:39 2022
read: IOPS=153k, BW=597MiB/s (626MB/s)(35.0GiB/60001msec)
slat (nsec): min=700, max=18049, avg=762.65, stdev=59.67
clat (nsec): min=80, max=26059, avg=5632.44, stdev=471.87
lat (nsec): min=5920, max=29739, avg=6395.10, stdev=474.23
clat percentiles (nsec):
| 1.00th=[ 5472], 5.00th=[ 5536], 10.00th=[ 5536], 20.00th=[ 5536],
| 30.00th=[ 5536], 40.00th=[ 5600], 50.00th=[ 5600], 60.00th=[ 5600],
| 70.00th=[ 5600], 80.00th=[ 5600], 90.00th=[ 5664], 95.00th=[ 5792],
| 99.00th=[ 6560], 99.50th=[ 9152], 99.90th=[12224], 99.95th=[12224],
| 99.99th=[12864]
bw ( KiB/s): min=610264, max=611352, per=100.00%, avg=611085.98, stdev=280.43, samples=119
iops : min=152566, max=152838, avg=152771.50, stdev=70.11, samples=119
lat (nsec) : 100=0.01%, 250=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.41%, 50=0.01%
cpu : usr=4.99%, sys=95.00%, ctx=219, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=9163979,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=597MiB/s (626MB/s), 597MiB/s-597MiB/s (626MB/s-626MB/s), io=35.0GiB (37.5GB), run=60001-60001msec
Disk stats (read/write):
nvme0n1: ios=9148582/0, merge=0/0, ticks=53935/0, in_queue=53935, util=99.87%
io_uring backend (xNVMe ioengine) with polled completions
$ fio xnvme-compare2.fio --section=xnvme_uring --filename=/dev/nvme0n1
xnvme_uring: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=595MiB/s][r=152k IOPS][eta 00m:00s]
xnvme_uring: (groupid=0, jobs=1): err= 0: pid=520074: Thu Oct 13 01:23:07 2022
read: IOPS=152k, BW=595MiB/s (623MB/s)(34.8GiB/60000msec)
slat (nsec): min=770, max=29609, avg=830.05, stdev=59.40
clat (nsec): min=80, max=106778, avg=5595.02, stdev=471.78
lat (nsec): min=5940, max=107878, avg=6425.06, stdev=474.11
clat percentiles (nsec):
| 1.00th=[ 5472], 5.00th=[ 5472], 10.00th=[ 5472], 20.00th=[ 5536],
| 30.00th=[ 5536], 40.00th=[ 5536], 50.00th=[ 5536], 60.00th=[ 5536],
| 70.00th=[ 5536], 80.00th=[ 5600], 90.00th=[ 5600], 95.00th=[ 5664],
| 99.00th=[ 6496], 99.50th=[ 9024], 99.90th=[12224], 99.95th=[12224],
| 99.99th=[12608]
bw ( KiB/s): min=606960, max=610960, per=100.00%, avg=608883.36, stdev=1037.71, samples=119
iops : min=151740, max=152740, avg=152220.87, stdev=259.42, samples=119
lat (nsec) : 100=0.01%, 250=0.01%, 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.59%, 20=0.40%, 50=0.01%
lat (usec) : 250=0.01%
cpu : usr=5.91%, sys=94.09%, ctx=231, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=9131935,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=595MiB/s (623MB/s), 595MiB/s-595MiB/s (623MB/s-623MB/s), io=34.8GiB (37.4GB), run=60000-60000msec
io_uring_cmd ioengine
$ fio xnvme-compare2.fio --section=ucmd --filename=/dev/ng0n1
ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=452MiB/s][r=116k IOPS][eta 00m:00s]
ucmd: (groupid=0, jobs=1): err= 0: pid=520104: Thu Oct 13 01:24:59 2022
read: IOPS=116k, BW=451MiB/s (473MB/s)(26.4GiB/60001msec)
slat (nsec): min=670, max=13370, avg=731.30, stdev=54.58
clat (nsec): min=90, max=2017.0k, avg=7712.97, stdev=1107.87
lat (usec): min=7, max=2017, avg= 8.44, stdev= 1.11
clat percentiles (nsec):
| 1.00th=[ 7520], 5.00th=[ 7584], 10.00th=[ 7584], 20.00th=[ 7648],
| 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7648], 60.00th=[ 7712],
| 70.00th=[ 7712], 80.00th=[ 7712], 90.00th=[ 7712], 95.00th=[ 7776],
| 99.00th=[ 8512], 99.50th=[11584], 99.90th=[14272], 99.95th=[14400],
| 99.99th=[15552]
bw ( KiB/s): min=455120, max=464448, per=100.00%, avg=462195.13, stdev=1356.73, samples=119
iops : min=113780, max=116112, avg=115548.78, stdev=339.19, samples=119
lat (nsec) : 100=0.01%, 250=0.01%
lat (usec) : 4=0.01%, 10=99.32%, 20=0.68%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=7.06%, sys=21.48%, ctx=6930490, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6930622,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=451MiB/s (473MB/s), 451MiB/s-451MiB/s (473MB/s-473MB/s), io=26.4GiB (28.4GB), run=60001-60001msec
io_uring_cmd backend (xNVMe ioengine)
$ fio xnvme-compare2.fio --section=xnvme_ucmd --filename=/dev/ng0n1
xnvme_ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=449MiB/s][r=115k IOPS][eta 00m:00s]
xnvme_ucmd: (groupid=0, jobs=1): err= 0: pid=520153: Thu Oct 13 01:31:06 2022
read: IOPS=115k, BW=448MiB/s (470MB/s)(26.3GiB/60001msec)
slat (nsec): min=710, max=23210, avg=775.53, stdev=58.47
clat (nsec): min=100, max=137437, avg=7731.31, stdev=492.69
lat (nsec): min=7920, max=152877, avg=8506.83, stdev=496.66
clat percentiles (nsec):
| 1.00th=[ 7520], 5.00th=[ 7584], 10.00th=[ 7648], 20.00th=[ 7648],
| 30.00th=[ 7648], 40.00th=[ 7648], 50.00th=[ 7712], 60.00th=[ 7712],
| 70.00th=[ 7712], 80.00th=[ 7712], 90.00th=[ 7712], 95.00th=[ 7840],
| 99.00th=[ 8512], 99.50th=[11712], 99.90th=[14528], 99.95th=[14528],
| 99.99th=[14912]
bw ( KiB/s): min=446616, max=460656, per=100.00%, avg=459226.22, stdev=1388.44, samples=119
iops : min=111654, max=115164, avg=114806.55, stdev=347.11, samples=119
lat (nsec) : 250=0.01%
lat (usec) : 10=99.31%, 20=0.69%, 50=0.01%, 250=0.01%
cpu : usr=7.87%, sys=21.32%, ctx=6886793, majf=0, minf=1
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=6886881,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=448MiB/s (470MB/s), 448MiB/s-448MiB/s (470MB/s-470MB/s), io=26.3GiB (28.2GB), run=60001-60001msec
io_uring_cmd ioengine with sqthread_poll enabled
$ fio xnvme-compare2.fio --section=ucmd_sqthread --filename=/dev/ng0n1
ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring_cmd, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=546MiB/s][r=140k IOPS][eta 00m:00s]
ucmd: (groupid=0, jobs=1): err= 0: pid=520218: Thu Oct 13 01:34:59 2022
read: IOPS=139k, BW=545MiB/s (571MB/s)(31.9GiB/60001msec)
clat (msec): min=1990.5M, max=1990.6M, avg=1990535777.15, stdev=17308.19
lat (usec): min=6, max=23643, avg= 7.04, stdev= 8.31
clat percentiles (msec):
| 1.00th=[17113], 5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
| 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
| 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
| 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
| 99.99th=[17113]
bw ( KiB/s): min=526874, max=571016, per=100.00%, avg=557953.97, stdev=3962.98, samples=119
iops : min=131718, max=142754, avg=139488.49, stdev=990.78, samples=119
lat (msec) : >=2000=100.00%
cpu : usr=99.98%, sys=0.00%, ctx=265, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8366947,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=545MiB/s (571MB/s), 545MiB/s-545MiB/s (571MB/s-571MB/s), io=31.9GiB (34.3GB), run=60001-60001msec
xNVMe ioengine using io_uring_cmd async backend with sqthread_poll enabled :
io_uring_cmd backend (xNVMe ioengine) with sqthread_poll enabled
$ fio xnvme-compare2.fio --section=xnvme_ucmd_sqthread --filename=/dev/ng0n1
xnvme_ucmd: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=543MiB/s][r=139k IOPS][eta 00m:00s]
xnvme_ucmd: (groupid=0, jobs=1): err= 0: pid=520189: Thu Oct 13 01:33:00 2022
read: IOPS=138k, BW=540MiB/s (566MB/s)(31.7GiB/60001msec)
slat (nsec): min=60, max=7060, avg=91.35, stdev= 9.58
clat (usec): min=6, max=23244, avg= 7.01, stdev= 8.08
lat (usec): min=6, max=23245, avg= 7.10, stdev= 8.08
clat percentiles (nsec):
| 1.00th=[ 6816], 5.00th=[ 6880], 10.00th=[ 6880], 20.00th=[ 6944],
| 30.00th=[ 6944], 40.00th=[ 6944], 50.00th=[ 6944], 60.00th=[ 6944],
| 70.00th=[ 7008], 80.00th=[ 7008], 90.00th=[ 7008], 95.00th=[ 7264],
| 99.00th=[ 7776], 99.50th=[ 9408], 99.90th=[13632], 99.95th=[13760],
| 99.99th=[16192]
bw ( KiB/s): min=487264, max=556344, per=100.00%, avg=553311.80, stdev=9104.91, samples=119
iops : min=121816, max=139086, avg=138327.95, stdev=2276.23, samples=119
lat (usec) : 10=99.55%, 20=0.45%, 50=0.01%
lat (msec) : 50=0.01%
cpu : usr=99.98%, sys=0.00%, ctx=268, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=8298319,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=540MiB/s (566MB/s), 540MiB/s-540MiB/s (566MB/s-566MB/s), io=31.7GiB (34.0GB), run=60001-60001msec
The table below summarizes the results of the six jobs.
xNVMe io_uring backend w/hipri | io_uring ioengine w/hipri | xNVMe io_uring_cmd backend | io_uring_cmd ioengine | xNVMe io_uring_cmd backend w/sqthread_poll | io_uring_cmd ioengine w/sqthread_poll | |||
---|---|---|---|---|---|---|---|---|
Mean total latency (ns) | 6425 | 6395 | 8507 | 8440 | 7100 | 7040 | ||
IOPS | 152K | 153K | 115K | 116K | 138K | 139K | ||
CPU usr | 5.91% | 4.99% | 7.87% | 7.06% | 99.98% | 99.98% | ||
CPU sys | 94.09% | 95.00% | 21.32% | 21.48% | 0.00% | 0.00% | ||
Context switches | 231 | 219 | 6886793 | 6930490 | 268 | 265 |
xNVMe's overhead is 30 ns (6425 ns vs 6395 ns total latency) for io_uring with polled completions. For io_uring_cmd with default options, xNVMe's overhead is 67 ns (8507 ns vs 8440 ns total latency). Finally, for io_uring_cmd with submission queue polling, xNVMe's overhead is 60 ns (7100 ns vs 7040 ns total latency). These differences are all in the same ballpark as what we observed in Part 1.
These results also provide another opportunity to see the effect of submission
queue polling, this time for the io_uring_cmd ioengine and backend. With
sqthread_poll=1
, latency falls by approximately 1400ns and IOPS increases by
slightly more than 22K. However, CPU utilization increases as well. Without
polling, fio user and system CPU usage totaled about 30 percent, but with
polling enabled user CPU usage was nearly 100 percent. Note also the
difference in the number of context switches. Without polling there were over
six million context switches but when polling is enabled, the number of context
switches is negligible.
From kernel 6.1 onwards Linux will have polled completion (hipri
) support for
uring passthrough commands (io_uring_cmd). This was not available in the kernel
we used for testing.
xNVMe bundles the SPDK userspace NVMe driver with the xNVMe library by default. So if you have built and installed xNVMe with the default options then the SPDK driver should also be available. The xNVMe user-space section provides an in-depth guide for using SPDK with xNVMe. We will cover some of the highlights here.
With the kernel and hardware configured appropriately users can use the
xnvme-driver
script to unbind devices from the kernel NVMe driver and bind them
to the vfio-pci
driver for use by SPDK.
root@ubuntu:~# xnvme-driver
0000:24:00.0 (8086 4140): nvme -> vfio-pci
Current user memlock limit: 1989 MB
This is the maximum amount of memory you will be
able to use with DPDK and VFIO if run as current user.
To change this, please adjust limits.conf memlock limit for current user.
The above command unbinds the device from the kernel NVMe driver and binds the
device to the vfio-pci
driver for use by SPDK.
By default the xnvme-driver
script will allocate 2048 MB of memory using
hugepages.
This can be modified to reserve 4096MB as follows:
HUGEMEM=4096 xnvme-driver
To unbind devices from vfio-pci
and return them to the kernel NVMe driver run:
root@ubuntu:~# xnvme-driver reset
0000:24:00.0 (8086 4140): vfio-pci -> nvme
Selecting NVMe devices for xNVMe's SPDK backend requires specifying the options
filename
(PCIe address) and xnvme_dev_nsid
(namespace). As fio splits
filenames on ":", colons must be escaped when specifying the PCIe address. For
a command-line option this can be done by specifying two backslash characters.
To specify the filename (PCIe address) inside a job file, only a single
backslash is required to escape the colon.
Let us compare the xNVMe ioengine using the SPDK backend against the SPDK
external fio ioengine from the SPDK repository. The below job file
xnvme-spdk.fio
will be used.
[test]
thread=1
time_based
size=500G
runtime=1m
numjobs=1
rw=randread
bs=4k
iodepth=1
cpus_allowed=1
norandommap
To use the SPDK external ioengine follow the detailed guide at spdk_fio_plugin. However, note that building the external SPDK ioengine with a build of fio that includes the xNVMe ioengine results in the circular dependency error message below.
fio: IFUNC symbol 'spdk_u64log2' referenced in '../spdk/build/fio/spdk_nvme' is defined in the executable and creates an unsatisfiable circular dependency.
To avoid this, build SPDK and its accompanying external fio ioengine with a fio
build that does not include the xNVMe ioengine by adding the --disable-xnvme
option when running fio's configure script as in the following:
./configure --disable-xnvme
Here is the output with the SPDK external ioengine and the xnvme ioengine using the SPDK backend:
SPDK external ioengine
$ LD_PRELOAD=../spdk/build/fio/spdk_nvme fio xnvme-spdk.fio --ioengine=spdk --filename="trtype=PCIe traddr=0000.24.00.0 ns=1"
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=spdk, iodepth=1
fio-3.32
Starting 1 thread
TELEMETRY: No legacy callbacks, legacy socket not created
Jobs: 1 (f=0): [f(1)][100.0%][r=661MiB/s][r=169k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=374623: Thu Oct 6 05:10:52 2022
read: IOPS=170k, BW=665MiB/s (697MB/s)(39.0GiB/60000msec)
slat (nsec): min=69, max=4140, avg=76.87, stdev= 6.57
clat (nsec): min=5090, max=23010, avg=5521.69, stdev=177.06
lat (nsec): min=5170, max=23090, avg=5598.56, stdev=177.11
clat percentiles (nsec):
| 1.00th=[ 5472], 5.00th=[ 5472], 10.00th=[ 5472], 20.00th=[ 5472],
| 30.00th=[ 5472], 40.00th=[ 5472], 50.00th=[ 5536], 60.00th=[ 5536],
| 70.00th=[ 5536], 80.00th=[ 5536], 90.00th=[ 5536], 95.00th=[ 5600],
| 99.00th=[ 5792], 99.50th=[ 6048], 99.90th=[ 8640], 99.95th=[ 9664],
| 99.99th=[10944]
bw ( KiB/s): min=677232, max=682512, per=100.00%, avg=680885.71, stdev=2277.69, samples=119
iops : min=169308, max=170628, avg=170221.43, stdev=569.42, samples=119
lat (usec) : 10=99.97%, 20=0.03%, 50=0.01%
cpu : usr=100.00%, sys=0.00%, ctx=237, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=10212059,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=665MiB/s (697MB/s), 665MiB/s-665MiB/s (697MB/s-697MB/s), io=39.0GiB (41.8GB), run=60000-60000msec
SPDK backend (xNVMe ioengine)
$ fio xnvme-spdk.fio --ioengine=xnvme --filename=0000\\:24\\:00.0 --xnvme_dev_nsid=1
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=xnvme, iodepth=1
fio-3.32
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=676MiB/s][r=173k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=359864: Thu Oct 6 05:03:12 2022
read: IOPS=173k, BW=674MiB/s (707MB/s)(39.5GiB/60000msec)
slat (nsec): min=59, max=6170, avg=74.41, stdev= 9.60
clat (nsec): min=230, max=23570, avg=5427.06, stdev=193.68
lat (nsec): min=5040, max=23640, avg=5501.47, stdev=193.65
clat percentiles (nsec):
| 1.00th=[ 5280], 5.00th=[ 5344], 10.00th=[ 5344], 20.00th=[ 5344],
| 30.00th=[ 5344], 40.00th=[ 5344], 50.00th=[ 5408], 60.00th=[ 5408],
| 70.00th=[ 5408], 80.00th=[ 5408], 90.00th=[ 5664], 95.00th=[ 5664],
| 99.00th=[ 5728], 99.50th=[ 5984], 99.90th=[ 7840], 99.95th=[ 9280],
| 99.99th=[10560]
bw ( KiB/s): min=685056, max=692544, per=100.00%, avg=690275.50, stdev=2826.15, samples=119
iops : min=171264, max=173136, avg=172568.87, stdev=706.54, samples=119
lat (nsec) : 250=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.02%, 50=0.01%
cpu : usr=100.00%, sys=0.00%, ctx=220, majf=0, minf=0
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=10353900,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=674MiB/s (707MB/s), 674MiB/s-674MiB/s (707MB/s-707MB/s), io=39.5GiB (42.4GB), run=60000-60000msec
SPDK | xNVMe with SPDK backend | |
---|---|---|
Mean total latency (ns) | 5599 | 5501 |
IOPS | 170K | 173K |
CPU usr | 100% | 100% |
CPU sys | 0% | 0% |
Context switches | 237 | 220 |
The average latencies are similar at 5599 ns for the SPDK external ioengine and 5501 ns for the xNVMe ioengine with the SPDK backend. For this workload xNVMe with the SPDK backend has marginally higher IOPS than the SPDK external ioengine (173K vs 170K). The CPU utilization and context switch statistics for both cases are similar.
This concludes Part 2 of our fio xNVMe ioengine blog post series. In Part 3 we plan to cover topics such as NVMeoF and Windows support.
- Fio's libaio ioengine and xNVMe's libaio backend use a system call to wait for a minimum number of I/O's to complete. The upper bound for the wait time is specified by
timespec
. The libaio ioengine and xNVMe's libaio backend respectively provideuserspace_reap
andhipri
flags to actively poll for completions. This provides higher IOPS and lower latencies at the cost of higher CPU utilization. Interested readers can find a comparison of the libaio ioengine withuserspace_reap
and xNVMe's libaio backed withhipri
here. - For the xNVMe ioengine with the io_uring backend we also measured performance with both the
hipri
andsqthread_poll
options enabled. This combination produced unexpectedly poor performance which is under investigation. - For more background information on topics discussed here see these references: