ZFS Proxmox - hpaluch/hpaluch.github.io GitHub Wiki
Tuning ZFS on Proxmox
Here will be details on my tuning of ZFS on Proxmox VE.
ZFS has many tuning parameters, some have unfortunately insane defaults.
zfs_arc_max
Tuning: ARC is the 1st level read cache - any read from disk will end-up in ARC cache. Unfortunately ZFS system default is 50% of RAM (!).
As stated on top of ZFS page, ARC cache is not accounted as kernel cache but as "used kernel memory". It has significant impact on swapping - because "real cache" can be dropped in moment so kernel happily uses big caches, because it knows that in case of stress it can be quickly reused for memory allocation. But ARC cache can't be dropped so easily so it stress kernel if it is too big.
Fortunately new Proxmox installations set default ARC cache size to 10% of RAM. When I have 8GB of RAM, Proxmox will offer default ARC size around 800MB.
However existing Proxmox installation will not touch this parameter even when upgraded to higher versions. So I strongly recommend to evaluate this parameter and update it eventually.
Here is an example how to change ARC max cache size to max 1GB:
- compute cache size in bytes - example is for 1GB:
$ echo $(( 1 * 1024 * 1024 * 1024 ))
1073741824
- now create or update file
/etc/modprobe.d/zfs.conf
with:options zfs zfs_arc_max=1073741824
- now you have to update initial ramdisk so change will be really applied on reboot:
update-initramfs -u
Reboot system and verify that your change was applied. First look into runtime module parameters:
$ cat /sys/module/zfs/parameters/zfs_arc_max
1073741824
Verify current ARC cache size with
$ arc_summary -s arc | sed '/Anonymous/,$d'
------------------------------------------------------------------------
ZFS Subsystem Report Sat Sep 14 09:03:35 2024
Linux 6.8.12-1-pve 2.2.4-pve1
Machine: pve-zfs (x86_64) 2.2.4-pve1
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 63.7 % 652.6 MiB
Target size (adaptive): 100.0 % 1.0 GiB
Min size (hard limit): 24.2 % 248.2 MiB
Max size (high water): 4:1 1.0 GiB
The ARC size (current)
and Max size
is most important.
Discussion:
zvol_threads
Tuning: ZVOL is block device on top of ZFS filesystem. It is similar to loopback device on ext2/3/4 filesystems. ZVOL is default used for VMs on Proxmox. Here is example how to list zvols:
$ zfs list -t volume
NAME USED AVAIL REFER MOUNTPOINT
rpool/data/vm-100-disk-0 4.09G 159G 4.09G -
rpool/data/vm-100-disk-1 134M 159G 134M -
...
ZVOL creates virtual disk and partitions devices under /dev/zd*
Simple example how to find which /dev/zd*
devices are opened by process:
$ lsof -Q /dev/zd*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kvm 2051 root 32u BLK 230,48 0t0 924 /dev/zd48
When you have write intensive workload you can see that there are up to 32 kernel processes competing for CPU:
[zvol_tq-0]
Please note that default zvol_threads
parameter will not tell much:
$ cat /sys/module/zfs/parameters/zvol_threads
0
We can count threads with simple grep:
$ ps ax | fgrep -c '[zvol_tq'
32
Now let's try to reduce them to 4 (I have 2 cores and I decided that using cores * 2
could be reasonable
start):
$ echo 4 > /sys/module/zfs/parameters/zvol_threads
-bash: /sys/module/zfs/parameters/zvol_threads: Permission denied
Err! So only way to try it is to update /etc/modprobe.d/zfs.conf
to
diff -u root/zfs.conf etc/modprobe.d/zfs.conf
--- root/zfs.conf 2024-09-14 10:17:22.710041912 +0200
+++ etc/modprobe.d/zfs.conf 2024-09-14 10:36:46.870408525 +0200
@@ -1 +1 @@
-options zfs zfs_arc_max=1073741824
+options zfs zfs_arc_max=1073741824 zvol_threads=4
Remember to update initramfs before reboot with update-initramfs -u
and then issue reboot
.
After reboot let's check number of zvol_tq
processes:
$ ps ax | fgrep -c '[zvol_tq'
5
Number is OK (there is also catched grep command - which can be verified when running fgrep without -c
).
And try to start VM - now load average should be much better (not something like 20 but rather below 2).
There are several interesting lines in source of upstream/module/os/linux/zfs/zvol_os.c
from GIT URL: git://git.proxmox.com/git/zfsonlinux
.
int
zvol_init(void)
{
/*
* zvol_threads is the module param the user passes in.
*
* zvol_actual_threads is what we use internally, since the user can
* pass zvol_thread = 0 to mean "use all the CPUs" (the default).
*/
static unsigned int zvol_actual_threads;
if (zvol_threads == 0) {
/*
* See dde9380a1 for why 32 was chosen here. This should
* probably be refined to be some multiple of the number
* of CPUs.
*/
zvol_actual_threads = MAX(num_online_cpus(), 32);
} else {
zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024);
}
/*
* Use atleast 32 zvol_threads but for many core system,
* prefer 6 threads per taskq, but no more taskqs
* than threads in them on large systems.
*
* taskq total
* cpus taskqs threads threads
* ------- ------- ------- -------
* 1 1 32 32
* 2 1 32 32
* 4 1 32 32
* 8 2 16 32
* 16 3 11 33
* 32 5 7 35
* 64 8 8 64
* 128 11 12 132
* 256 16 16 256
*/
for (uint_t i = 0; i < num_tqs; i++) {
char name[32];
(void) snprintf(name, sizeof (name), "%s_tq-%u",
ZVOL_DRIVER, i);
// ...
}
// ...
return (0);
}
NOTE: There is mentioned commit dde9380a1
but it is not in this (Proxmox fork) of tree.
However we can find it on Debian
tree: https://salsa.debian.org/zfsonlinux-team/zfs/-/commit/dde9380a1bf9084d0c8a3e073cdd65bb81db1a23
Here is full copy of that patch:
From dde9380a1bf9084d0c8a3e073cdd65bb81db1a23 Mon Sep 17 00:00:00 2001
From: Etienne Dechamps <e-t172@akegroup.org>
Date: Wed, 8 Feb 2012 22:41:41 +0100
Subject: [PATCH] Use 32 as the default number of zvol threads.
Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.
This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:
> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.
We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.
On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.
We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
---
module/zfs/zvol.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index 1636581d5..19888ea96 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -47,7 +47,7 @@
#include <linux/blkdev_compat.h>
unsigned int zvol_major = ZVOL_MAJOR;
-unsigned int zvol_threads = 0;
+unsigned int zvol_threads = 32;
static taskq_t *zvol_taskq;
static kmutex_t zvol_state_lock;
@@ -1343,9 +1343,6 @@ zvol_init(void)
{
int error;
- if (!zvol_threads)
- zvol_threads = num_online_cpus();
-
zvol_taskq = taskq_create(ZVOL_DRIVER, zvol_threads, maxclsyspri,
zvol_threads, INT_MAX, TASKQ_PREPOPULATE);
if (zvol_taskq == NULL) {
--
GitLab
NOTE: it is possible to bypass these Task Queues completely with
parameter zvol_request_sync=1
, we can find in same source:
static void
zvol_request_impl(zvol_state_t *zv, struct bio *bio, struct request *rq,
boolean_t force_sync)
{
// ...
if (rw == WRITE) {
// ...
/*
* We don't want this thread to be blocked waiting for i/o to
* complete, so we instead wait from a taskq callback. The
* i/o may be a ZIL write (via zil_commit()), or a read of an
* indirect block, or a read of a data block (if this is a
* partial-block write). We will indicate that the i/o is
* complete by calling END_IO() from the taskq callback.
*
* This design allows the calling thread to continue and
* initiate more concurrent operations by calling
* zvol_request() again. There are typically only a small
* number of threads available to call zvol_request() (e.g.
* one per iSCSI target), so keeping the latency of
* zvol_request() low is important for performance.
*
* The zvol_request_sync module parameter allows this
* behavior to be altered, for performance evaluation
* purposes. If the callback blocks, setting
* zvol_request_sync=1 will result in much worse performance.
*
* We can have up to zvol_threads concurrent i/o's being
* processed for all zvols on the system. This is typically
* a vast improvement over the zvol_request_sync=1 behavior
* of one i/o at a time per zvol. However, an even better
* design would be for zvol_request() to initiate the zio
* directly, and then be notified by the zio_done callback,
* which would call END_IO(). Unfortunately, the DMU/ZIL
* interfaces lack this functionality (they block waiting for
* the i/o to complete).
*/
if (io_is_discard(bio, rq) || io_is_secure_erase(bio, rq)) {
if (force_sync) {
zvol_discard(&zvr);
} else {
task = zv_request_task_create(zvr);
taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
zvol_discard_task, task, 0, &task->ent);
}
} else {
if (force_sync) {
zvol_write(&zvr);
} else {
task = zv_request_task_create(zvr);
taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
zvol_write_task, task, 0, &task->ent);
}
}
} else {
/*
* The SCST driver, and possibly others, may issue READ I/Os
* with a length of zero bytes. These empty I/Os contain no
* data and require no additional handling.
*/
if (size == 0) {
END_IO(zv, bio, rq, 0);
goto out;
}
rw_enter(&zv->zv_suspend_lock, RW_READER);
/* See comment in WRITE case above. */
if (force_sync) {
zvol_read(&zvr);
} else {
task = zv_request_task_create(zvr);
taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
zvol_read_task, task, 0, &task->ent);
}
}
}
Normally this parameter should be used only for problem evaluation.
Original upstream (ZoL - ZFS on Linux project) is here: https://github.com/openzfs/zfs/commit/99741bde59d1d1df0963009bb624ddc105f7d8dc
There is also interesting long discussion even with chart (!) on:
Please note that FreeBSD version of ZFS ZVOL write is vastly different from Linux as can be seen on: https://github.com/openzfs/zfs/blob/master/module/os/freebsd/zfs/zvol_os.c#L869 It is therefore incorrect to expect that zvol will behave on FreeBSD same as on Linux.
Resources:
General ZFS notes
ZFS on Proxmox is 2 level fork of "ZFS on Linux" (ZoL).
- Proxmox uses its fork of Debian version on https://git.proxmox.com/?p=zfsonlinux.git;a=summary
- Debian itself is fork of official ZoL. Debian tree is on: https://salsa.debian.org/zfsonlinux-team/zfs
- and final upstream ZoL is on: https://github.com/openzfs/zfs
It means that in case of problem one should contact maintainers in same order (Proxmox, Debian, ZoL team).