ZFS Proxmox - hpaluch/hpaluch.github.io GitHub Wiki

Tuning ZFS on Proxmox

Here will be details on my tuning of ZFS on Proxmox VE.

ZFS has many tuning parameters, some have unfortunately insane defaults.

Tuning: zfs_arc_max

ARC is the 1st level read cache - any read from disk will end-up in ARC cache. Unfortunately ZFS system default is 50% of RAM (!).

As stated on top of ZFS page, ARC cache is not accounted as kernel cache but as "used kernel memory". It has significant impact on swapping - because "real cache" can be dropped in moment so kernel happily uses big caches, because it knows that in case of stress it can be quickly reused for memory allocation. But ARC cache can't be dropped so easily so it stress kernel if it is too big.

Fortunately new Proxmox installations set default ARC cache size to 10% of RAM. When I have 8GB of RAM, Proxmox will offer default ARC size around 800MB.

However existing Proxmox installation will not touch this parameter even when upgraded to higher versions. So I strongly recommend to evaluate this parameter and update it eventually.

Here is an example how to change ARC max cache size to max 1GB:

  1. compute cache size in bytes - example is for 1GB:
$ echo $(( 1 * 1024 * 1024 * 1024 ))

1073741824
  1. now create or update file /etc/modprobe.d/zfs.conf with:
    options zfs zfs_arc_max=1073741824
    
  2. now you have to update initial ramdisk so change will be really applied on reboot:
    update-initramfs -u
    

Reboot system and verify that your change was applied. First look into runtime module parameters:

$ cat /sys/module/zfs/parameters/zfs_arc_max 

1073741824

Verify current ARC cache size with

$ arc_summary -s arc | sed '/Anonymous/,$d'

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Sep 14 09:03:35 2024
Linux 6.8.12-1-pve                                            2.2.4-pve1
Machine: pve-zfs (x86_64)                                     2.2.4-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    63.7 %  652.6 MiB
        Target size (adaptive):                       100.0 %    1.0 GiB
        Min size (hard limit):                         24.2 %  248.2 MiB
        Max size (high water):                            4:1    1.0 GiB

The ARC size (current) and Max size is most important.

Discussion:

Tuning: zvol_threads

ZVOL is block device on top of ZFS filesystem. It is similar to loopback device on ext2/3/4 filesystems. ZVOL is default used for VMs on Proxmox. Here is example how to list zvols:

$ zfs list -t volume

NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool/data/vm-100-disk-0  4.09G   159G  4.09G  -
rpool/data/vm-100-disk-1   134M   159G   134M  -
...

ZVOL creates virtual disk and partitions devices under /dev/zd*

Simple example how to find which /dev/zd* devices are opened by process:

$ lsof -Q /dev/zd*

COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
kvm     2051 root   32u   BLK 230,48      0t0  924 /dev/zd48

When you have write intensive workload you can see that there are up to 32 kernel processes competing for CPU:

[zvol_tq-0]

Please note that default zvol_threads parameter will not tell much:

$ cat /sys/module/zfs/parameters/zvol_threads 

0

We can count threads with simple grep:

$ ps ax | fgrep -c '[zvol_tq'

32

Now let's try to reduce them to 4 (I have 2 cores and I decided that using cores * 2 could be reasonable start):

$ echo 4 > /sys/module/zfs/parameters/zvol_threads 
-bash: /sys/module/zfs/parameters/zvol_threads: Permission denied

Err! So only way to try it is to update /etc/modprobe.d/zfs.conf to

 diff -u root/zfs.conf etc/modprobe.d/zfs.conf 
--- root/zfs.conf	2024-09-14 10:17:22.710041912 +0200
+++ etc/modprobe.d/zfs.conf	2024-09-14 10:36:46.870408525 +0200
@@ -1 +1 @@
-options zfs zfs_arc_max=1073741824
+options zfs zfs_arc_max=1073741824 zvol_threads=4

Remember to update initramfs before reboot with update-initramfs -u and then issue reboot.

After reboot let's check number of zvol_tq processes:

$ ps ax | fgrep -c '[zvol_tq'

5

Number is OK (there is also catched grep command - which can be verified when running fgrep without -c).

And try to start VM - now load average should be much better (not something like 20 but rather below 2).

There are several interesting lines in source of upstream/module/os/linux/zfs/zvol_os.c from GIT URL: git://git.proxmox.com/git/zfsonlinux.

int
zvol_init(void)
{
	/*
	 * zvol_threads is the module param the user passes in.
	 *
	 * zvol_actual_threads is what we use internally, since the user can
	 * pass zvol_thread = 0 to mean "use all the CPUs" (the default).
	 */
	static unsigned int zvol_actual_threads;

	if (zvol_threads == 0) {
		/*
		 * See dde9380a1 for why 32 was chosen here.  This should
		 * probably be refined to be some multiple of the number
		 * of CPUs.
		 */
		zvol_actual_threads = MAX(num_online_cpus(), 32);
	} else {
		zvol_actual_threads = MIN(MAX(zvol_threads, 1), 1024);
	}

	/*
	 * Use atleast 32 zvol_threads but for many core system,
	 * prefer 6 threads per taskq, but no more taskqs
	 * than threads in them on large systems.
	 *
	 *                 taskq   total
	 * cpus    taskqs  threads threads
	 * ------- ------- ------- -------
	 * 1       1       32       32
	 * 2       1       32       32
	 * 4       1       32       32
	 * 8       2       16       32
	 * 16      3       11       33
	 * 32      5       7        35
	 * 64      8       8        64
	 * 128     11      12       132
	 * 256     16      16       256
	 */

	for (uint_t i = 0; i < num_tqs; i++) {
		char name[32];
		(void) snprintf(name, sizeof (name), "%s_tq-%u",
		    ZVOL_DRIVER, i);
        // ...	
    }
    // ...
	return (0);
}

NOTE: There is mentioned commit dde9380a1 but it is not in this (Proxmox fork) of tree. However we can find it on Debian tree: https://salsa.debian.org/zfsonlinux-team/zfs/-/commit/dde9380a1bf9084d0c8a3e073cdd65bb81db1a23

Here is full copy of that patch:

From dde9380a1bf9084d0c8a3e073cdd65bb81db1a23 Mon Sep 17 00:00:00 2001
From: Etienne Dechamps <e-t172@akegroup.org>
Date: Wed, 8 Feb 2012 22:41:41 +0100
Subject: [PATCH] Use 32 as the default number of zvol threads.

Currently, the `zvol_threads` variable, which controls the number of worker
threads which process items from the ZVOL queues, is set to the number of
available CPUs.

This choice seems to be based on the assumption that ZVOL threads are
CPU-bound. This is not necessarily true, especially for synchronous writes.
Consider the situation described in the comments for `zil_commit()`, which is
called inside `zvol_write()` for synchronous writes:

> itxs are committed in batches. In a heavily stressed zil there will be a
> commit writer thread who is writing out a bunch of itxs to the log for a
> set of committing threads (cthreads) in the same batch as the writer.
> Those cthreads are all waiting on the same cv for that batch.
>
> There will also be a different and growing batch of threads that are
> waiting to commit (qthreads). When the committing batch completes a
> transition occurs such that the cthreads exit and the qthreads become
> cthreads. One of the new cthreads becomes he writer thread for the batch.
> Any new threads arriving become new qthreads.

We can easily deduce that, in the case of ZVOLs, there can be a maximum of
`zvol_threads` cthreads and qthreads. The default value for `zvol_threads` is
typically between 1 and 8, which is way too low in this case. This means
there will be a lot of small commits to the ZIL, which is very inefficient
compared to a few big commits, especially since we have to wait for the data
to be on stable storage. Increasing the number of threads will increase the
amount of data waiting to be commited and thus the size of the individual
commits.

On my system, in the context of VM disk image storage (lots of small
synchronous writes), increasing `zvol_threads` from 8 to 32 results in a 50%
increase in sequential synchronous write performance.

We should choose a more sensible default for `zvol_threads`. Unfortunately
the optimal value is difficult to determine automatically, since it depends
on the synchronous write latency of the underlying storage devices. In any
case, a hardcoded value of 32 would probably be better than the current
situation. Having a lot of ZVOL threads doesn't seem to have any real
downside anyway.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #392
---
 module/zfs/zvol.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index 1636581d5..19888ea96 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -47,7 +47,7 @@
 #include <linux/blkdev_compat.h>
 
 unsigned int zvol_major = ZVOL_MAJOR;
-unsigned int zvol_threads = 0;
+unsigned int zvol_threads = 32;
 
 static taskq_t *zvol_taskq;
 static kmutex_t zvol_state_lock;
@@ -1343,9 +1343,6 @@ zvol_init(void)
 {
 	int error;
 
-	if (!zvol_threads)
-		zvol_threads = num_online_cpus();
-
 	zvol_taskq = taskq_create(ZVOL_DRIVER, zvol_threads, maxclsyspri,
 		                  zvol_threads, INT_MAX, TASKQ_PREPOPULATE);
 	if (zvol_taskq == NULL) {
-- 
GitLab

NOTE: it is possible to bypass these Task Queues completely with parameter zvol_request_sync=1, we can find in same source:

static void
zvol_request_impl(zvol_state_t *zv, struct bio *bio, struct request *rq,
    boolean_t force_sync)
{
    // ...
	if (rw == WRITE) {
        // ...
		/*
		 * We don't want this thread to be blocked waiting for i/o to
		 * complete, so we instead wait from a taskq callback. The
		 * i/o may be a ZIL write (via zil_commit()), or a read of an
		 * indirect block, or a read of a data block (if this is a
		 * partial-block write).  We will indicate that the i/o is
		 * complete by calling END_IO() from the taskq callback.
		 *
		 * This design allows the calling thread to continue and
		 * initiate more concurrent operations by calling
		 * zvol_request() again. There are typically only a small
		 * number of threads available to call zvol_request() (e.g.
		 * one per iSCSI target), so keeping the latency of
		 * zvol_request() low is important for performance.
		 *
		 * The zvol_request_sync module parameter allows this
		 * behavior to be altered, for performance evaluation
		 * purposes.  If the callback blocks, setting
		 * zvol_request_sync=1 will result in much worse performance.
		 *
		 * We can have up to zvol_threads concurrent i/o's being
		 * processed for all zvols on the system.  This is typically
		 * a vast improvement over the zvol_request_sync=1 behavior
		 * of one i/o at a time per zvol.  However, an even better
		 * design would be for zvol_request() to initiate the zio
		 * directly, and then be notified by the zio_done callback,
		 * which would call END_IO().  Unfortunately, the DMU/ZIL
		 * interfaces lack this functionality (they block waiting for
		 * the i/o to complete).
		 */
		if (io_is_discard(bio, rq) || io_is_secure_erase(bio, rq)) {
			if (force_sync) {
				zvol_discard(&zvr);
			} else {
				task = zv_request_task_create(zvr);
				taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
				    zvol_discard_task, task, 0, &task->ent);
			}
		} else {
			if (force_sync) {
				zvol_write(&zvr);
			} else {
				task = zv_request_task_create(zvr);
				taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
				    zvol_write_task, task, 0, &task->ent);
			}
		}
	} else {
		/*
		 * The SCST driver, and possibly others, may issue READ I/Os
		 * with a length of zero bytes.  These empty I/Os contain no
		 * data and require no additional handling.
		 */
		if (size == 0) {
			END_IO(zv, bio, rq, 0);
			goto out;
		}

		rw_enter(&zv->zv_suspend_lock, RW_READER);

		/* See comment in WRITE case above. */
		if (force_sync) {
			zvol_read(&zvr);
		} else {
			task = zv_request_task_create(zvr);
			taskq_dispatch_ent(ztqs->tqs_taskq[tq_idx],
			    zvol_read_task, task, 0, &task->ent);
		}
	}
}

Normally this parameter should be used only for problem evaluation.

Original upstream (ZoL - ZFS on Linux project) is here: https://github.com/openzfs/zfs/commit/99741bde59d1d1df0963009bb624ddc105f7d8dc

There is also interesting long discussion even with chart (!) on:

Please note that FreeBSD version of ZFS ZVOL write is vastly different from Linux as can be seen on: https://github.com/openzfs/zfs/blob/master/module/os/freebsd/zfs/zvol_os.c#L869 It is therefore incorrect to expect that zvol will behave on FreeBSD same as on Linux.

Resources:

General ZFS notes

ZFS on Proxmox is 2 level fork of "ZFS on Linux" (ZoL).

It means that in case of problem one should contact maintainers in same order (Proxmox, Debian, ZoL team).