iommu smmu dma and performance - AmpereComputing/ampere-lts-kernel---DEPRECATED GitHub Wiki

Performance Tip

Set cma=0 and iommu.passthrough=1 and iommu.strict=0 when testing device performance, unless device cannot work or there is security concern.

  • Disable CMA: set cma=0 in kernel command line. Before 5.10 (see the per-numa-cma patch), there is only one default CMA reserved. This may affect performance in numa system if DMA is mapped from CMA.

  • Bypass SMMU for DMA translation(set iommu.passthrough=1), when performance is more important than security. Please note iommu.passthrough=1 only affects DMA mapping of kernel drivers. For user space driver based on VFIO, DMA mapping still uses SMMU even if iommu.passthrough=1 (explained bellow). VFIO is used in DPDK, SPDK and KVM device assignment.

  • Enable iommu lazy mode by setting iommu.strict=0, if security is not a major concern. This can improve IO performance by reducing SMMU TLBI overhead.

To summarize:

  • When testing kernel driver performance, set cma=0 and iommu.passthrough=1
  • When testing performance of user space driver based on VFIO (DPDK, SPDK, SR-IOV), set iommu.strict=0

Details

VFIO still uses SMMU for DMA mapping, even setting iommu.passthrough=1

  1. Boot kernel with option: iommu.passthrough=1

  2. Bind a nvme disk to VFIO

# modprobe vfio-pci
# echo "0003:04:00.0" > /sys/bus/pci/devices/0003\:04\:00.0/driver/unbind
# echo 144d a808 > /sys/bus/pci/drivers/vfio-pci/new_id
# ls /dev/vfio/
47  vfio
  1. Perform IO on the nvme disk using qemu user space nvme driver (refer to: https://events.static.linuxfound.org/sites/events/files/slides/Userspace%20NVMe%20driver%20in%20QEMU%20-%20Fam%20Zheng_0.pdf)
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:06.81 (150.337 MiB/sec and 0.1468 ops/sec)

  1. At the same time of 2, use eBPF tool to check kernel stack. We see qemu nvme driver calls arm_smmu_* functions.
# yum install bcc
# /usr/share/bcc/tools/stackcount 'arm_smmu_*'
  b'arm_smmu_tlb_inv_page_nosync'
  b'__arm_lpae_unmap'
  b'__arm_lpae_unmap'
  b'__arm_lpae_unmap'
  b'arm_lpae_unmap'
  b'arm_smmu_unmap'
  b'__iommu_unmap'
  b'iommu_unmap_fast'
  b'vfio_unmap_unpin'
  b'vfio_remove_dma'
  b'vfio_iommu_unmap_unpin_all'
  b'vfio_iommu_type1_detach_group'
  b'__vfio_group_unset_container'
  b'vfio_group_try_dissolve_container'
  b'vfio_group_fops_release'
  b'__fput'
  b'____fput'
  b'task_work_run'
  b'do_notify_resume'
  b'work_pending'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
  b'[unknown]'
    262276
  1. Use SMMU perf event to verify SMMU are working:
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':

            622224      smmuv3_pmcg_27ffe0202/transaction/

       1.002288245 seconds time elapsed

# perf stat -e smmuv3_pmcg_27ffe0202/tlb_miss/ -a sleep 1
Performance counter stats for 'system wide':

             38648      smmuv3_pmcg_27ffe0202/tlb_miss/

       1.002244245 seconds time elapsed

Setting iommu.passthrough=1 and iommu.strict=0 together

  1. Boot kernel with option iommu.passthrough=1 and iommu.strict=0
  2. Repeat above test, we see IO performance improved
# qemu-io -c 'read 0 1G' nvme://0003:04:00.0/1
WARNING: Image format was not specified for 'nvme://0003:04:00.0/1' and probing guessed raw.
         Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
         Specify the 'raw' format explicitly to remove the restrictions.
read 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 00.39 sec (2.539 GiB/sec and 2.5392 ops/sec)
  1. There are more smmu transactions per second when iommu.strict=0
# perf stat -e smmuv3_pmcg_27ffe0202/transaction/ -a sleep 1
Performance counter stats for 'system wide':

           4233791      smmuv3_pmcg_27ffe0202/transaction/ <--- compared with '622224' if iommu.strict=1

       1.002524565 seconds time elapsed

What is exactly iommu.passthrough=1?

passthrough and bypass are confusing terms when talking about smmu(iommu). Quoted from original patch: https://lists.linuxfoundation.org/pipermail/iommu/2017-March/020818.html

The IOMMU core currently initialises the default domain for each group
to IOMMU_DOMAIN_DMA, under the assumption that devices will use
IOMMU-backed DMA ops by default. However, in some cases it is desirable
for the DMA ops to bypass the IOMMU for performance reasons

We see the iommu.passthrough options sets default domain type to IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_DMA.

+static int __init iommu_set_def_domain_type(char *str)
+{
+	bool pt;
+
+	if (!str || strtobool(str, &pt))
+		return -EINVAL;
+
+	iommu_def_domain_type = pt ? IOMMU_DOMAIN_IDENTITY : IOMMU_DOMAIN_DMA;
+	return 0;
+}

The default domain is used by kernel dma-api:

https://elixir.bootlin.com/linux/v5.12.8/source/drivers/iommu/dma-iommu.c#L1291

/*
 * The IOMMU core code allocates the default DMA domain, which the underlying
 * IOMMU driver needs to support via the dma-iommu layer.
 */
void iommu_setup_dma_ops(struct device *dev, u64 dma_base, u64 size)
{
	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);

	if (!domain)
		goto out_err;

	/*
	 * The IOMMU core code allocates the default DMA domain, which the
	 * underlying IOMMU driver needs to support via the dma-iommu layer.
	 */
	if (domain->type == IOMMU_DOMAIN_DMA) { 
<------------------ if domain->type != IOMMU_DOMAIN_DMA, dev->dma_ops is NULL. dma-api will do dma_map_direct() ---------->
		if (iommu_dma_init_domain(domain, dma_base, size, dev))
			goto out_err;
		dev->dma_ops = &iommu_dma_ops;
	}

	return;
out_err:
	 pr_warn("Failed to set up IOMMU for device %s; retaining platform DMA ops\n",
		 dev_name(dev));
}

However, VFIO will allocate domain with type IOMMU_DOMAIN_UNMANAGED.

struct iommu_domain *iommu_domain_alloc(struct bus_type *bus)
{
	return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED);
}

To summarize, default domain is used by kernel dma-api. VFIO allocates domain with type IOMMU_DOMAIN_UNMANAGED.

To check, if setting iommu.passthrough=1:

# find /sys/kernel/iommu_groups -name type -type f -exec cat {} \;
identity
identity
identity
... ...

if setting iommu.passthrough=0:

# find /sys/kernel/iommu_groups -name type -type f -exec cat {} \;
DMA
DMA
... ...

SMMUv3 is configured as STE bypass mode when iommu.passthrough=1

https://patchwork.kernel.org/project/linux-arm-kernel/patch/[email protected]/

An identity domain is created by placing the corresponding stream table
entries into "bypass" mode, which allows transactions to flow through
the SMMU without any translation.

https://github.com/torvalds/linux/blob/v5.16-rc4/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#L2167

	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
		smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
		return 0;
	}

Reference

https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

        iommu.passthrough=
                        [ARM64, X86] Configure DMA to bypass the IOMMU by default.
                        Format: { "0" | "1" }
                        0 - Use IOMMU translation for DMA.
                        1 - Bypass the IOMMU for DMA.
                        unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
        iommu.strict=   [ARM64] Configure TLB invalidation behaviour
                        Format: { "0" | "1" }
                        0 - Lazy mode.
                          Request that DMA unmap operations use deferred
                          invalidation of hardware TLBs, for increased
                          throughput at the cost of reduced device isolation.
                          Will fall back to strict mode if not supported by
                          the relevant IOMMU driver.
                        1 - Strict mode (default).
                          DMA unmap operations invalidate IOMMU hardware TLBs
                          synchronously.
        cma=nn[MG]@[start[MG][-end[MG]]]
                        [KNL,CMA]
                        Sets the size of kernel global memory area for
                        contiguous memory allocations and optionally the
                        placement constraint by the physical address range of
                        memory allocations. A value of 0 disables CMA
                        altogether. For more information, see
                        kernel/dma/contiguous.c
        cma_pernuma=nn[MG]
                        [ARM64,KNL,CMA]
                        Sets the size of kernel per-numa memory area for
                        contiguous memory allocations. A value of 0 disables
                        per-numa CMA altogether. And If this option is not
                        specificed, the default value is 0.
                        With per-numa CMA enabled, DMA users on node nid will
                        first try to allocate buffer from the pernuma area
                        which is located in node nid, if the allocation fails,
                        they will fallback to the global default memory area.