MLNX_OFED step by step guide - Mellanox/scalablefunctions GitHub Wiki
This is a step by step guide for ConnectX-6DX. For other devices firmware version or maximum limit may vary. Please check respective sections for it. For DPU users, minor changes in the command examples are described in individual command(s).
Update firmware that has support for scalable functions. Minimum firmware version needed is 16.31.0378. It can be downloaded from firmware downloads.
Once firmware is updated, enable scalable function support in the device. Scalable functions support must be enabled on the PF where SFs will be used.
$ mlxconfig -d 0000:03:00.0 s PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=10 SRIOV_EN=0
Note: In above example SR-IOV is disabled. However it is NOT mandatory to disable SR-IOV. SFs and VFs can coexist.
When SFs to be used in the external controller of the DPU, user must enable SFs on the external host PF.
(a) Disable global symmetrical MSI-X configuration in external host PF.
$ mlxconfig -d 0000:03:00.0 s NUM_PF_MSIX_VALID=0
(b) Enable per PF MSI-X configuration in external host PF.
$ mlxconfig -d 0000:03:00.0 s PF_NUM_PF_MSIX_VALID=1
(c) Setup MSI-X vectors per PF. It should be four times the number of SFs configured. For example, when PF_TOTAL_SF=250, configure MSI-X vectors to be 1000.
$ mlxconfig -d 0000:03:00.0 s PF_TOTAL_SF=250 PF_NUM_PF_MSIX=1000 PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_SF_BAR_SIZE=10 SRIOV_EN=0
Note: MSI-X and high number of SFs requires larger BAR2 size. Some older BIOS/system may not be capable of supporting high BAR size. Hence, user should enable per PF MSI-X vectors or high SFs carefully.
Perform cold system reboot for configuration to take effect. In case scalable functions are desired on the external host controller, please configure them as well using mlxconfig tool in the external host controller before a cold reboot.
Scalable functions uses 4 step process from create to use as shown below.

$ devlink dev eswitch set pci/0000:03:00.0 mode switchdev $ devlink dev eswitch show pci/0000:03:00.0
$ devlink port show pci/0000:03:00.0/65535: type eth netdev ens3f0np0 flavour physical port 0 splittable false
Scalable functions are managed using mlxdevm tool supplied with iproute2 package. It is located at /opt/mellanox/iproute2/sbin/mlxdevm
SF after addition is still not usable for the end user application. It can be usable after configuration and activation.
$ mlxdevm port add pci/0000:03:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:03:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 function: hw_addr 00:00:00:00:00:00 state inactive opstate detached
When a SF is added for the external controller, such as on DPU/smartnic, user needs to supply the controller number. In a single host DPU case, there is only one external controller starting with controller number = 1.
Example of adding SF for the PF 0 of the external controller 1:
$ mlxdevm port add pci/0000:03:00.0 flavour pcisf pfnum 0 sfnum 88 controller 1 pci/0000:03:00.0/32768: type eth netdev eth6 flavour pcisf controller 1 pfnum 0 sfnum 88 splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached
Notice the difference in controller number 0 vs 1. SFs on the DPU root complex are created with implicit controller = 0. While the SFs on the external host controller by the DPU are created using controller = 1.
Show the SF by port index or by its representor device
$ mlxdevm port show en3f0pf0sf88
Or
$ mlxdevm port show pci/0000:03:00.0/32768 pci/0000:03:00.0/32768: type eth netdev en3f0pf0sf4 flavour pcisf controller 0 pfnum 0 sfnum 88 function: hw_addr 00:00:00:00:00:00 state inactive opstate detached
$ mlxdevm port function set pci/0000:03:00.0/32768 hw_addr 00:00:00:00:88:88
The default maximum number of channels for SFs is 8, it can be changed. "max_io_eqs" can be used to set the maximum number of channels.
$ mlxdevm port function set pci/0000:03:00.0/32768 max_io_eqs 16
- If max_io_eqs is 0 then the netdev won't be created.
- If max_io_eqs is not set, the default is 8.
- For all other max_io_eqs values, the max combined queues will depend on the number of cores and SF completion EQs.
In this example, when number of channels is set to 16, activated SF's netdev will show 16 channels. Such as,
$ ethtool -l <sf_netdev>
$ systemctl start openvswitch $ ovs-vsctl add-br network1 $ ovs-vsctl add-port network1 en3f0pf0sf88 $ ip link set dev en3f0pf0sf88 up $ ovs-vsctl add-port network1 ens3f0np0 $ ip link set dev ens3f0np0 up
Activating the SF results in creating an auxiliary device and initiating driver load sequence for netdevice, rdma and vdpa devices.
Once the operational state is marked as attached, driver is attached to this SF and device loading starts.
An application interested in using the SF netdevice and rdma device needs to monitor the rdma and netdevices either through udev monitor or poll the sysfs hierarchy of SF's auxiliary device.
In future, an explicit option will be added to deterministically add the netdev and rdma device of SF.
$ mlxdevm port function set pci/0000:03:00.0/32768 state active
On DPU(BlueField), need additional steps to unbind SF from config driver and bind SF to mlx5_core.sf driver correctly:
$ echo mlx5_core.sf.5 > /sys/bus/auxiliary/drivers/mlx5_core.sf_cfg/unbind $ echo mlx5_core.sf.5 > /sys/bus/auxiliary/drivers/mlx5_core.sf/bind
$ mlxdevm port show ens2f0npf0sf88 -jp { "port": { "pci/0000:03:00.0/32768": { "type": "eth", "netdev": "en3f0pf0sf4", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } }
$ tree -l -L 3 -P "mlx5_core.sf." /sys/bus/auxiliary/devices/ $ devlink dev show devlink dev show auxiliary/mlx5_core.sf.4
$ devlink port show auxiliary/mlx5_core.sf.4/1 auxiliary/mlx5_core.sf.4/1: type eth netdev enp3s0f0s88 flavour physical port 0 splittable false
Netdevice and RDMA device can be seen using iproute2 tools.
$ ip link show $ rdma link show
At this stage orchestration software or user should assign IP address and use it for application.
Once SF usage is complete, deactivate the SF. This will trigger driver unload in the host system. Once SF is deactivated, its operational state will change to be "detached". An orchestration system should poll for operational state to be changed to "detached" before deleting the SF. This ensures a graceful hot unplug.
$ mlxdevm port function set pci/0000:03:00.0/32768 state inactive
Finally once the state is "inactive" and operational state is "detached", user can safely delete the SF. For faster provisioning, a user can reconfigure and active the SF again without deletion.
$ mlxdevm port del pci/0000:03:00.0/32768
SFs are sharing IRQs either with peer SFs or with parent PF. To get best performance, it is desired to set the SF's cpu affinity. Setting SF's cpu affinity ensures that SF consumes resources and handles packets only on the CPU mentioned.
A typical example would be a 200 core system and each SF is attached on one cpu. In another example, a 64 core system, running 256 containers, and each SF is given cpu affinity of maximum of 8 cpus. Such as SF-0 -> affinity = 0-7. SF-1 -> affinity = 8-15. SF-2 -> affinity = 16-23. [...] SF-8 -> affinity = 0-7.
With above scheme, SF=0 and SF=8 will use cpu cores 0 to 7. By default SFs do not have cpu affinity setup.
$ mlxdevm dev param set auxiliary/mlx5_core.sf.2 name cpu_affinity value 0-2,5 cmode driverinit
After setting the cpu affinity, user must reload the instance for cpu affinity to take effect.
$ devlink dev reload auxiliary/mlx5_core.sf.2
If SFs are used for container, once SF is reloaded, its netdevice and rdma device should be assign to the net namespace.
When physical port netdevice (aka uplink) representor is used with bonding driver, SF traffic flows through this bond interface. When user prefers such bonding configuration, user must follow below sequence.
- Move both the PCI PFs to the switchdev mode
- Setup bonding between physical port (uplink) representors
- Create SFs
A destroy sequence must be a mirror of it, that is, to destroy all SFs first, before destroying the bond configuration.