MLNX_OFED step by step guide - Mellanox/scalablefunctions GitHub Wiki

1. Device configuration

This is a step by step guide for ConnectX-6DX. For other devices firmware version or maximum limit may vary. Please check respective sections for it. For DPU users, minor changes in the command examples are described in individual command(s).

1.1 Update firmware

Update firmware that has support for scalable functions. Minimum firmware version needed is 16.31.0378. It can be downloaded from firmware downloads.

1.2 Enable support

Once firmware is updated, enable scalable function support in the device. Scalable functions support must be enabled on the PF where SFs will be used.

$ mlxconfig -d 0000:03:00.0 s PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_TOTAL_SF=252 PF_SF_BAR_SIZE=10 SRIOV_EN=0

Note: In above example SR-IOV is disabled. However it is NOT mandatory to disable SR-IOV. SFs and VFs can coexist.

When SFs to be used in the external controller of the DPU, user must enable SFs on the external host PF.

(a) Disable global symmetrical MSI-X configuration in external host PF.

$ mlxconfig -d 0000:03:00.0 s NUM_PF_MSIX_VALID=0

(b) Enable per PF MSI-X configuration in external host PF.

$ mlxconfig -d 0000:03:00.0 s PF_NUM_PF_MSIX_VALID=1

(c) Setup MSI-X vectors per PF. It should be four times the number of SFs configured. For example, when PF_TOTAL_SF=250, configure MSI-X vectors to be 1000.

$ mlxconfig -d 0000:03:00.0 s PF_TOTAL_SF=250 PF_NUM_PF_MSIX=1000 PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_SF_BAR_SIZE=10 SRIOV_EN=0

Note: MSI-X and high number of SFs requires larger BAR2 size. Some older BIOS/system may not be capable of supporting high BAR size. Hence, user should enable per PF MSI-X vectors or high SFs carefully.

1.3 Cold reboot

Perform cold system reboot for configuration to take effect. In case scalable functions are desired on the external host controller, please configure them as well using mlxconfig tool in the external host controller before a cold reboot.

2. Software control and commands

Scalable functions uses 4 step process from create to use as shown below.

pictures/create-config-deploy-use.png

2.1 Move PCI PF to switchdev mode

$ devlink dev eswitch set pci/0000:03:00.0 mode switchdev
$ devlink dev eswitch show pci/0000:03:00.0

2.2 Show the physical (aka uplink) port of the PF

$ devlink port show
pci/0000:03:00.0/65535: type eth netdev ens3f0np0 flavour physical port 0 splittable false

2.3 Add one SF

Scalable functions are managed using mlxdevm tool supplied with iproute2 package. It is located at /opt/mellanox/iproute2/sbin/mlxdevm

SF after addition is still not usable for the end user application. It can be usable after configuration and activation.

$ mlxdevm port add pci/0000:03:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:03:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

When a SF is added for the external controller, such as on DPU/smartnic, user needs to supply the controller number. In a single host DPU case, there is only one external controller starting with controller number = 1.

Example of adding SF for the PF 0 of the external controller 1:

$ mlxdevm port add pci/0000:03:00.0 flavour pcisf pfnum 0 sfnum 88 controller 1
pci/0000:03:00.0/32768: type eth netdev eth6 flavour pcisf controller 1 pfnum 0 sfnum 88 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

Notice the difference in controller number 0 vs 1. SFs on the DPU root complex are created with implicit controller = 0. While the SFs on the external host controller by the DPU are created using controller = 1.

2.4 Show the newly added devlink port

Show the SF by port index or by its representor device

$ mlxdevm port show en3f0pf0sf88

Or

$ mlxdevm port show pci/0000:03:00.0/32768
pci/0000:03:00.0/32768: type eth netdev en3f0pf0sf4 flavour pcisf controller 0 pfnum 0 sfnum 88
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

2.5 Set the mac address of the SF

$ mlxdevm port function set pci/0000:03:00.0/32768 hw_addr 00:00:00:00:88:88

2.6 Set the number of netdev channels (optional)

The default maximum number of channels for SFs is 8, it can be changed. "max_io_eqs" can be used to set the maximum number of channels.

$ mlxdevm port function set pci/0000:03:00.0/32768 max_io_eqs 16
  • If max_io_eqs is 0 then the netdev won't be created.
  • If max_io_eqs is not set, the default is 8.
  • For all other max_io_eqs values, the max combined queues will depend on the number of cores and SF completion EQs.

In this example, when number of channels is set to 16, activated SF's netdev will show 16 channels. Such as,

$ ethtool -l <sf_netdev>

2.7 Configure OVS (openvswitch)

$ systemctl start openvswitch
$ ovs-vsctl add-br network1
$ ovs-vsctl add-port network1 en3f0pf0sf88
$ ip link set dev en3f0pf0sf88 up
$ ovs-vsctl add-port network1 ens3f0np0
$ ip link set dev ens3f0np0 up

2.8 Now activate the SF

Activating the SF results in creating an auxiliary device and initiating driver load sequence for netdevice, rdma and vdpa devices.

Once the operational state is marked as attached, driver is attached to this SF and device loading starts.

An application interested in using the SF netdevice and rdma device needs to monitor the rdma and netdevices either through udev monitor or poll the sysfs hierarchy of SF's auxiliary device.

In future, an explicit option will be added to deterministically add the netdev and rdma device of SF.

$ mlxdevm port function set pci/0000:03:00.0/32768 state active

On DPU(BlueField), need additional steps to unbind SF from config driver and bind SF to mlx5_core.sf driver correctly:

$ echo mlx5_core.sf.5 > /sys/bus/auxiliary/drivers/mlx5_core.sf_cfg/unbind
$ echo mlx5_core.sf.5 > /sys/bus/auxiliary/drivers/mlx5_core.sf/bind

2.10 View the new state of the SF

$ mlxdevm port show ens2f0npf0sf88 -jp
{
   "port": {
      "pci/0000:03:00.0/32768": {
         "type": "eth",
         "netdev": "en3f0pf0sf4",
         "flavour": "pcisf",
         "controller": 0,
         "pfnum": 0,
         "sfnum": 88,
         "function": {
           "hw_addr": "00:00:00:00:88:88",
           "state": "active",
           "opstate": "attached"
          }
       }
    }
  }

2.11 View the auxiliary device of the SF

$ tree -l -L 3 -P "mlx5_core.sf." /sys/bus/auxiliary/devices/

$ devlink dev show
devlink dev show auxiliary/mlx5_core.sf.4

2.12 View the port and netdevice associated with the SF

$ devlink port show auxiliary/mlx5_core.sf.4/1
auxiliary/mlx5_core.sf.4/1: type eth netdev enp3s0f0s88 flavour physical port 0 splittable false

Netdevice and RDMA device can be seen using iproute2 tools.

$ ip link show
$ rdma link show

At this stage orchestration software or user should assign IP address and use it for application.

2.13 Deactivate SF

Once SF usage is complete, deactivate the SF. This will trigger driver unload in the host system. Once SF is deactivated, its operational state will change to be "detached". An orchestration system should poll for operational state to be changed to "detached" before deleting the SF. This ensures a graceful hot unplug.

$ mlxdevm port function set pci/0000:03:00.0/32768 state inactive

2.14 Delete SF

Finally once the state is "inactive" and operational state is "detached", user can safely delete the SF. For faster provisioning, a user can reconfigure and active the SF again without deletion.

$ mlxdevm port del pci/0000:03:00.0/32768

2.15 Set affinity of SF (Optional)

SFs are sharing IRQs either with peer SFs or with parent PF. To get best performance, it is desired to set the SF's cpu affinity. Setting SF's cpu affinity ensures that SF consumes resources and handles packets only on the CPU mentioned.

A typical example would be a 200 core system and each SF is attached on one cpu. In another example, a 64 core system, running 256 containers, and each SF is given cpu affinity of maximum of 8 cpus. Such as SF-0 -> affinity = 0-7. SF-1 -> affinity = 8-15. SF-2 -> affinity = 16-23. [...] SF-8 -> affinity = 0-7.

With above scheme, SF=0 and SF=8 will use cpu cores 0 to 7. By default SFs do not have cpu affinity setup.

$ mlxdevm dev param set auxiliary/mlx5_core.sf.2 name cpu_affinity value 0-2,5 cmode driverinit

After setting the cpu affinity, user must reload the instance for cpu affinity to take effect.

$ devlink dev reload auxiliary/mlx5_core.sf.2

If SFs are used for container, once SF is reloaded, its netdevice and rdma device should be assign to the net namespace.

2.16 SFs with bonding (Optional)

When physical port netdevice (aka uplink) representor is used with bonding driver, SF traffic flows through this bond interface. When user prefers such bonding configuration, user must follow below sequence.

  1. Move both the PCI PFs to the switchdev mode
  2. Setup bonding between physical port (uplink) representors
  3. Create SFs

A destroy sequence must be a mirror of it, that is, to destroy all SFs first, before destroying the bond configuration.

⚠️ **GitHub.com Fallback** ⚠️