Static Routing - mellanox/mlxsw GitHub Wiki
Table of Contents
- Features by Version
- Basic Settings
- Router Interfaces
- Nexthop Routes
- Neighbours
- ECMP Routes
- IPv6 Source-specific Routing
- Disabling ICMP Redirects
- Route Offload Notifications
- Recommended Sysctl Configuration
- Further Resources
Features by Version
Kernel Version | |
---|---|
4.8 | IPv4 routing |
4.11 | net.ipv4.conf.default.ignore_routes_with_linkdown support |
4.14 | IPv6 routing, nexthop offload indication |
5.2 | neighbour offload indication |
5.11 | Nexthop objects support |
5.12 | Route offload notifications |
5.13 | Resilient next-hop groups |
5.14 | Inner layer 3 and custom multi-path hash policies, remove abort mechanism |
5.16 | Multiple router interface MAC prefixes support |
6.11 | Multipath seed configuration |
Basic Settings
In order for routing to work on a Linux system, forwarding must be enabled. To check if forwarding is enabled, run:
$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
$ sysctl net.ipv6.conf.all.forwarding
net.ipv6.conf.all.forwarding = 0
In this case, IPv4/IPv6 forwarding is disabled. To enable it, run:
$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
$ sysctl -w net.ipv6.conf.all.forwarding=1
net.ipv6.conf.all.forwarding = 1
To enable it permanently across reboots, run:
$ echo "net.ipv4.ip_forward = 1" > /etc/sysctl.d/forward.conf
$ echo "net.ipv6.conf.all.forwarding = 1" >> /etc/sysctl.d/forward.conf
In a similar way, the other IP sysctls can be adjusted. For example,
the maximum number of IPv4 neighbour entries can be adjusted via
net.ipv4.neigh.default.gc_thresh3
.
Router Interfaces
Whenever an IP address is assigned to a port netdevice or one of its uppers
(e.g. bridge, team, VLAN), a router interface is automatically created in the
hardware. In the following example, two router interfaces are created, one for
sw1p1
and one for sw1p2
:
$ ip addr add 192.168.0.1/24 dev sw1p1
$ ip link set dev sw1p1 up
$ ip addr add 192.168.1.1/24 dev sw1p2
$ ip link set dev sw1p2 up
For each address and its broadcast and network addresses, traps are inserted into the hardware which cause the appropriate packets to be delivered to the kernel.
Router Interface MAC Addresses
Spectrum ASICs require that all router interfaces use only a limited number of unique MAC address prefixes (so called MAC profiles). The following table shows how many MAC profiles a given ASIC supports, and how long the MAC prefix under consideration is:
ASIC | Profiles | Profile prefix length |
---|---|---|
Spectrum-1 | 1 | 38 bits |
Spectrum>1 | 4 | 36 bits |
Thus e.g. on Spectrum-2, MAC addresses of router interfaces have to satisfy the condition that when 36 most significant bits are taken from each router interface, the result is at most four distinct values.
Note: Prior to kernel 5.16, mlxsw
only supported 1 MAC profile
regardless of the ASIC generation.
The maximum number of profiles and current occupancy of the router
interface MAC profiles can be queried from the kernel using devlink resource
:
$ devlink resource show pci/0000:06:00.0
pci/0000:06:00.0:
...
name rif_mac_profiles size 4 occ 0 unit entry dpipe_tables none
IPv6 Addresses and Interface Down
By default, the kernel flushes all the IPv6 addresses upon interface down:
$ ip -6 address show dev sw1p1
28: sw1p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2001:db8::1/32 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e61d:2dff:fe45:a9f1/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev sw1p1 down
$ ip -6 address show dev sw1p1
To be consistent with IPv4 and keep static global addresses with no expiration time upon interface down, run:
$ sysctl -w net.ipv6.conf.sw1p1.keep_addr_on_down=1
$ ip -6 address show dev sw1p1
inet6 2001:db8::1/32 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e61d:2dff:fe45:a9f1/64 scope link
valid_lft forever preferred_lft forever
$ ip link set dev sw1p1 down
$ ip -6 address show dev sw1p1
28: sw1p1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
inet6 2001:db8::1/32 scope global tentative
valid_lft forever preferred_lft forever
Note that the global address 2001:db8::1/32
is still configured on the
interface, while the link-local address fe80::e61d:2dff:fe45:a9f1/64
was
flushed.
To make this option the default for all the netdevices on the system upon boot
set net.ipv6.conf.all.keep_addr_on_down=1
in the sysctl configuration
files.
Bridge Layer 3 Configuration
As previously stated, it is possible to create router interfaces on top of bridge netdevs by assigning them an IP address. In the case of the VLAN-aware bridge, a router interface can be created for each of its upper VLAN devices.
To create a router interface for the bridge netdev itself, run:
$ ip link add name br0 type bridge vlan_filtering 1
...
$ ip addr add 192.168.0.1/24 dev br0
And for one of its upper VLAN devices, run:
$ ip link add link br0 name br0.10 type vlan id 10
$ bridge vlan add dev br0 vid 10 self
$ ip addr add 192.168.1.1/24 dev br0.10
Nexthop Routes
Once the router interface is created, it is possible to add routes:
$ ip route add 192.168.2.0/24 via 192.168.0.2 dev sw1p1
$ ip route add 192.168.3.0/24 via 192.168.1.2 dev sw1p2
To list the routes, run:
$ ip route
192.168.0.0/24 dev sw1p1 proto kernel scope link src 192.168.0.1 offload
192.168.1.0/24 dev sw1p2 proto kernel scope link src 192.168.1.1 offload
192.168.2.0/24 via 192.168.0.2 dev sw1p1 offload
192.168.3.0/24 via 192.168.1.2 dev sw1p2 offload
The offload
flag indicates that the route is offloaded to hardware.
Neighbours
A neighbour entry is created for each nexthop. To list the neighbour entries, run:
$ ip neigh
192.168.0.2 dev sw1p1 INCOMPLETE
192.168.1.2 dev sw1p2 INCOMPLETE
After neighbour discovery takes place, the output changes:
$ ip neigh
192.168.0.2 dev sw1p1 52:54:00:aa:bb:01 REACHABLE
192.168.1.2 dev sw1p2 52:54:00:aa:bb:02 REACHABLE
ECMP Routes
To add routes with multiple nexthops, run:
$ ip route add 192.168.5.0/24 nexthop via 192.168.0.2 dev sw1p1 weight 1 nexthop via 192.168.1.1 dev sw1p2 weight 1
IPv6 ECMP Routes
Unlike IPv4, the kernel allows one to add and remove individual nexthops without the need to delete the entire ECMP route and re-add it with a modified nexthop configuration.
To add an ECMP route, run:
$ ip -6 route add 2001:db81::/32 \
nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1 \
nexthop via fe80::e61d:2dff:fea5:f365 dev sw1p2
$ ip -6 route show 2001:db81::/32
2001:db81::/32 metric 1024
nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1 weight 1 offload
nexthop via fe80::e61d:2dff:fea5:f365 dev sw1p2 weight 1 offload
To delete the first nexthop, run:
$ ip -6 route del 2001:db81::/32 nexthop via fe80::e61d:2dff:fea5:f341 dev sw1p1
$ ip -6 route show 2001:db81::/32
2001:db81::/32 via fe80::e61d:2dff:fea5:f365 dev sw1p2 metric 1024 offload pref medium
Alternatively, another nexthop can be added to the route.
Linkdown Nexthops
By default, when the carrier of a netdevice goes down, the routing
subsystem does not invalidate the nexthops using it and therefore
continues to try and forward packets through them. Such nexthops
are marked using the linkdown
flag. For example:
$ ip route show 192.168.100.0/24
192.168.100.0/24
nexthop via 192.168.0.1 dev sw1p17 weight 1 offload linkdown
nexthop via 192.168.1.1 dev sw1p18 weight 1 offload
It is possible to make the kernel exclude such nexthops from its ECMP groups by setting the following sysctl:
$ sysctl -w net.ipv4.conf.sw1p17.ignore_routes_with_linkdown=1
With this sysctl set, when the carrier of sw1p17
goes down the kernel
starts forwarding packets via the nexthop using sw1p18
as its nexthop
device. It also marks the nexthop as dead
for various listeners in
the user space:
$ ip route show 192.168.100.0/24
192.168.100.0/24
nexthop via 192.168.0.1 dev sw1p17 weight 1 dead linkdown
nexthop via 192.168.1.1 dev sw1p18 weight 1 offload
To make this option the default for all the netdevices on the system
upon boot set net.ipv4.conf.default.ignore_routes_with_linkdown=1
in
the sysctl configuration files.
Note: The mlxsw driver currently does not support this functionality in IPv6.
Multi-path Hash Policy
When forwarding packets, the device performs the multi-path hash in accordance with the kernel's policy.
The packet fields used for the multi-path hash are controlled by the
net.ipv{4,6}.fib_multipath_hash_policy
sysctls. By default, it is set to
0
, which means only the source and destination IP addresses are used.
In IPv6, the flow label is also used.
If the sysctl is set to 1
, a 5-tuple is used: The source and
destination IP addresses, the source and destination ports, and the IP
protocol.
If the sysctl is set to 2
, the kernel will compute the multi-path hash
based on the packet's layer 3 fields or inner layer 3 fields, if
available. The Spectrum ASIC cannot fall back to outer fields if inner
fields are not available and will therefore use both outer and inner
fields for the hash computation. This should not result in a discrepancy
from the software data path because if several flows have matching inner
fields, they will tend to have matching outer fields as well.
If the sysctl is set to 3
, the fields used for the hash computation
are determined by the net.ipv{4,6}.fib_multipath_hash_fields
sysctls.
These sysctls take a bitmask where each bit corresponds to a different
field according to the following table:
Value | Field |
---|---|
0x0001 | Source IP address |
0x0002 | Destination IP address |
0x0004 | IP protocol |
0x0008 | Flow Label (unused in IPv4) |
0x0010 | Source port |
0x0020 | Destination port |
0x0040 | Inner source IP address |
0x0080 | Inner destination IP address |
0x0100 | Inner IP protocol |
0x0200 | Inner Flow Label |
0x0400 | Inner source port |
0x0800 | Inner destination port |
For example, to hash IPv6 traffic based on standard 5-tuple and flow label, run:
$ sysctl -wq net.ipv6.fib_multipath_hash_fields=0x3F
$ sysctl -wq net.ipv6.fib_multipath_hash_policy=3
Note: When the sysctl is set to 0
, the kernel performs the
multi-path hash for ICMP error packets according to the inner IP
addresses. Currently, this is not supported by the device.
Note: Layer 4 fields are not considered for fragmented packets.
Note: When inner fields are enabled, they are enabled in the ASIC for both IPv4 and IPv6 packets, whereas in the kernel they are enabled based on the outer protocol. It is therefore important to ensure that the hash policies are configured the same for IPv4 and IPv6 packets.
Note: The ASIC calculates the multi-path hash for a packet before
ingress ACL. Therefore, editing packet
headers on ingress with the pedit
action
does not affect the multi-path hash.
Multi-path Hash Seed
Besides the header fields indicated by the multipath hash policy, the resulting hash used for next-hop selection depends on an initial value called seed.
By default, the SW datapath uses a random seed value, and for the HW
datapath, mlxsw
derives the seed from the device MAC address. In order
to either avoid undesirable polarization in next-hop selection, or to
cause two machines to forward traffic consistently, it may be necessary
to adjust the seed by hand. Do that through the
net.ipv4.fib_multipath_hash_seed
sysctl:
$ sysctl -wq net.ipv4.fib_multipath_hash_seed=0x12345678
The configured seed value influences both IPv4 and IPv6 forwarding, in SW as well as HW datapaths. However the in-ASIC hashing algorithm differs from the one used by the Linux kernel, so the same traffic will in general hit different next hops in SW datapath than in HW datapath, even for the same seed.
Setting the seed value to 0
reverts the seed used for multipath hashes
to its default initial value, in SW as well as HW datapaths.
Nexthop Objects
Traditionally, when routes were programmed to the kernel they contained both the prefix information and the nexthop information. As seen here, this results in sub-optimal insertion rates and in some cases, increased memory consumption.
Since kernel 5.3, the kernel supports nexthops as separate objects. This allows user space to program nexthops to the kernel and assign them a numeric identifier. Later, user space can program routes and associate them with nexthop objects by specifying their numeric identifier.
In the following example, a nexthop group is added with two nexthops:
# ip link set dev swp3 up
# ip link set dev swp7 up
# ip address add 192.0.2.2/28 dev swp3
# ip address add 192.0.2.18/28 dev swp7
# ip nexthop replace id 1 via 192.0.2.1 dev swp3
# ip nexthop replace id 2 via 192.0.2.17 dev swp7
# ip nexthop replace id 10 group 1/2
# ip nexthop
id 1 via 192.0.2.1 dev swp3 scope link offload
id 2 via 192.0.2.17 dev swp7 scope link offload
id 10 group 1/2 offload
A route using the nexthop group is then added:
# ip route replace 198.51.100.0/24 nhid 10
# sysctl -qw net.ipv4.nexthop_compat_mode=0
# ip route show 198.51.100.0/24
198.51.100.0/24 nhid 10 rt_offload
# sysctl -qw net.ipv4.nexthop_compat_mode=1
# ip route show 198.51.100.0/24
198.51.100.0/24 nhid 10 rt_offload
nexthop via 192.0.2.1 dev swp3 weight 1
nexthop via 192.0.2.17 dev swp7 weight 1
In the above example, the output of the ip route
command changes based
on the value of the net.ipv4.nexthop_compat_mode
sysctl option. It is
described as follows in the kernel documentation:
nexthop_compat_mode - BOOLEAN
New nexthop API provides a means for managing nexthops independent of
prefixes. Backwards compatibilty with old route format is enabled by
default which means route dumps and notifications contain the new
nexthop attribute but also the full, expanded nexthop definition.
Further, updates or deletes of a nexthop configuration generate route
notifications for each fib entry using the nexthop. Once a system
understands the new API, this sysctl can be disabled to achieve full
performance benefits of the new API by disabling the nexthop expansion
and extraneous notifications.
Default: true (backward compat mode)
Nexthop objects also support the following unique features which are not supported by the legacy API:
Blackhole Nexthops
By changing a nexthop to a blackhole nexthop, all the routes using the nexthop will start to silently drop packets. Example:
# ip nexthop replace id 3 blackhole
# ip nexthop show id 3
id 3 blackhole offload
The offload
flag indicates that packets are dropped by the ASIC. To
get visibility into packets dropped by such nexthops, enable the
blackhole_nexthop
trap:
# devlink trap set pci/0000:01:00.0 trap blackhole_nexthop action trap
Dropped packets can then be read using utilities such as DropWatch as explained here.
Native Support for RFC 5549
IPv4 routes can be added when pointing to IPv6 nexthops. For example:
# ip nexthop replace id 4 via fe80::1 dev swp3
# ip nexthop replace id 5 via fe80::2 dev swp7
# ip nexthop replace id 20 group 4/5
# ip nexthop
id 1 via 192.0.2.1 dev swp3 scope link offload
id 2 via 192.0.2.17 dev swp7 scope link offload
id 3 blackhole offload
id 4 via fe80::1 dev swp3 scope link offload
id 5 via fe80::2 dev swp7 scope link offload
id 10 group 1/2 offload
id 20 group 4/5 offload
# ip route replace 198.51.100.0/24 nhid 20
# ip route show 198.51.100.0/24
198.51.100.0/24 nhid 20 rt_offload
nexthop via inet6 fe80::1 dev swp3 weight 1
nexthop via inet6 fe80::2 dev swp7 weight 1
IPv6 overlay multi-path
With the legacy API, it was not possible to perform multi-pathing between several device-only IPv6 nexthops. This is made possible with the nexthop API, which enables multi-pathing between several GRE tunnels. See this example from the kernel source tree for example usage.
Note: iproute2 version 5.11 is required in order to display nexthop
hardware flags (e.g., offload
, trap
)
Note: By default, since version 7.3, FRR will try to install routes using nexthop objects
Resilient Next-Hop Groups
Resilient groups are a type of next-hop group that is aimed at minimizing disruption in flow routing across changes to the group composition and weights of constituent next hops. Please see the kernel documentation for details on configuration and implementation.
mlxsw
will offload resilient next-hop groups if their number of buckets
is as indicated in the following table:
ASIC | Permitted sizes |
---|---|
Spectrum-1 | 1..64, 512; optimal: 32 and 512 |
Spectrum>1 | 1..128, 256, 512, 1024, 2048, 4096 |
Note: On Spectrum-1, the indicated sizes are all supported, but only 32
and 512 are stored optimally. Further power-of-two sizes, up to 4096, are
supported by the router block in the Spectrum-1 ASIC. However these will
not be permitted due to the mlxsw
KVDL allocation
strategy.
For offloaded groups, mlxsw
will report activity of individual buckets in
regular intervals, and will bounce proposals for bucket migration if the
bucket became active since the last full activity report.
An example of creating a resilient next-hop group:
# ip nexthop add id 1 via 192.0.2.2 dev swp1
# ip nexthop add id 2 via 192.0.2.3 dev swp1
# ip nexthop add id 10 group 1/2 type resilient \
buckets 64 idle_timer 60 unbalanced_timer 300
IPv6 Source-specific Routing
The kernel supports IPv6 source-specific routing, which allows packets to be forwarded according to the destination and source addresses. If a packet matches two routes matching the destination address without one being more specific than the other, then it is possible to use the route with the most specific source prefix to route the packet.
However, without resorting to ACLs, the ASIC performs routing solely based on the destination address. Therefore, insertion of source-specific routes will be failed and the routes will be marked with the flag "rt_offload_failed".
Disabling ICMP Redirects
Usually, when a packet is routed through the same interface from which it ingressed the router, an ICMP redirect packet is generated and sent to the transmitting host.
In some cases (e.g., one-armed router) it is desirable to disable ICMP redirects, as routing a packet through the same interface is not indicative of sub-optimal routing tables. Disabling ICMP redirects can be either done in hosts (preventing them from processing such packets) or in routers (preventing them from generating such packets).
In Hosts
Hosts can be prevented from processing ICMP redirects by clearing the
following sysctls. To disable ICMP redirects processing on sw1p1
:
$ sysctl -qw net.ipv4.conf.all.accept_redirects=0
$ sysctl -qw net.ipv4.conf.sw1p1.accept_redirects=0
$ sysctl -qw net.ipv6.conf.all.accept_redirects=0
$ sysctl -qw net.ipv6.conf.sw1p1.accept_redirects=0
In Routers
IPv4 routers can be prevented from generating ICMP redirects by clearing
the following sysctl. To disable ICMP redirects generation on sw1p1
:
$ sysctl -qw net.ipv4.conf.all.send_redirects=0
$ sysctl -qw net.ipv4.conf.sw1p1.send_redirects=0
There is no corresponding sysctl for IPv6, but the following iptables
rule can be used:
$ ip6tables -A OUTPUT -p icmpv6 --icmpv6-type redirect -j DROP
Route Offload Notifications
When routes are programmed to hardware they are marked with one of the following flags:
RTM_F_OFFLOAD
for routes that offload traffic from the kernelRTM_F_TRAP
for routes that trap packets to the kernel for processing (e.g., host routes)RTM_F_OFFLOAD_FAILED
for routes that failed to be installed to hardware
By default, user space does not receive RTM_NEWROUTE
notifications
when these flags change. This behaviour is controlled by the
net.ipv4.fib_notify_on_flag_change
and
net.ipv6.fib_notify_on_flag_change
sysctls. Possible values:
- 0 - Do not emit notifications
- 1 - Emit notifications
- 2 - Emit notifications only for
RTM_F_OFFLOAD_FAILED
flag change
The default value is 0.
To monitor the notifications, use:
$ ip monitor route
The notifications can be used, for example, by a routing daemon to delay advertisement of a route until it was successfully installed in hardware.
Recommended Sysctl Configuration
This section contains a list of relevant sysctls and their recommended values when the system is used as a router given the Linux defaults might not be adequate for this use case. For the sake of keeping the list complete, there is an overlap with other sections that discuss various sysctls in more detail.
# Enable IPv4 and IPv6 forwarding.
net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
net.ipv6.conf.default.forwarding=1
# Keep IPv6 addresses on an interface when it goes down. This is
# consistent with IPv4.
net.ipv6.conf.all.keep_addr_on_down=1
net.ipv6.conf.default.keep_addr_on_down=1
# Prevent the kernel from routing packets via an interface whose link is
# down. This is not strictly necessary when a routing daemon is used as
# it will most likely evict such routes. In addition, when offloaded,
# such routes will not be considered anyway since the associated neighbour
# entries will be flushed upon the carrier going down, preventing the
# device from determining the destination MAC it should use.
net.ipv4.conf.all.ignore_routes_with_linkdown=1
net.ipv6.conf.all.ignore_routes_with_linkdown=1
net.ipv4.conf.default.ignore_routes_with_linkdown=1
net.ipv6.conf.default.ignore_routes_with_linkdown=1
# Use a standard 5-tuple to compute the multi-path hash.
net.ipv4.fib_multipath_hash_policy=1
net.ipv6.fib_multipath_hash_policy=1
# Generate an unsolicited neighbour advertisement when an interface goes
# down or its hardware address changes.
net.ipv6.conf.all.ndisc_notify=1
net.ipv6.conf.default.ndisc_notify=1
# Do not perform source validation when routing IPv4 packets. This is
# consistent with the hardware data path behavior. No configuration
# is necessary for IPv6.
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0
# Do not update the SKB priority from "TOS" field in IP header after
# the packet is forwarded. This applies to both IPv4 and IPv6 packets
# which are forwarded by the device.
net.ipv4.ip_forward_update_priority=0
# Prevent the kernel from generating a netlink event for each deleted
# IPv6 route when an interface goes down. This is consistent with IPv4.
net.ipv6.route.skip_notify_on_dev_down=1
# Use neighbour information when choosing a nexthop in a multi-path
# route. Will prevent the kernel from routing the packets via a
# failed nexthop. This is consistent with the hardware behavior.
net.ipv4.fib_multipath_use_neigh=1
# Increase the maximum number of cached IPv6 routes. No configuration is
# necessary for IPv4.
net.ipv6.route.max_size=16384
# In case the number of non-permanent neighbours in the system exceeds
# this value for over 5 seconds, the garbage collector will kick in.
# Default is 512, but if the system has a larger number of interfaces or
# expected to communicate with a larger number of directly-connected
# neighbours, then it is recommended to increase this value.
net.ipv4.neigh.default.gc_thresh2=8192
net.ipv6.neigh.default.gc_thresh2=8192
# In case the number of non-permanent neighbours in the system exceeds
# this value, the garbage collector will kick in. Default is 1024, but
# if the system has a larger number of interfaces or expected to
# communicate with a larger number of directly-connected neighbours,
# then it is recommended to increase this value.
net.ipv4.neigh.default.gc_thresh3=16384
net.ipv6.neigh.default.gc_thresh3=16384
Further Resources
Please refer to Routing intro in order to get essential information about routing setup in Linux.
- man ip
- man sysctl.d
- man ip-nexthop
- "Improving Route Scalability: Nexthops as Separate Objects" by David Ahern