162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci.. include:: <isonum.txt> 362306a36Sopenharmony_ci.. _switchdev: 462306a36Sopenharmony_ci 562306a36Sopenharmony_ci=============================================== 662306a36Sopenharmony_ciEthernet switch device driver model (switchdev) 762306a36Sopenharmony_ci=============================================== 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciCopyright |copy| 2014 Jiri Pirko <jiri@resnulli.us> 1062306a36Sopenharmony_ci 1162306a36Sopenharmony_ciCopyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com> 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ci 1462306a36Sopenharmony_ciThe Ethernet switch device driver model (switchdev) is an in-kernel driver 1562306a36Sopenharmony_cimodel for switch devices which offload the forwarding (data) plane from the 1662306a36Sopenharmony_cikernel. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ciFigure 1 is a block diagram showing the components of the switchdev model for 1962306a36Sopenharmony_cian example setup using a data-center-class switch ASIC chip. Other setups 2062306a36Sopenharmony_ciwith SR-IOV or soft switches, such as OVS, are possible. 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ci:: 2362306a36Sopenharmony_ci 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ci User-space tools 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ci user space | 2862306a36Sopenharmony_ci +-------------------------------------------------------------------+ 2962306a36Sopenharmony_ci kernel | Netlink 3062306a36Sopenharmony_ci | 3162306a36Sopenharmony_ci +--------------+-------------------------------+ 3262306a36Sopenharmony_ci | Network stack | 3362306a36Sopenharmony_ci | (Linux) | 3462306a36Sopenharmony_ci | | 3562306a36Sopenharmony_ci +----------------------------------------------+ 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ci sw1p2 sw1p4 sw1p6 3862306a36Sopenharmony_ci sw1p1 + sw1p3 + sw1p5 + eth1 3962306a36Sopenharmony_ci + | + | + | + 4062306a36Sopenharmony_ci | | | | | | | 4162306a36Sopenharmony_ci +--+----+----+----+----+----+---+ +-----+-----+ 4262306a36Sopenharmony_ci | Switch driver | | mgmt | 4362306a36Sopenharmony_ci | (this document) | | driver | 4462306a36Sopenharmony_ci | | | | 4562306a36Sopenharmony_ci +--------------+----------------+ +-----------+ 4662306a36Sopenharmony_ci | 4762306a36Sopenharmony_ci kernel | HW bus (eg PCI) 4862306a36Sopenharmony_ci +-------------------------------------------------------------------+ 4962306a36Sopenharmony_ci hardware | 5062306a36Sopenharmony_ci +--------------+----------------+ 5162306a36Sopenharmony_ci | Switch device (sw1) | 5262306a36Sopenharmony_ci | +----+ +--------+ 5362306a36Sopenharmony_ci | | v offloaded data path | mgmt port 5462306a36Sopenharmony_ci | | | | 5562306a36Sopenharmony_ci +--|----|----+----+----+----+---+ 5662306a36Sopenharmony_ci | | | | | | 5762306a36Sopenharmony_ci + + + + + + 5862306a36Sopenharmony_ci p1 p2 p3 p4 p5 p6 5962306a36Sopenharmony_ci 6062306a36Sopenharmony_ci front-panel ports 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ci Fig 1. 6462306a36Sopenharmony_ci 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ciInclude Files 6762306a36Sopenharmony_ci------------- 6862306a36Sopenharmony_ci 6962306a36Sopenharmony_ci:: 7062306a36Sopenharmony_ci 7162306a36Sopenharmony_ci #include <linux/netdevice.h> 7262306a36Sopenharmony_ci #include <net/switchdev.h> 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ci 7562306a36Sopenharmony_ciConfiguration 7662306a36Sopenharmony_ci------------- 7762306a36Sopenharmony_ci 7862306a36Sopenharmony_ciUse "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model 7962306a36Sopenharmony_cisupport is built for driver. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ciSwitch Ports 8362306a36Sopenharmony_ci------------ 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ciOn switchdev driver initialization, the driver will allocate and register a 8662306a36Sopenharmony_cistruct net_device (using register_netdev()) for each enumerated physical switch 8762306a36Sopenharmony_ciport, called the port netdev. A port netdev is the software representation of 8862306a36Sopenharmony_cithe physical port and provides a conduit for control traffic to/from the 8962306a36Sopenharmony_cicontroller (the kernel) and the network, as well as an anchor point for higher 9062306a36Sopenharmony_cilevel constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using 9162306a36Sopenharmony_cistandard netdev tools (iproute2, ethtool, etc), the port netdev can also 9262306a36Sopenharmony_ciprovide to the user access to the physical properties of the switch port such 9362306a36Sopenharmony_cias PHY link state and I/O statistics. 9462306a36Sopenharmony_ci 9562306a36Sopenharmony_ciThere is (currently) no higher-level kernel object for the switch beyond the 9662306a36Sopenharmony_ciport netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciA switch management port is outside the scope of the switchdev driver model. 9962306a36Sopenharmony_ciTypically, the management port is not participating in offloaded data plane and 10062306a36Sopenharmony_ciis loaded with a different driver, such as a NIC driver, on the management port 10162306a36Sopenharmony_cidevice. 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ciSwitch ID 10462306a36Sopenharmony_ci^^^^^^^^^ 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciThe switchdev driver must implement the net_device operation 10762306a36Sopenharmony_cindo_get_port_parent_id for each port netdev, returning the same physical ID for 10862306a36Sopenharmony_cieach port of a switch. The ID must be unique between switches on the same 10962306a36Sopenharmony_cisystem. The ID does not need to be unique between switches on different 11062306a36Sopenharmony_cisystems. 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ciThe switch ID is used to locate ports on a switch and to know if aggregated 11362306a36Sopenharmony_ciports belong to the same switch. 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ciPort Netdev Naming 11662306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_ciUdev rules should be used for port netdev naming, using some unique attribute 11962306a36Sopenharmony_ciof the port as a key, for example the port MAC address or the port PHYS name. 12062306a36Sopenharmony_ciHard-coding of kernel netdev names within the driver is discouraged; let the 12162306a36Sopenharmony_cikernel pick the default netdev name, and let udev set the final name based on a 12262306a36Sopenharmony_ciport attribute. 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ciUsing port PHYS name (ndo_get_phys_port_name) for the key is particularly 12562306a36Sopenharmony_ciuseful for dynamically-named ports where the device names its ports based on 12662306a36Sopenharmony_ciexternal configuration. For example, if a physical 40G port is split logically 12762306a36Sopenharmony_ciinto 4 10G ports, resulting in 4 port netdevs, the device can give a unique 12862306a36Sopenharmony_ciname for each port using port PHYS name. The udev rule would be:: 12962306a36Sopenharmony_ci 13062306a36Sopenharmony_ci SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ 13162306a36Sopenharmony_ci ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ciSuggested naming convention is "swXpYsZ", where X is the switch name or ID, Y 13462306a36Sopenharmony_ciis the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 13562306a36Sopenharmony_ciwould be sub-port 0 on port 1 on switch 1. 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ciPort Features 13862306a36Sopenharmony_ci^^^^^^^^^^^^^ 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciNETIF_F_NETNS_LOCAL 14162306a36Sopenharmony_ci 14262306a36Sopenharmony_ciIf the switchdev driver (and device) only supports offloading of the default 14362306a36Sopenharmony_cinetwork namespace (netns), the driver should set this feature flag to prevent 14462306a36Sopenharmony_cithe port netdev from being moved out of the default netns. A netns-aware 14562306a36Sopenharmony_cidriver/device would not set this flag and be responsible for partitioning 14662306a36Sopenharmony_cihardware to preserve netns containment. This means hardware cannot forward 14762306a36Sopenharmony_citraffic from a port in one namespace to another port in another namespace. 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ciPort Topology 15062306a36Sopenharmony_ci^^^^^^^^^^^^^ 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciThe port netdevs representing the physical switch ports can be organized into 15362306a36Sopenharmony_cihigher-level switching constructs. The default construct is a standalone 15462306a36Sopenharmony_cirouter port, used to offload L3 forwarding. Two or more ports can be bonded 15562306a36Sopenharmony_citogether to form a LAG. Two or more ports (or LAGs) can be bridged to bridge 15662306a36Sopenharmony_ciL2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 15762306a36Sopenharmony_citunnels can be built on ports. These constructs are built using standard Linux 15862306a36Sopenharmony_citools such as the bridge driver, the bonding/team drivers, and netlink-based 15962306a36Sopenharmony_citools such as iproute2. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciThe switchdev driver can know a particular port's position in the topology by 16262306a36Sopenharmony_cimonitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a 16362306a36Sopenharmony_cibond will see its upper master change. If that bond is moved into a bridge, 16462306a36Sopenharmony_cithe bond's upper master will change. And so on. The driver will track such 16562306a36Sopenharmony_cimovements to know what position a port is in in the overall topology by 16662306a36Sopenharmony_ciregistering for netdevice events and acting on NETDEV_CHANGEUPPER. 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ciL2 Forwarding Offload 16962306a36Sopenharmony_ci--------------------- 17062306a36Sopenharmony_ci 17162306a36Sopenharmony_ciThe idea is to offload the L2 data forwarding (switching) path from the kernel 17262306a36Sopenharmony_cito the switchdev device by mirroring bridge FDB entries down to the device. An 17362306a36Sopenharmony_ciFDB entry is the {port, MAC, VLAN} tuple forwarding destination. 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ciTo offloading L2 bridging, the switchdev driver/device should support: 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ci - Static FDB entries installed on a bridge port 17862306a36Sopenharmony_ci - Notification of learned/forgotten src mac/vlans from device 17962306a36Sopenharmony_ci - STP state changes on the port 18062306a36Sopenharmony_ci - VLAN flooding of multicast/broadcast and unknown unicast packets 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciStatic FDB Entries 18362306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciA driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and 18662306a36Sopenharmony_ci``ndo_fdb_dump`` operations is able to support the command below, which adds a 18762306a36Sopenharmony_cistatic bridge FDB entry:: 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci bridge fdb add dev DEV ADDRESS [vlan VID] [self] static 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci(the "static" keyword is non-optional: if not specified, the entry defaults to 19262306a36Sopenharmony_cibeing "local", which means that it should not be forwarded) 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ciThe "self" keyword (optional because it is implicit) has the role of 19562306a36Sopenharmony_ciinstructing the kernel to fulfill the operation through the ``ndo_fdb_add`` 19662306a36Sopenharmony_ciimplementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this 19762306a36Sopenharmony_ciwill bypass the bridge and therefore leave the software database out of sync 19862306a36Sopenharmony_ciwith the hardware one. 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ciTo avoid this, the "master" keyword can be used:: 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ci bridge fdb add dev DEV ADDRESS [vlan VID] master static 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ciThe above command instructs the kernel to search for a master interface of 20562306a36Sopenharmony_ci``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that. 20662306a36Sopenharmony_ciThis time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification 20762306a36Sopenharmony_ciwhich the port driver can handle and use it to program its hardware table. This 20862306a36Sopenharmony_ciway, the software and the hardware database will both contain this static FDB 20962306a36Sopenharmony_cientry. 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ciNote: for new switchdev drivers that offload the Linux bridge, implementing the 21262306a36Sopenharmony_ci``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly 21362306a36Sopenharmony_cidiscouraged: all static FDB entries should be added on a bridge port using the 21462306a36Sopenharmony_ci"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to 21562306a36Sopenharmony_civisualize the hardware tables, if the device does not have an interrupt for 21662306a36Sopenharmony_cinotifying the operating system of newly learned/forgotten dynamic FDB 21762306a36Sopenharmony_ciaddresses. In that case, the hardware FDB might end up having entries that the 21862306a36Sopenharmony_cisoftware FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see 21962306a36Sopenharmony_cithem. 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ciNote: by default, the bridge does not filter on VLAN and only bridges untagged 22262306a36Sopenharmony_citraffic. To enable VLAN support, turn on VLAN filtering:: 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ci echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ciNotification of Learned/Forgotten Source MAC/VLANs 22762306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ciThe switch device will learn/forget source MAC address/VLAN on ingress packets 23062306a36Sopenharmony_ciand notify the switch driver of the mac/vlan/port tuples. The switch driver, 23162306a36Sopenharmony_ciin turn, will notify the bridge driver using the switchdev notifier call:: 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ci err = call_switchdev_notifiers(val, dev, info, extack); 23462306a36Sopenharmony_ci 23562306a36Sopenharmony_ciWhere val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when 23662306a36Sopenharmony_ciforgetting, and info points to a struct switchdev_notifier_fdb_info. On 23762306a36Sopenharmony_ciSWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the 23862306a36Sopenharmony_cibridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge 23962306a36Sopenharmony_cicommand will label these entries "offload":: 24062306a36Sopenharmony_ci 24162306a36Sopenharmony_ci $ bridge fdb 24262306a36Sopenharmony_ci 52:54:00:12:35:01 dev sw1p1 master br0 permanent 24362306a36Sopenharmony_ci 00:02:00:00:02:00 dev sw1p1 master br0 offload 24462306a36Sopenharmony_ci 00:02:00:00:02:00 dev sw1p1 self 24562306a36Sopenharmony_ci 52:54:00:12:35:02 dev sw1p2 master br0 permanent 24662306a36Sopenharmony_ci 00:02:00:00:03:00 dev sw1p2 master br0 offload 24762306a36Sopenharmony_ci 00:02:00:00:03:00 dev sw1p2 self 24862306a36Sopenharmony_ci 33:33:00:00:00:01 dev eth0 self permanent 24962306a36Sopenharmony_ci 01:00:5e:00:00:01 dev eth0 self permanent 25062306a36Sopenharmony_ci 33:33:ff:00:00:00 dev eth0 self permanent 25162306a36Sopenharmony_ci 01:80:c2:00:00:0e dev eth0 self permanent 25262306a36Sopenharmony_ci 33:33:00:00:00:01 dev br0 self permanent 25362306a36Sopenharmony_ci 01:00:5e:00:00:01 dev br0 self permanent 25462306a36Sopenharmony_ci 33:33:ff:12:35:01 dev br0 self permanent 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ciLearning on the port should be disabled on the bridge using the bridge command:: 25762306a36Sopenharmony_ci 25862306a36Sopenharmony_ci bridge link set dev DEV learning off 25962306a36Sopenharmony_ci 26062306a36Sopenharmony_ciLearning on the device port should be enabled, as well as learning_sync:: 26162306a36Sopenharmony_ci 26262306a36Sopenharmony_ci bridge link set dev DEV learning on self 26362306a36Sopenharmony_ci bridge link set dev DEV learning_sync on self 26462306a36Sopenharmony_ci 26562306a36Sopenharmony_ciLearning_sync attribute enables syncing of the learned/forgotten FDB entry to 26662306a36Sopenharmony_cithe bridge's FDB. It's possible, but not optimal, to enable learning on the 26762306a36Sopenharmony_cidevice port and on the bridge port, and disable learning_sync. 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_ciTo support learning, the driver implements switchdev op 27062306a36Sopenharmony_ciswitchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_ciFDB Ageing 27362306a36Sopenharmony_ci^^^^^^^^^^ 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciThe bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is 27662306a36Sopenharmony_cithe responsibility of the port driver/device to age out these entries. If the 27762306a36Sopenharmony_ciport device supports ageing, when the FDB entry expires, it will notify the 27862306a36Sopenharmony_cidriver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the 27962306a36Sopenharmony_cidevice does not support ageing, the driver can simulate ageing using a 28062306a36Sopenharmony_cigarbage collection timer to monitor FDB entries. Expired entries will be 28162306a36Sopenharmony_cinotified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for 28262306a36Sopenharmony_ciexample of driver running ageing timer. 28362306a36Sopenharmony_ci 28462306a36Sopenharmony_ciTo keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB 28562306a36Sopenharmony_cientry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The 28662306a36Sopenharmony_cinotification will reset the FDB entry's last-used time to now. The driver 28762306a36Sopenharmony_cishould rate limit refresh notifications, for example, no more than once a 28862306a36Sopenharmony_cisecond. (The last-used time is visible using the bridge -s fdb option). 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ciSTP State Change on Port 29162306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^ 29262306a36Sopenharmony_ci 29362306a36Sopenharmony_ciInternally or with a third-party STP protocol implementation (e.g. mstpd), the 29462306a36Sopenharmony_cibridge driver maintains the STP state for ports, and will notify the switch 29562306a36Sopenharmony_cidriver of STP state change on a port using the switchdev op 29662306a36Sopenharmony_ciswitchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. 29762306a36Sopenharmony_ci 29862306a36Sopenharmony_ciState is one of BR_STATE_*. The switch driver can use STP state updates to 29962306a36Sopenharmony_ciupdate ingress packet filter list for the port. For example, if port is 30062306a36Sopenharmony_ciDISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs 30162306a36Sopenharmony_ciand other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. 30262306a36Sopenharmony_ci 30362306a36Sopenharmony_ciNote that STP BDPUs are untagged and STP state applies to all VLANs on the port 30462306a36Sopenharmony_ciso packet filters should be applied consistently across untagged and tagged 30562306a36Sopenharmony_ciVLANs on the port. 30662306a36Sopenharmony_ci 30762306a36Sopenharmony_ciFlooding L2 domain 30862306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 30962306a36Sopenharmony_ci 31062306a36Sopenharmony_ciFor a given L2 VLAN domain, the switch device should flood multicast/broadcast 31162306a36Sopenharmony_ciand unknown unicast packets to all ports in domain, if allowed by port's 31262306a36Sopenharmony_cicurrent STP state. The switch driver, knowing which ports are within which 31362306a36Sopenharmony_civlan L2 domain, can program the switch device for flooding. The packet may 31462306a36Sopenharmony_cibe sent to the port netdev for processing by the bridge driver. The 31562306a36Sopenharmony_cibridge should not reflood the packet to the same ports the device flooded, 31662306a36Sopenharmony_ciotherwise there will be duplicate packets on the wire. 31762306a36Sopenharmony_ci 31862306a36Sopenharmony_ciTo avoid duplicate packets, the switch driver should mark a packet as already 31962306a36Sopenharmony_ciforwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark 32062306a36Sopenharmony_cithe skb using the ingress bridge port's mark and prevent it from being forwarded 32162306a36Sopenharmony_cithrough any bridge port with the same mark. 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ciIt is possible for the switch device to not handle flooding and push the 32462306a36Sopenharmony_cipackets up to the bridge driver for flooding. This is not ideal as the number 32562306a36Sopenharmony_ciof ports scale in the L2 domain as the device is much more efficient at 32662306a36Sopenharmony_ciflooding packets that software. 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ciIf supported by the device, flood control can be offloaded to it, preventing 32962306a36Sopenharmony_cicertain netdevs from flooding unicast traffic for which there is no FDB entry. 33062306a36Sopenharmony_ci 33162306a36Sopenharmony_ciIGMP Snooping 33262306a36Sopenharmony_ci^^^^^^^^^^^^^ 33362306a36Sopenharmony_ci 33462306a36Sopenharmony_ciIn order to support IGMP snooping, the port netdevs should trap to the bridge 33562306a36Sopenharmony_cidriver all IGMP join and leave messages. 33662306a36Sopenharmony_ciThe bridge multicast module will notify port netdevs on every multicast group 33762306a36Sopenharmony_cichanged whether it is static configured or dynamically joined/leave. 33862306a36Sopenharmony_ciThe hardware implementation should be forwarding all registered multicast 33962306a36Sopenharmony_citraffic groups only to the configured ports. 34062306a36Sopenharmony_ci 34162306a36Sopenharmony_ciL3 Routing Offload 34262306a36Sopenharmony_ci------------------ 34362306a36Sopenharmony_ci 34462306a36Sopenharmony_ciOffloading L3 routing requires that device be programmed with FIB entries from 34562306a36Sopenharmony_cithe kernel, with the device doing the FIB lookup and forwarding. The device 34662306a36Sopenharmony_cidoes a longest prefix match (LPM) on FIB entries matching route prefix and 34762306a36Sopenharmony_ciforwards the packet to the matching FIB entry's nexthop(s) egress ports. 34862306a36Sopenharmony_ci 34962306a36Sopenharmony_ciTo program the device, the driver has to register a FIB notifier handler 35062306a36Sopenharmony_ciusing register_fib_notifier. The following events are available: 35162306a36Sopenharmony_ci 35262306a36Sopenharmony_ci=================== =================================================== 35362306a36Sopenharmony_ciFIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, 35462306a36Sopenharmony_ci or modifying an existing entry on the device. 35562306a36Sopenharmony_ciFIB_EVENT_ENTRY_DEL used for removing a FIB entry 35662306a36Sopenharmony_ciFIB_EVENT_RULE_ADD, 35762306a36Sopenharmony_ciFIB_EVENT_RULE_DEL used to propagate FIB rule changes 35862306a36Sopenharmony_ci=================== =================================================== 35962306a36Sopenharmony_ci 36062306a36Sopenharmony_ciFIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: 36162306a36Sopenharmony_ci 36262306a36Sopenharmony_ci struct fib_entry_notifier_info { 36362306a36Sopenharmony_ci struct fib_notifier_info info; /* must be first */ 36462306a36Sopenharmony_ci u32 dst; 36562306a36Sopenharmony_ci int dst_len; 36662306a36Sopenharmony_ci struct fib_info *fi; 36762306a36Sopenharmony_ci u8 tos; 36862306a36Sopenharmony_ci u8 type; 36962306a36Sopenharmony_ci u32 tb_id; 37062306a36Sopenharmony_ci u32 nlflags; 37162306a36Sopenharmony_ci }; 37262306a36Sopenharmony_ci 37362306a36Sopenharmony_cito add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` 37462306a36Sopenharmony_cistructure holds details on the route and route's nexthops. ``*dev`` is one 37562306a36Sopenharmony_ciof the port netdevs mentioned in the route's next hop list. 37662306a36Sopenharmony_ci 37762306a36Sopenharmony_ciRoutes offloaded to the device are labeled with "offload" in the ip route 37862306a36Sopenharmony_cilisting:: 37962306a36Sopenharmony_ci 38062306a36Sopenharmony_ci $ ip route show 38162306a36Sopenharmony_ci default via 192.168.0.2 dev eth0 38262306a36Sopenharmony_ci 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload 38362306a36Sopenharmony_ci 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload 38462306a36Sopenharmony_ci 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload 38562306a36Sopenharmony_ci 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload 38662306a36Sopenharmony_ci 12.0.0.2 proto zebra metric 30 offload 38762306a36Sopenharmony_ci nexthop via 11.0.0.1 dev sw1p1 weight 1 38862306a36Sopenharmony_ci nexthop via 11.0.0.9 dev sw1p2 weight 1 38962306a36Sopenharmony_ci 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload 39062306a36Sopenharmony_ci 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload 39162306a36Sopenharmony_ci 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 39262306a36Sopenharmony_ci 39362306a36Sopenharmony_ciThe "offload" flag is set in case at least one device offloads the FIB entry. 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ciXXX: add/mod/del IPv6 FIB API 39662306a36Sopenharmony_ci 39762306a36Sopenharmony_ciNexthop Resolution 39862306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^ 39962306a36Sopenharmony_ci 40062306a36Sopenharmony_ciThe FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for 40162306a36Sopenharmony_cithe switch device to forward the packet with the correct dst mac address, the 40262306a36Sopenharmony_cinexthop gateways must be resolved to the neighbor's mac address. Neighbor mac 40362306a36Sopenharmony_ciaddress discovery comes via the ARP (or ND) process and is available via the 40462306a36Sopenharmony_ciarp_tbl neighbor table. To resolve the routes nexthop gateways, the driver 40562306a36Sopenharmony_cishould trigger the kernel's neighbor resolution process. See the rocker 40662306a36Sopenharmony_cidriver's rocker_port_ipv4_resolve() for an example. 40762306a36Sopenharmony_ci 40862306a36Sopenharmony_ciThe driver can monitor for updates to arp_tbl using the netevent notifier 40962306a36Sopenharmony_ciNETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops 41062306a36Sopenharmony_cifor the routes as arp_tbl updates. The driver implements ndo_neigh_destroy 41162306a36Sopenharmony_cito know when arp_tbl neighbor entries are purged from the port. 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ciDevice driver expected behavior 41462306a36Sopenharmony_ci------------------------------- 41562306a36Sopenharmony_ci 41662306a36Sopenharmony_ciBelow is a set of defined behavior that switchdev enabled network devices must 41762306a36Sopenharmony_ciadhere to. 41862306a36Sopenharmony_ci 41962306a36Sopenharmony_ciConfiguration-less state 42062306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^ 42162306a36Sopenharmony_ci 42262306a36Sopenharmony_ciUpon driver bring up, the network devices must be fully operational, and the 42362306a36Sopenharmony_cibacking driver must configure the network device such that it is possible to 42462306a36Sopenharmony_cisend and receive traffic to this network device and it is properly separated 42562306a36Sopenharmony_cifrom other network devices/ports (e.g.: as is frequent with a switch ASIC). How 42662306a36Sopenharmony_cithis is achieved is heavily hardware dependent, but a simple solution can be to 42762306a36Sopenharmony_ciuse per-port VLAN identifiers unless a better mechanism is available 42862306a36Sopenharmony_ci(proprietary metadata for each network port for instance). 42962306a36Sopenharmony_ci 43062306a36Sopenharmony_ciThe network device must be capable of running a full IP protocol stack 43162306a36Sopenharmony_ciincluding multicast, DHCP, IPv4/6, etc. If necessary, it should program the 43262306a36Sopenharmony_ciappropriate filters for VLAN, multicast, unicast etc. The underlying device 43362306a36Sopenharmony_cidriver must effectively be configured in a similar fashion to what it would do 43462306a36Sopenharmony_ciwhen IGMP snooping is enabled for IP multicast over these switchdev network 43562306a36Sopenharmony_cidevices and unsolicited multicast must be filtered as early as possible in 43662306a36Sopenharmony_cithe hardware. 43762306a36Sopenharmony_ci 43862306a36Sopenharmony_ciWhen configuring VLANs on top of the network device, all VLANs must be working, 43962306a36Sopenharmony_ciirrespective of the state of other network devices (e.g.: other ports being part 44062306a36Sopenharmony_ciof a VLAN-aware bridge doing ingress VID checking). See below for details. 44162306a36Sopenharmony_ci 44262306a36Sopenharmony_ciIf the device implements e.g.: VLAN filtering, putting the interface in 44362306a36Sopenharmony_cipromiscuous mode should allow the reception of all VLAN tags (including those 44462306a36Sopenharmony_cinot present in the filter(s)). 44562306a36Sopenharmony_ci 44662306a36Sopenharmony_ciBridged switch ports 44762306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^ 44862306a36Sopenharmony_ci 44962306a36Sopenharmony_ciWhen a switchdev enabled network device is added as a bridge member, it should 45062306a36Sopenharmony_cinot disrupt any functionality of non-bridged network devices and they 45162306a36Sopenharmony_cishould continue to behave as normal network devices. Depending on the bridge 45262306a36Sopenharmony_ciconfiguration knobs below, the expected behavior is documented. 45362306a36Sopenharmony_ci 45462306a36Sopenharmony_ciBridge VLAN filtering 45562306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^ 45662306a36Sopenharmony_ci 45762306a36Sopenharmony_ciThe Linux bridge allows the configuration of a VLAN filtering mode (statically, 45862306a36Sopenharmony_ciat device creation time, and dynamically, during run time) which must be 45962306a36Sopenharmony_ciobserved by the underlying switchdev network device/hardware: 46062306a36Sopenharmony_ci 46162306a36Sopenharmony_ci- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its 46262306a36Sopenharmony_ci data path will process all Ethernet frames as if they are VLAN-untagged. 46362306a36Sopenharmony_ci The bridge VLAN database can still be modified, but the modifications should 46462306a36Sopenharmony_ci have no effect while VLAN filtering is turned off. Frames ingressing the 46562306a36Sopenharmony_ci device with a VID that is not programmed into the bridge/switch's VLAN table 46662306a36Sopenharmony_ci must be forwarded and may be processed using a VLAN device (see below). 46762306a36Sopenharmony_ci 46862306a36Sopenharmony_ci- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing 46962306a36Sopenharmony_ci the device with a VID that is not programmed into the bridges/switch's VLAN 47062306a36Sopenharmony_ci table must be dropped (strict VID checking). 47162306a36Sopenharmony_ci 47262306a36Sopenharmony_ciWhen there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev 47362306a36Sopenharmony_cinetwork device which is a bridge port member, the behavior of the software 47462306a36Sopenharmony_cinetwork stack must be preserved, or the configuration must be refused if that 47562306a36Sopenharmony_ciis not possible. 47662306a36Sopenharmony_ci 47762306a36Sopenharmony_ci- with VLAN filtering turned off, the bridge will process all ingress traffic 47862306a36Sopenharmony_ci for the port, except for the traffic tagged with a VLAN ID destined for a 47962306a36Sopenharmony_ci VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even 48062306a36Sopenharmony_ci be added to a second bridge, which includes other switch ports or software 48162306a36Sopenharmony_ci interfaces. Some approaches to ensure that the forwarding domain for traffic 48262306a36Sopenharmony_ci belonging to the VLAN upper interfaces are managed properly: 48362306a36Sopenharmony_ci 48462306a36Sopenharmony_ci * If forwarding destinations can be managed per VLAN, the hardware could be 48562306a36Sopenharmony_ci configured to map all traffic, except the packets tagged with a VID 48662306a36Sopenharmony_ci belonging to a VLAN upper interface, to an internal VID corresponding to 48762306a36Sopenharmony_ci untagged packets. This internal VID spans all ports of the VLAN-unaware 48862306a36Sopenharmony_ci bridge. The VID corresponding to the VLAN upper interface spans the 48962306a36Sopenharmony_ci physical port of that VLAN interface, as well as the other ports that 49062306a36Sopenharmony_ci might be bridged with it. 49162306a36Sopenharmony_ci * Treat bridge ports with VLAN upper interfaces as standalone, and let 49262306a36Sopenharmony_ci forwarding be handled in the software data path. 49362306a36Sopenharmony_ci 49462306a36Sopenharmony_ci- with VLAN filtering turned on, these VLAN devices can be created as long as 49562306a36Sopenharmony_ci the bridge does not have an existing VLAN entry with the same VID on any 49662306a36Sopenharmony_ci bridge port. These VLAN devices cannot be enslaved into the bridge since they 49762306a36Sopenharmony_ci duplicate functionality/use case with the bridge's VLAN data path processing. 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ciNon-bridged network ports of the same switch fabric must not be disturbed in any 50062306a36Sopenharmony_ciway by the enabling of VLAN filtering on the bridge device(s). If the VLAN 50162306a36Sopenharmony_cifiltering setting is global to the entire chip, then the standalone ports 50262306a36Sopenharmony_cishould indicate to the network stack that VLAN filtering is required by setting 50362306a36Sopenharmony_ci'rx-vlan-filter: on [fixed]' in the ethtool features. 50462306a36Sopenharmony_ci 50562306a36Sopenharmony_ciBecause VLAN filtering can be turned on/off at runtime, the switchdev driver 50662306a36Sopenharmony_cimust be able to reconfigure the underlying hardware on the fly to honor the 50762306a36Sopenharmony_citoggling of that option and behave appropriately. If that is not possible, the 50862306a36Sopenharmony_ciswitchdev driver can also refuse to support dynamic toggling of the VLAN 50962306a36Sopenharmony_cifiltering knob at runtime and require a destruction of the bridge device(s) and 51062306a36Sopenharmony_cicreation of new bridge device(s) with a different VLAN filtering value to 51162306a36Sopenharmony_ciensure VLAN awareness is pushed down to the hardware. 51262306a36Sopenharmony_ci 51362306a36Sopenharmony_ciEven when VLAN filtering in the bridge is turned off, the underlying switch 51462306a36Sopenharmony_cihardware and driver may still configure itself in a VLAN-aware mode provided 51562306a36Sopenharmony_cithat the behavior described above is observed. 51662306a36Sopenharmony_ci 51762306a36Sopenharmony_ciThe VLAN protocol of the bridge plays a role in deciding whether a packet is 51862306a36Sopenharmony_citreated as tagged or not: a bridge using the 802.1ad protocol must treat both 51962306a36Sopenharmony_ciVLAN-untagged packets, as well as packets tagged with 802.1Q headers, as 52062306a36Sopenharmony_ciuntagged. 52162306a36Sopenharmony_ci 52262306a36Sopenharmony_ciThe 802.1p (VID 0) tagged packets must be treated in the same way by the device 52362306a36Sopenharmony_cias untagged packets, since the bridge device does not allow the manipulation of 52462306a36Sopenharmony_ciVID 0 in its database. 52562306a36Sopenharmony_ci 52662306a36Sopenharmony_ciWhen the bridge has VLAN filtering enabled and a PVID is not configured on the 52762306a36Sopenharmony_ciingress port, untagged and 802.1p tagged packets must be dropped. When the bridge 52862306a36Sopenharmony_cihas VLAN filtering enabled and a PVID exists on the ingress port, untagged and 52962306a36Sopenharmony_cipriority-tagged packets must be accepted and forwarded according to the 53062306a36Sopenharmony_cibridge's port membership of the PVID VLAN. When the bridge has VLAN filtering 53162306a36Sopenharmony_cidisabled, the presence/lack of a PVID should not influence the packet 53262306a36Sopenharmony_ciforwarding decision. 53362306a36Sopenharmony_ci 53462306a36Sopenharmony_ciBridge IGMP snooping 53562306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^ 53662306a36Sopenharmony_ci 53762306a36Sopenharmony_ciThe Linux bridge allows the configuration of IGMP snooping (statically, at 53862306a36Sopenharmony_ciinterface creation time, or dynamically, during runtime) which must be observed 53962306a36Sopenharmony_ciby the underlying switchdev network device/hardware in the following way: 54062306a36Sopenharmony_ci 54162306a36Sopenharmony_ci- when IGMP snooping is turned off, multicast traffic must be flooded to all 54262306a36Sopenharmony_ci ports within the same bridge that have mcast_flood=true. The CPU/management 54362306a36Sopenharmony_ci port should ideally not be flooded (unless the ingress interface has 54462306a36Sopenharmony_ci IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through 54562306a36Sopenharmony_ci the network stack notifications. If the hardware is not capable of doing that 54662306a36Sopenharmony_ci then the CPU/management port must also be flooded and multicast filtering 54762306a36Sopenharmony_ci happens in software. 54862306a36Sopenharmony_ci 54962306a36Sopenharmony_ci- when IGMP snooping is turned on, multicast traffic must selectively flow 55062306a36Sopenharmony_ci to the appropriate network ports (including CPU/management port). Flooding of 55162306a36Sopenharmony_ci unknown multicast should be only towards the ports connected to a multicast 55262306a36Sopenharmony_ci router (the local device may also act as a multicast router). 55362306a36Sopenharmony_ci 55462306a36Sopenharmony_ciThe switch must adhere to RFC 4541 and flood multicast traffic accordingly 55562306a36Sopenharmony_cisince that is what the Linux bridge implementation does. 55662306a36Sopenharmony_ci 55762306a36Sopenharmony_ciBecause IGMP snooping can be turned on/off at runtime, the switchdev driver 55862306a36Sopenharmony_cimust be able to reconfigure the underlying hardware on the fly to honor the 55962306a36Sopenharmony_citoggling of that option and behave appropriately. 56062306a36Sopenharmony_ci 56162306a36Sopenharmony_ciA switchdev driver can also refuse to support dynamic toggling of the multicast 56262306a36Sopenharmony_cisnooping knob at runtime and require the destruction of the bridge device(s) 56362306a36Sopenharmony_ciand creation of a new bridge device(s) with a different multicast snooping 56462306a36Sopenharmony_civalue. 565