162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci.. include:: <isonum.txt>
362306a36Sopenharmony_ci.. _switchdev:
462306a36Sopenharmony_ci
562306a36Sopenharmony_ci===============================================
662306a36Sopenharmony_ciEthernet switch device driver model (switchdev)
762306a36Sopenharmony_ci===============================================
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciCopyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
1062306a36Sopenharmony_ci
1162306a36Sopenharmony_ciCopyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ciThe Ethernet switch device driver model (switchdev) is an in-kernel driver
1562306a36Sopenharmony_cimodel for switch devices which offload the forwarding (data) plane from the
1662306a36Sopenharmony_cikernel.
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ciFigure 1 is a block diagram showing the components of the switchdev model for
1962306a36Sopenharmony_cian example setup using a data-center-class switch ASIC chip.  Other setups
2062306a36Sopenharmony_ciwith SR-IOV or soft switches, such as OVS, are possible.
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ci::
2362306a36Sopenharmony_ci
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ci			     User-space tools
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ci       user space                   |
2862306a36Sopenharmony_ci      +-------------------------------------------------------------------+
2962306a36Sopenharmony_ci       kernel                       | Netlink
3062306a36Sopenharmony_ci				    |
3162306a36Sopenharmony_ci		     +--------------+-------------------------------+
3262306a36Sopenharmony_ci		     |         Network stack                        |
3362306a36Sopenharmony_ci		     |           (Linux)                            |
3462306a36Sopenharmony_ci		     |                                              |
3562306a36Sopenharmony_ci		     +----------------------------------------------+
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ci			   sw1p2     sw1p4     sw1p6
3862306a36Sopenharmony_ci		      sw1p1  +  sw1p3  +  sw1p5  +          eth1
3962306a36Sopenharmony_ci			+    |    +    |    +    |            +
4062306a36Sopenharmony_ci			|    |    |    |    |    |            |
4162306a36Sopenharmony_ci		     +--+----+----+----+----+----+---+  +-----+-----+
4262306a36Sopenharmony_ci		     |         Switch driver         |  |    mgmt   |
4362306a36Sopenharmony_ci		     |        (this document)        |  |   driver  |
4462306a36Sopenharmony_ci		     |                               |  |           |
4562306a36Sopenharmony_ci		     +--------------+----------------+  +-----------+
4662306a36Sopenharmony_ci				    |
4762306a36Sopenharmony_ci       kernel                       | HW bus (eg PCI)
4862306a36Sopenharmony_ci      +-------------------------------------------------------------------+
4962306a36Sopenharmony_ci       hardware                     |
5062306a36Sopenharmony_ci		     +--------------+----------------+
5162306a36Sopenharmony_ci		     |         Switch device (sw1)   |
5262306a36Sopenharmony_ci		     |  +----+                       +--------+
5362306a36Sopenharmony_ci		     |  |    v offloaded data path   | mgmt port
5462306a36Sopenharmony_ci		     |  |    |                       |
5562306a36Sopenharmony_ci		     +--|----|----+----+----+----+---+
5662306a36Sopenharmony_ci			|    |    |    |    |    |
5762306a36Sopenharmony_ci			+    +    +    +    +    +
5862306a36Sopenharmony_ci		       p1   p2   p3   p4   p5   p6
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ci			     front-panel ports
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ci				    Fig 1.
6462306a36Sopenharmony_ci
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ciInclude Files
6762306a36Sopenharmony_ci-------------
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_ci::
7062306a36Sopenharmony_ci
7162306a36Sopenharmony_ci    #include <linux/netdevice.h>
7262306a36Sopenharmony_ci    #include <net/switchdev.h>
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ciConfiguration
7662306a36Sopenharmony_ci-------------
7762306a36Sopenharmony_ci
7862306a36Sopenharmony_ciUse "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
7962306a36Sopenharmony_cisupport is built for driver.
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ciSwitch Ports
8362306a36Sopenharmony_ci------------
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciOn switchdev driver initialization, the driver will allocate and register a
8662306a36Sopenharmony_cistruct net_device (using register_netdev()) for each enumerated physical switch
8762306a36Sopenharmony_ciport, called the port netdev.  A port netdev is the software representation of
8862306a36Sopenharmony_cithe physical port and provides a conduit for control traffic to/from the
8962306a36Sopenharmony_cicontroller (the kernel) and the network, as well as an anchor point for higher
9062306a36Sopenharmony_cilevel constructs such as bridges, bonds, VLANs, tunnels, and L3 routers.  Using
9162306a36Sopenharmony_cistandard netdev tools (iproute2, ethtool, etc), the port netdev can also
9262306a36Sopenharmony_ciprovide to the user access to the physical properties of the switch port such
9362306a36Sopenharmony_cias PHY link state and I/O statistics.
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ciThere is (currently) no higher-level kernel object for the switch beyond the
9662306a36Sopenharmony_ciport netdevs.  All of the switchdev driver ops are netdev ops or switchdev ops.
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciA switch management port is outside the scope of the switchdev driver model.
9962306a36Sopenharmony_ciTypically, the management port is not participating in offloaded data plane and
10062306a36Sopenharmony_ciis loaded with a different driver, such as a NIC driver, on the management port
10162306a36Sopenharmony_cidevice.
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ciSwitch ID
10462306a36Sopenharmony_ci^^^^^^^^^
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciThe switchdev driver must implement the net_device operation
10762306a36Sopenharmony_cindo_get_port_parent_id for each port netdev, returning the same physical ID for
10862306a36Sopenharmony_cieach port of a switch. The ID must be unique between switches on the same
10962306a36Sopenharmony_cisystem. The ID does not need to be unique between switches on different
11062306a36Sopenharmony_cisystems.
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ciThe switch ID is used to locate ports on a switch and to know if aggregated
11362306a36Sopenharmony_ciports belong to the same switch.
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ciPort Netdev Naming
11662306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_ciUdev rules should be used for port netdev naming, using some unique attribute
11962306a36Sopenharmony_ciof the port as a key, for example the port MAC address or the port PHYS name.
12062306a36Sopenharmony_ciHard-coding of kernel netdev names within the driver is discouraged; let the
12162306a36Sopenharmony_cikernel pick the default netdev name, and let udev set the final name based on a
12262306a36Sopenharmony_ciport attribute.
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ciUsing port PHYS name (ndo_get_phys_port_name) for the key is particularly
12562306a36Sopenharmony_ciuseful for dynamically-named ports where the device names its ports based on
12662306a36Sopenharmony_ciexternal configuration.  For example, if a physical 40G port is split logically
12762306a36Sopenharmony_ciinto 4 10G ports, resulting in 4 port netdevs, the device can give a unique
12862306a36Sopenharmony_ciname for each port using port PHYS name.  The udev rule would be::
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ci    SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
13162306a36Sopenharmony_ci	    ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciSuggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
13462306a36Sopenharmony_ciis the port name or ID, and Z is the sub-port name or ID.  For example, sw1p1s0
13562306a36Sopenharmony_ciwould be sub-port 0 on port 1 on switch 1.
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ciPort Features
13862306a36Sopenharmony_ci^^^^^^^^^^^^^
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ciNETIF_F_NETNS_LOCAL
14162306a36Sopenharmony_ci
14262306a36Sopenharmony_ciIf the switchdev driver (and device) only supports offloading of the default
14362306a36Sopenharmony_cinetwork namespace (netns), the driver should set this feature flag to prevent
14462306a36Sopenharmony_cithe port netdev from being moved out of the default netns.  A netns-aware
14562306a36Sopenharmony_cidriver/device would not set this flag and be responsible for partitioning
14662306a36Sopenharmony_cihardware to preserve netns containment.  This means hardware cannot forward
14762306a36Sopenharmony_citraffic from a port in one namespace to another port in another namespace.
14862306a36Sopenharmony_ci
14962306a36Sopenharmony_ciPort Topology
15062306a36Sopenharmony_ci^^^^^^^^^^^^^
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ciThe port netdevs representing the physical switch ports can be organized into
15362306a36Sopenharmony_cihigher-level switching constructs.  The default construct is a standalone
15462306a36Sopenharmony_cirouter port, used to offload L3 forwarding.  Two or more ports can be bonded
15562306a36Sopenharmony_citogether to form a LAG.  Two or more ports (or LAGs) can be bridged to bridge
15662306a36Sopenharmony_ciL2 networks.  VLANs can be applied to sub-divide L2 networks.  L2-over-L3
15762306a36Sopenharmony_citunnels can be built on ports.  These constructs are built using standard Linux
15862306a36Sopenharmony_citools such as the bridge driver, the bonding/team drivers, and netlink-based
15962306a36Sopenharmony_citools such as iproute2.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ciThe switchdev driver can know a particular port's position in the topology by
16262306a36Sopenharmony_cimonitoring NETDEV_CHANGEUPPER notifications.  For example, a port moved into a
16362306a36Sopenharmony_cibond will see its upper master change.  If that bond is moved into a bridge,
16462306a36Sopenharmony_cithe bond's upper master will change.  And so on.  The driver will track such
16562306a36Sopenharmony_cimovements to know what position a port is in in the overall topology by
16662306a36Sopenharmony_ciregistering for netdevice events and acting on NETDEV_CHANGEUPPER.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ciL2 Forwarding Offload
16962306a36Sopenharmony_ci---------------------
17062306a36Sopenharmony_ci
17162306a36Sopenharmony_ciThe idea is to offload the L2 data forwarding (switching) path from the kernel
17262306a36Sopenharmony_cito the switchdev device by mirroring bridge FDB entries down to the device.  An
17362306a36Sopenharmony_ciFDB entry is the {port, MAC, VLAN} tuple forwarding destination.
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ciTo offloading L2 bridging, the switchdev driver/device should support:
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ci	- Static FDB entries installed on a bridge port
17862306a36Sopenharmony_ci	- Notification of learned/forgotten src mac/vlans from device
17962306a36Sopenharmony_ci	- STP state changes on the port
18062306a36Sopenharmony_ci	- VLAN flooding of multicast/broadcast and unknown unicast packets
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ciStatic FDB Entries
18362306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ciA driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
18662306a36Sopenharmony_ci``ndo_fdb_dump`` operations is able to support the command below, which adds a
18762306a36Sopenharmony_cistatic bridge FDB entry::
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci        bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci(the "static" keyword is non-optional: if not specified, the entry defaults to
19262306a36Sopenharmony_cibeing "local", which means that it should not be forwarded)
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ciThe "self" keyword (optional because it is implicit) has the role of
19562306a36Sopenharmony_ciinstructing the kernel to fulfill the operation through the ``ndo_fdb_add``
19662306a36Sopenharmony_ciimplementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
19762306a36Sopenharmony_ciwill bypass the bridge and therefore leave the software database out of sync
19862306a36Sopenharmony_ciwith the hardware one.
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ciTo avoid this, the "master" keyword can be used::
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ci        bridge fdb add dev DEV ADDRESS [vlan VID] master static
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ciThe above command instructs the kernel to search for a master interface of
20562306a36Sopenharmony_ci``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
20662306a36Sopenharmony_ciThis time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
20762306a36Sopenharmony_ciwhich the port driver can handle and use it to program its hardware table. This
20862306a36Sopenharmony_ciway, the software and the hardware database will both contain this static FDB
20962306a36Sopenharmony_cientry.
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ciNote: for new switchdev drivers that offload the Linux bridge, implementing the
21262306a36Sopenharmony_ci``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
21362306a36Sopenharmony_cidiscouraged: all static FDB entries should be added on a bridge port using the
21462306a36Sopenharmony_ci"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
21562306a36Sopenharmony_civisualize the hardware tables, if the device does not have an interrupt for
21662306a36Sopenharmony_cinotifying the operating system of newly learned/forgotten dynamic FDB
21762306a36Sopenharmony_ciaddresses. In that case, the hardware FDB might end up having entries that the
21862306a36Sopenharmony_cisoftware FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
21962306a36Sopenharmony_cithem.
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ciNote: by default, the bridge does not filter on VLAN and only bridges untagged
22262306a36Sopenharmony_citraffic.  To enable VLAN support, turn on VLAN filtering::
22362306a36Sopenharmony_ci
22462306a36Sopenharmony_ci	echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ciNotification of Learned/Forgotten Source MAC/VLANs
22762306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ciThe switch device will learn/forget source MAC address/VLAN on ingress packets
23062306a36Sopenharmony_ciand notify the switch driver of the mac/vlan/port tuples.  The switch driver,
23162306a36Sopenharmony_ciin turn, will notify the bridge driver using the switchdev notifier call::
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ci	err = call_switchdev_notifiers(val, dev, info, extack);
23462306a36Sopenharmony_ci
23562306a36Sopenharmony_ciWhere val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
23662306a36Sopenharmony_ciforgetting, and info points to a struct switchdev_notifier_fdb_info.  On
23762306a36Sopenharmony_ciSWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
23862306a36Sopenharmony_cibridge's FDB and mark the entry as NTF_EXT_LEARNED.  The iproute2 bridge
23962306a36Sopenharmony_cicommand will label these entries "offload"::
24062306a36Sopenharmony_ci
24162306a36Sopenharmony_ci	$ bridge fdb
24262306a36Sopenharmony_ci	52:54:00:12:35:01 dev sw1p1 master br0 permanent
24362306a36Sopenharmony_ci	00:02:00:00:02:00 dev sw1p1 master br0 offload
24462306a36Sopenharmony_ci	00:02:00:00:02:00 dev sw1p1 self
24562306a36Sopenharmony_ci	52:54:00:12:35:02 dev sw1p2 master br0 permanent
24662306a36Sopenharmony_ci	00:02:00:00:03:00 dev sw1p2 master br0 offload
24762306a36Sopenharmony_ci	00:02:00:00:03:00 dev sw1p2 self
24862306a36Sopenharmony_ci	33:33:00:00:00:01 dev eth0 self permanent
24962306a36Sopenharmony_ci	01:00:5e:00:00:01 dev eth0 self permanent
25062306a36Sopenharmony_ci	33:33:ff:00:00:00 dev eth0 self permanent
25162306a36Sopenharmony_ci	01:80:c2:00:00:0e dev eth0 self permanent
25262306a36Sopenharmony_ci	33:33:00:00:00:01 dev br0 self permanent
25362306a36Sopenharmony_ci	01:00:5e:00:00:01 dev br0 self permanent
25462306a36Sopenharmony_ci	33:33:ff:12:35:01 dev br0 self permanent
25562306a36Sopenharmony_ci
25662306a36Sopenharmony_ciLearning on the port should be disabled on the bridge using the bridge command::
25762306a36Sopenharmony_ci
25862306a36Sopenharmony_ci	bridge link set dev DEV learning off
25962306a36Sopenharmony_ci
26062306a36Sopenharmony_ciLearning on the device port should be enabled, as well as learning_sync::
26162306a36Sopenharmony_ci
26262306a36Sopenharmony_ci	bridge link set dev DEV learning on self
26362306a36Sopenharmony_ci	bridge link set dev DEV learning_sync on self
26462306a36Sopenharmony_ci
26562306a36Sopenharmony_ciLearning_sync attribute enables syncing of the learned/forgotten FDB entry to
26662306a36Sopenharmony_cithe bridge's FDB.  It's possible, but not optimal, to enable learning on the
26762306a36Sopenharmony_cidevice port and on the bridge port, and disable learning_sync.
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciTo support learning, the driver implements switchdev op
27062306a36Sopenharmony_ciswitchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS.
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ciFDB Ageing
27362306a36Sopenharmony_ci^^^^^^^^^^
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciThe bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
27662306a36Sopenharmony_cithe responsibility of the port driver/device to age out these entries.  If the
27762306a36Sopenharmony_ciport device supports ageing, when the FDB entry expires, it will notify the
27862306a36Sopenharmony_cidriver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If the
27962306a36Sopenharmony_cidevice does not support ageing, the driver can simulate ageing using a
28062306a36Sopenharmony_cigarbage collection timer to monitor FDB entries.  Expired entries will be
28162306a36Sopenharmony_cinotified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
28262306a36Sopenharmony_ciexample of driver running ageing timer.
28362306a36Sopenharmony_ci
28462306a36Sopenharmony_ciTo keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
28562306a36Sopenharmony_cientry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
28662306a36Sopenharmony_cinotification will reset the FDB entry's last-used time to now.  The driver
28762306a36Sopenharmony_cishould rate limit refresh notifications, for example, no more than once a
28862306a36Sopenharmony_cisecond.  (The last-used time is visible using the bridge -s fdb option).
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciSTP State Change on Port
29162306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^
29262306a36Sopenharmony_ci
29362306a36Sopenharmony_ciInternally or with a third-party STP protocol implementation (e.g. mstpd), the
29462306a36Sopenharmony_cibridge driver maintains the STP state for ports, and will notify the switch
29562306a36Sopenharmony_cidriver of STP state change on a port using the switchdev op
29662306a36Sopenharmony_ciswitchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
29762306a36Sopenharmony_ci
29862306a36Sopenharmony_ciState is one of BR_STATE_*.  The switch driver can use STP state updates to
29962306a36Sopenharmony_ciupdate ingress packet filter list for the port.  For example, if port is
30062306a36Sopenharmony_ciDISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
30162306a36Sopenharmony_ciand other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
30262306a36Sopenharmony_ci
30362306a36Sopenharmony_ciNote that STP BDPUs are untagged and STP state applies to all VLANs on the port
30462306a36Sopenharmony_ciso packet filters should be applied consistently across untagged and tagged
30562306a36Sopenharmony_ciVLANs on the port.
30662306a36Sopenharmony_ci
30762306a36Sopenharmony_ciFlooding L2 domain
30862306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^
30962306a36Sopenharmony_ci
31062306a36Sopenharmony_ciFor a given L2 VLAN domain, the switch device should flood multicast/broadcast
31162306a36Sopenharmony_ciand unknown unicast packets to all ports in domain, if allowed by port's
31262306a36Sopenharmony_cicurrent STP state.  The switch driver, knowing which ports are within which
31362306a36Sopenharmony_civlan L2 domain, can program the switch device for flooding.  The packet may
31462306a36Sopenharmony_cibe sent to the port netdev for processing by the bridge driver.  The
31562306a36Sopenharmony_cibridge should not reflood the packet to the same ports the device flooded,
31662306a36Sopenharmony_ciotherwise there will be duplicate packets on the wire.
31762306a36Sopenharmony_ci
31862306a36Sopenharmony_ciTo avoid duplicate packets, the switch driver should mark a packet as already
31962306a36Sopenharmony_ciforwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
32062306a36Sopenharmony_cithe skb using the ingress bridge port's mark and prevent it from being forwarded
32162306a36Sopenharmony_cithrough any bridge port with the same mark.
32262306a36Sopenharmony_ci
32362306a36Sopenharmony_ciIt is possible for the switch device to not handle flooding and push the
32462306a36Sopenharmony_cipackets up to the bridge driver for flooding.  This is not ideal as the number
32562306a36Sopenharmony_ciof ports scale in the L2 domain as the device is much more efficient at
32662306a36Sopenharmony_ciflooding packets that software.
32762306a36Sopenharmony_ci
32862306a36Sopenharmony_ciIf supported by the device, flood control can be offloaded to it, preventing
32962306a36Sopenharmony_cicertain netdevs from flooding unicast traffic for which there is no FDB entry.
33062306a36Sopenharmony_ci
33162306a36Sopenharmony_ciIGMP Snooping
33262306a36Sopenharmony_ci^^^^^^^^^^^^^
33362306a36Sopenharmony_ci
33462306a36Sopenharmony_ciIn order to support IGMP snooping, the port netdevs should trap to the bridge
33562306a36Sopenharmony_cidriver all IGMP join and leave messages.
33662306a36Sopenharmony_ciThe bridge multicast module will notify port netdevs on every multicast group
33762306a36Sopenharmony_cichanged whether it is static configured or dynamically joined/leave.
33862306a36Sopenharmony_ciThe hardware implementation should be forwarding all registered multicast
33962306a36Sopenharmony_citraffic groups only to the configured ports.
34062306a36Sopenharmony_ci
34162306a36Sopenharmony_ciL3 Routing Offload
34262306a36Sopenharmony_ci------------------
34362306a36Sopenharmony_ci
34462306a36Sopenharmony_ciOffloading L3 routing requires that device be programmed with FIB entries from
34562306a36Sopenharmony_cithe kernel, with the device doing the FIB lookup and forwarding.  The device
34662306a36Sopenharmony_cidoes a longest prefix match (LPM) on FIB entries matching route prefix and
34762306a36Sopenharmony_ciforwards the packet to the matching FIB entry's nexthop(s) egress ports.
34862306a36Sopenharmony_ci
34962306a36Sopenharmony_ciTo program the device, the driver has to register a FIB notifier handler
35062306a36Sopenharmony_ciusing register_fib_notifier. The following events are available:
35162306a36Sopenharmony_ci
35262306a36Sopenharmony_ci===================  ===================================================
35362306a36Sopenharmony_ciFIB_EVENT_ENTRY_ADD  used for both adding a new FIB entry to the device,
35462306a36Sopenharmony_ci		     or modifying an existing entry on the device.
35562306a36Sopenharmony_ciFIB_EVENT_ENTRY_DEL  used for removing a FIB entry
35662306a36Sopenharmony_ciFIB_EVENT_RULE_ADD,
35762306a36Sopenharmony_ciFIB_EVENT_RULE_DEL   used to propagate FIB rule changes
35862306a36Sopenharmony_ci===================  ===================================================
35962306a36Sopenharmony_ci
36062306a36Sopenharmony_ciFIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
36162306a36Sopenharmony_ci
36262306a36Sopenharmony_ci	struct fib_entry_notifier_info {
36362306a36Sopenharmony_ci		struct fib_notifier_info info; /* must be first */
36462306a36Sopenharmony_ci		u32 dst;
36562306a36Sopenharmony_ci		int dst_len;
36662306a36Sopenharmony_ci		struct fib_info *fi;
36762306a36Sopenharmony_ci		u8 tos;
36862306a36Sopenharmony_ci		u8 type;
36962306a36Sopenharmony_ci		u32 tb_id;
37062306a36Sopenharmony_ci		u32 nlflags;
37162306a36Sopenharmony_ci	};
37262306a36Sopenharmony_ci
37362306a36Sopenharmony_cito add/modify/delete IPv4 dst/dest_len prefix on table tb_id.  The ``*fi``
37462306a36Sopenharmony_cistructure holds details on the route and route's nexthops.  ``*dev`` is one
37562306a36Sopenharmony_ciof the port netdevs mentioned in the route's next hop list.
37662306a36Sopenharmony_ci
37762306a36Sopenharmony_ciRoutes offloaded to the device are labeled with "offload" in the ip route
37862306a36Sopenharmony_cilisting::
37962306a36Sopenharmony_ci
38062306a36Sopenharmony_ci	$ ip route show
38162306a36Sopenharmony_ci	default via 192.168.0.2 dev eth0
38262306a36Sopenharmony_ci	11.0.0.0/30 dev sw1p1  proto kernel  scope link  src 11.0.0.2 offload
38362306a36Sopenharmony_ci	11.0.0.4/30 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
38462306a36Sopenharmony_ci	11.0.0.8/30 dev sw1p2  proto kernel  scope link  src 11.0.0.10 offload
38562306a36Sopenharmony_ci	11.0.0.12/30 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
38662306a36Sopenharmony_ci	12.0.0.2  proto zebra  metric 30 offload
38762306a36Sopenharmony_ci		nexthop via 11.0.0.1  dev sw1p1 weight 1
38862306a36Sopenharmony_ci		nexthop via 11.0.0.9  dev sw1p2 weight 1
38962306a36Sopenharmony_ci	12.0.0.3 via 11.0.0.1 dev sw1p1  proto zebra  metric 20 offload
39062306a36Sopenharmony_ci	12.0.0.4 via 11.0.0.9 dev sw1p2  proto zebra  metric 20 offload
39162306a36Sopenharmony_ci	192.168.0.0/24 dev eth0  proto kernel  scope link  src 192.168.0.15
39262306a36Sopenharmony_ci
39362306a36Sopenharmony_ciThe "offload" flag is set in case at least one device offloads the FIB entry.
39462306a36Sopenharmony_ci
39562306a36Sopenharmony_ciXXX: add/mod/del IPv6 FIB API
39662306a36Sopenharmony_ci
39762306a36Sopenharmony_ciNexthop Resolution
39862306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^
39962306a36Sopenharmony_ci
40062306a36Sopenharmony_ciThe FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
40162306a36Sopenharmony_cithe switch device to forward the packet with the correct dst mac address, the
40262306a36Sopenharmony_cinexthop gateways must be resolved to the neighbor's mac address.  Neighbor mac
40362306a36Sopenharmony_ciaddress discovery comes via the ARP (or ND) process and is available via the
40462306a36Sopenharmony_ciarp_tbl neighbor table.  To resolve the routes nexthop gateways, the driver
40562306a36Sopenharmony_cishould trigger the kernel's neighbor resolution process.  See the rocker
40662306a36Sopenharmony_cidriver's rocker_port_ipv4_resolve() for an example.
40762306a36Sopenharmony_ci
40862306a36Sopenharmony_ciThe driver can monitor for updates to arp_tbl using the netevent notifier
40962306a36Sopenharmony_ciNETEVENT_NEIGH_UPDATE.  The device can be programmed with resolved nexthops
41062306a36Sopenharmony_cifor the routes as arp_tbl updates.  The driver implements ndo_neigh_destroy
41162306a36Sopenharmony_cito know when arp_tbl neighbor entries are purged from the port.
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ciDevice driver expected behavior
41462306a36Sopenharmony_ci-------------------------------
41562306a36Sopenharmony_ci
41662306a36Sopenharmony_ciBelow is a set of defined behavior that switchdev enabled network devices must
41762306a36Sopenharmony_ciadhere to.
41862306a36Sopenharmony_ci
41962306a36Sopenharmony_ciConfiguration-less state
42062306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^
42162306a36Sopenharmony_ci
42262306a36Sopenharmony_ciUpon driver bring up, the network devices must be fully operational, and the
42362306a36Sopenharmony_cibacking driver must configure the network device such that it is possible to
42462306a36Sopenharmony_cisend and receive traffic to this network device and it is properly separated
42562306a36Sopenharmony_cifrom other network devices/ports (e.g.: as is frequent with a switch ASIC). How
42662306a36Sopenharmony_cithis is achieved is heavily hardware dependent, but a simple solution can be to
42762306a36Sopenharmony_ciuse per-port VLAN identifiers unless a better mechanism is available
42862306a36Sopenharmony_ci(proprietary metadata for each network port for instance).
42962306a36Sopenharmony_ci
43062306a36Sopenharmony_ciThe network device must be capable of running a full IP protocol stack
43162306a36Sopenharmony_ciincluding multicast, DHCP, IPv4/6, etc. If necessary, it should program the
43262306a36Sopenharmony_ciappropriate filters for VLAN, multicast, unicast etc. The underlying device
43362306a36Sopenharmony_cidriver must effectively be configured in a similar fashion to what it would do
43462306a36Sopenharmony_ciwhen IGMP snooping is enabled for IP multicast over these switchdev network
43562306a36Sopenharmony_cidevices and unsolicited multicast must be filtered as early as possible in
43662306a36Sopenharmony_cithe hardware.
43762306a36Sopenharmony_ci
43862306a36Sopenharmony_ciWhen configuring VLANs on top of the network device, all VLANs must be working,
43962306a36Sopenharmony_ciirrespective of the state of other network devices (e.g.: other ports being part
44062306a36Sopenharmony_ciof a VLAN-aware bridge doing ingress VID checking). See below for details.
44162306a36Sopenharmony_ci
44262306a36Sopenharmony_ciIf the device implements e.g.: VLAN filtering, putting the interface in
44362306a36Sopenharmony_cipromiscuous mode should allow the reception of all VLAN tags (including those
44462306a36Sopenharmony_cinot present in the filter(s)).
44562306a36Sopenharmony_ci
44662306a36Sopenharmony_ciBridged switch ports
44762306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^
44862306a36Sopenharmony_ci
44962306a36Sopenharmony_ciWhen a switchdev enabled network device is added as a bridge member, it should
45062306a36Sopenharmony_cinot disrupt any functionality of non-bridged network devices and they
45162306a36Sopenharmony_cishould continue to behave as normal network devices. Depending on the bridge
45262306a36Sopenharmony_ciconfiguration knobs below, the expected behavior is documented.
45362306a36Sopenharmony_ci
45462306a36Sopenharmony_ciBridge VLAN filtering
45562306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^
45662306a36Sopenharmony_ci
45762306a36Sopenharmony_ciThe Linux bridge allows the configuration of a VLAN filtering mode (statically,
45862306a36Sopenharmony_ciat device creation time, and dynamically, during run time) which must be
45962306a36Sopenharmony_ciobserved by the underlying switchdev network device/hardware:
46062306a36Sopenharmony_ci
46162306a36Sopenharmony_ci- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
46262306a36Sopenharmony_ci  data path will process all Ethernet frames as if they are VLAN-untagged.
46362306a36Sopenharmony_ci  The bridge VLAN database can still be modified, but the modifications should
46462306a36Sopenharmony_ci  have no effect while VLAN filtering is turned off. Frames ingressing the
46562306a36Sopenharmony_ci  device with a VID that is not programmed into the bridge/switch's VLAN table
46662306a36Sopenharmony_ci  must be forwarded and may be processed using a VLAN device (see below).
46762306a36Sopenharmony_ci
46862306a36Sopenharmony_ci- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
46962306a36Sopenharmony_ci  the device with a VID that is not programmed into the bridges/switch's VLAN
47062306a36Sopenharmony_ci  table must be dropped (strict VID checking).
47162306a36Sopenharmony_ci
47262306a36Sopenharmony_ciWhen there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
47362306a36Sopenharmony_cinetwork device which is a bridge port member, the behavior of the software
47462306a36Sopenharmony_cinetwork stack must be preserved, or the configuration must be refused if that
47562306a36Sopenharmony_ciis not possible.
47662306a36Sopenharmony_ci
47762306a36Sopenharmony_ci- with VLAN filtering turned off, the bridge will process all ingress traffic
47862306a36Sopenharmony_ci  for the port, except for the traffic tagged with a VLAN ID destined for a
47962306a36Sopenharmony_ci  VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
48062306a36Sopenharmony_ci  be added to a second bridge, which includes other switch ports or software
48162306a36Sopenharmony_ci  interfaces. Some approaches to ensure that the forwarding domain for traffic
48262306a36Sopenharmony_ci  belonging to the VLAN upper interfaces are managed properly:
48362306a36Sopenharmony_ci
48462306a36Sopenharmony_ci    * If forwarding destinations can be managed per VLAN, the hardware could be
48562306a36Sopenharmony_ci      configured to map all traffic, except the packets tagged with a VID
48662306a36Sopenharmony_ci      belonging to a VLAN upper interface, to an internal VID corresponding to
48762306a36Sopenharmony_ci      untagged packets. This internal VID spans all ports of the VLAN-unaware
48862306a36Sopenharmony_ci      bridge. The VID corresponding to the VLAN upper interface spans the
48962306a36Sopenharmony_ci      physical port of that VLAN interface, as well as the other ports that
49062306a36Sopenharmony_ci      might be bridged with it.
49162306a36Sopenharmony_ci    * Treat bridge ports with VLAN upper interfaces as standalone, and let
49262306a36Sopenharmony_ci      forwarding be handled in the software data path.
49362306a36Sopenharmony_ci
49462306a36Sopenharmony_ci- with VLAN filtering turned on, these VLAN devices can be created as long as
49562306a36Sopenharmony_ci  the bridge does not have an existing VLAN entry with the same VID on any
49662306a36Sopenharmony_ci  bridge port. These VLAN devices cannot be enslaved into the bridge since they
49762306a36Sopenharmony_ci  duplicate functionality/use case with the bridge's VLAN data path processing.
49862306a36Sopenharmony_ci
49962306a36Sopenharmony_ciNon-bridged network ports of the same switch fabric must not be disturbed in any
50062306a36Sopenharmony_ciway by the enabling of VLAN filtering on the bridge device(s). If the VLAN
50162306a36Sopenharmony_cifiltering setting is global to the entire chip, then the standalone ports
50262306a36Sopenharmony_cishould indicate to the network stack that VLAN filtering is required by setting
50362306a36Sopenharmony_ci'rx-vlan-filter: on [fixed]' in the ethtool features.
50462306a36Sopenharmony_ci
50562306a36Sopenharmony_ciBecause VLAN filtering can be turned on/off at runtime, the switchdev driver
50662306a36Sopenharmony_cimust be able to reconfigure the underlying hardware on the fly to honor the
50762306a36Sopenharmony_citoggling of that option and behave appropriately. If that is not possible, the
50862306a36Sopenharmony_ciswitchdev driver can also refuse to support dynamic toggling of the VLAN
50962306a36Sopenharmony_cifiltering knob at runtime and require a destruction of the bridge device(s) and
51062306a36Sopenharmony_cicreation of new bridge device(s) with a different VLAN filtering value to
51162306a36Sopenharmony_ciensure VLAN awareness is pushed down to the hardware.
51262306a36Sopenharmony_ci
51362306a36Sopenharmony_ciEven when VLAN filtering in the bridge is turned off, the underlying switch
51462306a36Sopenharmony_cihardware and driver may still configure itself in a VLAN-aware mode provided
51562306a36Sopenharmony_cithat the behavior described above is observed.
51662306a36Sopenharmony_ci
51762306a36Sopenharmony_ciThe VLAN protocol of the bridge plays a role in deciding whether a packet is
51862306a36Sopenharmony_citreated as tagged or not: a bridge using the 802.1ad protocol must treat both
51962306a36Sopenharmony_ciVLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
52062306a36Sopenharmony_ciuntagged.
52162306a36Sopenharmony_ci
52262306a36Sopenharmony_ciThe 802.1p (VID 0) tagged packets must be treated in the same way by the device
52362306a36Sopenharmony_cias untagged packets, since the bridge device does not allow the manipulation of
52462306a36Sopenharmony_ciVID 0 in its database.
52562306a36Sopenharmony_ci
52662306a36Sopenharmony_ciWhen the bridge has VLAN filtering enabled and a PVID is not configured on the
52762306a36Sopenharmony_ciingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
52862306a36Sopenharmony_cihas VLAN filtering enabled and a PVID exists on the ingress port, untagged and
52962306a36Sopenharmony_cipriority-tagged packets must be accepted and forwarded according to the
53062306a36Sopenharmony_cibridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
53162306a36Sopenharmony_cidisabled, the presence/lack of a PVID should not influence the packet
53262306a36Sopenharmony_ciforwarding decision.
53362306a36Sopenharmony_ci
53462306a36Sopenharmony_ciBridge IGMP snooping
53562306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^
53662306a36Sopenharmony_ci
53762306a36Sopenharmony_ciThe Linux bridge allows the configuration of IGMP snooping (statically, at
53862306a36Sopenharmony_ciinterface creation time, or dynamically, during runtime) which must be observed
53962306a36Sopenharmony_ciby the underlying switchdev network device/hardware in the following way:
54062306a36Sopenharmony_ci
54162306a36Sopenharmony_ci- when IGMP snooping is turned off, multicast traffic must be flooded to all
54262306a36Sopenharmony_ci  ports within the same bridge that have mcast_flood=true. The CPU/management
54362306a36Sopenharmony_ci  port should ideally not be flooded (unless the ingress interface has
54462306a36Sopenharmony_ci  IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
54562306a36Sopenharmony_ci  the network stack notifications. If the hardware is not capable of doing that
54662306a36Sopenharmony_ci  then the CPU/management port must also be flooded and multicast filtering
54762306a36Sopenharmony_ci  happens in software.
54862306a36Sopenharmony_ci
54962306a36Sopenharmony_ci- when IGMP snooping is turned on, multicast traffic must selectively flow
55062306a36Sopenharmony_ci  to the appropriate network ports (including CPU/management port). Flooding of
55162306a36Sopenharmony_ci  unknown multicast should be only towards the ports connected to a multicast
55262306a36Sopenharmony_ci  router (the local device may also act as a multicast router).
55362306a36Sopenharmony_ci
55462306a36Sopenharmony_ciThe switch must adhere to RFC 4541 and flood multicast traffic accordingly
55562306a36Sopenharmony_cisince that is what the Linux bridge implementation does.
55662306a36Sopenharmony_ci
55762306a36Sopenharmony_ciBecause IGMP snooping can be turned on/off at runtime, the switchdev driver
55862306a36Sopenharmony_cimust be able to reconfigure the underlying hardware on the fly to honor the
55962306a36Sopenharmony_citoggling of that option and behave appropriately.
56062306a36Sopenharmony_ci
56162306a36Sopenharmony_ciA switchdev driver can also refuse to support dynamic toggling of the multicast
56262306a36Sopenharmony_cisnooping knob at runtime and require the destruction of the bridge device(s)
56362306a36Sopenharmony_ciand creation of a new bridge device(s) with a different multicast snooping
56462306a36Sopenharmony_civalue.
565