162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============================
462306a36Sopenharmony_ciNetwork Function Representors
562306a36Sopenharmony_ci=============================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciThis document describes the semantics and usage of representor netdevices, as
862306a36Sopenharmony_ciused to control internal switching on SmartNICs.  For the closely-related port
962306a36Sopenharmony_cirepresentors on physical (multi-port) switches, see
1062306a36Sopenharmony_ci:ref:`Documentation/networking/switchdev.rst <switchdev>`.
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ciMotivation
1362306a36Sopenharmony_ci----------
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciSince the mid-2010s, network cards have started offering more complex
1662306a36Sopenharmony_civirtualisation capabilities than the legacy SR-IOV approach (with its simple
1762306a36Sopenharmony_ciMAC/VLAN-based switching model) can support.  This led to a desire to offload
1862306a36Sopenharmony_cisoftware-defined networks (such as OpenVSwitch) to these NICs to specify the
1962306a36Sopenharmony_cinetwork connectivity of each function.  The resulting designs are variously
2062306a36Sopenharmony_cicalled SmartNICs or DPUs.
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ciNetwork function representors bring the standard Linux networking stack to
2362306a36Sopenharmony_civirtual switches and IOV devices.  Just as each physical port of a Linux-
2462306a36Sopenharmony_cicontrolled switch has a separate netdev, so does each virtual port of a virtual
2562306a36Sopenharmony_ciswitch.
2662306a36Sopenharmony_ciWhen the system boots, and before any offload is configured, all packets from
2762306a36Sopenharmony_cithe virtual functions appear in the networking stack of the PF via the
2862306a36Sopenharmony_cirepresentors.  The PF can thus always communicate freely with the virtual
2962306a36Sopenharmony_cifunctions.
3062306a36Sopenharmony_ciThe PF can configure standard Linux forwarding between representors, the uplink
3162306a36Sopenharmony_cior any other netdev (routing, bridging, TC classifiers).
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ciThus, a representor is both a control plane object (representing the function in
3462306a36Sopenharmony_ciadministrative commands) and a data plane object (one end of a virtual pipe).
3562306a36Sopenharmony_ciAs a virtual link endpoint, the representor can be configured like any other
3662306a36Sopenharmony_cinetdevice; in some cases (e.g. link state) the representee will follow the
3762306a36Sopenharmony_cirepresentor's configuration, while in others there are separate APIs to
3862306a36Sopenharmony_ciconfigure the representee.
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ciDefinitions
4162306a36Sopenharmony_ci-----------
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ciThis document uses the term "switchdev function" to refer to the PCIe function
4462306a36Sopenharmony_ciwhich has administrative control over the virtual switch on the device.
4562306a36Sopenharmony_ciTypically, this will be a PF, but conceivably a NIC could be configured to grant
4662306a36Sopenharmony_cithese administrative privileges instead to a VF or SF (subfunction).
4762306a36Sopenharmony_ciDepending on NIC design, a multi-port NIC might have a single switchdev function
4862306a36Sopenharmony_cifor the whole device or might have a separate virtual switch, and hence
4962306a36Sopenharmony_ciswitchdev function, for each physical network port.
5062306a36Sopenharmony_ciIf the NIC supports nested switching, there might be separate switchdev
5162306a36Sopenharmony_cifunctions for each nested switch, in which case each switchdev function should
5262306a36Sopenharmony_cionly create representors for the ports on the (sub-)switch it directly
5362306a36Sopenharmony_ciadministers.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciA "representee" is the object that a representor represents.  So for example in
5662306a36Sopenharmony_cithe case of a VF representor, the representee is the corresponding VF.
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciWhat does a representor do?
5962306a36Sopenharmony_ci---------------------------
6062306a36Sopenharmony_ci
6162306a36Sopenharmony_ciA representor has three main roles.
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ci1. It is used to configure the network connection the representee sees, e.g.
6462306a36Sopenharmony_ci   link up/down, MTU, etc.  For instance, bringing the representor
6562306a36Sopenharmony_ci   administratively UP should cause the representee to see a link up / carrier
6662306a36Sopenharmony_ci   on event.
6762306a36Sopenharmony_ci2. It provides the slow path for traffic which does not hit any offloaded
6862306a36Sopenharmony_ci   fast-path rules in the virtual switch.  Packets transmitted on the
6962306a36Sopenharmony_ci   representor netdevice should be delivered to the representee; packets
7062306a36Sopenharmony_ci   transmitted by the representee which fail to match any switching rule should
7162306a36Sopenharmony_ci   be received on the representor netdevice.  (That is, there is a virtual pipe
7262306a36Sopenharmony_ci   connecting the representor to the representee, similar in concept to a veth
7362306a36Sopenharmony_ci   pair.)
7462306a36Sopenharmony_ci   This allows software switch implementations (such as OpenVSwitch or a Linux
7562306a36Sopenharmony_ci   bridge) to forward packets between representees and the rest of the network.
7662306a36Sopenharmony_ci3. It acts as a handle by which switching rules (such as TC filters) can refer
7762306a36Sopenharmony_ci   to the representee, allowing these rules to be offloaded.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciThe combination of 2) and 3) means that the behaviour (apart from performance)
8062306a36Sopenharmony_cishould be the same whether a TC filter is offloaded or not.  E.g. a TC rule
8162306a36Sopenharmony_cion a VF representor applies in software to packets received on that representor
8262306a36Sopenharmony_cinetdevice, while in hardware offload it would apply to packets transmitted by
8362306a36Sopenharmony_cithe representee VF.  Conversely, a mirred egress redirect to a VF representor
8462306a36Sopenharmony_cicorresponds in hardware to delivery directly to the representee VF.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciWhat functions should have a representor?
8762306a36Sopenharmony_ci-----------------------------------------
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ciEssentially, for each virtual port on the device's internal switch, there
9062306a36Sopenharmony_cishould be a representor.
9162306a36Sopenharmony_ciSome vendors have chosen to omit representors for the uplink and the physical
9262306a36Sopenharmony_cinetwork port, which can simplify usage (the uplink netdev becomes in effect the
9362306a36Sopenharmony_ciphysical port's representor) but does not generalise to devices with multiple
9462306a36Sopenharmony_ciports or uplinks.
9562306a36Sopenharmony_ci
9662306a36Sopenharmony_ciThus, the following should all have representors:
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ci - VFs belonging to the switchdev function.
9962306a36Sopenharmony_ci - Other PFs on the local PCIe controller, and any VFs belonging to them.
10062306a36Sopenharmony_ci - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
10162306a36Sopenharmony_ci   System-on-Chip within the SmartNIC).
10262306a36Sopenharmony_ci - PFs and VFs with other personalities, including network block devices (such
10362306a36Sopenharmony_ci   as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
10462306a36Sopenharmony_ci   if) their network access is implemented through a virtual switch port. [#]_
10562306a36Sopenharmony_ci   Note that such functions can require a representor despite the representee
10662306a36Sopenharmony_ci   not having a netdev.
10762306a36Sopenharmony_ci - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
10862306a36Sopenharmony_ci   their own port on the switch (as opposed to using their parent PF's port).
10962306a36Sopenharmony_ci - Any accelerators or plugins on the device whose interface to the network is
11062306a36Sopenharmony_ci   through a virtual switch port, even if they do not have a corresponding PCIe
11162306a36Sopenharmony_ci   PF or VF.
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ciThis allows the entire switching behaviour of the NIC to be controlled through
11462306a36Sopenharmony_cirepresentor TC rules.
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciIt is a common misunderstanding to conflate virtual ports with PCIe virtual
11762306a36Sopenharmony_cifunctions or their netdevs.  While in simple cases there will be a 1:1
11862306a36Sopenharmony_cicorrespondence between VF netdevices and VF representors, more advanced device
11962306a36Sopenharmony_ciconfigurations may not follow this.
12062306a36Sopenharmony_ciA PCIe function which does not have network access through the internal switch
12162306a36Sopenharmony_ci(not even indirectly through the hardware implementation of whatever services
12262306a36Sopenharmony_cithe function provides) should *not* have a representor (even if it has a
12362306a36Sopenharmony_cinetdev).
12462306a36Sopenharmony_ciSuch a function has no switch virtual port for the representor to configure or
12562306a36Sopenharmony_cito be the other end of the virtual pipe.
12662306a36Sopenharmony_ciThe representor represents the virtual port, not the PCIe function nor the 'end
12762306a36Sopenharmony_ciuser' netdevice.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ci.. [#] The concept here is that a hardware IP stack in the device performs the
13062306a36Sopenharmony_ci   translation between block DMA requests and network packets, so that only
13162306a36Sopenharmony_ci   network packets pass through the virtual port onto the switch.  The network
13262306a36Sopenharmony_ci   access that the IP stack "sees" would then be configurable through tc rules;
13362306a36Sopenharmony_ci   e.g. its traffic might all be wrapped in a specific VLAN or VxLAN.  However,
13462306a36Sopenharmony_ci   any needed configuration of the block device *qua* block device, not being a
13562306a36Sopenharmony_ci   networking entity, would not be appropriate for the representor and would
13662306a36Sopenharmony_ci   thus use some other channel such as devlink.
13762306a36Sopenharmony_ci   Contrast this with the case of a virtio-blk implementation which forwards the
13862306a36Sopenharmony_ci   DMA requests unchanged to another PF whose driver then initiates and
13962306a36Sopenharmony_ci   terminates IP traffic in software; in that case the DMA traffic would *not*
14062306a36Sopenharmony_ci   run over the virtual switch and the virtio-blk PF should thus *not* have a
14162306a36Sopenharmony_ci   representor.
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ciHow are representors created?
14462306a36Sopenharmony_ci-----------------------------
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ciThe driver instance attached to the switchdev function should, for each virtual
14762306a36Sopenharmony_ciport on the switch, create a pure-software netdevice which has some form of
14862306a36Sopenharmony_ciin-kernel reference to the switchdev function's own netdevice or driver private
14962306a36Sopenharmony_cidata (``netdev_priv()``).
15062306a36Sopenharmony_ciThis may be by enumerating ports at probe time, reacting dynamically to the
15162306a36Sopenharmony_cicreation and destruction of ports at run time, or a combination of the two.
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciThe operations of the representor netdevice will generally involve acting
15462306a36Sopenharmony_cithrough the switchdev function.  For example, ``ndo_start_xmit()`` might send
15562306a36Sopenharmony_cithe packet through a hardware TX queue attached to the switchdev function, with
15662306a36Sopenharmony_cieither packet metadata or queue configuration marking it for delivery to the
15762306a36Sopenharmony_cirepresentee.
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ciHow are representors identified?
16062306a36Sopenharmony_ci--------------------------------
16162306a36Sopenharmony_ci
16262306a36Sopenharmony_ciThe representor netdevice should *not* directly refer to a PCIe device (e.g.
16362306a36Sopenharmony_cithrough ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
16462306a36Sopenharmony_cirepresentee or of the switchdev function.
16562306a36Sopenharmony_ciInstead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
16662306a36Sopenharmony_ciassign a devlink port instance to the netdevice before registering the
16762306a36Sopenharmony_cinetdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
16862306a36Sopenharmony_ciand ``phys_port_name`` sysfs nodes.
16962306a36Sopenharmony_ci(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
17062306a36Sopenharmony_ci``ndo_get_phys_port_name()`` directly, but this is deprecated.)  See
17162306a36Sopenharmony_ci:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
17262306a36Sopenharmony_cidetails of this API.
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ciIt is expected that userland will use this information (e.g. through udev rules)
17562306a36Sopenharmony_cito construct an appropriately informative name or alias for the netdevice.  For
17662306a36Sopenharmony_ciinstance if the switchdev function is ``eth4`` then a representor with a
17762306a36Sopenharmony_ci``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ciThere are as yet no established conventions for naming representors which do not
18062306a36Sopenharmony_cicorrespond to PCIe functions (e.g. accelerators and plugins).
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ciHow do representors interact with TC rules?
18362306a36Sopenharmony_ci-------------------------------------------
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ciAny TC rule on a representor applies (in software TC) to packets received by
18662306a36Sopenharmony_cithat representor netdevice.  Thus, if the delivery part of the rule corresponds
18762306a36Sopenharmony_cito another port on the virtual switch, the driver may choose to offload it to
18862306a36Sopenharmony_cihardware, applying it to packets transmitted by the representee.
18962306a36Sopenharmony_ci
19062306a36Sopenharmony_ciSimilarly, since a TC mirred egress action targeting the representor would (in
19162306a36Sopenharmony_cisoftware) send the packet through the representor (and thus indirectly deliver
19262306a36Sopenharmony_ciit to the representee), hardware offload should interpret this as delivery to
19362306a36Sopenharmony_cithe representee.
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ciAs a simple example, if ``PORT_DEV`` is the physical port representor and
19662306a36Sopenharmony_ci``REP_DEV`` is a VF representor, the following rules::
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ci    tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
19962306a36Sopenharmony_ci        action mirred egress redirect dev $PORT_DEV
20062306a36Sopenharmony_ci    tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
20162306a36Sopenharmony_ci        action mirred egress mirror dev $REP_DEV
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ciwould mean that all IPv4 packets from the VF are sent out the physical port, and
20462306a36Sopenharmony_ciall IPv4 packets received on the physical port are delivered to the VF in
20562306a36Sopenharmony_ciaddition to ``PORT_DEV``.  (Note that without ``skip_sw`` on the second rule,
20662306a36Sopenharmony_cithe VF would get two copies, as the packet reception on ``PORT_DEV`` would
20762306a36Sopenharmony_citrigger the TC rule again and mirror the packet to ``REP_DEV``.)
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ciOn devices without separate port and uplink representors, ``PORT_DEV`` would
21062306a36Sopenharmony_ciinstead be the switchdev function's own uplink netdevice.
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ciOf course the rules can (if supported by the NIC) include packet-modifying
21362306a36Sopenharmony_ciactions (e.g. VLAN push/pop), which should be performed by the virtual switch.
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ciTunnel encapsulation and decapsulation are rather more complicated, as they
21662306a36Sopenharmony_ciinvolve a third netdevice (a tunnel netdev operating in metadata mode, such as
21762306a36Sopenharmony_cia VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
21862306a36Sopenharmony_cirequire an IP address to be bound to the underlay device (e.g. switchdev
21962306a36Sopenharmony_cifunction uplink netdev or port representor).  TC rules such as::
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ci    tc filter add dev $REP_DEV parent ffff: flower \
22262306a36Sopenharmony_ci        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
22362306a36Sopenharmony_ci                              dst_port 4789 \
22462306a36Sopenharmony_ci        action mirred egress redirect dev vxlan0
22562306a36Sopenharmony_ci    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
22662306a36Sopenharmony_ci        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
22762306a36Sopenharmony_ci        action tunnel_key unset action mirred egress redirect dev $REP_DEV
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ciwhere ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
23062306a36Sopenharmony_cianother IP address on the same subnet, mean that packets sent by the VF should
23162306a36Sopenharmony_cibe VxLAN encapsulated and sent out the physical port (the driver has to deduce
23262306a36Sopenharmony_cithis by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
23362306a36Sopenharmony_ciperform an ARP/neighbour table lookup to find the MAC addresses to use in the
23462306a36Sopenharmony_ciouter Ethernet frame), while UDP packets received on the physical port with UDP
23562306a36Sopenharmony_ciport 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
23662306a36Sopenharmony_cidecapsulated and forwarded to the VF.
23762306a36Sopenharmony_ci
23862306a36Sopenharmony_ciIf this all seems complicated, just remember the 'golden rule' of TC offload:
23962306a36Sopenharmony_cithe hardware should ensure the same final results as if the packets were
24062306a36Sopenharmony_ciprocessed through the slow path, traversed software TC (except ignoring any
24162306a36Sopenharmony_ci``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
24262306a36Sopenharmony_cireceived through the representor netdevices.
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ciConfiguring the representee's MAC
24562306a36Sopenharmony_ci---------------------------------
24662306a36Sopenharmony_ci
24762306a36Sopenharmony_ciThe representee's link state is controlled through the representor.  Setting the
24862306a36Sopenharmony_cirepresentor administratively UP or DOWN should cause carrier ON or OFF at the
24962306a36Sopenharmony_cirepresentee.
25062306a36Sopenharmony_ci
25162306a36Sopenharmony_ciSetting an MTU on the representor should cause that same MTU to be reported to
25262306a36Sopenharmony_cithe representee.
25362306a36Sopenharmony_ci(On hardware that allows configuring separate and distinct MTU and MRU values,
25462306a36Sopenharmony_cithe representor MTU should correspond to the representee's MRU and vice-versa.)
25562306a36Sopenharmony_ci
25662306a36Sopenharmony_ciCurrently there is no way to use the representor to set the station permanent
25762306a36Sopenharmony_ciMAC address of the representee; other methods available to do this include:
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ci - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
26062306a36Sopenharmony_ci - devlink port function (see **devlink-port(8)** and
26162306a36Sopenharmony_ci   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
262