162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============================= 462306a36Sopenharmony_ciNetwork Function Representors 562306a36Sopenharmony_ci============================= 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciThis document describes the semantics and usage of representor netdevices, as 862306a36Sopenharmony_ciused to control internal switching on SmartNICs. For the closely-related port 962306a36Sopenharmony_cirepresentors on physical (multi-port) switches, see 1062306a36Sopenharmony_ci:ref:`Documentation/networking/switchdev.rst <switchdev>`. 1162306a36Sopenharmony_ci 1262306a36Sopenharmony_ciMotivation 1362306a36Sopenharmony_ci---------- 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciSince the mid-2010s, network cards have started offering more complex 1662306a36Sopenharmony_civirtualisation capabilities than the legacy SR-IOV approach (with its simple 1762306a36Sopenharmony_ciMAC/VLAN-based switching model) can support. This led to a desire to offload 1862306a36Sopenharmony_cisoftware-defined networks (such as OpenVSwitch) to these NICs to specify the 1962306a36Sopenharmony_cinetwork connectivity of each function. The resulting designs are variously 2062306a36Sopenharmony_cicalled SmartNICs or DPUs. 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ciNetwork function representors bring the standard Linux networking stack to 2362306a36Sopenharmony_civirtual switches and IOV devices. Just as each physical port of a Linux- 2462306a36Sopenharmony_cicontrolled switch has a separate netdev, so does each virtual port of a virtual 2562306a36Sopenharmony_ciswitch. 2662306a36Sopenharmony_ciWhen the system boots, and before any offload is configured, all packets from 2762306a36Sopenharmony_cithe virtual functions appear in the networking stack of the PF via the 2862306a36Sopenharmony_cirepresentors. The PF can thus always communicate freely with the virtual 2962306a36Sopenharmony_cifunctions. 3062306a36Sopenharmony_ciThe PF can configure standard Linux forwarding between representors, the uplink 3162306a36Sopenharmony_cior any other netdev (routing, bridging, TC classifiers). 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ciThus, a representor is both a control plane object (representing the function in 3462306a36Sopenharmony_ciadministrative commands) and a data plane object (one end of a virtual pipe). 3562306a36Sopenharmony_ciAs a virtual link endpoint, the representor can be configured like any other 3662306a36Sopenharmony_cinetdevice; in some cases (e.g. link state) the representee will follow the 3762306a36Sopenharmony_cirepresentor's configuration, while in others there are separate APIs to 3862306a36Sopenharmony_ciconfigure the representee. 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ciDefinitions 4162306a36Sopenharmony_ci----------- 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ciThis document uses the term "switchdev function" to refer to the PCIe function 4462306a36Sopenharmony_ciwhich has administrative control over the virtual switch on the device. 4562306a36Sopenharmony_ciTypically, this will be a PF, but conceivably a NIC could be configured to grant 4662306a36Sopenharmony_cithese administrative privileges instead to a VF or SF (subfunction). 4762306a36Sopenharmony_ciDepending on NIC design, a multi-port NIC might have a single switchdev function 4862306a36Sopenharmony_cifor the whole device or might have a separate virtual switch, and hence 4962306a36Sopenharmony_ciswitchdev function, for each physical network port. 5062306a36Sopenharmony_ciIf the NIC supports nested switching, there might be separate switchdev 5162306a36Sopenharmony_cifunctions for each nested switch, in which case each switchdev function should 5262306a36Sopenharmony_cionly create representors for the ports on the (sub-)switch it directly 5362306a36Sopenharmony_ciadministers. 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciA "representee" is the object that a representor represents. So for example in 5662306a36Sopenharmony_cithe case of a VF representor, the representee is the corresponding VF. 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciWhat does a representor do? 5962306a36Sopenharmony_ci--------------------------- 6062306a36Sopenharmony_ci 6162306a36Sopenharmony_ciA representor has three main roles. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ci1. It is used to configure the network connection the representee sees, e.g. 6462306a36Sopenharmony_ci link up/down, MTU, etc. For instance, bringing the representor 6562306a36Sopenharmony_ci administratively UP should cause the representee to see a link up / carrier 6662306a36Sopenharmony_ci on event. 6762306a36Sopenharmony_ci2. It provides the slow path for traffic which does not hit any offloaded 6862306a36Sopenharmony_ci fast-path rules in the virtual switch. Packets transmitted on the 6962306a36Sopenharmony_ci representor netdevice should be delivered to the representee; packets 7062306a36Sopenharmony_ci transmitted by the representee which fail to match any switching rule should 7162306a36Sopenharmony_ci be received on the representor netdevice. (That is, there is a virtual pipe 7262306a36Sopenharmony_ci connecting the representor to the representee, similar in concept to a veth 7362306a36Sopenharmony_ci pair.) 7462306a36Sopenharmony_ci This allows software switch implementations (such as OpenVSwitch or a Linux 7562306a36Sopenharmony_ci bridge) to forward packets between representees and the rest of the network. 7662306a36Sopenharmony_ci3. It acts as a handle by which switching rules (such as TC filters) can refer 7762306a36Sopenharmony_ci to the representee, allowing these rules to be offloaded. 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ciThe combination of 2) and 3) means that the behaviour (apart from performance) 8062306a36Sopenharmony_cishould be the same whether a TC filter is offloaded or not. E.g. a TC rule 8162306a36Sopenharmony_cion a VF representor applies in software to packets received on that representor 8262306a36Sopenharmony_cinetdevice, while in hardware offload it would apply to packets transmitted by 8362306a36Sopenharmony_cithe representee VF. Conversely, a mirred egress redirect to a VF representor 8462306a36Sopenharmony_cicorresponds in hardware to delivery directly to the representee VF. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciWhat functions should have a representor? 8762306a36Sopenharmony_ci----------------------------------------- 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ciEssentially, for each virtual port on the device's internal switch, there 9062306a36Sopenharmony_cishould be a representor. 9162306a36Sopenharmony_ciSome vendors have chosen to omit representors for the uplink and the physical 9262306a36Sopenharmony_cinetwork port, which can simplify usage (the uplink netdev becomes in effect the 9362306a36Sopenharmony_ciphysical port's representor) but does not generalise to devices with multiple 9462306a36Sopenharmony_ciports or uplinks. 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ciThus, the following should all have representors: 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ci - VFs belonging to the switchdev function. 9962306a36Sopenharmony_ci - Other PFs on the local PCIe controller, and any VFs belonging to them. 10062306a36Sopenharmony_ci - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded 10162306a36Sopenharmony_ci System-on-Chip within the SmartNIC). 10262306a36Sopenharmony_ci - PFs and VFs with other personalities, including network block devices (such 10362306a36Sopenharmony_ci as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only 10462306a36Sopenharmony_ci if) their network access is implemented through a virtual switch port. [#]_ 10562306a36Sopenharmony_ci Note that such functions can require a representor despite the representee 10662306a36Sopenharmony_ci not having a netdev. 10762306a36Sopenharmony_ci - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have 10862306a36Sopenharmony_ci their own port on the switch (as opposed to using their parent PF's port). 10962306a36Sopenharmony_ci - Any accelerators or plugins on the device whose interface to the network is 11062306a36Sopenharmony_ci through a virtual switch port, even if they do not have a corresponding PCIe 11162306a36Sopenharmony_ci PF or VF. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ciThis allows the entire switching behaviour of the NIC to be controlled through 11462306a36Sopenharmony_cirepresentor TC rules. 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ciIt is a common misunderstanding to conflate virtual ports with PCIe virtual 11762306a36Sopenharmony_cifunctions or their netdevs. While in simple cases there will be a 1:1 11862306a36Sopenharmony_cicorrespondence between VF netdevices and VF representors, more advanced device 11962306a36Sopenharmony_ciconfigurations may not follow this. 12062306a36Sopenharmony_ciA PCIe function which does not have network access through the internal switch 12162306a36Sopenharmony_ci(not even indirectly through the hardware implementation of whatever services 12262306a36Sopenharmony_cithe function provides) should *not* have a representor (even if it has a 12362306a36Sopenharmony_cinetdev). 12462306a36Sopenharmony_ciSuch a function has no switch virtual port for the representor to configure or 12562306a36Sopenharmony_cito be the other end of the virtual pipe. 12662306a36Sopenharmony_ciThe representor represents the virtual port, not the PCIe function nor the 'end 12762306a36Sopenharmony_ciuser' netdevice. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ci.. [#] The concept here is that a hardware IP stack in the device performs the 13062306a36Sopenharmony_ci translation between block DMA requests and network packets, so that only 13162306a36Sopenharmony_ci network packets pass through the virtual port onto the switch. The network 13262306a36Sopenharmony_ci access that the IP stack "sees" would then be configurable through tc rules; 13362306a36Sopenharmony_ci e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However, 13462306a36Sopenharmony_ci any needed configuration of the block device *qua* block device, not being a 13562306a36Sopenharmony_ci networking entity, would not be appropriate for the representor and would 13662306a36Sopenharmony_ci thus use some other channel such as devlink. 13762306a36Sopenharmony_ci Contrast this with the case of a virtio-blk implementation which forwards the 13862306a36Sopenharmony_ci DMA requests unchanged to another PF whose driver then initiates and 13962306a36Sopenharmony_ci terminates IP traffic in software; in that case the DMA traffic would *not* 14062306a36Sopenharmony_ci run over the virtual switch and the virtio-blk PF should thus *not* have a 14162306a36Sopenharmony_ci representor. 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ciHow are representors created? 14462306a36Sopenharmony_ci----------------------------- 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ciThe driver instance attached to the switchdev function should, for each virtual 14762306a36Sopenharmony_ciport on the switch, create a pure-software netdevice which has some form of 14862306a36Sopenharmony_ciin-kernel reference to the switchdev function's own netdevice or driver private 14962306a36Sopenharmony_cidata (``netdev_priv()``). 15062306a36Sopenharmony_ciThis may be by enumerating ports at probe time, reacting dynamically to the 15162306a36Sopenharmony_cicreation and destruction of ports at run time, or a combination of the two. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciThe operations of the representor netdevice will generally involve acting 15462306a36Sopenharmony_cithrough the switchdev function. For example, ``ndo_start_xmit()`` might send 15562306a36Sopenharmony_cithe packet through a hardware TX queue attached to the switchdev function, with 15662306a36Sopenharmony_cieither packet metadata or queue configuration marking it for delivery to the 15762306a36Sopenharmony_cirepresentee. 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciHow are representors identified? 16062306a36Sopenharmony_ci-------------------------------- 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ciThe representor netdevice should *not* directly refer to a PCIe device (e.g. 16362306a36Sopenharmony_cithrough ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the 16462306a36Sopenharmony_cirepresentee or of the switchdev function. 16562306a36Sopenharmony_ciInstead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to 16662306a36Sopenharmony_ciassign a devlink port instance to the netdevice before registering the 16762306a36Sopenharmony_cinetdevice; the kernel uses the devlink port to provide the ``phys_switch_id`` 16862306a36Sopenharmony_ciand ``phys_port_name`` sysfs nodes. 16962306a36Sopenharmony_ci(Some legacy drivers implement ``ndo_get_port_parent_id()`` and 17062306a36Sopenharmony_ci``ndo_get_phys_port_name()`` directly, but this is deprecated.) See 17162306a36Sopenharmony_ci:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the 17262306a36Sopenharmony_cidetails of this API. 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ciIt is expected that userland will use this information (e.g. through udev rules) 17562306a36Sopenharmony_cito construct an appropriately informative name or alias for the netdevice. For 17662306a36Sopenharmony_ciinstance if the switchdev function is ``eth4`` then a representor with a 17762306a36Sopenharmony_ci``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``. 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ciThere are as yet no established conventions for naming representors which do not 18062306a36Sopenharmony_cicorrespond to PCIe functions (e.g. accelerators and plugins). 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciHow do representors interact with TC rules? 18362306a36Sopenharmony_ci------------------------------------------- 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciAny TC rule on a representor applies (in software TC) to packets received by 18662306a36Sopenharmony_cithat representor netdevice. Thus, if the delivery part of the rule corresponds 18762306a36Sopenharmony_cito another port on the virtual switch, the driver may choose to offload it to 18862306a36Sopenharmony_cihardware, applying it to packets transmitted by the representee. 18962306a36Sopenharmony_ci 19062306a36Sopenharmony_ciSimilarly, since a TC mirred egress action targeting the representor would (in 19162306a36Sopenharmony_cisoftware) send the packet through the representor (and thus indirectly deliver 19262306a36Sopenharmony_ciit to the representee), hardware offload should interpret this as delivery to 19362306a36Sopenharmony_cithe representee. 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ciAs a simple example, if ``PORT_DEV`` is the physical port representor and 19662306a36Sopenharmony_ci``REP_DEV`` is a VF representor, the following rules:: 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ci tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \ 19962306a36Sopenharmony_ci action mirred egress redirect dev $PORT_DEV 20062306a36Sopenharmony_ci tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \ 20162306a36Sopenharmony_ci action mirred egress mirror dev $REP_DEV 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ciwould mean that all IPv4 packets from the VF are sent out the physical port, and 20462306a36Sopenharmony_ciall IPv4 packets received on the physical port are delivered to the VF in 20562306a36Sopenharmony_ciaddition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule, 20662306a36Sopenharmony_cithe VF would get two copies, as the packet reception on ``PORT_DEV`` would 20762306a36Sopenharmony_citrigger the TC rule again and mirror the packet to ``REP_DEV``.) 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ciOn devices without separate port and uplink representors, ``PORT_DEV`` would 21062306a36Sopenharmony_ciinstead be the switchdev function's own uplink netdevice. 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciOf course the rules can (if supported by the NIC) include packet-modifying 21362306a36Sopenharmony_ciactions (e.g. VLAN push/pop), which should be performed by the virtual switch. 21462306a36Sopenharmony_ci 21562306a36Sopenharmony_ciTunnel encapsulation and decapsulation are rather more complicated, as they 21662306a36Sopenharmony_ciinvolve a third netdevice (a tunnel netdev operating in metadata mode, such as 21762306a36Sopenharmony_cia VxLAN device created with ``ip link add vxlan0 type vxlan external``) and 21862306a36Sopenharmony_cirequire an IP address to be bound to the underlay device (e.g. switchdev 21962306a36Sopenharmony_cifunction uplink netdev or port representor). TC rules such as:: 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci tc filter add dev $REP_DEV parent ffff: flower \ 22262306a36Sopenharmony_ci action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \ 22362306a36Sopenharmony_ci dst_port 4789 \ 22462306a36Sopenharmony_ci action mirred egress redirect dev vxlan0 22562306a36Sopenharmony_ci tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \ 22662306a36Sopenharmony_ci enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \ 22762306a36Sopenharmony_ci action tunnel_key unset action mirred egress redirect dev $REP_DEV 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ciwhere ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is 23062306a36Sopenharmony_cianother IP address on the same subnet, mean that packets sent by the VF should 23162306a36Sopenharmony_cibe VxLAN encapsulated and sent out the physical port (the driver has to deduce 23262306a36Sopenharmony_cithis by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also 23362306a36Sopenharmony_ciperform an ARP/neighbour table lookup to find the MAC addresses to use in the 23462306a36Sopenharmony_ciouter Ethernet frame), while UDP packets received on the physical port with UDP 23562306a36Sopenharmony_ciport 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, 23662306a36Sopenharmony_cidecapsulated and forwarded to the VF. 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ciIf this all seems complicated, just remember the 'golden rule' of TC offload: 23962306a36Sopenharmony_cithe hardware should ensure the same final results as if the packets were 24062306a36Sopenharmony_ciprocessed through the slow path, traversed software TC (except ignoring any 24162306a36Sopenharmony_ci``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or 24262306a36Sopenharmony_cireceived through the representor netdevices. 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ciConfiguring the representee's MAC 24562306a36Sopenharmony_ci--------------------------------- 24662306a36Sopenharmony_ci 24762306a36Sopenharmony_ciThe representee's link state is controlled through the representor. Setting the 24862306a36Sopenharmony_cirepresentor administratively UP or DOWN should cause carrier ON or OFF at the 24962306a36Sopenharmony_cirepresentee. 25062306a36Sopenharmony_ci 25162306a36Sopenharmony_ciSetting an MTU on the representor should cause that same MTU to be reported to 25262306a36Sopenharmony_cithe representee. 25362306a36Sopenharmony_ci(On hardware that allows configuring separate and distinct MTU and MRU values, 25462306a36Sopenharmony_cithe representor MTU should correspond to the representee's MRU and vice-versa.) 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ciCurrently there is no way to use the representor to set the station permanent 25762306a36Sopenharmony_ciMAC address of the representee; other methods available to do this include: 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ci - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``) 26062306a36Sopenharmony_ci - devlink port function (see **devlink-port(8)** and 26162306a36Sopenharmony_ci :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) 262