18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=============================================
48c2ecf20Sopenharmony_ciOpen vSwitch datapath developer documentation
58c2ecf20Sopenharmony_ci=============================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciThe Open vSwitch kernel module allows flexible userspace control over
88c2ecf20Sopenharmony_ciflow-level packet processing on selected network devices.  It can be
98c2ecf20Sopenharmony_ciused to implement a plain Ethernet switch, network device bonding,
108c2ecf20Sopenharmony_ciVLAN processing, network access control, flow-based network control,
118c2ecf20Sopenharmony_ciand so on.
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciThe kernel module implements multiple "datapaths" (analogous to
148c2ecf20Sopenharmony_cibridges), each of which can have multiple "vports" (analogous to ports
158c2ecf20Sopenharmony_ciwithin a bridge).  Each datapath also has associated with it a "flow
168c2ecf20Sopenharmony_citable" that userspace populates with "flows" that map from keys based
178c2ecf20Sopenharmony_cion packet headers and metadata to sets of actions.  The most common
188c2ecf20Sopenharmony_ciaction forwards the packet to another vport; other actions are also
198c2ecf20Sopenharmony_ciimplemented.
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciWhen a packet arrives on a vport, the kernel module processes it by
228c2ecf20Sopenharmony_ciextracting its flow key and looking it up in the flow table.  If there
238c2ecf20Sopenharmony_ciis a matching flow, it executes the associated actions.  If there is
248c2ecf20Sopenharmony_cino match, it queues the packet to userspace for processing (as part of
258c2ecf20Sopenharmony_ciits processing, userspace will likely set up a flow to handle further
268c2ecf20Sopenharmony_cipackets of the same type entirely in-kernel).
278c2ecf20Sopenharmony_ci
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ciFlow key compatibility
308c2ecf20Sopenharmony_ci----------------------
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ciNetwork protocols evolve over time.  New protocols become important
338c2ecf20Sopenharmony_ciand existing protocols lose their prominence.  For the Open vSwitch
348c2ecf20Sopenharmony_cikernel module to remain relevant, it must be possible for newer
358c2ecf20Sopenharmony_civersions to parse additional protocols as part of the flow key.  It
368c2ecf20Sopenharmony_cimight even be desirable, someday, to drop support for parsing
378c2ecf20Sopenharmony_ciprotocols that have become obsolete.  Therefore, the Netlink interface
388c2ecf20Sopenharmony_cito Open vSwitch is designed to allow carefully written userspace
398c2ecf20Sopenharmony_ciapplications to work with any version of the flow key, past or future.
408c2ecf20Sopenharmony_ci
418c2ecf20Sopenharmony_ciTo support this forward and backward compatibility, whenever the
428c2ecf20Sopenharmony_cikernel module passes a packet to userspace, it also passes along the
438c2ecf20Sopenharmony_ciflow key that it parsed from the packet.  Userspace then extracts its
448c2ecf20Sopenharmony_ciown notion of a flow key from the packet and compares it against the
458c2ecf20Sopenharmony_cikernel-provided version:
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ci    - If userspace's notion of the flow key for the packet matches the
488c2ecf20Sopenharmony_ci      kernel's, then nothing special is necessary.
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_ci    - If the kernel's flow key includes more fields than the userspace
518c2ecf20Sopenharmony_ci      version of the flow key, for example if the kernel decoded IPv6
528c2ecf20Sopenharmony_ci      headers but userspace stopped at the Ethernet type (because it
538c2ecf20Sopenharmony_ci      does not understand IPv6), then again nothing special is
548c2ecf20Sopenharmony_ci      necessary.  Userspace can still set up a flow in the usual way,
558c2ecf20Sopenharmony_ci      as long as it uses the kernel-provided flow key to do it.
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ci    - If the userspace flow key includes more fields than the
588c2ecf20Sopenharmony_ci      kernel's, for example if userspace decoded an IPv6 header but
598c2ecf20Sopenharmony_ci      the kernel stopped at the Ethernet type, then userspace can
608c2ecf20Sopenharmony_ci      forward the packet manually, without setting up a flow in the
618c2ecf20Sopenharmony_ci      kernel.  This case is bad for performance because every packet
628c2ecf20Sopenharmony_ci      that the kernel considers part of the flow must go to userspace,
638c2ecf20Sopenharmony_ci      but the forwarding behavior is correct.  (If userspace can
648c2ecf20Sopenharmony_ci      determine that the values of the extra fields would not affect
658c2ecf20Sopenharmony_ci      forwarding behavior, then it could set up a flow anyway.)
668c2ecf20Sopenharmony_ci
678c2ecf20Sopenharmony_ciHow flow keys evolve over time is important to making this work, so
688c2ecf20Sopenharmony_cithe following sections go into detail.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ciFlow key format
728c2ecf20Sopenharmony_ci---------------
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ciA flow key is passed over a Netlink socket as a sequence of Netlink
758c2ecf20Sopenharmony_ciattributes.  Some attributes represent packet metadata, defined as any
768c2ecf20Sopenharmony_ciinformation about a packet that cannot be extracted from the packet
778c2ecf20Sopenharmony_ciitself, e.g. the vport on which the packet was received.  Most
788c2ecf20Sopenharmony_ciattributes, however, are extracted from headers within the packet,
798c2ecf20Sopenharmony_cie.g. source and destination addresses from Ethernet, IP, or TCP
808c2ecf20Sopenharmony_ciheaders.
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ciThe <linux/openvswitch.h> header file defines the exact format of the
838c2ecf20Sopenharmony_ciflow key attributes.  For informal explanatory purposes here, we write
848c2ecf20Sopenharmony_cithem as comma-separated strings, with parentheses indicating arguments
858c2ecf20Sopenharmony_ciand nesting.  For example, the following could represent a flow key
868c2ecf20Sopenharmony_cicorresponding to a TCP packet that arrived on vport 1::
878c2ecf20Sopenharmony_ci
888c2ecf20Sopenharmony_ci    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
898c2ecf20Sopenharmony_ci    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
908c2ecf20Sopenharmony_ci    frag=no), tcp(src=49163, dst=80)
918c2ecf20Sopenharmony_ci
928c2ecf20Sopenharmony_ciOften we ellipsize arguments not important to the discussion, e.g.::
938c2ecf20Sopenharmony_ci
948c2ecf20Sopenharmony_ci    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
958c2ecf20Sopenharmony_ci
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ciWildcarded flow key format
988c2ecf20Sopenharmony_ci--------------------------
998c2ecf20Sopenharmony_ci
1008c2ecf20Sopenharmony_ciA wildcarded flow is described with two sequences of Netlink attributes
1018c2ecf20Sopenharmony_cipassed over the Netlink socket. A flow key, exactly as described above, and an
1028c2ecf20Sopenharmony_cioptional corresponding flow mask.
1038c2ecf20Sopenharmony_ci
1048c2ecf20Sopenharmony_ciA wildcarded flow can represent a group of exact match flows. Each '1' bit
1058c2ecf20Sopenharmony_ciin the mask specifies a exact match with the corresponding bit in the flow key.
1068c2ecf20Sopenharmony_ciA '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
1078c2ecf20Sopenharmony_ciof a incoming packet. Using wildcarded flow can improve the flow set up rate
1088c2ecf20Sopenharmony_ciby reduce the number of new flows need to be processed by the user space program.
1098c2ecf20Sopenharmony_ci
1108c2ecf20Sopenharmony_ciSupport for the mask Netlink attribute is optional for both the kernel and user
1118c2ecf20Sopenharmony_cispace program. The kernel can ignore the mask attribute, installing an exact
1128c2ecf20Sopenharmony_cimatch flow, or reduce the number of don't care bits in the kernel to less than
1138c2ecf20Sopenharmony_ciwhat was specified by the user space program. In this case, variations in bits
1148c2ecf20Sopenharmony_cithat the kernel does not implement will simply result in additional flow setups.
1158c2ecf20Sopenharmony_ciThe kernel module will also work with user space programs that neither support
1168c2ecf20Sopenharmony_cinor supply flow mask attributes.
1178c2ecf20Sopenharmony_ci
1188c2ecf20Sopenharmony_ciSince the kernel may ignore or modify wildcard bits, it can be difficult for
1198c2ecf20Sopenharmony_cithe userspace program to know exactly what matches are installed. There are
1208c2ecf20Sopenharmony_citwo possible approaches: reactively install flows as they miss the kernel
1218c2ecf20Sopenharmony_ciflow table (and therefore not attempt to determine wildcard changes at all)
1228c2ecf20Sopenharmony_cior use the kernel's response messages to determine the installed wildcards.
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ciWhen interacting with userspace, the kernel should maintain the match portion
1258c2ecf20Sopenharmony_ciof the key exactly as originally installed. This will provides a handle to
1268c2ecf20Sopenharmony_ciidentify the flow for all future operations. However, when reporting the
1278c2ecf20Sopenharmony_cimask of an installed flow, the mask should include any restrictions imposed
1288c2ecf20Sopenharmony_ciby the kernel.
1298c2ecf20Sopenharmony_ci
1308c2ecf20Sopenharmony_ciThe behavior when using overlapping wildcarded flows is undefined. It is the
1318c2ecf20Sopenharmony_ciresponsibility of the user space program to ensure that any incoming packet
1328c2ecf20Sopenharmony_cican match at most one flow, wildcarded or not. The current implementation
1338c2ecf20Sopenharmony_ciperforms best-effort detection of overlapping wildcarded flows and may reject
1348c2ecf20Sopenharmony_cisome but not all of them. However, this behavior may change in future versions.
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ciUnique flow identifiers
1388c2ecf20Sopenharmony_ci-----------------------
1398c2ecf20Sopenharmony_ci
1408c2ecf20Sopenharmony_ciAn alternative to using the original match portion of a key as the handle for
1418c2ecf20Sopenharmony_ciflow identification is a unique flow identifier, or "UFID". UFIDs are optional
1428c2ecf20Sopenharmony_cifor both the kernel and user space program.
1438c2ecf20Sopenharmony_ci
1448c2ecf20Sopenharmony_ciUser space programs that support UFID are expected to provide it during flow
1458c2ecf20Sopenharmony_cisetup in addition to the flow, then refer to the flow using the UFID for all
1468c2ecf20Sopenharmony_cifuture operations. The kernel is not required to index flows by the original
1478c2ecf20Sopenharmony_ciflow key if a UFID is specified.
1488c2ecf20Sopenharmony_ci
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ciBasic rule for evolving flow keys
1518c2ecf20Sopenharmony_ci---------------------------------
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ciSome care is needed to really maintain forward and backward
1548c2ecf20Sopenharmony_cicompatibility for applications that follow the rules listed under
1558c2ecf20Sopenharmony_ci"Flow key compatibility" above.
1568c2ecf20Sopenharmony_ci
1578c2ecf20Sopenharmony_ciThe basic rule is obvious::
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ci    ==================================================================
1608c2ecf20Sopenharmony_ci    New network protocol support must only supplement existing flow
1618c2ecf20Sopenharmony_ci    key attributes.  It must not change the meaning of already defined
1628c2ecf20Sopenharmony_ci    flow key attributes.
1638c2ecf20Sopenharmony_ci    ==================================================================
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciThis rule does have less-obvious consequences so it is worth working
1668c2ecf20Sopenharmony_cithrough a few examples.  Suppose, for example, that the kernel module
1678c2ecf20Sopenharmony_cidid not already implement VLAN parsing.  Instead, it just interpreted
1688c2ecf20Sopenharmony_cithe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
1698c2ecf20Sopenharmony_cipacket.  The flow key for any packet with an 802.1Q header would look
1708c2ecf20Sopenharmony_ciessentially like this, ignoring metadata::
1718c2ecf20Sopenharmony_ci
1728c2ecf20Sopenharmony_ci    eth(...), eth_type(0x8100)
1738c2ecf20Sopenharmony_ci
1748c2ecf20Sopenharmony_ciNaively, to add VLAN support, it makes sense to add a new "vlan" flow
1758c2ecf20Sopenharmony_cikey attribute to contain the VLAN tag, then continue to decode the
1768c2ecf20Sopenharmony_ciencapsulated headers beyond the VLAN tag using the existing field
1778c2ecf20Sopenharmony_cidefinitions.  With this change, a TCP packet in VLAN 10 would have a
1788c2ecf20Sopenharmony_ciflow key much like this::
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ci    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
1818c2ecf20Sopenharmony_ci
1828c2ecf20Sopenharmony_ciBut this change would negatively affect a userspace application that
1838c2ecf20Sopenharmony_cihas not been updated to understand the new "vlan" flow key attribute.
1848c2ecf20Sopenharmony_ciThe application could, following the flow compatibility rules above,
1858c2ecf20Sopenharmony_ciignore the "vlan" attribute that it does not understand and therefore
1868c2ecf20Sopenharmony_ciassume that the flow contained IP packets.  This is a bad assumption
1878c2ecf20Sopenharmony_ci(the flow only contains IP packets if one parses and skips over the
1888c2ecf20Sopenharmony_ci802.1Q header) and it could cause the application's behavior to change
1898c2ecf20Sopenharmony_ciacross kernel versions even though it follows the compatibility rules.
1908c2ecf20Sopenharmony_ci
1918c2ecf20Sopenharmony_ciThe solution is to use a set of nested attributes.  This is, for
1928c2ecf20Sopenharmony_ciexample, why 802.1Q support uses nested attributes.  A TCP packet in
1938c2ecf20Sopenharmony_ciVLAN 10 is actually expressed as::
1948c2ecf20Sopenharmony_ci
1958c2ecf20Sopenharmony_ci    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
1968c2ecf20Sopenharmony_ci    ip(proto=6, ...), tcp(...)))
1978c2ecf20Sopenharmony_ci
1988c2ecf20Sopenharmony_ciNotice how the "eth_type", "ip", and "tcp" flow key attributes are
1998c2ecf20Sopenharmony_cinested inside the "encap" attribute.  Thus, an application that does
2008c2ecf20Sopenharmony_cinot understand the "vlan" key will not see either of those attributes
2018c2ecf20Sopenharmony_ciand therefore will not misinterpret them.  (Also, the outer eth_type
2028c2ecf20Sopenharmony_ciis still 0x8100, not changed to 0x0800.)
2038c2ecf20Sopenharmony_ci
2048c2ecf20Sopenharmony_ciHandling malformed packets
2058c2ecf20Sopenharmony_ci--------------------------
2068c2ecf20Sopenharmony_ci
2078c2ecf20Sopenharmony_ciDon't drop packets in the kernel for malformed protocol headers, bad
2088c2ecf20Sopenharmony_cichecksums, etc.  This would prevent userspace from implementing a
2098c2ecf20Sopenharmony_cisimple Ethernet switch that forwards every packet.
2108c2ecf20Sopenharmony_ci
2118c2ecf20Sopenharmony_ciInstead, in such a case, include an attribute with "empty" content.
2128c2ecf20Sopenharmony_ciIt doesn't matter if the empty content could be valid protocol values,
2138c2ecf20Sopenharmony_cias long as those values are rarely seen in practice, because userspace
2148c2ecf20Sopenharmony_cican always forward all packets with those values to userspace and
2158c2ecf20Sopenharmony_cihandle them individually.
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ciFor example, consider a packet that contains an IP header that
2188c2ecf20Sopenharmony_ciindicates protocol 6 for TCP, but which is truncated just after the IP
2198c2ecf20Sopenharmony_ciheader, so that the TCP header is missing.  The flow key for this
2208c2ecf20Sopenharmony_cipacket would include a tcp attribute with all-zero src and dst, like
2218c2ecf20Sopenharmony_cithis::
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ci    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
2248c2ecf20Sopenharmony_ci
2258c2ecf20Sopenharmony_ciAs another example, consider a packet with an Ethernet type of 0x8100,
2268c2ecf20Sopenharmony_ciindicating that a VLAN TCI should follow, but which is truncated just
2278c2ecf20Sopenharmony_ciafter the Ethernet type.  The flow key for this packet would include
2288c2ecf20Sopenharmony_cian all-zero-bits vlan and an empty encap attribute, like this::
2298c2ecf20Sopenharmony_ci
2308c2ecf20Sopenharmony_ci    eth(...), eth_type(0x8100), vlan(0), encap()
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ciUnlike a TCP packet with source and destination ports 0, an
2338c2ecf20Sopenharmony_ciall-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
2348c2ecf20Sopenharmony_ciVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
2358c2ecf20Sopenharmony_ciattribute expressly to allow this situation to be distinguished.
2368c2ecf20Sopenharmony_ciThus, the flow key in this second example unambiguously indicates a
2378c2ecf20Sopenharmony_cimissing or malformed VLAN TCI.
2388c2ecf20Sopenharmony_ci
2398c2ecf20Sopenharmony_ciOther rules
2408c2ecf20Sopenharmony_ci-----------
2418c2ecf20Sopenharmony_ci
2428c2ecf20Sopenharmony_ciThe other rules for flow keys are much less subtle:
2438c2ecf20Sopenharmony_ci
2448c2ecf20Sopenharmony_ci    - Duplicate attributes are not allowed at a given nesting level.
2458c2ecf20Sopenharmony_ci
2468c2ecf20Sopenharmony_ci    - Ordering of attributes is not significant.
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ci    - When the kernel sends a given flow key to userspace, it always
2498c2ecf20Sopenharmony_ci      composes it the same way.  This allows userspace to hash and
2508c2ecf20Sopenharmony_ci      compare entire flow keys that it may not be able to fully
2518c2ecf20Sopenharmony_ci      interpret.
252