162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============================================= 462306a36Sopenharmony_ciOpen vSwitch datapath developer documentation 562306a36Sopenharmony_ci============================================= 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciThe Open vSwitch kernel module allows flexible userspace control over 862306a36Sopenharmony_ciflow-level packet processing on selected network devices. It can be 962306a36Sopenharmony_ciused to implement a plain Ethernet switch, network device bonding, 1062306a36Sopenharmony_ciVLAN processing, network access control, flow-based network control, 1162306a36Sopenharmony_ciand so on. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciThe kernel module implements multiple "datapaths" (analogous to 1462306a36Sopenharmony_cibridges), each of which can have multiple "vports" (analogous to ports 1562306a36Sopenharmony_ciwithin a bridge). Each datapath also has associated with it a "flow 1662306a36Sopenharmony_citable" that userspace populates with "flows" that map from keys based 1762306a36Sopenharmony_cion packet headers and metadata to sets of actions. The most common 1862306a36Sopenharmony_ciaction forwards the packet to another vport; other actions are also 1962306a36Sopenharmony_ciimplemented. 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciWhen a packet arrives on a vport, the kernel module processes it by 2262306a36Sopenharmony_ciextracting its flow key and looking it up in the flow table. If there 2362306a36Sopenharmony_ciis a matching flow, it executes the associated actions. If there is 2462306a36Sopenharmony_cino match, it queues the packet to userspace for processing (as part of 2562306a36Sopenharmony_ciits processing, userspace will likely set up a flow to handle further 2662306a36Sopenharmony_cipackets of the same type entirely in-kernel). 2762306a36Sopenharmony_ci 2862306a36Sopenharmony_ci 2962306a36Sopenharmony_ciFlow key compatibility 3062306a36Sopenharmony_ci---------------------- 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciNetwork protocols evolve over time. New protocols become important 3362306a36Sopenharmony_ciand existing protocols lose their prominence. For the Open vSwitch 3462306a36Sopenharmony_cikernel module to remain relevant, it must be possible for newer 3562306a36Sopenharmony_civersions to parse additional protocols as part of the flow key. It 3662306a36Sopenharmony_cimight even be desirable, someday, to drop support for parsing 3762306a36Sopenharmony_ciprotocols that have become obsolete. Therefore, the Netlink interface 3862306a36Sopenharmony_cito Open vSwitch is designed to allow carefully written userspace 3962306a36Sopenharmony_ciapplications to work with any version of the flow key, past or future. 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciTo support this forward and backward compatibility, whenever the 4262306a36Sopenharmony_cikernel module passes a packet to userspace, it also passes along the 4362306a36Sopenharmony_ciflow key that it parsed from the packet. Userspace then extracts its 4462306a36Sopenharmony_ciown notion of a flow key from the packet and compares it against the 4562306a36Sopenharmony_cikernel-provided version: 4662306a36Sopenharmony_ci 4762306a36Sopenharmony_ci - If userspace's notion of the flow key for the packet matches the 4862306a36Sopenharmony_ci kernel's, then nothing special is necessary. 4962306a36Sopenharmony_ci 5062306a36Sopenharmony_ci - If the kernel's flow key includes more fields than the userspace 5162306a36Sopenharmony_ci version of the flow key, for example if the kernel decoded IPv6 5262306a36Sopenharmony_ci headers but userspace stopped at the Ethernet type (because it 5362306a36Sopenharmony_ci does not understand IPv6), then again nothing special is 5462306a36Sopenharmony_ci necessary. Userspace can still set up a flow in the usual way, 5562306a36Sopenharmony_ci as long as it uses the kernel-provided flow key to do it. 5662306a36Sopenharmony_ci 5762306a36Sopenharmony_ci - If the userspace flow key includes more fields than the 5862306a36Sopenharmony_ci kernel's, for example if userspace decoded an IPv6 header but 5962306a36Sopenharmony_ci the kernel stopped at the Ethernet type, then userspace can 6062306a36Sopenharmony_ci forward the packet manually, without setting up a flow in the 6162306a36Sopenharmony_ci kernel. This case is bad for performance because every packet 6262306a36Sopenharmony_ci that the kernel considers part of the flow must go to userspace, 6362306a36Sopenharmony_ci but the forwarding behavior is correct. (If userspace can 6462306a36Sopenharmony_ci determine that the values of the extra fields would not affect 6562306a36Sopenharmony_ci forwarding behavior, then it could set up a flow anyway.) 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ciHow flow keys evolve over time is important to making this work, so 6862306a36Sopenharmony_cithe following sections go into detail. 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci 7162306a36Sopenharmony_ciFlow key format 7262306a36Sopenharmony_ci--------------- 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciA flow key is passed over a Netlink socket as a sequence of Netlink 7562306a36Sopenharmony_ciattributes. Some attributes represent packet metadata, defined as any 7662306a36Sopenharmony_ciinformation about a packet that cannot be extracted from the packet 7762306a36Sopenharmony_ciitself, e.g. the vport on which the packet was received. Most 7862306a36Sopenharmony_ciattributes, however, are extracted from headers within the packet, 7962306a36Sopenharmony_cie.g. source and destination addresses from Ethernet, IP, or TCP 8062306a36Sopenharmony_ciheaders. 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ciThe <linux/openvswitch.h> header file defines the exact format of the 8362306a36Sopenharmony_ciflow key attributes. For informal explanatory purposes here, we write 8462306a36Sopenharmony_cithem as comma-separated strings, with parentheses indicating arguments 8562306a36Sopenharmony_ciand nesting. For example, the following could represent a flow key 8662306a36Sopenharmony_cicorresponding to a TCP packet that arrived on vport 1:: 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), 8962306a36Sopenharmony_ci eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, 9062306a36Sopenharmony_ci frag=no), tcp(src=49163, dst=80) 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ciOften we ellipsize arguments not important to the discussion, e.g.:: 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ci in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ciWildcarded flow key format 9862306a36Sopenharmony_ci-------------------------- 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ciA wildcarded flow is described with two sequences of Netlink attributes 10162306a36Sopenharmony_cipassed over the Netlink socket. A flow key, exactly as described above, and an 10262306a36Sopenharmony_cioptional corresponding flow mask. 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ciA wildcarded flow can represent a group of exact match flows. Each '1' bit 10562306a36Sopenharmony_ciin the mask specifies a exact match with the corresponding bit in the flow key. 10662306a36Sopenharmony_ciA '0' bit specifies a don't care bit, which will match either a '1' or '0' bit 10762306a36Sopenharmony_ciof a incoming packet. Using wildcarded flow can improve the flow set up rate 10862306a36Sopenharmony_ciby reduce the number of new flows need to be processed by the user space program. 10962306a36Sopenharmony_ci 11062306a36Sopenharmony_ciSupport for the mask Netlink attribute is optional for both the kernel and user 11162306a36Sopenharmony_cispace program. The kernel can ignore the mask attribute, installing an exact 11262306a36Sopenharmony_cimatch flow, or reduce the number of don't care bits in the kernel to less than 11362306a36Sopenharmony_ciwhat was specified by the user space program. In this case, variations in bits 11462306a36Sopenharmony_cithat the kernel does not implement will simply result in additional flow setups. 11562306a36Sopenharmony_ciThe kernel module will also work with user space programs that neither support 11662306a36Sopenharmony_cinor supply flow mask attributes. 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_ciSince the kernel may ignore or modify wildcard bits, it can be difficult for 11962306a36Sopenharmony_cithe userspace program to know exactly what matches are installed. There are 12062306a36Sopenharmony_citwo possible approaches: reactively install flows as they miss the kernel 12162306a36Sopenharmony_ciflow table (and therefore not attempt to determine wildcard changes at all) 12262306a36Sopenharmony_cior use the kernel's response messages to determine the installed wildcards. 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ciWhen interacting with userspace, the kernel should maintain the match portion 12562306a36Sopenharmony_ciof the key exactly as originally installed. This will provides a handle to 12662306a36Sopenharmony_ciidentify the flow for all future operations. However, when reporting the 12762306a36Sopenharmony_cimask of an installed flow, the mask should include any restrictions imposed 12862306a36Sopenharmony_ciby the kernel. 12962306a36Sopenharmony_ci 13062306a36Sopenharmony_ciThe behavior when using overlapping wildcarded flows is undefined. It is the 13162306a36Sopenharmony_ciresponsibility of the user space program to ensure that any incoming packet 13262306a36Sopenharmony_cican match at most one flow, wildcarded or not. The current implementation 13362306a36Sopenharmony_ciperforms best-effort detection of overlapping wildcarded flows and may reject 13462306a36Sopenharmony_cisome but not all of them. However, this behavior may change in future versions. 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ciUnique flow identifiers 13862306a36Sopenharmony_ci----------------------- 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciAn alternative to using the original match portion of a key as the handle for 14162306a36Sopenharmony_ciflow identification is a unique flow identifier, or "UFID". UFIDs are optional 14262306a36Sopenharmony_cifor both the kernel and user space program. 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciUser space programs that support UFID are expected to provide it during flow 14562306a36Sopenharmony_cisetup in addition to the flow, then refer to the flow using the UFID for all 14662306a36Sopenharmony_cifuture operations. The kernel is not required to index flows by the original 14762306a36Sopenharmony_ciflow key if a UFID is specified. 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ci 15062306a36Sopenharmony_ciBasic rule for evolving flow keys 15162306a36Sopenharmony_ci--------------------------------- 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciSome care is needed to really maintain forward and backward 15462306a36Sopenharmony_cicompatibility for applications that follow the rules listed under 15562306a36Sopenharmony_ci"Flow key compatibility" above. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciThe basic rule is obvious:: 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ci ================================================================== 16062306a36Sopenharmony_ci New network protocol support must only supplement existing flow 16162306a36Sopenharmony_ci key attributes. It must not change the meaning of already defined 16262306a36Sopenharmony_ci flow key attributes. 16362306a36Sopenharmony_ci ================================================================== 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciThis rule does have less-obvious consequences so it is worth working 16662306a36Sopenharmony_cithrough a few examples. Suppose, for example, that the kernel module 16762306a36Sopenharmony_cidid not already implement VLAN parsing. Instead, it just interpreted 16862306a36Sopenharmony_cithe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the 16962306a36Sopenharmony_cipacket. The flow key for any packet with an 802.1Q header would look 17062306a36Sopenharmony_ciessentially like this, ignoring metadata:: 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ci eth(...), eth_type(0x8100) 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ciNaively, to add VLAN support, it makes sense to add a new "vlan" flow 17562306a36Sopenharmony_cikey attribute to contain the VLAN tag, then continue to decode the 17662306a36Sopenharmony_ciencapsulated headers beyond the VLAN tag using the existing field 17762306a36Sopenharmony_cidefinitions. With this change, a TCP packet in VLAN 10 would have a 17862306a36Sopenharmony_ciflow key much like this:: 17962306a36Sopenharmony_ci 18062306a36Sopenharmony_ci eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciBut this change would negatively affect a userspace application that 18362306a36Sopenharmony_cihas not been updated to understand the new "vlan" flow key attribute. 18462306a36Sopenharmony_ciThe application could, following the flow compatibility rules above, 18562306a36Sopenharmony_ciignore the "vlan" attribute that it does not understand and therefore 18662306a36Sopenharmony_ciassume that the flow contained IP packets. This is a bad assumption 18762306a36Sopenharmony_ci(the flow only contains IP packets if one parses and skips over the 18862306a36Sopenharmony_ci802.1Q header) and it could cause the application's behavior to change 18962306a36Sopenharmony_ciacross kernel versions even though it follows the compatibility rules. 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ciThe solution is to use a set of nested attributes. This is, for 19262306a36Sopenharmony_ciexample, why 802.1Q support uses nested attributes. A TCP packet in 19362306a36Sopenharmony_ciVLAN 10 is actually expressed as:: 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ci eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), 19662306a36Sopenharmony_ci ip(proto=6, ...), tcp(...))) 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ciNotice how the "eth_type", "ip", and "tcp" flow key attributes are 19962306a36Sopenharmony_cinested inside the "encap" attribute. Thus, an application that does 20062306a36Sopenharmony_cinot understand the "vlan" key will not see either of those attributes 20162306a36Sopenharmony_ciand therefore will not misinterpret them. (Also, the outer eth_type 20262306a36Sopenharmony_ciis still 0x8100, not changed to 0x0800.) 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ciHandling malformed packets 20562306a36Sopenharmony_ci-------------------------- 20662306a36Sopenharmony_ci 20762306a36Sopenharmony_ciDon't drop packets in the kernel for malformed protocol headers, bad 20862306a36Sopenharmony_cichecksums, etc. This would prevent userspace from implementing a 20962306a36Sopenharmony_cisimple Ethernet switch that forwards every packet. 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ciInstead, in such a case, include an attribute with "empty" content. 21262306a36Sopenharmony_ciIt doesn't matter if the empty content could be valid protocol values, 21362306a36Sopenharmony_cias long as those values are rarely seen in practice, because userspace 21462306a36Sopenharmony_cican always forward all packets with those values to userspace and 21562306a36Sopenharmony_cihandle them individually. 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_ciFor example, consider a packet that contains an IP header that 21862306a36Sopenharmony_ciindicates protocol 6 for TCP, but which is truncated just after the IP 21962306a36Sopenharmony_ciheader, so that the TCP header is missing. The flow key for this 22062306a36Sopenharmony_cipacket would include a tcp attribute with all-zero src and dst, like 22162306a36Sopenharmony_cithis:: 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ci eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ciAs another example, consider a packet with an Ethernet type of 0x8100, 22662306a36Sopenharmony_ciindicating that a VLAN TCI should follow, but which is truncated just 22762306a36Sopenharmony_ciafter the Ethernet type. The flow key for this packet would include 22862306a36Sopenharmony_cian all-zero-bits vlan and an empty encap attribute, like this:: 22962306a36Sopenharmony_ci 23062306a36Sopenharmony_ci eth(...), eth_type(0x8100), vlan(0), encap() 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ciUnlike a TCP packet with source and destination ports 0, an 23362306a36Sopenharmony_ciall-zero-bits VLAN TCI is not that rare, so the CFI bit (aka 23462306a36Sopenharmony_ciVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan 23562306a36Sopenharmony_ciattribute expressly to allow this situation to be distinguished. 23662306a36Sopenharmony_ciThus, the flow key in this second example unambiguously indicates a 23762306a36Sopenharmony_cimissing or malformed VLAN TCI. 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ciOther rules 24062306a36Sopenharmony_ci----------- 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciThe other rules for flow keys are much less subtle: 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ci - Duplicate attributes are not allowed at a given nesting level. 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ci - Ordering of attributes is not significant. 24762306a36Sopenharmony_ci 24862306a36Sopenharmony_ci - When the kernel sends a given flow key to userspace, it always 24962306a36Sopenharmony_ci composes it the same way. This allows userspace to hash and 25062306a36Sopenharmony_ci compare entire flow keys that it may not be able to fully 25162306a36Sopenharmony_ci interpret. 252