162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============================================
462306a36Sopenharmony_ciOpen vSwitch datapath developer documentation
562306a36Sopenharmony_ci=============================================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciThe Open vSwitch kernel module allows flexible userspace control over
862306a36Sopenharmony_ciflow-level packet processing on selected network devices.  It can be
962306a36Sopenharmony_ciused to implement a plain Ethernet switch, network device bonding,
1062306a36Sopenharmony_ciVLAN processing, network access control, flow-based network control,
1162306a36Sopenharmony_ciand so on.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciThe kernel module implements multiple "datapaths" (analogous to
1462306a36Sopenharmony_cibridges), each of which can have multiple "vports" (analogous to ports
1562306a36Sopenharmony_ciwithin a bridge).  Each datapath also has associated with it a "flow
1662306a36Sopenharmony_citable" that userspace populates with "flows" that map from keys based
1762306a36Sopenharmony_cion packet headers and metadata to sets of actions.  The most common
1862306a36Sopenharmony_ciaction forwards the packet to another vport; other actions are also
1962306a36Sopenharmony_ciimplemented.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciWhen a packet arrives on a vport, the kernel module processes it by
2262306a36Sopenharmony_ciextracting its flow key and looking it up in the flow table.  If there
2362306a36Sopenharmony_ciis a matching flow, it executes the associated actions.  If there is
2462306a36Sopenharmony_cino match, it queues the packet to userspace for processing (as part of
2562306a36Sopenharmony_ciits processing, userspace will likely set up a flow to handle further
2662306a36Sopenharmony_cipackets of the same type entirely in-kernel).
2762306a36Sopenharmony_ci
2862306a36Sopenharmony_ci
2962306a36Sopenharmony_ciFlow key compatibility
3062306a36Sopenharmony_ci----------------------
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciNetwork protocols evolve over time.  New protocols become important
3362306a36Sopenharmony_ciand existing protocols lose their prominence.  For the Open vSwitch
3462306a36Sopenharmony_cikernel module to remain relevant, it must be possible for newer
3562306a36Sopenharmony_civersions to parse additional protocols as part of the flow key.  It
3662306a36Sopenharmony_cimight even be desirable, someday, to drop support for parsing
3762306a36Sopenharmony_ciprotocols that have become obsolete.  Therefore, the Netlink interface
3862306a36Sopenharmony_cito Open vSwitch is designed to allow carefully written userspace
3962306a36Sopenharmony_ciapplications to work with any version of the flow key, past or future.
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciTo support this forward and backward compatibility, whenever the
4262306a36Sopenharmony_cikernel module passes a packet to userspace, it also passes along the
4362306a36Sopenharmony_ciflow key that it parsed from the packet.  Userspace then extracts its
4462306a36Sopenharmony_ciown notion of a flow key from the packet and compares it against the
4562306a36Sopenharmony_cikernel-provided version:
4662306a36Sopenharmony_ci
4762306a36Sopenharmony_ci    - If userspace's notion of the flow key for the packet matches the
4862306a36Sopenharmony_ci      kernel's, then nothing special is necessary.
4962306a36Sopenharmony_ci
5062306a36Sopenharmony_ci    - If the kernel's flow key includes more fields than the userspace
5162306a36Sopenharmony_ci      version of the flow key, for example if the kernel decoded IPv6
5262306a36Sopenharmony_ci      headers but userspace stopped at the Ethernet type (because it
5362306a36Sopenharmony_ci      does not understand IPv6), then again nothing special is
5462306a36Sopenharmony_ci      necessary.  Userspace can still set up a flow in the usual way,
5562306a36Sopenharmony_ci      as long as it uses the kernel-provided flow key to do it.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ci    - If the userspace flow key includes more fields than the
5862306a36Sopenharmony_ci      kernel's, for example if userspace decoded an IPv6 header but
5962306a36Sopenharmony_ci      the kernel stopped at the Ethernet type, then userspace can
6062306a36Sopenharmony_ci      forward the packet manually, without setting up a flow in the
6162306a36Sopenharmony_ci      kernel.  This case is bad for performance because every packet
6262306a36Sopenharmony_ci      that the kernel considers part of the flow must go to userspace,
6362306a36Sopenharmony_ci      but the forwarding behavior is correct.  (If userspace can
6462306a36Sopenharmony_ci      determine that the values of the extra fields would not affect
6562306a36Sopenharmony_ci      forwarding behavior, then it could set up a flow anyway.)
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciHow flow keys evolve over time is important to making this work, so
6862306a36Sopenharmony_cithe following sections go into detail.
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci
7162306a36Sopenharmony_ciFlow key format
7262306a36Sopenharmony_ci---------------
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ciA flow key is passed over a Netlink socket as a sequence of Netlink
7562306a36Sopenharmony_ciattributes.  Some attributes represent packet metadata, defined as any
7662306a36Sopenharmony_ciinformation about a packet that cannot be extracted from the packet
7762306a36Sopenharmony_ciitself, e.g. the vport on which the packet was received.  Most
7862306a36Sopenharmony_ciattributes, however, are extracted from headers within the packet,
7962306a36Sopenharmony_cie.g. source and destination addresses from Ethernet, IP, or TCP
8062306a36Sopenharmony_ciheaders.
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ciThe <linux/openvswitch.h> header file defines the exact format of the
8362306a36Sopenharmony_ciflow key attributes.  For informal explanatory purposes here, we write
8462306a36Sopenharmony_cithem as comma-separated strings, with parentheses indicating arguments
8562306a36Sopenharmony_ciand nesting.  For example, the following could represent a flow key
8662306a36Sopenharmony_cicorresponding to a TCP packet that arrived on vport 1::
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
8962306a36Sopenharmony_ci    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
9062306a36Sopenharmony_ci    frag=no), tcp(src=49163, dst=80)
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ciOften we ellipsize arguments not important to the discussion, e.g.::
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ci    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
9562306a36Sopenharmony_ci
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ciWildcarded flow key format
9862306a36Sopenharmony_ci--------------------------
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ciA wildcarded flow is described with two sequences of Netlink attributes
10162306a36Sopenharmony_cipassed over the Netlink socket. A flow key, exactly as described above, and an
10262306a36Sopenharmony_cioptional corresponding flow mask.
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ciA wildcarded flow can represent a group of exact match flows. Each '1' bit
10562306a36Sopenharmony_ciin the mask specifies a exact match with the corresponding bit in the flow key.
10662306a36Sopenharmony_ciA '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
10762306a36Sopenharmony_ciof a incoming packet. Using wildcarded flow can improve the flow set up rate
10862306a36Sopenharmony_ciby reduce the number of new flows need to be processed by the user space program.
10962306a36Sopenharmony_ci
11062306a36Sopenharmony_ciSupport for the mask Netlink attribute is optional for both the kernel and user
11162306a36Sopenharmony_cispace program. The kernel can ignore the mask attribute, installing an exact
11262306a36Sopenharmony_cimatch flow, or reduce the number of don't care bits in the kernel to less than
11362306a36Sopenharmony_ciwhat was specified by the user space program. In this case, variations in bits
11462306a36Sopenharmony_cithat the kernel does not implement will simply result in additional flow setups.
11562306a36Sopenharmony_ciThe kernel module will also work with user space programs that neither support
11662306a36Sopenharmony_cinor supply flow mask attributes.
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_ciSince the kernel may ignore or modify wildcard bits, it can be difficult for
11962306a36Sopenharmony_cithe userspace program to know exactly what matches are installed. There are
12062306a36Sopenharmony_citwo possible approaches: reactively install flows as they miss the kernel
12162306a36Sopenharmony_ciflow table (and therefore not attempt to determine wildcard changes at all)
12262306a36Sopenharmony_cior use the kernel's response messages to determine the installed wildcards.
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ciWhen interacting with userspace, the kernel should maintain the match portion
12562306a36Sopenharmony_ciof the key exactly as originally installed. This will provides a handle to
12662306a36Sopenharmony_ciidentify the flow for all future operations. However, when reporting the
12762306a36Sopenharmony_cimask of an installed flow, the mask should include any restrictions imposed
12862306a36Sopenharmony_ciby the kernel.
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ciThe behavior when using overlapping wildcarded flows is undefined. It is the
13162306a36Sopenharmony_ciresponsibility of the user space program to ensure that any incoming packet
13262306a36Sopenharmony_cican match at most one flow, wildcarded or not. The current implementation
13362306a36Sopenharmony_ciperforms best-effort detection of overlapping wildcarded flows and may reject
13462306a36Sopenharmony_cisome but not all of them. However, this behavior may change in future versions.
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ciUnique flow identifiers
13862306a36Sopenharmony_ci-----------------------
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ciAn alternative to using the original match portion of a key as the handle for
14162306a36Sopenharmony_ciflow identification is a unique flow identifier, or "UFID". UFIDs are optional
14262306a36Sopenharmony_cifor both the kernel and user space program.
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ciUser space programs that support UFID are expected to provide it during flow
14562306a36Sopenharmony_cisetup in addition to the flow, then refer to the flow using the UFID for all
14662306a36Sopenharmony_cifuture operations. The kernel is not required to index flows by the original
14762306a36Sopenharmony_ciflow key if a UFID is specified.
14862306a36Sopenharmony_ci
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ciBasic rule for evolving flow keys
15162306a36Sopenharmony_ci---------------------------------
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciSome care is needed to really maintain forward and backward
15462306a36Sopenharmony_cicompatibility for applications that follow the rules listed under
15562306a36Sopenharmony_ci"Flow key compatibility" above.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciThe basic rule is obvious::
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ci    ==================================================================
16062306a36Sopenharmony_ci    New network protocol support must only supplement existing flow
16162306a36Sopenharmony_ci    key attributes.  It must not change the meaning of already defined
16262306a36Sopenharmony_ci    flow key attributes.
16362306a36Sopenharmony_ci    ==================================================================
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciThis rule does have less-obvious consequences so it is worth working
16662306a36Sopenharmony_cithrough a few examples.  Suppose, for example, that the kernel module
16762306a36Sopenharmony_cidid not already implement VLAN parsing.  Instead, it just interpreted
16862306a36Sopenharmony_cithe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
16962306a36Sopenharmony_cipacket.  The flow key for any packet with an 802.1Q header would look
17062306a36Sopenharmony_ciessentially like this, ignoring metadata::
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ci    eth(...), eth_type(0x8100)
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ciNaively, to add VLAN support, it makes sense to add a new "vlan" flow
17562306a36Sopenharmony_cikey attribute to contain the VLAN tag, then continue to decode the
17662306a36Sopenharmony_ciencapsulated headers beyond the VLAN tag using the existing field
17762306a36Sopenharmony_cidefinitions.  With this change, a TCP packet in VLAN 10 would have a
17862306a36Sopenharmony_ciflow key much like this::
17962306a36Sopenharmony_ci
18062306a36Sopenharmony_ci    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ciBut this change would negatively affect a userspace application that
18362306a36Sopenharmony_cihas not been updated to understand the new "vlan" flow key attribute.
18462306a36Sopenharmony_ciThe application could, following the flow compatibility rules above,
18562306a36Sopenharmony_ciignore the "vlan" attribute that it does not understand and therefore
18662306a36Sopenharmony_ciassume that the flow contained IP packets.  This is a bad assumption
18762306a36Sopenharmony_ci(the flow only contains IP packets if one parses and skips over the
18862306a36Sopenharmony_ci802.1Q header) and it could cause the application's behavior to change
18962306a36Sopenharmony_ciacross kernel versions even though it follows the compatibility rules.
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ciThe solution is to use a set of nested attributes.  This is, for
19262306a36Sopenharmony_ciexample, why 802.1Q support uses nested attributes.  A TCP packet in
19362306a36Sopenharmony_ciVLAN 10 is actually expressed as::
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ci    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
19662306a36Sopenharmony_ci    ip(proto=6, ...), tcp(...)))
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ciNotice how the "eth_type", "ip", and "tcp" flow key attributes are
19962306a36Sopenharmony_cinested inside the "encap" attribute.  Thus, an application that does
20062306a36Sopenharmony_cinot understand the "vlan" key will not see either of those attributes
20162306a36Sopenharmony_ciand therefore will not misinterpret them.  (Also, the outer eth_type
20262306a36Sopenharmony_ciis still 0x8100, not changed to 0x0800.)
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ciHandling malformed packets
20562306a36Sopenharmony_ci--------------------------
20662306a36Sopenharmony_ci
20762306a36Sopenharmony_ciDon't drop packets in the kernel for malformed protocol headers, bad
20862306a36Sopenharmony_cichecksums, etc.  This would prevent userspace from implementing a
20962306a36Sopenharmony_cisimple Ethernet switch that forwards every packet.
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ciInstead, in such a case, include an attribute with "empty" content.
21262306a36Sopenharmony_ciIt doesn't matter if the empty content could be valid protocol values,
21362306a36Sopenharmony_cias long as those values are rarely seen in practice, because userspace
21462306a36Sopenharmony_cican always forward all packets with those values to userspace and
21562306a36Sopenharmony_cihandle them individually.
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ciFor example, consider a packet that contains an IP header that
21862306a36Sopenharmony_ciindicates protocol 6 for TCP, but which is truncated just after the IP
21962306a36Sopenharmony_ciheader, so that the TCP header is missing.  The flow key for this
22062306a36Sopenharmony_cipacket would include a tcp attribute with all-zero src and dst, like
22162306a36Sopenharmony_cithis::
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ci    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ciAs another example, consider a packet with an Ethernet type of 0x8100,
22662306a36Sopenharmony_ciindicating that a VLAN TCI should follow, but which is truncated just
22762306a36Sopenharmony_ciafter the Ethernet type.  The flow key for this packet would include
22862306a36Sopenharmony_cian all-zero-bits vlan and an empty encap attribute, like this::
22962306a36Sopenharmony_ci
23062306a36Sopenharmony_ci    eth(...), eth_type(0x8100), vlan(0), encap()
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ciUnlike a TCP packet with source and destination ports 0, an
23362306a36Sopenharmony_ciall-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
23462306a36Sopenharmony_ciVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
23562306a36Sopenharmony_ciattribute expressly to allow this situation to be distinguished.
23662306a36Sopenharmony_ciThus, the flow key in this second example unambiguously indicates a
23762306a36Sopenharmony_cimissing or malformed VLAN TCI.
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ciOther rules
24062306a36Sopenharmony_ci-----------
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciThe other rules for flow keys are much less subtle:
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ci    - Duplicate attributes are not allowed at a given nesting level.
24562306a36Sopenharmony_ci
24662306a36Sopenharmony_ci    - Ordering of attributes is not significant.
24762306a36Sopenharmony_ci
24862306a36Sopenharmony_ci    - When the kernel sends a given flow key to userspace, it always
24962306a36Sopenharmony_ci      composes it the same way.  This allows userspace to hash and
25062306a36Sopenharmony_ci      compare entire flow keys that it may not be able to fully
25162306a36Sopenharmony_ci      interpret.
252