162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci==================================== 462306a36Sopenharmony_ciNetfilter's flowtable infrastructure 562306a36Sopenharmony_ci==================================== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciThis documentation describes the Netfilter flowtable infrastructure which allows 862306a36Sopenharmony_ciyou to define a fastpath through the flowtable datapath. This infrastructure 962306a36Sopenharmony_cialso provides hardware offload support. The flowtable supports for the layer 3 1062306a36Sopenharmony_ciIPv4 and IPv6 and the layer 4 TCP and UDP protocols. 1162306a36Sopenharmony_ci 1262306a36Sopenharmony_ciOverview 1362306a36Sopenharmony_ci-------- 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciOnce the first packet of the flow successfully goes through the IP forwarding 1662306a36Sopenharmony_cipath, from the second packet on, you might decide to offload the flow to the 1762306a36Sopenharmony_ciflowtable through your ruleset. The flowtable infrastructure provides a rule 1862306a36Sopenharmony_ciaction that allows you to specify when to add a flow to the flowtable. 1962306a36Sopenharmony_ci 2062306a36Sopenharmony_ciA packet that finds a matching entry in the flowtable (ie. flowtable hit) is 2162306a36Sopenharmony_citransmitted to the output netdevice via neigh_xmit(), hence, packets bypass the 2262306a36Sopenharmony_ciclassic IP forwarding path (the visible effect is that you do not see these 2362306a36Sopenharmony_cipackets from any of the Netfilter hooks coming after ingress). In case that 2462306a36Sopenharmony_cithere is no matching entry in the flowtable (ie. flowtable miss), the packet 2562306a36Sopenharmony_cifollows the classic IP forwarding path. 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ciThe flowtable uses a resizable hashtable. Lookups are based on the following 2862306a36Sopenharmony_cin-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3 2962306a36Sopenharmony_cisource and destination, layer 4 source and destination ports and the input 3062306a36Sopenharmony_ciinterface (useful in case there are several conntrack zones in place). 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciThe 'flow add' action allows you to populate the flowtable, the user selectively 3362306a36Sopenharmony_cispecifies what flows are placed into the flowtable. Hence, packets follow the 3462306a36Sopenharmony_ciclassic IP forwarding path unless the user explicitly instruct flows to use this 3562306a36Sopenharmony_cinew alternative forwarding path via policy. 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ciThe flowtable datapath is represented in Fig.1, which describes the classic IP 3862306a36Sopenharmony_ciforwarding path including the Netfilter hooks and the flowtable fastpath bypass. 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ci:: 4162306a36Sopenharmony_ci 4262306a36Sopenharmony_ci userspace process 4362306a36Sopenharmony_ci ^ | 4462306a36Sopenharmony_ci | | 4562306a36Sopenharmony_ci _____|____ ____\/___ 4662306a36Sopenharmony_ci / \ / \ 4762306a36Sopenharmony_ci | input | | output | 4862306a36Sopenharmony_ci \__________/ \_________/ 4962306a36Sopenharmony_ci ^ | 5062306a36Sopenharmony_ci | | 5162306a36Sopenharmony_ci _________ __________ --------- _____\/_____ 5262306a36Sopenharmony_ci / \ / \ |Routing | / \ 5362306a36Sopenharmony_ci --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit 5462306a36Sopenharmony_ci \_________/ \__________/ ---------- \____________/ ^ 5562306a36Sopenharmony_ci | ^ | ^ | 5662306a36Sopenharmony_ci flowtable | ____\/___ | | 5762306a36Sopenharmony_ci | | / \ | | 5862306a36Sopenharmony_ci __\/___ | | forward |------------ | 5962306a36Sopenharmony_ci |-----| | \_________/ | 6062306a36Sopenharmony_ci |-----| | 'flow offload' rule | 6162306a36Sopenharmony_ci |-----| | adds entry to | 6262306a36Sopenharmony_ci |_____| | flowtable | 6362306a36Sopenharmony_ci | | | 6462306a36Sopenharmony_ci / \ | | 6562306a36Sopenharmony_ci /hit\_no_| | 6662306a36Sopenharmony_ci \ ? / | 6762306a36Sopenharmony_ci \ / | 6862306a36Sopenharmony_ci |__yes_________________fastpath bypass ____________________________| 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci Fig.1 Netfilter hooks and flowtable interactions 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ciThe flowtable entry also stores the NAT configuration, so all packets are 7362306a36Sopenharmony_cimangled according to the NAT policy that is specified from the classic IP 7462306a36Sopenharmony_ciforwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented 7562306a36Sopenharmony_citraffic is passed up to follow the classic IP forwarding path given that the 7662306a36Sopenharmony_citransport header is missing, in this case, flowtable lookups are not possible. 7762306a36Sopenharmony_ciTCP RST and FIN packets are also passed up to the classic IP forwarding path to 7862306a36Sopenharmony_cirelease the flow gracefully. Packets that exceed the MTU are also passed up to 7962306a36Sopenharmony_cithe classic forwarding path to report packet-too-big ICMP errors to the sender. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ciExample configuration 8262306a36Sopenharmony_ci--------------------- 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ciEnabling the flowtable bypass is relatively easy, you only need to create a 8562306a36Sopenharmony_ciflowtable and add one rule to your forward chain:: 8662306a36Sopenharmony_ci 8762306a36Sopenharmony_ci table inet x { 8862306a36Sopenharmony_ci flowtable f { 8962306a36Sopenharmony_ci hook ingress priority 0; devices = { eth0, eth1 }; 9062306a36Sopenharmony_ci } 9162306a36Sopenharmony_ci chain y { 9262306a36Sopenharmony_ci type filter hook forward priority 0; policy accept; 9362306a36Sopenharmony_ci ip protocol tcp flow add @f 9462306a36Sopenharmony_ci counter packets 0 bytes 0 9562306a36Sopenharmony_ci } 9662306a36Sopenharmony_ci } 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciThis example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 9962306a36Sopenharmony_cinetdevices. You can create as many flowtables as you want in case you need to 10062306a36Sopenharmony_ciperform resource partitioning. The flowtable priority defines the order in which 10162306a36Sopenharmony_cihooks are run in the pipeline, this is convenient in case you already have a 10262306a36Sopenharmony_cinftables ingress chain (make sure the flowtable priority is smaller than the 10362306a36Sopenharmony_cinftables ingress chain hence the flowtable runs before in the pipeline). 10462306a36Sopenharmony_ci 10562306a36Sopenharmony_ciThe 'flow offload' action from the forward chain 'y' adds an entry to the 10662306a36Sopenharmony_ciflowtable for the TCP syn-ack packet coming in the reply direction. Once the 10762306a36Sopenharmony_ciflow is offloaded, you will observe that the counter rule in the example above 10862306a36Sopenharmony_cidoes not get updated for the packets that are being forwarded through the 10962306a36Sopenharmony_ciforwarding bypass. 11062306a36Sopenharmony_ci 11162306a36Sopenharmony_ciYou can identify offloaded flows through the [OFFLOAD] tag when listing your 11262306a36Sopenharmony_ciconnection tracking table. 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ci:: 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ci # conntrack -L 11762306a36Sopenharmony_ci tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ciLayer 2 encapsulation 12162306a36Sopenharmony_ci--------------------- 12262306a36Sopenharmony_ci 12362306a36Sopenharmony_ciSince Linux kernel 5.13, the flowtable infrastructure discovers the real 12462306a36Sopenharmony_cinetdevice behind VLAN and PPPoE netdevices. The flowtable software datapath 12562306a36Sopenharmony_ciparses the VLAN and PPPoE layer 2 headers to extract the ethertype and the 12662306a36Sopenharmony_ciVLAN ID / PPPoE session ID which are used for the flowtable lookups. The 12762306a36Sopenharmony_ciflowtable datapath also deals with layer 2 decapsulation. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ciYou do not need to add the PPPoE and the VLAN devices to your flowtable, 13062306a36Sopenharmony_ciinstead the real device is sufficient for the flowtable to track your flows. 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciBridge and IP forwarding 13362306a36Sopenharmony_ci------------------------ 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ciSince Linux kernel 5.13, you can add bridge ports to the flowtable. The 13662306a36Sopenharmony_ciflowtable infrastructure discovers the topology behind the bridge device. This 13762306a36Sopenharmony_ciallows the flowtable to define a fastpath bypass between the bridge ports 13862306a36Sopenharmony_ci(represented as eth1 and eth2 in the example figure below) and the gateway 13962306a36Sopenharmony_cidevice (represented as eth0) in your switch/router. 14062306a36Sopenharmony_ci 14162306a36Sopenharmony_ci:: 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ci fastpath bypass 14462306a36Sopenharmony_ci .-------------------------. 14562306a36Sopenharmony_ci / \ 14662306a36Sopenharmony_ci | IP forwarding | 14762306a36Sopenharmony_ci | / \ \/ 14862306a36Sopenharmony_ci | br0 eth0 ..... eth0 14962306a36Sopenharmony_ci . / \ *host B* 15062306a36Sopenharmony_ci -> eth1 eth2 15162306a36Sopenharmony_ci . *switch/router* 15262306a36Sopenharmony_ci . 15362306a36Sopenharmony_ci . 15462306a36Sopenharmony_ci eth0 15562306a36Sopenharmony_ci *host A* 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciThe flowtable infrastructure also supports for bridge VLAN filtering actions 15862306a36Sopenharmony_cisuch as PVID and untagged. You can also stack a classic VLAN device on top of 15962306a36Sopenharmony_ciyour bridge port. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciIf you would like that your flowtable defines a fastpath between your bridge 16262306a36Sopenharmony_ciports and your IP forwarding path, you have to add your bridge ports (as 16362306a36Sopenharmony_cirepresented by the real netdevice) to your flowtable definition. 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciCounters 16662306a36Sopenharmony_ci-------- 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ciThe flowtable can synchronize packet and byte counters with the existing 16962306a36Sopenharmony_ciconnection tracking entry by specifying the counter statement in your flowtable 17062306a36Sopenharmony_cidefinition, e.g. 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ci:: 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ci table inet x { 17562306a36Sopenharmony_ci flowtable f { 17662306a36Sopenharmony_ci hook ingress priority 0; devices = { eth0, eth1 }; 17762306a36Sopenharmony_ci counter 17862306a36Sopenharmony_ci } 17962306a36Sopenharmony_ci } 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ciCounter support is available since Linux kernel 5.7. 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ciHardware offload 18462306a36Sopenharmony_ci---------------- 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_ciIf your network device provides hardware offload support, you can turn it on by 18762306a36Sopenharmony_cimeans of the 'offload' flag in your flowtable definition, e.g. 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci:: 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci table inet x { 19262306a36Sopenharmony_ci flowtable f { 19362306a36Sopenharmony_ci hook ingress priority 0; devices = { eth0, eth1 }; 19462306a36Sopenharmony_ci flags offload; 19562306a36Sopenharmony_ci } 19662306a36Sopenharmony_ci } 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ciThere is a workqueue that adds the flows to the hardware. Note that a few 19962306a36Sopenharmony_cipackets might still run over the flowtable software path until the workqueue has 20062306a36Sopenharmony_cia chance to offload the flow to the network device. 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ciYou can identify hardware offloaded flows through the [HW_OFFLOAD] tag when 20362306a36Sopenharmony_cilisting your connection tracking table. Please, note that the [OFFLOAD] tag 20462306a36Sopenharmony_cirefers to the software offload mode, so there is a distinction between [OFFLOAD] 20562306a36Sopenharmony_ciwhich refers to the software flowtable fastpath and [HW_OFFLOAD] which refers 20662306a36Sopenharmony_cito the hardware offload datapath being used by the flow. 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ciThe flowtable hardware offload infrastructure also supports for the DSA 20962306a36Sopenharmony_ci(Distributed Switch Architecture). 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ciLimitations 21262306a36Sopenharmony_ci----------- 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ciThe flowtable behaves like a cache. The flowtable entries might get stale if 21562306a36Sopenharmony_cieither the destination MAC address or the egress netdevice that is used for 21662306a36Sopenharmony_citransmission changes. 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ciThis might be a problem if: 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ci- You run the flowtable in software mode and you combine bridge and IP 22162306a36Sopenharmony_ci forwarding in your setup. 22262306a36Sopenharmony_ci- Hardware offload is enabled. 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ciMore reading 22562306a36Sopenharmony_ci------------ 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ciThis documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki 22862306a36Sopenharmony_cialso made a very complete and comprehensive summary called "A state of network 22962306a36Sopenharmony_ciacceleration" that describes how things were before this infrastructure was 23062306a36Sopenharmony_cimainlined [3]_ and it also makes a rough summary of this work [4]_. 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ci.. [1] https://lwn.net/Articles/738214/ 23362306a36Sopenharmony_ci.. [2] https://lwn.net/Articles/742164/ 23462306a36Sopenharmony_ci.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html 23562306a36Sopenharmony_ci.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html 236