162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci====================================
462306a36Sopenharmony_ciNetfilter's flowtable infrastructure
562306a36Sopenharmony_ci====================================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciThis documentation describes the Netfilter flowtable infrastructure which allows
862306a36Sopenharmony_ciyou to define a fastpath through the flowtable datapath. This infrastructure
962306a36Sopenharmony_cialso provides hardware offload support. The flowtable supports for the layer 3
1062306a36Sopenharmony_ciIPv4 and IPv6 and the layer 4 TCP and UDP protocols.
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ciOverview
1362306a36Sopenharmony_ci--------
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciOnce the first packet of the flow successfully goes through the IP forwarding
1662306a36Sopenharmony_cipath, from the second packet on, you might decide to offload the flow to the
1762306a36Sopenharmony_ciflowtable through your ruleset. The flowtable infrastructure provides a rule
1862306a36Sopenharmony_ciaction that allows you to specify when to add a flow to the flowtable.
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciA packet that finds a matching entry in the flowtable (ie. flowtable hit) is
2162306a36Sopenharmony_citransmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
2262306a36Sopenharmony_ciclassic IP forwarding path (the visible effect is that you do not see these
2362306a36Sopenharmony_cipackets from any of the Netfilter hooks coming after ingress). In case that
2462306a36Sopenharmony_cithere is no matching entry in the flowtable (ie. flowtable miss), the packet
2562306a36Sopenharmony_cifollows the classic IP forwarding path.
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ciThe flowtable uses a resizable hashtable. Lookups are based on the following
2862306a36Sopenharmony_cin-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
2962306a36Sopenharmony_cisource and destination, layer 4 source and destination ports and the input
3062306a36Sopenharmony_ciinterface (useful in case there are several conntrack zones in place).
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciThe 'flow add' action allows you to populate the flowtable, the user selectively
3362306a36Sopenharmony_cispecifies what flows are placed into the flowtable. Hence, packets follow the
3462306a36Sopenharmony_ciclassic IP forwarding path unless the user explicitly instruct flows to use this
3562306a36Sopenharmony_cinew alternative forwarding path via policy.
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ciThe flowtable datapath is represented in Fig.1, which describes the classic IP
3862306a36Sopenharmony_ciforwarding path including the Netfilter hooks and the flowtable fastpath bypass.
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ci::
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ci					 userspace process
4362306a36Sopenharmony_ci					  ^              |
4462306a36Sopenharmony_ci					  |              |
4562306a36Sopenharmony_ci				     _____|____     ____\/___
4662306a36Sopenharmony_ci				    /          \   /         \
4762306a36Sopenharmony_ci				    |   input   |  |  output  |
4862306a36Sopenharmony_ci				    \__________/   \_________/
4962306a36Sopenharmony_ci					 ^               |
5062306a36Sopenharmony_ci					 |               |
5162306a36Sopenharmony_ci      _________      __________      ---------     _____\/_____
5262306a36Sopenharmony_ci     /         \    /          \     |Routing |   /            \
5362306a36Sopenharmony_ci  -->  ingress  ---> prerouting ---> |decision|   | postrouting |--> neigh_xmit
5462306a36Sopenharmony_ci     \_________/    \__________/     ----------   \____________/          ^
5562306a36Sopenharmony_ci       |      ^                          |               ^                |
5662306a36Sopenharmony_ci   flowtable  |                     ____\/___            |                |
5762306a36Sopenharmony_ci       |      |                    /         \           |                |
5862306a36Sopenharmony_ci    __\/___   |                    | forward |------------                |
5962306a36Sopenharmony_ci    |-----|   |                    \_________/                            |
6062306a36Sopenharmony_ci    |-----|   |                 'flow offload' rule                       |
6162306a36Sopenharmony_ci    |-----|   |                   adds entry to                           |
6262306a36Sopenharmony_ci    |_____|   |                     flowtable                             |
6362306a36Sopenharmony_ci       |      |                                                           |
6462306a36Sopenharmony_ci      / \     |                                                           |
6562306a36Sopenharmony_ci     /hit\_no_|                                                           |
6662306a36Sopenharmony_ci     \ ? /                                                                |
6762306a36Sopenharmony_ci      \ /                                                                 |
6862306a36Sopenharmony_ci       |__yes_________________fastpath bypass ____________________________|
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci	       Fig.1 Netfilter hooks and flowtable interactions
7162306a36Sopenharmony_ci
7262306a36Sopenharmony_ciThe flowtable entry also stores the NAT configuration, so all packets are
7362306a36Sopenharmony_cimangled according to the NAT policy that is specified from the classic IP
7462306a36Sopenharmony_ciforwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
7562306a36Sopenharmony_citraffic is passed up to follow the classic IP forwarding path given that the
7662306a36Sopenharmony_citransport header is missing, in this case, flowtable lookups are not possible.
7762306a36Sopenharmony_ciTCP RST and FIN packets are also passed up to the classic IP forwarding path to
7862306a36Sopenharmony_cirelease the flow gracefully. Packets that exceed the MTU are also passed up to
7962306a36Sopenharmony_cithe classic forwarding path to report packet-too-big ICMP errors to the sender.
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_ciExample configuration
8262306a36Sopenharmony_ci---------------------
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ciEnabling the flowtable bypass is relatively easy, you only need to create a
8562306a36Sopenharmony_ciflowtable and add one rule to your forward chain::
8662306a36Sopenharmony_ci
8762306a36Sopenharmony_ci	table inet x {
8862306a36Sopenharmony_ci		flowtable f {
8962306a36Sopenharmony_ci			hook ingress priority 0; devices = { eth0, eth1 };
9062306a36Sopenharmony_ci		}
9162306a36Sopenharmony_ci		chain y {
9262306a36Sopenharmony_ci			type filter hook forward priority 0; policy accept;
9362306a36Sopenharmony_ci			ip protocol tcp flow add @f
9462306a36Sopenharmony_ci			counter packets 0 bytes 0
9562306a36Sopenharmony_ci		}
9662306a36Sopenharmony_ci	}
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciThis example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
9962306a36Sopenharmony_cinetdevices. You can create as many flowtables as you want in case you need to
10062306a36Sopenharmony_ciperform resource partitioning. The flowtable priority defines the order in which
10162306a36Sopenharmony_cihooks are run in the pipeline, this is convenient in case you already have a
10262306a36Sopenharmony_cinftables ingress chain (make sure the flowtable priority is smaller than the
10362306a36Sopenharmony_cinftables ingress chain hence the flowtable runs before in the pipeline).
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ciThe 'flow offload' action from the forward chain 'y' adds an entry to the
10662306a36Sopenharmony_ciflowtable for the TCP syn-ack packet coming in the reply direction. Once the
10762306a36Sopenharmony_ciflow is offloaded, you will observe that the counter rule in the example above
10862306a36Sopenharmony_cidoes not get updated for the packets that are being forwarded through the
10962306a36Sopenharmony_ciforwarding bypass.
11062306a36Sopenharmony_ci
11162306a36Sopenharmony_ciYou can identify offloaded flows through the [OFFLOAD] tag when listing your
11262306a36Sopenharmony_ciconnection tracking table.
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci::
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ci	# conntrack -L
11762306a36Sopenharmony_ci	tcp      6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ciLayer 2 encapsulation
12162306a36Sopenharmony_ci---------------------
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ciSince Linux kernel 5.13, the flowtable infrastructure discovers the real
12462306a36Sopenharmony_cinetdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
12562306a36Sopenharmony_ciparses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
12662306a36Sopenharmony_ciVLAN ID / PPPoE session ID which are used for the flowtable lookups. The
12762306a36Sopenharmony_ciflowtable datapath also deals with layer 2 decapsulation.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ciYou do not need to add the PPPoE and the VLAN devices to your flowtable,
13062306a36Sopenharmony_ciinstead the real device is sufficient for the flowtable to track your flows.
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciBridge and IP forwarding
13362306a36Sopenharmony_ci------------------------
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ciSince Linux kernel 5.13, you can add bridge ports to the flowtable. The
13662306a36Sopenharmony_ciflowtable infrastructure discovers the topology behind the bridge device. This
13762306a36Sopenharmony_ciallows the flowtable to define a fastpath bypass between the bridge ports
13862306a36Sopenharmony_ci(represented as eth1 and eth2 in the example figure below) and the gateway
13962306a36Sopenharmony_cidevice (represented as eth0) in your switch/router.
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ci::
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ci                      fastpath bypass
14462306a36Sopenharmony_ci               .-------------------------.
14562306a36Sopenharmony_ci              /                           \
14662306a36Sopenharmony_ci              |           IP forwarding   |
14762306a36Sopenharmony_ci              |          /             \ \/
14862306a36Sopenharmony_ci              |       br0               eth0 ..... eth0
14962306a36Sopenharmony_ci              .       / \                          *host B*
15062306a36Sopenharmony_ci               -> eth1  eth2
15162306a36Sopenharmony_ci                   .           *switch/router*
15262306a36Sopenharmony_ci                   .
15362306a36Sopenharmony_ci                   .
15462306a36Sopenharmony_ci                 eth0
15562306a36Sopenharmony_ci               *host A*
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciThe flowtable infrastructure also supports for bridge VLAN filtering actions
15862306a36Sopenharmony_cisuch as PVID and untagged. You can also stack a classic VLAN device on top of
15962306a36Sopenharmony_ciyour bridge port.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ciIf you would like that your flowtable defines a fastpath between your bridge
16262306a36Sopenharmony_ciports and your IP forwarding path, you have to add your bridge ports (as
16362306a36Sopenharmony_cirepresented by the real netdevice) to your flowtable definition.
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciCounters
16662306a36Sopenharmony_ci--------
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ciThe flowtable can synchronize packet and byte counters with the existing
16962306a36Sopenharmony_ciconnection tracking entry by specifying the counter statement in your flowtable
17062306a36Sopenharmony_cidefinition, e.g.
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ci::
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ci	table inet x {
17562306a36Sopenharmony_ci		flowtable f {
17662306a36Sopenharmony_ci			hook ingress priority 0; devices = { eth0, eth1 };
17762306a36Sopenharmony_ci			counter
17862306a36Sopenharmony_ci		}
17962306a36Sopenharmony_ci	}
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ciCounter support is available since Linux kernel 5.7.
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ciHardware offload
18462306a36Sopenharmony_ci----------------
18562306a36Sopenharmony_ci
18662306a36Sopenharmony_ciIf your network device provides hardware offload support, you can turn it on by
18762306a36Sopenharmony_cimeans of the 'offload' flag in your flowtable definition, e.g.
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci::
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci	table inet x {
19262306a36Sopenharmony_ci		flowtable f {
19362306a36Sopenharmony_ci			hook ingress priority 0; devices = { eth0, eth1 };
19462306a36Sopenharmony_ci			flags offload;
19562306a36Sopenharmony_ci		}
19662306a36Sopenharmony_ci	}
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ciThere is a workqueue that adds the flows to the hardware. Note that a few
19962306a36Sopenharmony_cipackets might still run over the flowtable software path until the workqueue has
20062306a36Sopenharmony_cia chance to offload the flow to the network device.
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ciYou can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
20362306a36Sopenharmony_cilisting your connection tracking table. Please, note that the [OFFLOAD] tag
20462306a36Sopenharmony_cirefers to the software offload mode, so there is a distinction between [OFFLOAD]
20562306a36Sopenharmony_ciwhich refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
20662306a36Sopenharmony_cito the hardware offload datapath being used by the flow.
20762306a36Sopenharmony_ci
20862306a36Sopenharmony_ciThe flowtable hardware offload infrastructure also supports for the DSA
20962306a36Sopenharmony_ci(Distributed Switch Architecture).
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ciLimitations
21262306a36Sopenharmony_ci-----------
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ciThe flowtable behaves like a cache. The flowtable entries might get stale if
21562306a36Sopenharmony_cieither the destination MAC address or the egress netdevice that is used for
21662306a36Sopenharmony_citransmission changes.
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ciThis might be a problem if:
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ci- You run the flowtable in software mode and you combine bridge and IP
22162306a36Sopenharmony_ci  forwarding in your setup.
22262306a36Sopenharmony_ci- Hardware offload is enabled.
22362306a36Sopenharmony_ci
22462306a36Sopenharmony_ciMore reading
22562306a36Sopenharmony_ci------------
22662306a36Sopenharmony_ci
22762306a36Sopenharmony_ciThis documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
22862306a36Sopenharmony_cialso made a very complete and comprehensive summary called "A state of network
22962306a36Sopenharmony_ciacceleration" that describes how things were before this infrastructure was
23062306a36Sopenharmony_cimainlined [3]_ and it also makes a rough summary of this work [4]_.
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ci.. [1] https://lwn.net/Articles/738214/
23362306a36Sopenharmony_ci.. [2] https://lwn.net/Articles/742164/
23462306a36Sopenharmony_ci.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
23562306a36Sopenharmony_ci.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
236