162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=========================
462306a36Sopenharmony_ciResilient Next-hop Groups
562306a36Sopenharmony_ci=========================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciResilient groups are a type of next-hop group that is aimed at minimizing
862306a36Sopenharmony_cidisruption in flow routing across changes to the group composition and
962306a36Sopenharmony_ciweights of constituent next hops.
1062306a36Sopenharmony_ci
1162306a36Sopenharmony_ciThe idea behind resilient hashing groups is best explained in contrast to
1262306a36Sopenharmony_cithe legacy multipath next-hop group, which uses the hash-threshold
1362306a36Sopenharmony_cialgorithm, described in RFC 2992.
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciTo select a next hop, hash-threshold algorithm first assigns a range of
1662306a36Sopenharmony_cihashes to each next hop in the group, and then selects the next hop by
1762306a36Sopenharmony_cicomparing the SKB hash with the individual ranges. When a next hop is
1862306a36Sopenharmony_ciremoved from the group, the ranges are recomputed, which leads to
1962306a36Sopenharmony_cireassignment of parts of hash space from one next hop to another. RFC 2992
2062306a36Sopenharmony_ciillustrates it thus::
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ci             +-------+-------+-------+-------+-------+
2362306a36Sopenharmony_ci             |   1   |   2   |   3   |   4   |   5   |
2462306a36Sopenharmony_ci             +-------+-+-----+---+---+-----+-+-------+
2562306a36Sopenharmony_ci             |    1    |    2    |    4    |    5    |
2662306a36Sopenharmony_ci             +---------+---------+---------+---------+
2762306a36Sopenharmony_ci
2862306a36Sopenharmony_ci              Before and after deletion of next hop 3
2962306a36Sopenharmony_ci	      under the hash-threshold algorithm.
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciNote how next hop 2 gave up part of the hash space in favor of next hop 1,
3262306a36Sopenharmony_ciand 4 in favor of 5. While there will usually be some overlap between the
3362306a36Sopenharmony_ciprevious and the new distribution, some traffic flows change the next hop
3462306a36Sopenharmony_cithat they resolve to.
3562306a36Sopenharmony_ci
3662306a36Sopenharmony_ciIf a multipath group is used for load-balancing between multiple servers,
3762306a36Sopenharmony_cithis hash space reassignment causes an issue that packets from a single
3862306a36Sopenharmony_ciflow suddenly end up arriving at a server that does not expect them. This
3962306a36Sopenharmony_cican result in TCP connections being reset.
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciIf a multipath group is used for load-balancing among available paths to
4262306a36Sopenharmony_cithe same server, the issue is that different latencies and reordering along
4362306a36Sopenharmony_cithe way causes the packets to arrive in the wrong order, resulting in
4462306a36Sopenharmony_cidegraded application performance.
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ciTo mitigate the above-mentioned flow redirection, resilient next-hop groups
4762306a36Sopenharmony_ciinsert another layer of indirection between the hash space and its
4862306a36Sopenharmony_ciconstituent next hops: a hash table. The selection algorithm uses SKB hash
4962306a36Sopenharmony_cito choose a hash table bucket, then reads the next hop that this bucket
5062306a36Sopenharmony_cicontains, and forwards traffic there.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ciThis indirection brings an important feature. In the hash-threshold
5362306a36Sopenharmony_cialgorithm, the range of hashes associated with a next hop must be
5462306a36Sopenharmony_cicontinuous. With a hash table, mapping between the hash table buckets and
5562306a36Sopenharmony_cithe individual next hops is arbitrary. Therefore when a next hop is deleted
5662306a36Sopenharmony_cithe buckets that held it are simply reassigned to other next hops::
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ci	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
5962306a36Sopenharmony_ci	    |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
6062306a36Sopenharmony_ci	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
6162306a36Sopenharmony_ci	                     v v v v
6262306a36Sopenharmony_ci	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
6362306a36Sopenharmony_ci	    |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
6462306a36Sopenharmony_ci	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ci	    Before and after deletion of next hop 3
6762306a36Sopenharmony_ci	    under the resilient hashing algorithm.
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_ciWhen weights of next hops in a group are altered, it may be possible to
7062306a36Sopenharmony_cichoose a subset of buckets that are currently not used for forwarding
7162306a36Sopenharmony_citraffic, and use those to satisfy the new next-hop distribution demands,
7262306a36Sopenharmony_cikeeping the "busy" buckets intact. This way, established flows are ideally
7362306a36Sopenharmony_cikept being forwarded to the same endpoints through the same paths as before
7462306a36Sopenharmony_cithe next-hop group change.
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_ciAlgorithm
7762306a36Sopenharmony_ci---------
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciIn a nutshell, the algorithm works as follows. Each next hop deserves a
8062306a36Sopenharmony_cicertain number of buckets, according to its weight and the number of
8162306a36Sopenharmony_cibuckets in the hash table. In accordance with the source code, we will call
8262306a36Sopenharmony_cithis number a "wants count" of a next hop. In case of an event that might
8362306a36Sopenharmony_cicause bucket allocation change, the wants counts for individual next hops
8462306a36Sopenharmony_ciare updated.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciNext hops that have fewer buckets than their wants count, are called
8762306a36Sopenharmony_ci"underweight". Those that have more are "overweight". If there are no
8862306a36Sopenharmony_cioverweight (and therefore no underweight) next hops in the group, it is
8962306a36Sopenharmony_cisaid to be "balanced".
9062306a36Sopenharmony_ci
9162306a36Sopenharmony_ciEach bucket maintains a last-used timer. Every time a packet is forwarded
9262306a36Sopenharmony_cithrough a bucket, this timer is updated to current jiffies value. One
9362306a36Sopenharmony_ciattribute of a resilient group is then the "idle timer", which is the
9462306a36Sopenharmony_ciamount of time that a bucket must not be hit by traffic in order for it to
9562306a36Sopenharmony_cibe considered "idle". Buckets that are not idle are busy.
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ciAfter assigning wants counts to next hops, an "upkeep" algorithm runs. For
9862306a36Sopenharmony_cibuckets:
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ci1) that have no assigned next hop, or
10162306a36Sopenharmony_ci2) whose next hop has been removed, or
10262306a36Sopenharmony_ci3) that are idle and their next hop is overweight,
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ciupkeep changes the next hop that the bucket references to one of the
10562306a36Sopenharmony_ciunderweight next hops. If, after considering all buckets in this manner,
10662306a36Sopenharmony_cithere are still underweight next hops, another upkeep run is scheduled to a
10762306a36Sopenharmony_cifuture time.
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ciThere may not be enough "idle" buckets to satisfy the updated wants counts
11062306a36Sopenharmony_ciof all next hops. Another attribute of a resilient group is the "unbalanced
11162306a36Sopenharmony_citimer". This timer can be set to 0, in which case the table will stay out
11262306a36Sopenharmony_ciof balance until idle buckets do appear, possibly never. If set to a
11362306a36Sopenharmony_cinon-zero value, the value represents the period of time that the table is
11462306a36Sopenharmony_cipermitted to stay out of balance.
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciWith this in mind, we update the above list of conditions with one more
11762306a36Sopenharmony_ciitem. Thus buckets:
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ci4) whose next hop is overweight, and the amount of time that the table has
12062306a36Sopenharmony_ci   been out of balance exceeds the unbalanced timer, if that is non-zero,
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ci\... are migrated as well.
12362306a36Sopenharmony_ci
12462306a36Sopenharmony_ciOffloading & Driver Feedback
12562306a36Sopenharmony_ci----------------------------
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ciWhen offloading resilient groups, the algorithm that distributes buckets
12862306a36Sopenharmony_ciamong next hops is still the one in SW. Drivers are notified of updates to
12962306a36Sopenharmony_cinext hop groups in the following three ways:
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci- Full group notification with the type
13262306a36Sopenharmony_ci  ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
13362306a36Sopenharmony_ci  created and buckets populated for the first time.
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ci- Single-bucket notifications of the type
13662306a36Sopenharmony_ci  ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
13762306a36Sopenharmony_ci  individual migrations within an already-established group.
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ci- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
14062306a36Sopenharmony_ci  is sent before the group is replaced, and is a way for the driver to veto
14162306a36Sopenharmony_ci  the group before committing anything to the HW.
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ciSome single-bucket notifications are forced, as indicated by the "force"
14462306a36Sopenharmony_ciflag in the notification. Those are used for the cases where e.g. the next
14562306a36Sopenharmony_cihop associated with the bucket was removed, and the bucket really must be
14662306a36Sopenharmony_cimigrated.
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_ciNon-forced notifications can be overridden by the driver by returning an
14962306a36Sopenharmony_cierror code. The use case for this is that the driver notifies the HW that a
15062306a36Sopenharmony_cibucket should be migrated, but the HW discovers that the bucket has in fact
15162306a36Sopenharmony_cibeen hit by traffic.
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciA second way for the HW to report that a bucket is busy is through the
15462306a36Sopenharmony_ci``nexthop_res_grp_activity_update()`` API. The buckets identified this way
15562306a36Sopenharmony_cias busy are treated as if traffic hit them.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciOffloaded buckets should be flagged as either "offload" or "trap". This is
15862306a36Sopenharmony_cidone through the ``nexthop_bucket_set_hw_flags()`` API.
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_ciNetlink UAPI
16162306a36Sopenharmony_ci------------
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciResilient Group Replacement
16462306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_ciResilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
16762306a36Sopenharmony_cisame manner as other multipath groups. The following changes apply to the
16862306a36Sopenharmony_ciattributes passed in the netlink message:
16962306a36Sopenharmony_ci
17062306a36Sopenharmony_ci  =================== =========================================================
17162306a36Sopenharmony_ci  ``NHA_GROUP_TYPE``  Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
17262306a36Sopenharmony_ci  ``NHA_RES_GROUP``   A nest that contains attributes specific to resilient
17362306a36Sopenharmony_ci                      groups.
17462306a36Sopenharmony_ci  =================== =========================================================
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ci``NHA_RES_GROUP`` payload:
17762306a36Sopenharmony_ci
17862306a36Sopenharmony_ci  =================================== =========================================
17962306a36Sopenharmony_ci  ``NHA_RES_GROUP_BUCKETS``           Number of buckets in the hash table.
18062306a36Sopenharmony_ci  ``NHA_RES_GROUP_IDLE_TIMER``        Idle timer in units of clock_t.
18162306a36Sopenharmony_ci  ``NHA_RES_GROUP_UNBALANCED_TIMER``  Unbalanced timer in units of clock_t.
18262306a36Sopenharmony_ci  =================================== =========================================
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ciNext Hop Get
18562306a36Sopenharmony_ci^^^^^^^^^^^^
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ciRequests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
18862306a36Sopenharmony_cimessage in exactly the same way as other next hop get requests. The
18962306a36Sopenharmony_ciresponse attributes match the replacement attributes cited above, except
19062306a36Sopenharmony_ci``NHA_RES_GROUP`` payload will include the following attribute:
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ci  =================================== =========================================
19362306a36Sopenharmony_ci  ``NHA_RES_GROUP_UNBALANCED_TIME``   How long has the resilient group been out
19462306a36Sopenharmony_ci                                      of balance, in units of clock_t.
19562306a36Sopenharmony_ci  =================================== =========================================
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciBucket Get
19862306a36Sopenharmony_ci^^^^^^^^^^
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ciThe message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
20162306a36Sopenharmony_ciused to request a single bucket. The attributes recognized at get requests
20262306a36Sopenharmony_ciare:
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ci  =================== =========================================================
20562306a36Sopenharmony_ci  ``NHA_ID``          ID of the next-hop group that the bucket belongs to.
20662306a36Sopenharmony_ci  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
20762306a36Sopenharmony_ci  =================== =========================================================
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ci``NHA_RES_BUCKET`` payload:
21062306a36Sopenharmony_ci
21162306a36Sopenharmony_ci  ======================== ====================================================
21262306a36Sopenharmony_ci  ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
21362306a36Sopenharmony_ci  ======================== ====================================================
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ciBucket Dumps
21662306a36Sopenharmony_ci^^^^^^^^^^^^
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ciThe message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
21962306a36Sopenharmony_cito request a dump of matching buckets. The attributes recognized at dump
22062306a36Sopenharmony_cirequests are:
22162306a36Sopenharmony_ci
22262306a36Sopenharmony_ci  =================== =========================================================
22362306a36Sopenharmony_ci  ``NHA_ID``          If specified, limits the dump to just the next-hop group
22462306a36Sopenharmony_ci                      with this ID.
22562306a36Sopenharmony_ci  ``NHA_OIF``         If specified, limits the dump to buckets that contain
22662306a36Sopenharmony_ci                      next hops that use the device with this ifindex.
22762306a36Sopenharmony_ci  ``NHA_MASTER``      If specified, limits the dump to buckets that contain
22862306a36Sopenharmony_ci                      next hops that use a device in the VRF with this ifindex.
22962306a36Sopenharmony_ci  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
23062306a36Sopenharmony_ci  =================== =========================================================
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ci``NHA_RES_BUCKET`` payload:
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci  ======================== ====================================================
23562306a36Sopenharmony_ci  ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
23662306a36Sopenharmony_ci                           that contain the next hop with this ID.
23762306a36Sopenharmony_ci  ======================== ====================================================
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ciUsage
24062306a36Sopenharmony_ci-----
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciTo illustrate the usage, consider the following commands::
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ci	# ip nexthop add id 1 via 192.0.2.2 dev eth0
24562306a36Sopenharmony_ci	# ip nexthop add id 2 via 192.0.2.3 dev eth0
24662306a36Sopenharmony_ci	# ip nexthop add id 10 group 1/2 type resilient \
24762306a36Sopenharmony_ci		buckets 8 idle_timer 60 unbalanced_timer 300
24862306a36Sopenharmony_ci
24962306a36Sopenharmony_ciThe last command creates a resilient next-hop group. It will have 8 buckets
25062306a36Sopenharmony_ci(which is unusually low number, and used here for demonstration purposes
25162306a36Sopenharmony_cionly), each bucket will be considered idle when no traffic hits it for at
25262306a36Sopenharmony_cileast 60 seconds, and if the table remains out of balance for 300 seconds,
25362306a36Sopenharmony_ciit will be forcefully brought into balance.
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ciChanging next-hop weights leads to change in bucket allocation::
25662306a36Sopenharmony_ci
25762306a36Sopenharmony_ci	# ip nexthop replace id 10 group 1,3/2 type resilient
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ciThis can be confirmed by looking at individual buckets::
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ci	# ip nexthop bucket show id 10
26262306a36Sopenharmony_ci	id 10 index 0 idle_time 5.59 nhid 1
26362306a36Sopenharmony_ci	id 10 index 1 idle_time 5.59 nhid 1
26462306a36Sopenharmony_ci	id 10 index 2 idle_time 8.74 nhid 2
26562306a36Sopenharmony_ci	id 10 index 3 idle_time 8.74 nhid 2
26662306a36Sopenharmony_ci	id 10 index 4 idle_time 8.74 nhid 1
26762306a36Sopenharmony_ci	id 10 index 5 idle_time 8.74 nhid 1
26862306a36Sopenharmony_ci	id 10 index 6 idle_time 8.74 nhid 1
26962306a36Sopenharmony_ci	id 10 index 7 idle_time 8.74 nhid 1
27062306a36Sopenharmony_ci
27162306a36Sopenharmony_ciNote the two buckets that have a shorter idle time. Those are the ones that
27262306a36Sopenharmony_ciwere migrated after the next-hop replace command to satisfy the new demand
27362306a36Sopenharmony_cithat next hop 1 be given 6 buckets instead of 4.
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciNetdevsim
27662306a36Sopenharmony_ci---------
27762306a36Sopenharmony_ci
27862306a36Sopenharmony_ciThe netdevsim driver implements a mock offload of resilient groups, and
27962306a36Sopenharmony_ciexposes debugfs interface that allows marking individual buckets as busy.
28062306a36Sopenharmony_ciFor example, the following will mark bucket 23 in next-hop group 10 as
28162306a36Sopenharmony_ciactive::
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_ci	# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ciIn addition, another debugfs interface can be used to configure that the
28662306a36Sopenharmony_cinext attempt to migrate a bucket should fail::
28762306a36Sopenharmony_ci
28862306a36Sopenharmony_ci	# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciBesides serving as an example, the interfaces that netdevsim exposes are
29162306a36Sopenharmony_ciuseful in automated testing, and
29262306a36Sopenharmony_ci``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
29362306a36Sopenharmony_cithem to test the algorithm.
294