162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci========================= 462306a36Sopenharmony_ciResilient Next-hop Groups 562306a36Sopenharmony_ci========================= 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciResilient groups are a type of next-hop group that is aimed at minimizing 862306a36Sopenharmony_cidisruption in flow routing across changes to the group composition and 962306a36Sopenharmony_ciweights of constituent next hops. 1062306a36Sopenharmony_ci 1162306a36Sopenharmony_ciThe idea behind resilient hashing groups is best explained in contrast to 1262306a36Sopenharmony_cithe legacy multipath next-hop group, which uses the hash-threshold 1362306a36Sopenharmony_cialgorithm, described in RFC 2992. 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciTo select a next hop, hash-threshold algorithm first assigns a range of 1662306a36Sopenharmony_cihashes to each next hop in the group, and then selects the next hop by 1762306a36Sopenharmony_cicomparing the SKB hash with the individual ranges. When a next hop is 1862306a36Sopenharmony_ciremoved from the group, the ranges are recomputed, which leads to 1962306a36Sopenharmony_cireassignment of parts of hash space from one next hop to another. RFC 2992 2062306a36Sopenharmony_ciillustrates it thus:: 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ci +-------+-------+-------+-------+-------+ 2362306a36Sopenharmony_ci | 1 | 2 | 3 | 4 | 5 | 2462306a36Sopenharmony_ci +-------+-+-----+---+---+-----+-+-------+ 2562306a36Sopenharmony_ci | 1 | 2 | 4 | 5 | 2662306a36Sopenharmony_ci +---------+---------+---------+---------+ 2762306a36Sopenharmony_ci 2862306a36Sopenharmony_ci Before and after deletion of next hop 3 2962306a36Sopenharmony_ci under the hash-threshold algorithm. 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciNote how next hop 2 gave up part of the hash space in favor of next hop 1, 3262306a36Sopenharmony_ciand 4 in favor of 5. While there will usually be some overlap between the 3362306a36Sopenharmony_ciprevious and the new distribution, some traffic flows change the next hop 3462306a36Sopenharmony_cithat they resolve to. 3562306a36Sopenharmony_ci 3662306a36Sopenharmony_ciIf a multipath group is used for load-balancing between multiple servers, 3762306a36Sopenharmony_cithis hash space reassignment causes an issue that packets from a single 3862306a36Sopenharmony_ciflow suddenly end up arriving at a server that does not expect them. This 3962306a36Sopenharmony_cican result in TCP connections being reset. 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciIf a multipath group is used for load-balancing among available paths to 4262306a36Sopenharmony_cithe same server, the issue is that different latencies and reordering along 4362306a36Sopenharmony_cithe way causes the packets to arrive in the wrong order, resulting in 4462306a36Sopenharmony_cidegraded application performance. 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ciTo mitigate the above-mentioned flow redirection, resilient next-hop groups 4762306a36Sopenharmony_ciinsert another layer of indirection between the hash space and its 4862306a36Sopenharmony_ciconstituent next hops: a hash table. The selection algorithm uses SKB hash 4962306a36Sopenharmony_cito choose a hash table bucket, then reads the next hop that this bucket 5062306a36Sopenharmony_cicontains, and forwards traffic there. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ciThis indirection brings an important feature. In the hash-threshold 5362306a36Sopenharmony_cialgorithm, the range of hashes associated with a next hop must be 5462306a36Sopenharmony_cicontinuous. With a hash table, mapping between the hash table buckets and 5562306a36Sopenharmony_cithe individual next hops is arbitrary. Therefore when a next hop is deleted 5662306a36Sopenharmony_cithe buckets that held it are simply reassigned to other next hops:: 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ci +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5962306a36Sopenharmony_ci |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| 6062306a36Sopenharmony_ci +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6162306a36Sopenharmony_ci v v v v 6262306a36Sopenharmony_ci +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6362306a36Sopenharmony_ci |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| 6462306a36Sopenharmony_ci +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ci Before and after deletion of next hop 3 6762306a36Sopenharmony_ci under the resilient hashing algorithm. 6862306a36Sopenharmony_ci 6962306a36Sopenharmony_ciWhen weights of next hops in a group are altered, it may be possible to 7062306a36Sopenharmony_cichoose a subset of buckets that are currently not used for forwarding 7162306a36Sopenharmony_citraffic, and use those to satisfy the new next-hop distribution demands, 7262306a36Sopenharmony_cikeeping the "busy" buckets intact. This way, established flows are ideally 7362306a36Sopenharmony_cikept being forwarded to the same endpoints through the same paths as before 7462306a36Sopenharmony_cithe next-hop group change. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ciAlgorithm 7762306a36Sopenharmony_ci--------- 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ciIn a nutshell, the algorithm works as follows. Each next hop deserves a 8062306a36Sopenharmony_cicertain number of buckets, according to its weight and the number of 8162306a36Sopenharmony_cibuckets in the hash table. In accordance with the source code, we will call 8262306a36Sopenharmony_cithis number a "wants count" of a next hop. In case of an event that might 8362306a36Sopenharmony_cicause bucket allocation change, the wants counts for individual next hops 8462306a36Sopenharmony_ciare updated. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciNext hops that have fewer buckets than their wants count, are called 8762306a36Sopenharmony_ci"underweight". Those that have more are "overweight". If there are no 8862306a36Sopenharmony_cioverweight (and therefore no underweight) next hops in the group, it is 8962306a36Sopenharmony_cisaid to be "balanced". 9062306a36Sopenharmony_ci 9162306a36Sopenharmony_ciEach bucket maintains a last-used timer. Every time a packet is forwarded 9262306a36Sopenharmony_cithrough a bucket, this timer is updated to current jiffies value. One 9362306a36Sopenharmony_ciattribute of a resilient group is then the "idle timer", which is the 9462306a36Sopenharmony_ciamount of time that a bucket must not be hit by traffic in order for it to 9562306a36Sopenharmony_cibe considered "idle". Buckets that are not idle are busy. 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ciAfter assigning wants counts to next hops, an "upkeep" algorithm runs. For 9862306a36Sopenharmony_cibuckets: 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ci1) that have no assigned next hop, or 10162306a36Sopenharmony_ci2) whose next hop has been removed, or 10262306a36Sopenharmony_ci3) that are idle and their next hop is overweight, 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ciupkeep changes the next hop that the bucket references to one of the 10562306a36Sopenharmony_ciunderweight next hops. If, after considering all buckets in this manner, 10662306a36Sopenharmony_cithere are still underweight next hops, another upkeep run is scheduled to a 10762306a36Sopenharmony_cifuture time. 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ciThere may not be enough "idle" buckets to satisfy the updated wants counts 11062306a36Sopenharmony_ciof all next hops. Another attribute of a resilient group is the "unbalanced 11162306a36Sopenharmony_citimer". This timer can be set to 0, in which case the table will stay out 11262306a36Sopenharmony_ciof balance until idle buckets do appear, possibly never. If set to a 11362306a36Sopenharmony_cinon-zero value, the value represents the period of time that the table is 11462306a36Sopenharmony_cipermitted to stay out of balance. 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ciWith this in mind, we update the above list of conditions with one more 11762306a36Sopenharmony_ciitem. Thus buckets: 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci4) whose next hop is overweight, and the amount of time that the table has 12062306a36Sopenharmony_ci been out of balance exceeds the unbalanced timer, if that is non-zero, 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci\... are migrated as well. 12362306a36Sopenharmony_ci 12462306a36Sopenharmony_ciOffloading & Driver Feedback 12562306a36Sopenharmony_ci---------------------------- 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ciWhen offloading resilient groups, the algorithm that distributes buckets 12862306a36Sopenharmony_ciamong next hops is still the one in SW. Drivers are notified of updates to 12962306a36Sopenharmony_cinext hop groups in the following three ways: 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci- Full group notification with the type 13262306a36Sopenharmony_ci ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is 13362306a36Sopenharmony_ci created and buckets populated for the first time. 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ci- Single-bucket notifications of the type 13662306a36Sopenharmony_ci ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of 13762306a36Sopenharmony_ci individual migrations within an already-established group. 13862306a36Sopenharmony_ci 13962306a36Sopenharmony_ci- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This 14062306a36Sopenharmony_ci is sent before the group is replaced, and is a way for the driver to veto 14162306a36Sopenharmony_ci the group before committing anything to the HW. 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ciSome single-bucket notifications are forced, as indicated by the "force" 14462306a36Sopenharmony_ciflag in the notification. Those are used for the cases where e.g. the next 14562306a36Sopenharmony_cihop associated with the bucket was removed, and the bucket really must be 14662306a36Sopenharmony_cimigrated. 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ciNon-forced notifications can be overridden by the driver by returning an 14962306a36Sopenharmony_cierror code. The use case for this is that the driver notifies the HW that a 15062306a36Sopenharmony_cibucket should be migrated, but the HW discovers that the bucket has in fact 15162306a36Sopenharmony_cibeen hit by traffic. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciA second way for the HW to report that a bucket is busy is through the 15462306a36Sopenharmony_ci``nexthop_res_grp_activity_update()`` API. The buckets identified this way 15562306a36Sopenharmony_cias busy are treated as if traffic hit them. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciOffloaded buckets should be flagged as either "offload" or "trap". This is 15862306a36Sopenharmony_cidone through the ``nexthop_bucket_set_hw_flags()`` API. 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_ciNetlink UAPI 16162306a36Sopenharmony_ci------------ 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ciResilient Group Replacement 16462306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^^^^^^ 16562306a36Sopenharmony_ci 16662306a36Sopenharmony_ciResilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the 16762306a36Sopenharmony_cisame manner as other multipath groups. The following changes apply to the 16862306a36Sopenharmony_ciattributes passed in the netlink message: 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_ci =================== ========================================================= 17162306a36Sopenharmony_ci ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group. 17262306a36Sopenharmony_ci ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient 17362306a36Sopenharmony_ci groups. 17462306a36Sopenharmony_ci =================== ========================================================= 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ci``NHA_RES_GROUP`` payload: 17762306a36Sopenharmony_ci 17862306a36Sopenharmony_ci =================================== ========================================= 17962306a36Sopenharmony_ci ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table. 18062306a36Sopenharmony_ci ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t. 18162306a36Sopenharmony_ci ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t. 18262306a36Sopenharmony_ci =================================== ========================================= 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ciNext Hop Get 18562306a36Sopenharmony_ci^^^^^^^^^^^^ 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ciRequests to get resilient next-hop groups use the ``RTM_GETNEXTHOP`` 18862306a36Sopenharmony_cimessage in exactly the same way as other next hop get requests. The 18962306a36Sopenharmony_ciresponse attributes match the replacement attributes cited above, except 19062306a36Sopenharmony_ci``NHA_RES_GROUP`` payload will include the following attribute: 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ci =================================== ========================================= 19362306a36Sopenharmony_ci ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out 19462306a36Sopenharmony_ci of balance, in units of clock_t. 19562306a36Sopenharmony_ci =================================== ========================================= 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ciBucket Get 19862306a36Sopenharmony_ci^^^^^^^^^^ 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ciThe message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is 20162306a36Sopenharmony_ciused to request a single bucket. The attributes recognized at get requests 20262306a36Sopenharmony_ciare: 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci =================== ========================================================= 20562306a36Sopenharmony_ci ``NHA_ID`` ID of the next-hop group that the bucket belongs to. 20662306a36Sopenharmony_ci ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. 20762306a36Sopenharmony_ci =================== ========================================================= 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ci``NHA_RES_BUCKET`` payload: 21062306a36Sopenharmony_ci 21162306a36Sopenharmony_ci ======================== ==================================================== 21262306a36Sopenharmony_ci ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table. 21362306a36Sopenharmony_ci ======================== ==================================================== 21462306a36Sopenharmony_ci 21562306a36Sopenharmony_ciBucket Dumps 21662306a36Sopenharmony_ci^^^^^^^^^^^^ 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ciThe message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used 21962306a36Sopenharmony_cito request a dump of matching buckets. The attributes recognized at dump 22062306a36Sopenharmony_cirequests are: 22162306a36Sopenharmony_ci 22262306a36Sopenharmony_ci =================== ========================================================= 22362306a36Sopenharmony_ci ``NHA_ID`` If specified, limits the dump to just the next-hop group 22462306a36Sopenharmony_ci with this ID. 22562306a36Sopenharmony_ci ``NHA_OIF`` If specified, limits the dump to buckets that contain 22662306a36Sopenharmony_ci next hops that use the device with this ifindex. 22762306a36Sopenharmony_ci ``NHA_MASTER`` If specified, limits the dump to buckets that contain 22862306a36Sopenharmony_ci next hops that use a device in the VRF with this ifindex. 22962306a36Sopenharmony_ci ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket. 23062306a36Sopenharmony_ci =================== ========================================================= 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ci``NHA_RES_BUCKET`` payload: 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ci ======================== ==================================================== 23562306a36Sopenharmony_ci ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets 23662306a36Sopenharmony_ci that contain the next hop with this ID. 23762306a36Sopenharmony_ci ======================== ==================================================== 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ciUsage 24062306a36Sopenharmony_ci----- 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciTo illustrate the usage, consider the following commands:: 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ci # ip nexthop add id 1 via 192.0.2.2 dev eth0 24562306a36Sopenharmony_ci # ip nexthop add id 2 via 192.0.2.3 dev eth0 24662306a36Sopenharmony_ci # ip nexthop add id 10 group 1/2 type resilient \ 24762306a36Sopenharmony_ci buckets 8 idle_timer 60 unbalanced_timer 300 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ciThe last command creates a resilient next-hop group. It will have 8 buckets 25062306a36Sopenharmony_ci(which is unusually low number, and used here for demonstration purposes 25162306a36Sopenharmony_cionly), each bucket will be considered idle when no traffic hits it for at 25262306a36Sopenharmony_cileast 60 seconds, and if the table remains out of balance for 300 seconds, 25362306a36Sopenharmony_ciit will be forcefully brought into balance. 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ciChanging next-hop weights leads to change in bucket allocation:: 25662306a36Sopenharmony_ci 25762306a36Sopenharmony_ci # ip nexthop replace id 10 group 1,3/2 type resilient 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ciThis can be confirmed by looking at individual buckets:: 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci # ip nexthop bucket show id 10 26262306a36Sopenharmony_ci id 10 index 0 idle_time 5.59 nhid 1 26362306a36Sopenharmony_ci id 10 index 1 idle_time 5.59 nhid 1 26462306a36Sopenharmony_ci id 10 index 2 idle_time 8.74 nhid 2 26562306a36Sopenharmony_ci id 10 index 3 idle_time 8.74 nhid 2 26662306a36Sopenharmony_ci id 10 index 4 idle_time 8.74 nhid 1 26762306a36Sopenharmony_ci id 10 index 5 idle_time 8.74 nhid 1 26862306a36Sopenharmony_ci id 10 index 6 idle_time 8.74 nhid 1 26962306a36Sopenharmony_ci id 10 index 7 idle_time 8.74 nhid 1 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_ciNote the two buckets that have a shorter idle time. Those are the ones that 27262306a36Sopenharmony_ciwere migrated after the next-hop replace command to satisfy the new demand 27362306a36Sopenharmony_cithat next hop 1 be given 6 buckets instead of 4. 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciNetdevsim 27662306a36Sopenharmony_ci--------- 27762306a36Sopenharmony_ci 27862306a36Sopenharmony_ciThe netdevsim driver implements a mock offload of resilient groups, and 27962306a36Sopenharmony_ciexposes debugfs interface that allows marking individual buckets as busy. 28062306a36Sopenharmony_ciFor example, the following will mark bucket 23 in next-hop group 10 as 28162306a36Sopenharmony_ciactive:: 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ci # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity 28462306a36Sopenharmony_ci 28562306a36Sopenharmony_ciIn addition, another debugfs interface can be used to configure that the 28662306a36Sopenharmony_cinext attempt to migrate a bucket should fail:: 28762306a36Sopenharmony_ci 28862306a36Sopenharmony_ci # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ciBesides serving as an example, the interfaces that netdevsim exposes are 29162306a36Sopenharmony_ciuseful in automated testing, and 29262306a36Sopenharmony_ci``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of 29362306a36Sopenharmony_cithem to test the algorithm. 294