18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci=====================================
48c2ecf20Sopenharmony_ciScaling in the Linux Networking Stack
58c2ecf20Sopenharmony_ci=====================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ci
88c2ecf20Sopenharmony_ciIntroduction
98c2ecf20Sopenharmony_ci============
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ciThis document describes a set of complementary techniques in the Linux
128c2ecf20Sopenharmony_cinetworking stack to increase parallelism and improve performance for
138c2ecf20Sopenharmony_cimulti-processor systems.
148c2ecf20Sopenharmony_ci
158c2ecf20Sopenharmony_ciThe following technologies are described:
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ci- RSS: Receive Side Scaling
188c2ecf20Sopenharmony_ci- RPS: Receive Packet Steering
198c2ecf20Sopenharmony_ci- RFS: Receive Flow Steering
208c2ecf20Sopenharmony_ci- Accelerated Receive Flow Steering
218c2ecf20Sopenharmony_ci- XPS: Transmit Packet Steering
228c2ecf20Sopenharmony_ci
238c2ecf20Sopenharmony_ci
248c2ecf20Sopenharmony_ciRSS: Receive Side Scaling
258c2ecf20Sopenharmony_ci=========================
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ciContemporary NICs support multiple receive and transmit descriptor queues
288c2ecf20Sopenharmony_ci(multi-queue). On reception, a NIC can send different packets to different
298c2ecf20Sopenharmony_ciqueues to distribute processing among CPUs. The NIC distributes packets by
308c2ecf20Sopenharmony_ciapplying a filter to each packet that assigns it to one of a small number
318c2ecf20Sopenharmony_ciof logical flows. Packets for each flow are steered to a separate receive
328c2ecf20Sopenharmony_ciqueue, which in turn can be processed by separate CPUs. This mechanism is
338c2ecf20Sopenharmony_cigenerally known as “Receive-side Scaling” (RSS). The goal of RSS and
348c2ecf20Sopenharmony_cithe other scaling techniques is to increase performance uniformly.
358c2ecf20Sopenharmony_ciMulti-queue distribution can also be used for traffic prioritization, but
368c2ecf20Sopenharmony_cithat is not the focus of these techniques.
378c2ecf20Sopenharmony_ci
388c2ecf20Sopenharmony_ciThe filter used in RSS is typically a hash function over the network
398c2ecf20Sopenharmony_ciand/or transport layer headers-- for example, a 4-tuple hash over
408c2ecf20Sopenharmony_ciIP addresses and TCP ports of a packet. The most common hardware
418c2ecf20Sopenharmony_ciimplementation of RSS uses a 128-entry indirection table where each entry
428c2ecf20Sopenharmony_cistores a queue number. The receive queue for a packet is determined
438c2ecf20Sopenharmony_ciby masking out the low order seven bits of the computed hash for the
448c2ecf20Sopenharmony_cipacket (usually a Toeplitz hash), taking this number as a key into the
458c2ecf20Sopenharmony_ciindirection table and reading the corresponding value.
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ciSome advanced NICs allow steering packets to queues based on
488c2ecf20Sopenharmony_ciprogrammable filters. For example, webserver bound TCP port 80 packets
498c2ecf20Sopenharmony_cican be directed to their own receive queue. Such “n-tuple” filters can
508c2ecf20Sopenharmony_cibe configured from ethtool (--config-ntuple).
518c2ecf20Sopenharmony_ci
528c2ecf20Sopenharmony_ci
538c2ecf20Sopenharmony_ciRSS Configuration
548c2ecf20Sopenharmony_ci-----------------
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciThe driver for a multi-queue capable NIC typically provides a kernel
578c2ecf20Sopenharmony_cimodule parameter for specifying the number of hardware queues to
588c2ecf20Sopenharmony_ciconfigure. In the bnx2x driver, for instance, this parameter is called
598c2ecf20Sopenharmony_cinum_queues. A typical RSS configuration would be to have one receive queue
608c2ecf20Sopenharmony_cifor each CPU if the device supports enough queues, or otherwise at least
618c2ecf20Sopenharmony_cione for each memory domain, where a memory domain is a set of CPUs that
628c2ecf20Sopenharmony_cishare a particular memory level (L1, L2, NUMA node, etc.).
638c2ecf20Sopenharmony_ci
648c2ecf20Sopenharmony_ciThe indirection table of an RSS device, which resolves a queue by masked
658c2ecf20Sopenharmony_cihash, is usually programmed by the driver at initialization. The
668c2ecf20Sopenharmony_cidefault mapping is to distribute the queues evenly in the table, but the
678c2ecf20Sopenharmony_ciindirection table can be retrieved and modified at runtime using ethtool
688c2ecf20Sopenharmony_cicommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
698c2ecf20Sopenharmony_ciindirection table could be done to give different queues different
708c2ecf20Sopenharmony_cirelative weights.
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ci
738c2ecf20Sopenharmony_ciRSS IRQ Configuration
748c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciEach receive queue has a separate IRQ associated with it. The NIC triggers
778c2ecf20Sopenharmony_cithis to notify a CPU when new packets arrive on the given queue. The
788c2ecf20Sopenharmony_cisignaling path for PCIe devices uses message signaled interrupts (MSI-X),
798c2ecf20Sopenharmony_cithat can route each interrupt to a particular CPU. The active mapping
808c2ecf20Sopenharmony_ciof queues to IRQs can be determined from /proc/interrupts. By default,
818c2ecf20Sopenharmony_cian IRQ may be handled on any CPU. Because a non-negligible part of packet
828c2ecf20Sopenharmony_ciprocessing takes place in receive interrupt handling, it is advantageous
838c2ecf20Sopenharmony_cito spread receive interrupts between CPUs. To manually adjust the IRQ
848c2ecf20Sopenharmony_ciaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
858c2ecf20Sopenharmony_ciwill be running irqbalance, a daemon that dynamically optimizes IRQ
868c2ecf20Sopenharmony_ciassignments and as a result may override any manual settings.
878c2ecf20Sopenharmony_ci
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ciSuggested Configuration
908c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
918c2ecf20Sopenharmony_ci
928c2ecf20Sopenharmony_ciRSS should be enabled when latency is a concern or whenever receive
938c2ecf20Sopenharmony_ciinterrupt processing forms a bottleneck. Spreading load between CPUs
948c2ecf20Sopenharmony_cidecreases queue length. For low latency networking, the optimal setting
958c2ecf20Sopenharmony_ciis to allocate as many queues as there are CPUs in the system (or the
968c2ecf20Sopenharmony_ciNIC maximum, if lower). The most efficient high-rate configuration
978c2ecf20Sopenharmony_ciis likely the one with the smallest number of receive queues where no
988c2ecf20Sopenharmony_cireceive queue overflows due to a saturated CPU, because in default
998c2ecf20Sopenharmony_cimode with interrupt coalescing enabled, the aggregate number of
1008c2ecf20Sopenharmony_ciinterrupts (and thus work) grows with each additional queue.
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciPer-cpu load can be observed using the mpstat utility, but note that on
1038c2ecf20Sopenharmony_ciprocessors with hyperthreading (HT), each hyperthread is represented as
1048c2ecf20Sopenharmony_cia separate CPU. For interrupt handling, HT has shown no benefit in
1058c2ecf20Sopenharmony_ciinitial tests, so limit the number of queues to the number of CPU cores
1068c2ecf20Sopenharmony_ciin the system.
1078c2ecf20Sopenharmony_ci
1088c2ecf20Sopenharmony_ci
1098c2ecf20Sopenharmony_ciRPS: Receive Packet Steering
1108c2ecf20Sopenharmony_ci============================
1118c2ecf20Sopenharmony_ci
1128c2ecf20Sopenharmony_ciReceive Packet Steering (RPS) is logically a software implementation of
1138c2ecf20Sopenharmony_ciRSS. Being in software, it is necessarily called later in the datapath.
1148c2ecf20Sopenharmony_ciWhereas RSS selects the queue and hence CPU that will run the hardware
1158c2ecf20Sopenharmony_ciinterrupt handler, RPS selects the CPU to perform protocol processing
1168c2ecf20Sopenharmony_ciabove the interrupt handler. This is accomplished by placing the packet
1178c2ecf20Sopenharmony_cion the desired CPU’s backlog queue and waking up the CPU for processing.
1188c2ecf20Sopenharmony_ciRPS has some advantages over RSS:
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ci1) it can be used with any NIC
1218c2ecf20Sopenharmony_ci2) software filters can easily be added to hash over new protocols
1228c2ecf20Sopenharmony_ci3) it does not increase hardware device interrupt rate (although it does
1238c2ecf20Sopenharmony_ci   introduce inter-processor interrupts (IPIs))
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ciRPS is called during bottom half of the receive interrupt handler, when
1268c2ecf20Sopenharmony_cia driver sends a packet up the network stack with netif_rx() or
1278c2ecf20Sopenharmony_cinetif_receive_skb(). These call the get_rps_cpu() function, which
1288c2ecf20Sopenharmony_ciselects the queue that should process a packet.
1298c2ecf20Sopenharmony_ci
1308c2ecf20Sopenharmony_ciThe first step in determining the target CPU for RPS is to calculate a
1318c2ecf20Sopenharmony_ciflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
1328c2ecf20Sopenharmony_cidepending on the protocol). This serves as a consistent hash of the
1338c2ecf20Sopenharmony_ciassociated flow of the packet. The hash is either provided by hardware
1348c2ecf20Sopenharmony_cior will be computed in the stack. Capable hardware can pass the hash in
1358c2ecf20Sopenharmony_cithe receive descriptor for the packet; this would usually be the same
1368c2ecf20Sopenharmony_cihash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
1378c2ecf20Sopenharmony_ciskb->hash and can be used elsewhere in the stack as a hash of the
1388c2ecf20Sopenharmony_cipacket’s flow.
1398c2ecf20Sopenharmony_ci
1408c2ecf20Sopenharmony_ciEach receive hardware queue has an associated list of CPUs to which
1418c2ecf20Sopenharmony_ciRPS may enqueue packets for processing. For each received packet,
1428c2ecf20Sopenharmony_cian index into the list is computed from the flow hash modulo the size
1438c2ecf20Sopenharmony_ciof the list. The indexed CPU is the target for processing the packet,
1448c2ecf20Sopenharmony_ciand the packet is queued to the tail of that CPU’s backlog queue. At
1458c2ecf20Sopenharmony_cithe end of the bottom half routine, IPIs are sent to any CPUs for which
1468c2ecf20Sopenharmony_cipackets have been queued to their backlog queue. The IPI wakes backlog
1478c2ecf20Sopenharmony_ciprocessing on the remote CPU, and any queued packets are then processed
1488c2ecf20Sopenharmony_ciup the networking stack.
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ci
1518c2ecf20Sopenharmony_ciRPS Configuration
1528c2ecf20Sopenharmony_ci-----------------
1538c2ecf20Sopenharmony_ci
1548c2ecf20Sopenharmony_ciRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
1558c2ecf20Sopenharmony_ciby default for SMP). Even when compiled in, RPS remains disabled until
1568c2ecf20Sopenharmony_ciexplicitly configured. The list of CPUs to which RPS may forward traffic
1578c2ecf20Sopenharmony_cican be configured for each receive queue using a sysfs file entry::
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ci  /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ciThis file implements a bitmap of CPUs. RPS is disabled when it is zero
1628c2ecf20Sopenharmony_ci(the default), in which case packets are processed on the interrupting
1638c2ecf20Sopenharmony_ciCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
1648c2ecf20Sopenharmony_cithe bitmap.
1658c2ecf20Sopenharmony_ci
1668c2ecf20Sopenharmony_ci
1678c2ecf20Sopenharmony_ciSuggested Configuration
1688c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
1698c2ecf20Sopenharmony_ci
1708c2ecf20Sopenharmony_ciFor a single queue device, a typical RPS configuration would be to set
1718c2ecf20Sopenharmony_cithe rps_cpus to the CPUs in the same memory domain of the interrupting
1728c2ecf20Sopenharmony_ciCPU. If NUMA locality is not an issue, this could also be all CPUs in
1738c2ecf20Sopenharmony_cithe system. At high interrupt rate, it might be wise to exclude the
1748c2ecf20Sopenharmony_ciinterrupting CPU from the map since that already performs much work.
1758c2ecf20Sopenharmony_ci
1768c2ecf20Sopenharmony_ciFor a multi-queue system, if RSS is configured so that a hardware
1778c2ecf20Sopenharmony_cireceive queue is mapped to each CPU, then RPS is probably redundant
1788c2ecf20Sopenharmony_ciand unnecessary. If there are fewer hardware queues than CPUs, then
1798c2ecf20Sopenharmony_ciRPS might be beneficial if the rps_cpus for each queue are the ones that
1808c2ecf20Sopenharmony_cishare the same memory domain as the interrupting CPU for that queue.
1818c2ecf20Sopenharmony_ci
1828c2ecf20Sopenharmony_ci
1838c2ecf20Sopenharmony_ciRPS Flow Limit
1848c2ecf20Sopenharmony_ci--------------
1858c2ecf20Sopenharmony_ci
1868c2ecf20Sopenharmony_ciRPS scales kernel receive processing across CPUs without introducing
1878c2ecf20Sopenharmony_cireordering. The trade-off to sending all packets from the same flow
1888c2ecf20Sopenharmony_cito the same CPU is CPU load imbalance if flows vary in packet rate.
1898c2ecf20Sopenharmony_ciIn the extreme case a single flow dominates traffic. Especially on
1908c2ecf20Sopenharmony_cicommon server workloads with many concurrent connections, such
1918c2ecf20Sopenharmony_cibehavior indicates a problem such as a misconfiguration or spoofed
1928c2ecf20Sopenharmony_cisource Denial of Service attack.
1938c2ecf20Sopenharmony_ci
1948c2ecf20Sopenharmony_ciFlow Limit is an optional RPS feature that prioritizes small flows
1958c2ecf20Sopenharmony_ciduring CPU contention by dropping packets from large flows slightly
1968c2ecf20Sopenharmony_ciahead of those from small flows. It is active only when an RPS or RFS
1978c2ecf20Sopenharmony_cidestination CPU approaches saturation.  Once a CPU's input packet
1988c2ecf20Sopenharmony_ciqueue exceeds half the maximum queue length (as set by sysctl
1998c2ecf20Sopenharmony_cinet.core.netdev_max_backlog), the kernel starts a per-flow packet
2008c2ecf20Sopenharmony_cicount over the last 256 packets. If a flow exceeds a set ratio (by
2018c2ecf20Sopenharmony_cidefault, half) of these packets when a new packet arrives, then the
2028c2ecf20Sopenharmony_cinew packet is dropped. Packets from other flows are still only
2038c2ecf20Sopenharmony_cidropped once the input packet queue reaches netdev_max_backlog.
2048c2ecf20Sopenharmony_ciNo packets are dropped when the input packet queue length is below
2058c2ecf20Sopenharmony_cithe threshold, so flow limit does not sever connections outright:
2068c2ecf20Sopenharmony_cieven large flows maintain connectivity.
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ci
2098c2ecf20Sopenharmony_ciInterface
2108c2ecf20Sopenharmony_ci~~~~~~~~~
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
2138c2ecf20Sopenharmony_citurned on. It is implemented for each CPU independently (to avoid lock
2148c2ecf20Sopenharmony_ciand cache contention) and toggled per CPU by setting the relevant bit
2158c2ecf20Sopenharmony_ciin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
2168c2ecf20Sopenharmony_cibitmap interface as rps_cpus (see above) when called from procfs::
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ci  /proc/sys/net/core/flow_limit_cpu_bitmap
2198c2ecf20Sopenharmony_ci
2208c2ecf20Sopenharmony_ciPer-flow rate is calculated by hashing each packet into a hashtable
2218c2ecf20Sopenharmony_cibucket and incrementing a per-bucket counter. The hash function is
2228c2ecf20Sopenharmony_cithe same that selects a CPU in RPS, but as the number of buckets can
2238c2ecf20Sopenharmony_cibe much larger than the number of CPUs, flow limit has finer-grained
2248c2ecf20Sopenharmony_ciidentification of large flows and fewer false positives. The default
2258c2ecf20Sopenharmony_citable has 4096 buckets. This value can be modified through sysctl::
2268c2ecf20Sopenharmony_ci
2278c2ecf20Sopenharmony_ci  net.core.flow_limit_table_len
2288c2ecf20Sopenharmony_ci
2298c2ecf20Sopenharmony_ciThe value is only consulted when a new table is allocated. Modifying
2308c2ecf20Sopenharmony_ciit does not update active tables.
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ci
2338c2ecf20Sopenharmony_ciSuggested Configuration
2348c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
2358c2ecf20Sopenharmony_ci
2368c2ecf20Sopenharmony_ciFlow limit is useful on systems with many concurrent connections,
2378c2ecf20Sopenharmony_ciwhere a single connection taking up 50% of a CPU indicates a problem.
2388c2ecf20Sopenharmony_ciIn such environments, enable the feature on all CPUs that handle
2398c2ecf20Sopenharmony_cinetwork rx interrupts (as set in /proc/irq/N/smp_affinity).
2408c2ecf20Sopenharmony_ci
2418c2ecf20Sopenharmony_ciThe feature depends on the input packet queue length to exceed
2428c2ecf20Sopenharmony_cithe flow limit threshold (50%) + the flow history length (256).
2438c2ecf20Sopenharmony_ciSetting net.core.netdev_max_backlog to either 1000 or 10000
2448c2ecf20Sopenharmony_ciperformed well in experiments.
2458c2ecf20Sopenharmony_ci
2468c2ecf20Sopenharmony_ci
2478c2ecf20Sopenharmony_ciRFS: Receive Flow Steering
2488c2ecf20Sopenharmony_ci==========================
2498c2ecf20Sopenharmony_ci
2508c2ecf20Sopenharmony_ciWhile RPS steers packets solely based on hash, and thus generally
2518c2ecf20Sopenharmony_ciprovides good load distribution, it does not take into account
2528c2ecf20Sopenharmony_ciapplication locality. This is accomplished by Receive Flow Steering
2538c2ecf20Sopenharmony_ci(RFS). The goal of RFS is to increase datacache hitrate by steering
2548c2ecf20Sopenharmony_cikernel processing of packets to the CPU where the application thread
2558c2ecf20Sopenharmony_ciconsuming the packet is running. RFS relies on the same RPS mechanisms
2568c2ecf20Sopenharmony_cito enqueue packets onto the backlog of another CPU and to wake up that
2578c2ecf20Sopenharmony_ciCPU.
2588c2ecf20Sopenharmony_ci
2598c2ecf20Sopenharmony_ciIn RFS, packets are not forwarded directly by the value of their hash,
2608c2ecf20Sopenharmony_cibut the hash is used as index into a flow lookup table. This table maps
2618c2ecf20Sopenharmony_ciflows to the CPUs where those flows are being processed. The flow hash
2628c2ecf20Sopenharmony_ci(see RPS section above) is used to calculate the index into this table.
2638c2ecf20Sopenharmony_ciThe CPU recorded in each entry is the one which last processed the flow.
2648c2ecf20Sopenharmony_ciIf an entry does not hold a valid CPU, then packets mapped to that entry
2658c2ecf20Sopenharmony_ciare steered using plain RPS. Multiple table entries may point to the
2668c2ecf20Sopenharmony_cisame CPU. Indeed, with many flows and few CPUs, it is very likely that
2678c2ecf20Sopenharmony_cia single application thread handles flows with many different flow hashes.
2688c2ecf20Sopenharmony_ci
2698c2ecf20Sopenharmony_cirps_sock_flow_table is a global flow table that contains the *desired* CPU
2708c2ecf20Sopenharmony_cifor flows: the CPU that is currently processing the flow in userspace.
2718c2ecf20Sopenharmony_ciEach table value is a CPU index that is updated during calls to recvmsg
2728c2ecf20Sopenharmony_ciand sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
2738c2ecf20Sopenharmony_ciand tcp_splice_read()).
2748c2ecf20Sopenharmony_ci
2758c2ecf20Sopenharmony_ciWhen the scheduler moves a thread to a new CPU while it has outstanding
2768c2ecf20Sopenharmony_cireceive packets on the old CPU, packets may arrive out of order. To
2778c2ecf20Sopenharmony_ciavoid this, RFS uses a second flow table to track outstanding packets
2788c2ecf20Sopenharmony_cifor each flow: rps_dev_flow_table is a table specific to each hardware
2798c2ecf20Sopenharmony_cireceive queue of each device. Each table value stores a CPU index and a
2808c2ecf20Sopenharmony_cicounter. The CPU index represents the *current* CPU onto which packets
2818c2ecf20Sopenharmony_cifor this flow are enqueued for further kernel processing. Ideally, kernel
2828c2ecf20Sopenharmony_ciand userspace processing occur on the same CPU, and hence the CPU index
2838c2ecf20Sopenharmony_ciin both tables is identical. This is likely false if the scheduler has
2848c2ecf20Sopenharmony_cirecently migrated a userspace thread while the kernel still has packets
2858c2ecf20Sopenharmony_cienqueued for kernel processing on the old CPU.
2868c2ecf20Sopenharmony_ci
2878c2ecf20Sopenharmony_ciThe counter in rps_dev_flow_table values records the length of the current
2888c2ecf20Sopenharmony_ciCPU's backlog when a packet in this flow was last enqueued. Each backlog
2898c2ecf20Sopenharmony_ciqueue has a head counter that is incremented on dequeue. A tail counter
2908c2ecf20Sopenharmony_ciis computed as head counter + queue length. In other words, the counter
2918c2ecf20Sopenharmony_ciin rps_dev_flow[i] records the last element in flow i that has
2928c2ecf20Sopenharmony_cibeen enqueued onto the currently designated CPU for flow i (of course,
2938c2ecf20Sopenharmony_cientry i is actually selected by hash and multiple flows may hash to the
2948c2ecf20Sopenharmony_cisame entry i).
2958c2ecf20Sopenharmony_ci
2968c2ecf20Sopenharmony_ciAnd now the trick for avoiding out of order packets: when selecting the
2978c2ecf20Sopenharmony_ciCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
2988c2ecf20Sopenharmony_ciand the rps_dev_flow table of the queue that the packet was received on
2998c2ecf20Sopenharmony_ciare compared. If the desired CPU for the flow (found in the
3008c2ecf20Sopenharmony_cirps_sock_flow table) matches the current CPU (found in the rps_dev_flow
3018c2ecf20Sopenharmony_citable), the packet is enqueued onto that CPU’s backlog. If they differ,
3028c2ecf20Sopenharmony_cithe current CPU is updated to match the desired CPU if one of the
3038c2ecf20Sopenharmony_cifollowing is true:
3048c2ecf20Sopenharmony_ci
3058c2ecf20Sopenharmony_ci  - The current CPU's queue head counter >= the recorded tail counter
3068c2ecf20Sopenharmony_ci    value in rps_dev_flow[i]
3078c2ecf20Sopenharmony_ci  - The current CPU is unset (>= nr_cpu_ids)
3088c2ecf20Sopenharmony_ci  - The current CPU is offline
3098c2ecf20Sopenharmony_ci
3108c2ecf20Sopenharmony_ciAfter this check, the packet is sent to the (possibly updated) current
3118c2ecf20Sopenharmony_ciCPU. These rules aim to ensure that a flow only moves to a new CPU when
3128c2ecf20Sopenharmony_cithere are no packets outstanding on the old CPU, as the outstanding
3138c2ecf20Sopenharmony_cipackets could arrive later than those about to be processed on the new
3148c2ecf20Sopenharmony_ciCPU.
3158c2ecf20Sopenharmony_ci
3168c2ecf20Sopenharmony_ci
3178c2ecf20Sopenharmony_ciRFS Configuration
3188c2ecf20Sopenharmony_ci-----------------
3198c2ecf20Sopenharmony_ci
3208c2ecf20Sopenharmony_ciRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
3218c2ecf20Sopenharmony_ciby default for SMP). The functionality remains disabled until explicitly
3228c2ecf20Sopenharmony_ciconfigured. The number of entries in the global flow table is set through::
3238c2ecf20Sopenharmony_ci
3248c2ecf20Sopenharmony_ci  /proc/sys/net/core/rps_sock_flow_entries
3258c2ecf20Sopenharmony_ci
3268c2ecf20Sopenharmony_ciThe number of entries in the per-queue flow table are set through::
3278c2ecf20Sopenharmony_ci
3288c2ecf20Sopenharmony_ci  /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
3298c2ecf20Sopenharmony_ci
3308c2ecf20Sopenharmony_ci
3318c2ecf20Sopenharmony_ciSuggested Configuration
3328c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
3338c2ecf20Sopenharmony_ci
3348c2ecf20Sopenharmony_ciBoth of these need to be set before RFS is enabled for a receive queue.
3358c2ecf20Sopenharmony_ciValues for both are rounded up to the nearest power of two. The
3368c2ecf20Sopenharmony_cisuggested flow count depends on the expected number of active connections
3378c2ecf20Sopenharmony_ciat any given time, which may be significantly less than the number of open
3388c2ecf20Sopenharmony_ciconnections. We have found that a value of 32768 for rps_sock_flow_entries
3398c2ecf20Sopenharmony_ciworks fairly well on a moderately loaded server.
3408c2ecf20Sopenharmony_ci
3418c2ecf20Sopenharmony_ciFor a single queue device, the rps_flow_cnt value for the single queue
3428c2ecf20Sopenharmony_ciwould normally be configured to the same value as rps_sock_flow_entries.
3438c2ecf20Sopenharmony_ciFor a multi-queue device, the rps_flow_cnt for each queue might be
3448c2ecf20Sopenharmony_ciconfigured as rps_sock_flow_entries / N, where N is the number of
3458c2ecf20Sopenharmony_ciqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there
3468c2ecf20Sopenharmony_ciare 16 configured receive queues, rps_flow_cnt for each queue might be
3478c2ecf20Sopenharmony_ciconfigured as 2048.
3488c2ecf20Sopenharmony_ci
3498c2ecf20Sopenharmony_ci
3508c2ecf20Sopenharmony_ciAccelerated RFS
3518c2ecf20Sopenharmony_ci===============
3528c2ecf20Sopenharmony_ci
3538c2ecf20Sopenharmony_ciAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
3548c2ecf20Sopenharmony_cibalancing mechanism that uses soft state to steer flows based on where
3558c2ecf20Sopenharmony_cithe application thread consuming the packets of each flow is running.
3568c2ecf20Sopenharmony_ciAccelerated RFS should perform better than RFS since packets are sent
3578c2ecf20Sopenharmony_cidirectly to a CPU local to the thread consuming the data. The target CPU
3588c2ecf20Sopenharmony_ciwill either be the same CPU where the application runs, or at least a CPU
3598c2ecf20Sopenharmony_ciwhich is local to the application thread’s CPU in the cache hierarchy.
3608c2ecf20Sopenharmony_ci
3618c2ecf20Sopenharmony_ciTo enable accelerated RFS, the networking stack calls the
3628c2ecf20Sopenharmony_cindo_rx_flow_steer driver function to communicate the desired hardware
3638c2ecf20Sopenharmony_ciqueue for packets matching a particular flow. The network stack
3648c2ecf20Sopenharmony_ciautomatically calls this function every time a flow entry in
3658c2ecf20Sopenharmony_cirps_dev_flow_table is updated. The driver in turn uses a device specific
3668c2ecf20Sopenharmony_cimethod to program the NIC to steer the packets.
3678c2ecf20Sopenharmony_ci
3688c2ecf20Sopenharmony_ciThe hardware queue for a flow is derived from the CPU recorded in
3698c2ecf20Sopenharmony_cirps_dev_flow_table. The stack consults a CPU to hardware queue map which
3708c2ecf20Sopenharmony_ciis maintained by the NIC driver. This is an auto-generated reverse map of
3718c2ecf20Sopenharmony_cithe IRQ affinity table shown by /proc/interrupts. Drivers can use
3728c2ecf20Sopenharmony_cifunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library
3738c2ecf20Sopenharmony_cito populate the map. For each CPU, the corresponding queue in the map is
3748c2ecf20Sopenharmony_ciset to be one whose processing CPU is closest in cache locality.
3758c2ecf20Sopenharmony_ci
3768c2ecf20Sopenharmony_ci
3778c2ecf20Sopenharmony_ciAccelerated RFS Configuration
3788c2ecf20Sopenharmony_ci-----------------------------
3798c2ecf20Sopenharmony_ci
3808c2ecf20Sopenharmony_ciAccelerated RFS is only available if the kernel is compiled with
3818c2ecf20Sopenharmony_ciCONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
3828c2ecf20Sopenharmony_ciIt also requires that ntuple filtering is enabled via ethtool. The map
3838c2ecf20Sopenharmony_ciof CPU to queues is automatically deduced from the IRQ affinities
3848c2ecf20Sopenharmony_ciconfigured for each receive queue by the driver, so no additional
3858c2ecf20Sopenharmony_ciconfiguration should be necessary.
3868c2ecf20Sopenharmony_ci
3878c2ecf20Sopenharmony_ci
3888c2ecf20Sopenharmony_ciSuggested Configuration
3898c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
3908c2ecf20Sopenharmony_ci
3918c2ecf20Sopenharmony_ciThis technique should be enabled whenever one wants to use RFS and the
3928c2ecf20Sopenharmony_ciNIC supports hardware acceleration.
3938c2ecf20Sopenharmony_ci
3948c2ecf20Sopenharmony_ci
3958c2ecf20Sopenharmony_ciXPS: Transmit Packet Steering
3968c2ecf20Sopenharmony_ci=============================
3978c2ecf20Sopenharmony_ci
3988c2ecf20Sopenharmony_ciTransmit Packet Steering is a mechanism for intelligently selecting
3998c2ecf20Sopenharmony_ciwhich transmit queue to use when transmitting a packet on a multi-queue
4008c2ecf20Sopenharmony_cidevice. This can be accomplished by recording two kinds of maps, either
4018c2ecf20Sopenharmony_cia mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
4028c2ecf20Sopenharmony_cito hardware transmit queue(s).
4038c2ecf20Sopenharmony_ci
4048c2ecf20Sopenharmony_ci1. XPS using CPUs map
4058c2ecf20Sopenharmony_ci
4068c2ecf20Sopenharmony_ciThe goal of this mapping is usually to assign queues
4078c2ecf20Sopenharmony_ciexclusively to a subset of CPUs, where the transmit completions for
4088c2ecf20Sopenharmony_cithese queues are processed on a CPU within this set. This choice
4098c2ecf20Sopenharmony_ciprovides two benefits. First, contention on the device queue lock is
4108c2ecf20Sopenharmony_cisignificantly reduced since fewer CPUs contend for the same queue
4118c2ecf20Sopenharmony_ci(contention can be eliminated completely if each CPU has its own
4128c2ecf20Sopenharmony_citransmit queue). Secondly, cache miss rate on transmit completion is
4138c2ecf20Sopenharmony_cireduced, in particular for data cache lines that hold the sk_buff
4148c2ecf20Sopenharmony_cistructures.
4158c2ecf20Sopenharmony_ci
4168c2ecf20Sopenharmony_ci2. XPS using receive queues map
4178c2ecf20Sopenharmony_ci
4188c2ecf20Sopenharmony_ciThis mapping is used to pick transmit queue based on the receive
4198c2ecf20Sopenharmony_ciqueue(s) map configuration set by the administrator. A set of receive
4208c2ecf20Sopenharmony_ciqueues can be mapped to a set of transmit queues (many:many), although
4218c2ecf20Sopenharmony_cithe common use case is a 1:1 mapping. This will enable sending packets
4228c2ecf20Sopenharmony_cion the same queue associations for transmit and receive. This is useful for
4238c2ecf20Sopenharmony_cibusy polling multi-threaded workloads where there are challenges in
4248c2ecf20Sopenharmony_ciassociating a given CPU to a given application thread. The application
4258c2ecf20Sopenharmony_cithreads are not pinned to CPUs and each thread handles packets
4268c2ecf20Sopenharmony_cireceived on a single queue. The receive queue number is cached in the
4278c2ecf20Sopenharmony_cisocket for the connection. In this model, sending the packets on the same
4288c2ecf20Sopenharmony_citransmit queue corresponding to the associated receive queue has benefits
4298c2ecf20Sopenharmony_ciin keeping the CPU overhead low. Transmit completion work is locked into
4308c2ecf20Sopenharmony_cithe same queue-association that a given application is polling on. This
4318c2ecf20Sopenharmony_ciavoids the overhead of triggering an interrupt on another CPU. When the
4328c2ecf20Sopenharmony_ciapplication cleans up the packets during the busy poll, transmit completion
4338c2ecf20Sopenharmony_cimay be processed along with it in the same thread context and so result in
4348c2ecf20Sopenharmony_cireduced latency.
4358c2ecf20Sopenharmony_ci
4368c2ecf20Sopenharmony_ciXPS is configured per transmit queue by setting a bitmap of
4378c2ecf20Sopenharmony_ciCPUs/receive-queues that may use that queue to transmit. The reverse
4388c2ecf20Sopenharmony_cimapping, from CPUs to transmit queues or from receive-queues to transmit
4398c2ecf20Sopenharmony_ciqueues, is computed and maintained for each network device. When
4408c2ecf20Sopenharmony_citransmitting the first packet in a flow, the function get_xps_queue() is
4418c2ecf20Sopenharmony_cicalled to select a queue. This function uses the ID of the receive queue
4428c2ecf20Sopenharmony_cifor the socket connection for a match in the receive queue-to-transmit queue
4438c2ecf20Sopenharmony_cilookup table. Alternatively, this function can also use the ID of the
4448c2ecf20Sopenharmony_cirunning CPU as a key into the CPU-to-queue lookup table. If the
4458c2ecf20Sopenharmony_ciID matches a single queue, that is used for transmission. If multiple
4468c2ecf20Sopenharmony_ciqueues match, one is selected by using the flow hash to compute an index
4478c2ecf20Sopenharmony_ciinto the set. When selecting the transmit queue based on receive queue(s)
4488c2ecf20Sopenharmony_cimap, the transmit device is not validated against the receive device as it
4498c2ecf20Sopenharmony_cirequires expensive lookup operation in the datapath.
4508c2ecf20Sopenharmony_ci
4518c2ecf20Sopenharmony_ciThe queue chosen for transmitting a particular flow is saved in the
4528c2ecf20Sopenharmony_cicorresponding socket structure for the flow (e.g. a TCP connection).
4538c2ecf20Sopenharmony_ciThis transmit queue is used for subsequent packets sent on the flow to
4548c2ecf20Sopenharmony_ciprevent out of order (ooo) packets. The choice also amortizes the cost
4558c2ecf20Sopenharmony_ciof calling get_xps_queues() over all packets in the flow. To avoid
4568c2ecf20Sopenharmony_ciooo packets, the queue for a flow can subsequently only be changed if
4578c2ecf20Sopenharmony_ciskb->ooo_okay is set for a packet in the flow. This flag indicates that
4588c2ecf20Sopenharmony_cithere are no outstanding packets in the flow, so the transmit queue can
4598c2ecf20Sopenharmony_cichange without the risk of generating out of order packets. The
4608c2ecf20Sopenharmony_citransport layer is responsible for setting ooo_okay appropriately. TCP,
4618c2ecf20Sopenharmony_cifor instance, sets the flag when all data for a connection has been
4628c2ecf20Sopenharmony_ciacknowledged.
4638c2ecf20Sopenharmony_ci
4648c2ecf20Sopenharmony_ciXPS Configuration
4658c2ecf20Sopenharmony_ci-----------------
4668c2ecf20Sopenharmony_ci
4678c2ecf20Sopenharmony_ciXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
4688c2ecf20Sopenharmony_cidefault for SMP). If compiled in, it is driver dependent whether, and
4698c2ecf20Sopenharmony_cihow, XPS is configured at device init. The mapping of CPUs/receive-queues
4708c2ecf20Sopenharmony_cito transmit queue can be inspected and configured using sysfs:
4718c2ecf20Sopenharmony_ci
4728c2ecf20Sopenharmony_ciFor selection based on CPUs map::
4738c2ecf20Sopenharmony_ci
4748c2ecf20Sopenharmony_ci  /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
4758c2ecf20Sopenharmony_ci
4768c2ecf20Sopenharmony_ciFor selection based on receive-queues map::
4778c2ecf20Sopenharmony_ci
4788c2ecf20Sopenharmony_ci  /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
4798c2ecf20Sopenharmony_ci
4808c2ecf20Sopenharmony_ci
4818c2ecf20Sopenharmony_ciSuggested Configuration
4828c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~
4838c2ecf20Sopenharmony_ci
4848c2ecf20Sopenharmony_ciFor a network device with a single transmission queue, XPS configuration
4858c2ecf20Sopenharmony_cihas no effect, since there is no choice in this case. In a multi-queue
4868c2ecf20Sopenharmony_cisystem, XPS is preferably configured so that each CPU maps onto one queue.
4878c2ecf20Sopenharmony_ciIf there are as many queues as there are CPUs in the system, then each
4888c2ecf20Sopenharmony_ciqueue can also map onto one CPU, resulting in exclusive pairings that
4898c2ecf20Sopenharmony_ciexperience no contention. If there are fewer queues than CPUs, then the
4908c2ecf20Sopenharmony_cibest CPUs to share a given queue are probably those that share the cache
4918c2ecf20Sopenharmony_ciwith the CPU that processes transmit completions for that queue
4928c2ecf20Sopenharmony_ci(transmit interrupts).
4938c2ecf20Sopenharmony_ci
4948c2ecf20Sopenharmony_ciFor transmit queue selection based on receive queue(s), XPS has to be
4958c2ecf20Sopenharmony_ciexplicitly configured mapping receive-queue(s) to transmit queue(s). If the
4968c2ecf20Sopenharmony_ciuser configuration for receive-queue map does not apply, then the transmit
4978c2ecf20Sopenharmony_ciqueue is selected based on the CPUs map.
4988c2ecf20Sopenharmony_ci
4998c2ecf20Sopenharmony_ci
5008c2ecf20Sopenharmony_ciPer TX Queue rate limitation
5018c2ecf20Sopenharmony_ci============================
5028c2ecf20Sopenharmony_ci
5038c2ecf20Sopenharmony_ciThese are rate-limitation mechanisms implemented by HW, where currently
5048c2ecf20Sopenharmony_cia max-rate attribute is supported, by setting a Mbps value to::
5058c2ecf20Sopenharmony_ci
5068c2ecf20Sopenharmony_ci  /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
5078c2ecf20Sopenharmony_ci
5088c2ecf20Sopenharmony_ciA value of zero means disabled, and this is the default.
5098c2ecf20Sopenharmony_ci
5108c2ecf20Sopenharmony_ci
5118c2ecf20Sopenharmony_ciFurther Information
5128c2ecf20Sopenharmony_ci===================
5138c2ecf20Sopenharmony_ciRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
5148c2ecf20Sopenharmony_ci2.6.38. Original patches were submitted by Tom Herbert
5158c2ecf20Sopenharmony_ci(therbert@google.com)
5168c2ecf20Sopenharmony_ci
5178c2ecf20Sopenharmony_ciAccelerated RFS was introduced in 2.6.35. Original patches were
5188c2ecf20Sopenharmony_cisubmitted by Ben Hutchings (bwh@kernel.org)
5198c2ecf20Sopenharmony_ci
5208c2ecf20Sopenharmony_ciAuthors:
5218c2ecf20Sopenharmony_ci
5228c2ecf20Sopenharmony_ci- Tom Herbert (therbert@google.com)
5238c2ecf20Sopenharmony_ci- Willem de Bruijn (willemb@google.com)
524