162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci===================================== 462306a36Sopenharmony_ciScaling in the Linux Networking Stack 562306a36Sopenharmony_ci===================================== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciIntroduction 962306a36Sopenharmony_ci============ 1062306a36Sopenharmony_ci 1162306a36Sopenharmony_ciThis document describes a set of complementary techniques in the Linux 1262306a36Sopenharmony_cinetworking stack to increase parallelism and improve performance for 1362306a36Sopenharmony_cimulti-processor systems. 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciThe following technologies are described: 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ci- RSS: Receive Side Scaling 1862306a36Sopenharmony_ci- RPS: Receive Packet Steering 1962306a36Sopenharmony_ci- RFS: Receive Flow Steering 2062306a36Sopenharmony_ci- Accelerated Receive Flow Steering 2162306a36Sopenharmony_ci- XPS: Transmit Packet Steering 2262306a36Sopenharmony_ci 2362306a36Sopenharmony_ci 2462306a36Sopenharmony_ciRSS: Receive Side Scaling 2562306a36Sopenharmony_ci========================= 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ciContemporary NICs support multiple receive and transmit descriptor queues 2862306a36Sopenharmony_ci(multi-queue). On reception, a NIC can send different packets to different 2962306a36Sopenharmony_ciqueues to distribute processing among CPUs. The NIC distributes packets by 3062306a36Sopenharmony_ciapplying a filter to each packet that assigns it to one of a small number 3162306a36Sopenharmony_ciof logical flows. Packets for each flow are steered to a separate receive 3262306a36Sopenharmony_ciqueue, which in turn can be processed by separate CPUs. This mechanism is 3362306a36Sopenharmony_cigenerally known as “Receive-side Scaling” (RSS). The goal of RSS and 3462306a36Sopenharmony_cithe other scaling techniques is to increase performance uniformly. 3562306a36Sopenharmony_ciMulti-queue distribution can also be used for traffic prioritization, but 3662306a36Sopenharmony_cithat is not the focus of these techniques. 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciThe filter used in RSS is typically a hash function over the network 3962306a36Sopenharmony_ciand/or transport layer headers-- for example, a 4-tuple hash over 4062306a36Sopenharmony_ciIP addresses and TCP ports of a packet. The most common hardware 4162306a36Sopenharmony_ciimplementation of RSS uses a 128-entry indirection table where each entry 4262306a36Sopenharmony_cistores a queue number. The receive queue for a packet is determined 4362306a36Sopenharmony_ciby masking out the low order seven bits of the computed hash for the 4462306a36Sopenharmony_cipacket (usually a Toeplitz hash), taking this number as a key into the 4562306a36Sopenharmony_ciindirection table and reading the corresponding value. 4662306a36Sopenharmony_ci 4762306a36Sopenharmony_ciSome advanced NICs allow steering packets to queues based on 4862306a36Sopenharmony_ciprogrammable filters. For example, webserver bound TCP port 80 packets 4962306a36Sopenharmony_cican be directed to their own receive queue. Such “n-tuple” filters can 5062306a36Sopenharmony_cibe configured from ethtool (--config-ntuple). 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ciRSS Configuration 5462306a36Sopenharmony_ci----------------- 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_ciThe driver for a multi-queue capable NIC typically provides a kernel 5762306a36Sopenharmony_cimodule parameter for specifying the number of hardware queues to 5862306a36Sopenharmony_ciconfigure. In the bnx2x driver, for instance, this parameter is called 5962306a36Sopenharmony_cinum_queues. A typical RSS configuration would be to have one receive queue 6062306a36Sopenharmony_cifor each CPU if the device supports enough queues, or otherwise at least 6162306a36Sopenharmony_cione for each memory domain, where a memory domain is a set of CPUs that 6262306a36Sopenharmony_cishare a particular memory level (L1, L2, NUMA node, etc.). 6362306a36Sopenharmony_ci 6462306a36Sopenharmony_ciThe indirection table of an RSS device, which resolves a queue by masked 6562306a36Sopenharmony_cihash, is usually programmed by the driver at initialization. The 6662306a36Sopenharmony_cidefault mapping is to distribute the queues evenly in the table, but the 6762306a36Sopenharmony_ciindirection table can be retrieved and modified at runtime using ethtool 6862306a36Sopenharmony_cicommands (--show-rxfh-indir and --set-rxfh-indir). Modifying the 6962306a36Sopenharmony_ciindirection table could be done to give different queues different 7062306a36Sopenharmony_cirelative weights. 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ci 7362306a36Sopenharmony_ciRSS IRQ Configuration 7462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ciEach receive queue has a separate IRQ associated with it. The NIC triggers 7762306a36Sopenharmony_cithis to notify a CPU when new packets arrive on the given queue. The 7862306a36Sopenharmony_cisignaling path for PCIe devices uses message signaled interrupts (MSI-X), 7962306a36Sopenharmony_cithat can route each interrupt to a particular CPU. The active mapping 8062306a36Sopenharmony_ciof queues to IRQs can be determined from /proc/interrupts. By default, 8162306a36Sopenharmony_cian IRQ may be handled on any CPU. Because a non-negligible part of packet 8262306a36Sopenharmony_ciprocessing takes place in receive interrupt handling, it is advantageous 8362306a36Sopenharmony_cito spread receive interrupts between CPUs. To manually adjust the IRQ 8462306a36Sopenharmony_ciaffinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems 8562306a36Sopenharmony_ciwill be running irqbalance, a daemon that dynamically optimizes IRQ 8662306a36Sopenharmony_ciassignments and as a result may override any manual settings. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ciSuggested Configuration 9062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ciRSS should be enabled when latency is a concern or whenever receive 9362306a36Sopenharmony_ciinterrupt processing forms a bottleneck. Spreading load between CPUs 9462306a36Sopenharmony_cidecreases queue length. For low latency networking, the optimal setting 9562306a36Sopenharmony_ciis to allocate as many queues as there are CPUs in the system (or the 9662306a36Sopenharmony_ciNIC maximum, if lower). The most efficient high-rate configuration 9762306a36Sopenharmony_ciis likely the one with the smallest number of receive queues where no 9862306a36Sopenharmony_cireceive queue overflows due to a saturated CPU, because in default 9962306a36Sopenharmony_cimode with interrupt coalescing enabled, the aggregate number of 10062306a36Sopenharmony_ciinterrupts (and thus work) grows with each additional queue. 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_ciPer-cpu load can be observed using the mpstat utility, but note that on 10362306a36Sopenharmony_ciprocessors with hyperthreading (HT), each hyperthread is represented as 10462306a36Sopenharmony_cia separate CPU. For interrupt handling, HT has shown no benefit in 10562306a36Sopenharmony_ciinitial tests, so limit the number of queues to the number of CPU cores 10662306a36Sopenharmony_ciin the system. 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ciRPS: Receive Packet Steering 11062306a36Sopenharmony_ci============================ 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ciReceive Packet Steering (RPS) is logically a software implementation of 11362306a36Sopenharmony_ciRSS. Being in software, it is necessarily called later in the datapath. 11462306a36Sopenharmony_ciWhereas RSS selects the queue and hence CPU that will run the hardware 11562306a36Sopenharmony_ciinterrupt handler, RPS selects the CPU to perform protocol processing 11662306a36Sopenharmony_ciabove the interrupt handler. This is accomplished by placing the packet 11762306a36Sopenharmony_cion the desired CPU’s backlog queue and waking up the CPU for processing. 11862306a36Sopenharmony_ciRPS has some advantages over RSS: 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ci1) it can be used with any NIC 12162306a36Sopenharmony_ci2) software filters can easily be added to hash over new protocols 12262306a36Sopenharmony_ci3) it does not increase hardware device interrupt rate (although it does 12362306a36Sopenharmony_ci introduce inter-processor interrupts (IPIs)) 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ciRPS is called during bottom half of the receive interrupt handler, when 12662306a36Sopenharmony_cia driver sends a packet up the network stack with netif_rx() or 12762306a36Sopenharmony_cinetif_receive_skb(). These call the get_rps_cpu() function, which 12862306a36Sopenharmony_ciselects the queue that should process a packet. 12962306a36Sopenharmony_ci 13062306a36Sopenharmony_ciThe first step in determining the target CPU for RPS is to calculate a 13162306a36Sopenharmony_ciflow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash 13262306a36Sopenharmony_cidepending on the protocol). This serves as a consistent hash of the 13362306a36Sopenharmony_ciassociated flow of the packet. The hash is either provided by hardware 13462306a36Sopenharmony_cior will be computed in the stack. Capable hardware can pass the hash in 13562306a36Sopenharmony_cithe receive descriptor for the packet; this would usually be the same 13662306a36Sopenharmony_cihash used for RSS (e.g. computed Toeplitz hash). The hash is saved in 13762306a36Sopenharmony_ciskb->hash and can be used elsewhere in the stack as a hash of the 13862306a36Sopenharmony_cipacket’s flow. 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ciEach receive hardware queue has an associated list of CPUs to which 14162306a36Sopenharmony_ciRPS may enqueue packets for processing. For each received packet, 14262306a36Sopenharmony_cian index into the list is computed from the flow hash modulo the size 14362306a36Sopenharmony_ciof the list. The indexed CPU is the target for processing the packet, 14462306a36Sopenharmony_ciand the packet is queued to the tail of that CPU’s backlog queue. At 14562306a36Sopenharmony_cithe end of the bottom half routine, IPIs are sent to any CPUs for which 14662306a36Sopenharmony_cipackets have been queued to their backlog queue. The IPI wakes backlog 14762306a36Sopenharmony_ciprocessing on the remote CPU, and any queued packets are then processed 14862306a36Sopenharmony_ciup the networking stack. 14962306a36Sopenharmony_ci 15062306a36Sopenharmony_ci 15162306a36Sopenharmony_ciRPS Configuration 15262306a36Sopenharmony_ci----------------- 15362306a36Sopenharmony_ci 15462306a36Sopenharmony_ciRPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on 15562306a36Sopenharmony_ciby default for SMP). Even when compiled in, RPS remains disabled until 15662306a36Sopenharmony_ciexplicitly configured. The list of CPUs to which RPS may forward traffic 15762306a36Sopenharmony_cican be configured for each receive queue using a sysfs file entry:: 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ci /sys/class/net/<dev>/queues/rx-<n>/rps_cpus 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciThis file implements a bitmap of CPUs. RPS is disabled when it is zero 16262306a36Sopenharmony_ci(the default), in which case packets are processed on the interrupting 16362306a36Sopenharmony_ciCPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to 16462306a36Sopenharmony_cithe bitmap. 16562306a36Sopenharmony_ci 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ciSuggested Configuration 16862306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_ciFor a single queue device, a typical RPS configuration would be to set 17162306a36Sopenharmony_cithe rps_cpus to the CPUs in the same memory domain of the interrupting 17262306a36Sopenharmony_ciCPU. If NUMA locality is not an issue, this could also be all CPUs in 17362306a36Sopenharmony_cithe system. At high interrupt rate, it might be wise to exclude the 17462306a36Sopenharmony_ciinterrupting CPU from the map since that already performs much work. 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciFor a multi-queue system, if RSS is configured so that a hardware 17762306a36Sopenharmony_cireceive queue is mapped to each CPU, then RPS is probably redundant 17862306a36Sopenharmony_ciand unnecessary. If there are fewer hardware queues than CPUs, then 17962306a36Sopenharmony_ciRPS might be beneficial if the rps_cpus for each queue are the ones that 18062306a36Sopenharmony_cishare the same memory domain as the interrupting CPU for that queue. 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ciRPS Flow Limit 18462306a36Sopenharmony_ci-------------- 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_ciRPS scales kernel receive processing across CPUs without introducing 18762306a36Sopenharmony_cireordering. The trade-off to sending all packets from the same flow 18862306a36Sopenharmony_cito the same CPU is CPU load imbalance if flows vary in packet rate. 18962306a36Sopenharmony_ciIn the extreme case a single flow dominates traffic. Especially on 19062306a36Sopenharmony_cicommon server workloads with many concurrent connections, such 19162306a36Sopenharmony_cibehavior indicates a problem such as a misconfiguration or spoofed 19262306a36Sopenharmony_cisource Denial of Service attack. 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ciFlow Limit is an optional RPS feature that prioritizes small flows 19562306a36Sopenharmony_ciduring CPU contention by dropping packets from large flows slightly 19662306a36Sopenharmony_ciahead of those from small flows. It is active only when an RPS or RFS 19762306a36Sopenharmony_cidestination CPU approaches saturation. Once a CPU's input packet 19862306a36Sopenharmony_ciqueue exceeds half the maximum queue length (as set by sysctl 19962306a36Sopenharmony_cinet.core.netdev_max_backlog), the kernel starts a per-flow packet 20062306a36Sopenharmony_cicount over the last 256 packets. If a flow exceeds a set ratio (by 20162306a36Sopenharmony_cidefault, half) of these packets when a new packet arrives, then the 20262306a36Sopenharmony_cinew packet is dropped. Packets from other flows are still only 20362306a36Sopenharmony_cidropped once the input packet queue reaches netdev_max_backlog. 20462306a36Sopenharmony_ciNo packets are dropped when the input packet queue length is below 20562306a36Sopenharmony_cithe threshold, so flow limit does not sever connections outright: 20662306a36Sopenharmony_cieven large flows maintain connectivity. 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ciInterface 21062306a36Sopenharmony_ci~~~~~~~~~ 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciFlow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not 21362306a36Sopenharmony_citurned on. It is implemented for each CPU independently (to avoid lock 21462306a36Sopenharmony_ciand cache contention) and toggled per CPU by setting the relevant bit 21562306a36Sopenharmony_ciin sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU 21662306a36Sopenharmony_cibitmap interface as rps_cpus (see above) when called from procfs:: 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ci /proc/sys/net/core/flow_limit_cpu_bitmap 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciPer-flow rate is calculated by hashing each packet into a hashtable 22162306a36Sopenharmony_cibucket and incrementing a per-bucket counter. The hash function is 22262306a36Sopenharmony_cithe same that selects a CPU in RPS, but as the number of buckets can 22362306a36Sopenharmony_cibe much larger than the number of CPUs, flow limit has finer-grained 22462306a36Sopenharmony_ciidentification of large flows and fewer false positives. The default 22562306a36Sopenharmony_citable has 4096 buckets. This value can be modified through sysctl:: 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ci net.core.flow_limit_table_len 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ciThe value is only consulted when a new table is allocated. Modifying 23062306a36Sopenharmony_ciit does not update active tables. 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ciSuggested Configuration 23462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_ciFlow limit is useful on systems with many concurrent connections, 23762306a36Sopenharmony_ciwhere a single connection taking up 50% of a CPU indicates a problem. 23862306a36Sopenharmony_ciIn such environments, enable the feature on all CPUs that handle 23962306a36Sopenharmony_cinetwork rx interrupts (as set in /proc/irq/N/smp_affinity). 24062306a36Sopenharmony_ci 24162306a36Sopenharmony_ciThe feature depends on the input packet queue length to exceed 24262306a36Sopenharmony_cithe flow limit threshold (50%) + the flow history length (256). 24362306a36Sopenharmony_ciSetting net.core.netdev_max_backlog to either 1000 or 10000 24462306a36Sopenharmony_ciperformed well in experiments. 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ci 24762306a36Sopenharmony_ciRFS: Receive Flow Steering 24862306a36Sopenharmony_ci========================== 24962306a36Sopenharmony_ci 25062306a36Sopenharmony_ciWhile RPS steers packets solely based on hash, and thus generally 25162306a36Sopenharmony_ciprovides good load distribution, it does not take into account 25262306a36Sopenharmony_ciapplication locality. This is accomplished by Receive Flow Steering 25362306a36Sopenharmony_ci(RFS). The goal of RFS is to increase datacache hitrate by steering 25462306a36Sopenharmony_cikernel processing of packets to the CPU where the application thread 25562306a36Sopenharmony_ciconsuming the packet is running. RFS relies on the same RPS mechanisms 25662306a36Sopenharmony_cito enqueue packets onto the backlog of another CPU and to wake up that 25762306a36Sopenharmony_ciCPU. 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ciIn RFS, packets are not forwarded directly by the value of their hash, 26062306a36Sopenharmony_cibut the hash is used as index into a flow lookup table. This table maps 26162306a36Sopenharmony_ciflows to the CPUs where those flows are being processed. The flow hash 26262306a36Sopenharmony_ci(see RPS section above) is used to calculate the index into this table. 26362306a36Sopenharmony_ciThe CPU recorded in each entry is the one which last processed the flow. 26462306a36Sopenharmony_ciIf an entry does not hold a valid CPU, then packets mapped to that entry 26562306a36Sopenharmony_ciare steered using plain RPS. Multiple table entries may point to the 26662306a36Sopenharmony_cisame CPU. Indeed, with many flows and few CPUs, it is very likely that 26762306a36Sopenharmony_cia single application thread handles flows with many different flow hashes. 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_cirps_sock_flow_table is a global flow table that contains the *desired* CPU 27062306a36Sopenharmony_cifor flows: the CPU that is currently processing the flow in userspace. 27162306a36Sopenharmony_ciEach table value is a CPU index that is updated during calls to recvmsg 27262306a36Sopenharmony_ciand sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and 27362306a36Sopenharmony_citcp_splice_read()). 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciWhen the scheduler moves a thread to a new CPU while it has outstanding 27662306a36Sopenharmony_cireceive packets on the old CPU, packets may arrive out of order. To 27762306a36Sopenharmony_ciavoid this, RFS uses a second flow table to track outstanding packets 27862306a36Sopenharmony_cifor each flow: rps_dev_flow_table is a table specific to each hardware 27962306a36Sopenharmony_cireceive queue of each device. Each table value stores a CPU index and a 28062306a36Sopenharmony_cicounter. The CPU index represents the *current* CPU onto which packets 28162306a36Sopenharmony_cifor this flow are enqueued for further kernel processing. Ideally, kernel 28262306a36Sopenharmony_ciand userspace processing occur on the same CPU, and hence the CPU index 28362306a36Sopenharmony_ciin both tables is identical. This is likely false if the scheduler has 28462306a36Sopenharmony_cirecently migrated a userspace thread while the kernel still has packets 28562306a36Sopenharmony_cienqueued for kernel processing on the old CPU. 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_ciThe counter in rps_dev_flow_table values records the length of the current 28862306a36Sopenharmony_ciCPU's backlog when a packet in this flow was last enqueued. Each backlog 28962306a36Sopenharmony_ciqueue has a head counter that is incremented on dequeue. A tail counter 29062306a36Sopenharmony_ciis computed as head counter + queue length. In other words, the counter 29162306a36Sopenharmony_ciin rps_dev_flow[i] records the last element in flow i that has 29262306a36Sopenharmony_cibeen enqueued onto the currently designated CPU for flow i (of course, 29362306a36Sopenharmony_cientry i is actually selected by hash and multiple flows may hash to the 29462306a36Sopenharmony_cisame entry i). 29562306a36Sopenharmony_ci 29662306a36Sopenharmony_ciAnd now the trick for avoiding out of order packets: when selecting the 29762306a36Sopenharmony_ciCPU for packet processing (from get_rps_cpu()) the rps_sock_flow table 29862306a36Sopenharmony_ciand the rps_dev_flow table of the queue that the packet was received on 29962306a36Sopenharmony_ciare compared. If the desired CPU for the flow (found in the 30062306a36Sopenharmony_cirps_sock_flow table) matches the current CPU (found in the rps_dev_flow 30162306a36Sopenharmony_citable), the packet is enqueued onto that CPU’s backlog. If they differ, 30262306a36Sopenharmony_cithe current CPU is updated to match the desired CPU if one of the 30362306a36Sopenharmony_cifollowing is true: 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ci - The current CPU's queue head counter >= the recorded tail counter 30662306a36Sopenharmony_ci value in rps_dev_flow[i] 30762306a36Sopenharmony_ci - The current CPU is unset (>= nr_cpu_ids) 30862306a36Sopenharmony_ci - The current CPU is offline 30962306a36Sopenharmony_ci 31062306a36Sopenharmony_ciAfter this check, the packet is sent to the (possibly updated) current 31162306a36Sopenharmony_ciCPU. These rules aim to ensure that a flow only moves to a new CPU when 31262306a36Sopenharmony_cithere are no packets outstanding on the old CPU, as the outstanding 31362306a36Sopenharmony_cipackets could arrive later than those about to be processed on the new 31462306a36Sopenharmony_ciCPU. 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ci 31762306a36Sopenharmony_ciRFS Configuration 31862306a36Sopenharmony_ci----------------- 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ciRFS is only available if the kconfig symbol CONFIG_RPS is enabled (on 32162306a36Sopenharmony_ciby default for SMP). The functionality remains disabled until explicitly 32262306a36Sopenharmony_ciconfigured. The number of entries in the global flow table is set through:: 32362306a36Sopenharmony_ci 32462306a36Sopenharmony_ci /proc/sys/net/core/rps_sock_flow_entries 32562306a36Sopenharmony_ci 32662306a36Sopenharmony_ciThe number of entries in the per-queue flow table are set through:: 32762306a36Sopenharmony_ci 32862306a36Sopenharmony_ci /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt 32962306a36Sopenharmony_ci 33062306a36Sopenharmony_ci 33162306a36Sopenharmony_ciSuggested Configuration 33262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 33362306a36Sopenharmony_ci 33462306a36Sopenharmony_ciBoth of these need to be set before RFS is enabled for a receive queue. 33562306a36Sopenharmony_ciValues for both are rounded up to the nearest power of two. The 33662306a36Sopenharmony_cisuggested flow count depends on the expected number of active connections 33762306a36Sopenharmony_ciat any given time, which may be significantly less than the number of open 33862306a36Sopenharmony_ciconnections. We have found that a value of 32768 for rps_sock_flow_entries 33962306a36Sopenharmony_ciworks fairly well on a moderately loaded server. 34062306a36Sopenharmony_ci 34162306a36Sopenharmony_ciFor a single queue device, the rps_flow_cnt value for the single queue 34262306a36Sopenharmony_ciwould normally be configured to the same value as rps_sock_flow_entries. 34362306a36Sopenharmony_ciFor a multi-queue device, the rps_flow_cnt for each queue might be 34462306a36Sopenharmony_ciconfigured as rps_sock_flow_entries / N, where N is the number of 34562306a36Sopenharmony_ciqueues. So for instance, if rps_sock_flow_entries is set to 32768 and there 34662306a36Sopenharmony_ciare 16 configured receive queues, rps_flow_cnt for each queue might be 34762306a36Sopenharmony_ciconfigured as 2048. 34862306a36Sopenharmony_ci 34962306a36Sopenharmony_ci 35062306a36Sopenharmony_ciAccelerated RFS 35162306a36Sopenharmony_ci=============== 35262306a36Sopenharmony_ci 35362306a36Sopenharmony_ciAccelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load 35462306a36Sopenharmony_cibalancing mechanism that uses soft state to steer flows based on where 35562306a36Sopenharmony_cithe application thread consuming the packets of each flow is running. 35662306a36Sopenharmony_ciAccelerated RFS should perform better than RFS since packets are sent 35762306a36Sopenharmony_cidirectly to a CPU local to the thread consuming the data. The target CPU 35862306a36Sopenharmony_ciwill either be the same CPU where the application runs, or at least a CPU 35962306a36Sopenharmony_ciwhich is local to the application thread’s CPU in the cache hierarchy. 36062306a36Sopenharmony_ci 36162306a36Sopenharmony_ciTo enable accelerated RFS, the networking stack calls the 36262306a36Sopenharmony_cindo_rx_flow_steer driver function to communicate the desired hardware 36362306a36Sopenharmony_ciqueue for packets matching a particular flow. The network stack 36462306a36Sopenharmony_ciautomatically calls this function every time a flow entry in 36562306a36Sopenharmony_cirps_dev_flow_table is updated. The driver in turn uses a device specific 36662306a36Sopenharmony_cimethod to program the NIC to steer the packets. 36762306a36Sopenharmony_ci 36862306a36Sopenharmony_ciThe hardware queue for a flow is derived from the CPU recorded in 36962306a36Sopenharmony_cirps_dev_flow_table. The stack consults a CPU to hardware queue map which 37062306a36Sopenharmony_ciis maintained by the NIC driver. This is an auto-generated reverse map of 37162306a36Sopenharmony_cithe IRQ affinity table shown by /proc/interrupts. Drivers can use 37262306a36Sopenharmony_cifunctions in the cpu_rmap (“CPU affinity reverse map”) kernel library 37362306a36Sopenharmony_cito populate the map. For each CPU, the corresponding queue in the map is 37462306a36Sopenharmony_ciset to be one whose processing CPU is closest in cache locality. 37562306a36Sopenharmony_ci 37662306a36Sopenharmony_ci 37762306a36Sopenharmony_ciAccelerated RFS Configuration 37862306a36Sopenharmony_ci----------------------------- 37962306a36Sopenharmony_ci 38062306a36Sopenharmony_ciAccelerated RFS is only available if the kernel is compiled with 38162306a36Sopenharmony_ciCONFIG_RFS_ACCEL and support is provided by the NIC device and driver. 38262306a36Sopenharmony_ciIt also requires that ntuple filtering is enabled via ethtool. The map 38362306a36Sopenharmony_ciof CPU to queues is automatically deduced from the IRQ affinities 38462306a36Sopenharmony_ciconfigured for each receive queue by the driver, so no additional 38562306a36Sopenharmony_ciconfiguration should be necessary. 38662306a36Sopenharmony_ci 38762306a36Sopenharmony_ci 38862306a36Sopenharmony_ciSuggested Configuration 38962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 39062306a36Sopenharmony_ci 39162306a36Sopenharmony_ciThis technique should be enabled whenever one wants to use RFS and the 39262306a36Sopenharmony_ciNIC supports hardware acceleration. 39362306a36Sopenharmony_ci 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ciXPS: Transmit Packet Steering 39662306a36Sopenharmony_ci============================= 39762306a36Sopenharmony_ci 39862306a36Sopenharmony_ciTransmit Packet Steering is a mechanism for intelligently selecting 39962306a36Sopenharmony_ciwhich transmit queue to use when transmitting a packet on a multi-queue 40062306a36Sopenharmony_cidevice. This can be accomplished by recording two kinds of maps, either 40162306a36Sopenharmony_cia mapping of CPU to hardware queue(s) or a mapping of receive queue(s) 40262306a36Sopenharmony_cito hardware transmit queue(s). 40362306a36Sopenharmony_ci 40462306a36Sopenharmony_ci1. XPS using CPUs map 40562306a36Sopenharmony_ci 40662306a36Sopenharmony_ciThe goal of this mapping is usually to assign queues 40762306a36Sopenharmony_ciexclusively to a subset of CPUs, where the transmit completions for 40862306a36Sopenharmony_cithese queues are processed on a CPU within this set. This choice 40962306a36Sopenharmony_ciprovides two benefits. First, contention on the device queue lock is 41062306a36Sopenharmony_cisignificantly reduced since fewer CPUs contend for the same queue 41162306a36Sopenharmony_ci(contention can be eliminated completely if each CPU has its own 41262306a36Sopenharmony_citransmit queue). Secondly, cache miss rate on transmit completion is 41362306a36Sopenharmony_cireduced, in particular for data cache lines that hold the sk_buff 41462306a36Sopenharmony_cistructures. 41562306a36Sopenharmony_ci 41662306a36Sopenharmony_ci2. XPS using receive queues map 41762306a36Sopenharmony_ci 41862306a36Sopenharmony_ciThis mapping is used to pick transmit queue based on the receive 41962306a36Sopenharmony_ciqueue(s) map configuration set by the administrator. A set of receive 42062306a36Sopenharmony_ciqueues can be mapped to a set of transmit queues (many:many), although 42162306a36Sopenharmony_cithe common use case is a 1:1 mapping. This will enable sending packets 42262306a36Sopenharmony_cion the same queue associations for transmit and receive. This is useful for 42362306a36Sopenharmony_cibusy polling multi-threaded workloads where there are challenges in 42462306a36Sopenharmony_ciassociating a given CPU to a given application thread. The application 42562306a36Sopenharmony_cithreads are not pinned to CPUs and each thread handles packets 42662306a36Sopenharmony_cireceived on a single queue. The receive queue number is cached in the 42762306a36Sopenharmony_cisocket for the connection. In this model, sending the packets on the same 42862306a36Sopenharmony_citransmit queue corresponding to the associated receive queue has benefits 42962306a36Sopenharmony_ciin keeping the CPU overhead low. Transmit completion work is locked into 43062306a36Sopenharmony_cithe same queue-association that a given application is polling on. This 43162306a36Sopenharmony_ciavoids the overhead of triggering an interrupt on another CPU. When the 43262306a36Sopenharmony_ciapplication cleans up the packets during the busy poll, transmit completion 43362306a36Sopenharmony_cimay be processed along with it in the same thread context and so result in 43462306a36Sopenharmony_cireduced latency. 43562306a36Sopenharmony_ci 43662306a36Sopenharmony_ciXPS is configured per transmit queue by setting a bitmap of 43762306a36Sopenharmony_ciCPUs/receive-queues that may use that queue to transmit. The reverse 43862306a36Sopenharmony_cimapping, from CPUs to transmit queues or from receive-queues to transmit 43962306a36Sopenharmony_ciqueues, is computed and maintained for each network device. When 44062306a36Sopenharmony_citransmitting the first packet in a flow, the function get_xps_queue() is 44162306a36Sopenharmony_cicalled to select a queue. This function uses the ID of the receive queue 44262306a36Sopenharmony_cifor the socket connection for a match in the receive queue-to-transmit queue 44362306a36Sopenharmony_cilookup table. Alternatively, this function can also use the ID of the 44462306a36Sopenharmony_cirunning CPU as a key into the CPU-to-queue lookup table. If the 44562306a36Sopenharmony_ciID matches a single queue, that is used for transmission. If multiple 44662306a36Sopenharmony_ciqueues match, one is selected by using the flow hash to compute an index 44762306a36Sopenharmony_ciinto the set. When selecting the transmit queue based on receive queue(s) 44862306a36Sopenharmony_cimap, the transmit device is not validated against the receive device as it 44962306a36Sopenharmony_cirequires expensive lookup operation in the datapath. 45062306a36Sopenharmony_ci 45162306a36Sopenharmony_ciThe queue chosen for transmitting a particular flow is saved in the 45262306a36Sopenharmony_cicorresponding socket structure for the flow (e.g. a TCP connection). 45362306a36Sopenharmony_ciThis transmit queue is used for subsequent packets sent on the flow to 45462306a36Sopenharmony_ciprevent out of order (ooo) packets. The choice also amortizes the cost 45562306a36Sopenharmony_ciof calling get_xps_queues() over all packets in the flow. To avoid 45662306a36Sopenharmony_ciooo packets, the queue for a flow can subsequently only be changed if 45762306a36Sopenharmony_ciskb->ooo_okay is set for a packet in the flow. This flag indicates that 45862306a36Sopenharmony_cithere are no outstanding packets in the flow, so the transmit queue can 45962306a36Sopenharmony_cichange without the risk of generating out of order packets. The 46062306a36Sopenharmony_citransport layer is responsible for setting ooo_okay appropriately. TCP, 46162306a36Sopenharmony_cifor instance, sets the flag when all data for a connection has been 46262306a36Sopenharmony_ciacknowledged. 46362306a36Sopenharmony_ci 46462306a36Sopenharmony_ciXPS Configuration 46562306a36Sopenharmony_ci----------------- 46662306a36Sopenharmony_ci 46762306a36Sopenharmony_ciXPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by 46862306a36Sopenharmony_cidefault for SMP). If compiled in, it is driver dependent whether, and 46962306a36Sopenharmony_cihow, XPS is configured at device init. The mapping of CPUs/receive-queues 47062306a36Sopenharmony_cito transmit queue can be inspected and configured using sysfs: 47162306a36Sopenharmony_ci 47262306a36Sopenharmony_ciFor selection based on CPUs map:: 47362306a36Sopenharmony_ci 47462306a36Sopenharmony_ci /sys/class/net/<dev>/queues/tx-<n>/xps_cpus 47562306a36Sopenharmony_ci 47662306a36Sopenharmony_ciFor selection based on receive-queues map:: 47762306a36Sopenharmony_ci 47862306a36Sopenharmony_ci /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs 47962306a36Sopenharmony_ci 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ciSuggested Configuration 48262306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~ 48362306a36Sopenharmony_ci 48462306a36Sopenharmony_ciFor a network device with a single transmission queue, XPS configuration 48562306a36Sopenharmony_cihas no effect, since there is no choice in this case. In a multi-queue 48662306a36Sopenharmony_cisystem, XPS is preferably configured so that each CPU maps onto one queue. 48762306a36Sopenharmony_ciIf there are as many queues as there are CPUs in the system, then each 48862306a36Sopenharmony_ciqueue can also map onto one CPU, resulting in exclusive pairings that 48962306a36Sopenharmony_ciexperience no contention. If there are fewer queues than CPUs, then the 49062306a36Sopenharmony_cibest CPUs to share a given queue are probably those that share the cache 49162306a36Sopenharmony_ciwith the CPU that processes transmit completions for that queue 49262306a36Sopenharmony_ci(transmit interrupts). 49362306a36Sopenharmony_ci 49462306a36Sopenharmony_ciFor transmit queue selection based on receive queue(s), XPS has to be 49562306a36Sopenharmony_ciexplicitly configured mapping receive-queue(s) to transmit queue(s). If the 49662306a36Sopenharmony_ciuser configuration for receive-queue map does not apply, then the transmit 49762306a36Sopenharmony_ciqueue is selected based on the CPUs map. 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ci 50062306a36Sopenharmony_ciPer TX Queue rate limitation 50162306a36Sopenharmony_ci============================ 50262306a36Sopenharmony_ci 50362306a36Sopenharmony_ciThese are rate-limitation mechanisms implemented by HW, where currently 50462306a36Sopenharmony_cia max-rate attribute is supported, by setting a Mbps value to:: 50562306a36Sopenharmony_ci 50662306a36Sopenharmony_ci /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate 50762306a36Sopenharmony_ci 50862306a36Sopenharmony_ciA value of zero means disabled, and this is the default. 50962306a36Sopenharmony_ci 51062306a36Sopenharmony_ci 51162306a36Sopenharmony_ciFurther Information 51262306a36Sopenharmony_ci=================== 51362306a36Sopenharmony_ciRPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into 51462306a36Sopenharmony_ci2.6.38. Original patches were submitted by Tom Herbert 51562306a36Sopenharmony_ci(therbert@google.com) 51662306a36Sopenharmony_ci 51762306a36Sopenharmony_ciAccelerated RFS was introduced in 2.6.35. Original patches were 51862306a36Sopenharmony_cisubmitted by Ben Hutchings (bwh@kernel.org) 51962306a36Sopenharmony_ci 52062306a36Sopenharmony_ciAuthors: 52162306a36Sopenharmony_ci 52262306a36Sopenharmony_ci- Tom Herbert (therbert@google.com) 52362306a36Sopenharmony_ci- Willem de Bruijn (willemb@google.com) 524