162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci====== 462306a36Sopenharmony_ciAF_XDP 562306a36Sopenharmony_ci====== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciOverview 862306a36Sopenharmony_ci======== 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciAF_XDP is an address family that is optimized for high performance 1162306a36Sopenharmony_cipacket processing. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciThis document assumes that the reader is familiar with BPF and XDP. If 1462306a36Sopenharmony_cinot, the Cilium project has an excellent reference guide at 1562306a36Sopenharmony_cihttp://cilium.readthedocs.io/en/latest/bpf/. 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ciUsing the XDP_REDIRECT action from an XDP program, the program can 1862306a36Sopenharmony_ciredirect ingress frames to other XDP enabled netdevs, using the 1962306a36Sopenharmony_cibpf_redirect_map() function. AF_XDP sockets enable the possibility for 2062306a36Sopenharmony_ciXDP programs to redirect frames to a memory buffer in a user-space 2162306a36Sopenharmony_ciapplication. 2262306a36Sopenharmony_ci 2362306a36Sopenharmony_ciAn AF_XDP socket (XSK) is created with the normal socket() 2462306a36Sopenharmony_cisyscall. Associated with each XSK are two rings: the RX ring and the 2562306a36Sopenharmony_ciTX ring. A socket can receive packets on the RX ring and it can send 2662306a36Sopenharmony_cipackets on the TX ring. These rings are registered and sized with the 2762306a36Sopenharmony_cisetsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory 2862306a36Sopenharmony_cito have at least one of these rings for each socket. An RX or TX 2962306a36Sopenharmony_cidescriptor ring points to a data buffer in a memory area called a 3062306a36Sopenharmony_ciUMEM. RX and TX can share the same UMEM so that a packet does not have 3162306a36Sopenharmony_cito be copied between RX and TX. Moreover, if a packet needs to be kept 3262306a36Sopenharmony_cifor a while due to a possible retransmit, the descriptor that points 3362306a36Sopenharmony_cito that packet can be changed to point to another and reused right 3462306a36Sopenharmony_ciaway. This again avoids copying data. 3562306a36Sopenharmony_ci 3662306a36Sopenharmony_ciThe UMEM consists of a number of equally sized chunks. A descriptor in 3762306a36Sopenharmony_cione of the rings references a frame by referencing its addr. The addr 3862306a36Sopenharmony_ciis simply an offset within the entire UMEM region. The user space 3962306a36Sopenharmony_ciallocates memory for this UMEM using whatever means it feels is most 4062306a36Sopenharmony_ciappropriate (malloc, mmap, huge pages, etc). This memory area is then 4162306a36Sopenharmony_ciregistered with the kernel using the new setsockopt XDP_UMEM_REG. The 4262306a36Sopenharmony_ciUMEM also has two rings: the FILL ring and the COMPLETION ring. The 4362306a36Sopenharmony_ciFILL ring is used by the application to send down addr for the kernel 4462306a36Sopenharmony_cito fill in with RX packet data. References to these frames will then 4562306a36Sopenharmony_ciappear in the RX ring once each packet has been received. The 4662306a36Sopenharmony_ciCOMPLETION ring, on the other hand, contains frame addr that the 4762306a36Sopenharmony_cikernel has transmitted completely and can now be used again by user 4862306a36Sopenharmony_cispace, for either TX or RX. Thus, the frame addrs appearing in the 4962306a36Sopenharmony_ciCOMPLETION ring are addrs that were previously transmitted using the 5062306a36Sopenharmony_ciTX ring. In summary, the RX and FILL rings are used for the RX path 5162306a36Sopenharmony_ciand the TX and COMPLETION rings are used for the TX path. 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ciThe socket is then finally bound with a bind() call to a device and a 5462306a36Sopenharmony_cispecific queue id on that device, and it is not until bind is 5562306a36Sopenharmony_cicompleted that traffic starts to flow. 5662306a36Sopenharmony_ci 5762306a36Sopenharmony_ciThe UMEM can be shared between processes, if desired. If a process 5862306a36Sopenharmony_ciwants to do this, it simply skips the registration of the UMEM and its 5962306a36Sopenharmony_cicorresponding two rings, sets the XDP_SHARED_UMEM flag in the bind 6062306a36Sopenharmony_cicall and submits the XSK of the process it would like to share UMEM 6162306a36Sopenharmony_ciwith as well as its own newly created XSK socket. The new process will 6262306a36Sopenharmony_cithen receive frame addr references in its own RX ring that point to 6362306a36Sopenharmony_cithis shared UMEM. Note that since the ring structures are 6462306a36Sopenharmony_cisingle-consumer / single-producer (for performance reasons), the new 6562306a36Sopenharmony_ciprocess has to create its own socket with associated RX and TX rings, 6662306a36Sopenharmony_cisince it cannot share this with the other process. This is also the 6762306a36Sopenharmony_cireason that there is only one set of FILL and COMPLETION rings per 6862306a36Sopenharmony_ciUMEM. It is the responsibility of a single process to handle the UMEM. 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ciHow is then packets distributed from an XDP program to the XSKs? There 7162306a36Sopenharmony_ciis a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The 7262306a36Sopenharmony_ciuser-space application can place an XSK at an arbitrary place in this 7362306a36Sopenharmony_cimap. The XDP program can then redirect a packet to a specific index in 7462306a36Sopenharmony_cithis map and at this point XDP validates that the XSK in that map was 7562306a36Sopenharmony_ciindeed bound to that device and ring number. If not, the packet is 7662306a36Sopenharmony_cidropped. If the map is empty at that index, the packet is also 7762306a36Sopenharmony_cidropped. This also means that it is currently mandatory to have an XDP 7862306a36Sopenharmony_ciprogram loaded (and one XSK in the XSKMAP) to be able to get any 7962306a36Sopenharmony_citraffic to user space through the XSK. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ciAF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the 8262306a36Sopenharmony_cidriver does not have support for XDP, or XDP_SKB is explicitly chosen 8362306a36Sopenharmony_ciwhen loading the XDP program, XDP_SKB mode is employed that uses SKBs 8462306a36Sopenharmony_citogether with the generic XDP support and copies out the data to user 8562306a36Sopenharmony_cispace. A fallback mode that works for any network device. On the other 8662306a36Sopenharmony_cihand, if the driver has support for XDP, it will be used by the AF_XDP 8762306a36Sopenharmony_cicode to provide better performance, but there is still a copy of the 8862306a36Sopenharmony_cidata into user space. 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ciConcepts 9162306a36Sopenharmony_ci======== 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ciIn order to use an AF_XDP socket, a number of associated objects need 9462306a36Sopenharmony_cito be setup. These objects and their options are explained in the 9562306a36Sopenharmony_cifollowing sections. 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ciFor an overview on how AF_XDP works, you can also take a look at the 9862306a36Sopenharmony_ciLinux Plumbers paper from 2018 on the subject: 9962306a36Sopenharmony_cihttp://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do 10062306a36Sopenharmony_ciNOT consult the paper from 2017 on "AF_PACKET v4", the first attempt 10162306a36Sopenharmony_ciat AF_XDP. Nearly everything changed since then. Jonathan Corbet has 10262306a36Sopenharmony_cialso written an excellent article on LWN, "Accelerating networking 10362306a36Sopenharmony_ciwith AF_XDP". It can be found at https://lwn.net/Articles/750845/. 10462306a36Sopenharmony_ci 10562306a36Sopenharmony_ciUMEM 10662306a36Sopenharmony_ci---- 10762306a36Sopenharmony_ci 10862306a36Sopenharmony_ciUMEM is a region of virtual contiguous memory, divided into 10962306a36Sopenharmony_ciequal-sized frames. An UMEM is associated to a netdev and a specific 11062306a36Sopenharmony_ciqueue id of that netdev. It is created and configured (chunk size, 11162306a36Sopenharmony_ciheadroom, start address and size) by using the XDP_UMEM_REG setsockopt 11262306a36Sopenharmony_cisystem call. A UMEM is bound to a netdev and queue id, via the bind() 11362306a36Sopenharmony_cisystem call. 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ciAn AF_XDP is socket linked to a single UMEM, but one UMEM can have 11662306a36Sopenharmony_cimultiple AF_XDP sockets. To share an UMEM created via one socket A, 11762306a36Sopenharmony_cithe next socket B can do this by setting the XDP_SHARED_UMEM flag in 11862306a36Sopenharmony_cistruct sockaddr_xdp member sxdp_flags, and passing the file descriptor 11962306a36Sopenharmony_ciof A to struct sockaddr_xdp member sxdp_shared_umem_fd. 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ciThe UMEM has two single-producer/single-consumer rings that are used 12262306a36Sopenharmony_cito transfer ownership of UMEM frames between the kernel and the 12362306a36Sopenharmony_ciuser-space application. 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ciRings 12662306a36Sopenharmony_ci----- 12762306a36Sopenharmony_ci 12862306a36Sopenharmony_ciThere are a four different kind of rings: FILL, COMPLETION, RX and 12962306a36Sopenharmony_ciTX. All rings are single-producer/single-consumer, so the user-space 13062306a36Sopenharmony_ciapplication need explicit synchronization of multiple 13162306a36Sopenharmony_ciprocesses/threads are reading/writing to them. 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ciThe UMEM uses two rings: FILL and COMPLETION. Each socket associated 13462306a36Sopenharmony_ciwith the UMEM must have an RX queue, TX queue or both. Say, that there 13562306a36Sopenharmony_ciis a setup with four sockets (all doing TX and RX). Then there will be 13662306a36Sopenharmony_cione FILL ring, one COMPLETION ring, four TX rings and four RX rings. 13762306a36Sopenharmony_ci 13862306a36Sopenharmony_ciThe rings are head(producer)/tail(consumer) based rings. A producer 13962306a36Sopenharmony_ciwrites the data ring at the index pointed out by struct xdp_ring 14062306a36Sopenharmony_ciproducer member, and increasing the producer index. A consumer reads 14162306a36Sopenharmony_cithe data ring at the index pointed out by struct xdp_ring consumer 14262306a36Sopenharmony_cimember, and increasing the consumer index. 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciThe rings are configured and created via the _RING setsockopt system 14562306a36Sopenharmony_cicalls and mmapped to user-space using the appropriate offset to mmap() 14662306a36Sopenharmony_ci(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and 14762306a36Sopenharmony_ciXDP_UMEM_PGOFF_COMPLETION_RING). 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ciThe size of the rings need to be of size power of two. 15062306a36Sopenharmony_ci 15162306a36Sopenharmony_ciUMEM Fill Ring 15262306a36Sopenharmony_ci~~~~~~~~~~~~~~ 15362306a36Sopenharmony_ci 15462306a36Sopenharmony_ciThe FILL ring is used to transfer ownership of UMEM frames from 15562306a36Sopenharmony_ciuser-space to kernel-space. The UMEM addrs are passed in the ring. As 15662306a36Sopenharmony_cian example, if the UMEM is 64k and each chunk is 4k, then the UMEM has 15762306a36Sopenharmony_ci16 chunks and can pass addrs between 0 and 64k. 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciFrames passed to the kernel are used for the ingress path (RX rings). 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciThe user application produces UMEM addrs to this ring. Note that, if 16262306a36Sopenharmony_cirunning the application with aligned chunk mode, the kernel will mask 16362306a36Sopenharmony_cithe incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of 16462306a36Sopenharmony_cithe addr will be masked off, meaning that 2048, 2050 and 3000 refers 16562306a36Sopenharmony_cito the same chunk. If the user application is run in the unaligned 16662306a36Sopenharmony_cichunks mode, then the incoming addr will be left untouched. 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ciUMEM Completion Ring 17062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciThe COMPLETION Ring is used transfer ownership of UMEM frames from 17362306a36Sopenharmony_cikernel-space to user-space. Just like the FILL ring, UMEM indices are 17462306a36Sopenharmony_ciused. 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciFrames passed from the kernel to user-space are frames that has been 17762306a36Sopenharmony_cisent (TX ring) and can be used by user-space again. 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ciThe user application consumes UMEM addrs from this ring. 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ciRX Ring 18362306a36Sopenharmony_ci~~~~~~~ 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciThe RX ring is the receiving side of a socket. Each entry in the ring 18662306a36Sopenharmony_ciis a struct xdp_desc descriptor. The descriptor contains UMEM offset 18762306a36Sopenharmony_ci(addr) and the length of the data (len). 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ciIf no frames have been passed to kernel via the FILL ring, no 19062306a36Sopenharmony_cidescriptors will (or can) appear on the RX ring. 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ciThe user application consumes struct xdp_desc descriptors from this 19362306a36Sopenharmony_ciring. 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ciTX Ring 19662306a36Sopenharmony_ci~~~~~~~ 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ciThe TX ring is used to send frames. The struct xdp_desc descriptor is 19962306a36Sopenharmony_cifilled (index, length and offset) and passed into the ring. 20062306a36Sopenharmony_ci 20162306a36Sopenharmony_ciTo start the transfer a sendmsg() system call is required. This might 20262306a36Sopenharmony_cibe relaxed in the future. 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ciThe user application produces struct xdp_desc descriptors to this 20562306a36Sopenharmony_ciring. 20662306a36Sopenharmony_ci 20762306a36Sopenharmony_ciLibbpf 20862306a36Sopenharmony_ci====== 20962306a36Sopenharmony_ci 21062306a36Sopenharmony_ciLibbpf is a helper library for eBPF and XDP that makes using these 21162306a36Sopenharmony_citechnologies a lot simpler. It also contains specific helper functions 21262306a36Sopenharmony_ciin tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It 21362306a36Sopenharmony_cicontains two types of functions: those that can be used to make the 21462306a36Sopenharmony_cisetup of AF_XDP socket easier and ones that can be used in the data 21562306a36Sopenharmony_ciplane to access the rings safely and quickly. To see an example on how 21662306a36Sopenharmony_cito use this API, please take a look at the sample application in 21762306a36Sopenharmony_cisamples/bpf/xdpsock_usr.c which uses libbpf for both setup and data 21862306a36Sopenharmony_ciplane operations. 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciWe recommend that you use this library unless you have become a power 22162306a36Sopenharmony_ciuser. It will make your program a lot simpler. 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ciXSKMAP / BPF_MAP_TYPE_XSKMAP 22462306a36Sopenharmony_ci============================ 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ciOn XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that 22762306a36Sopenharmony_ciis used in conjunction with bpf_redirect_map() to pass the ingress 22862306a36Sopenharmony_ciframe to a socket. 22962306a36Sopenharmony_ci 23062306a36Sopenharmony_ciThe user application inserts the socket into the map, via the bpf() 23162306a36Sopenharmony_cisystem call. 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ciNote that if an XDP program tries to redirect to a socket that does 23462306a36Sopenharmony_cinot match the queue configuration and netdev, the frame will be 23562306a36Sopenharmony_cidropped. E.g. an AF_XDP socket is bound to netdev eth0 and 23662306a36Sopenharmony_ciqueue 17. Only the XDP program executing for eth0 and queue 17 will 23762306a36Sopenharmony_cisuccessfully pass data to the socket. Please refer to the sample 23862306a36Sopenharmony_ciapplication (samples/bpf/) in for an example. 23962306a36Sopenharmony_ci 24062306a36Sopenharmony_ciConfiguration Flags and Socket Options 24162306a36Sopenharmony_ci====================================== 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_ciThese are the various configuration flags that can be used to control 24462306a36Sopenharmony_ciand monitor the behavior of AF_XDP sockets. 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ciXDP_COPY and XDP_ZEROCOPY bind flags 24762306a36Sopenharmony_ci------------------------------------ 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ciWhen you bind to a socket, the kernel will first try to use zero-copy 25062306a36Sopenharmony_cicopy. If zero-copy is not supported, it will fall back on using copy 25162306a36Sopenharmony_cimode, i.e. copying all packets out to user space. But if you would 25262306a36Sopenharmony_cilike to force a certain mode, you can use the following flags. If you 25362306a36Sopenharmony_cipass the XDP_COPY flag to the bind call, the kernel will force the 25462306a36Sopenharmony_cisocket into copy mode. If it cannot use copy mode, the bind call will 25562306a36Sopenharmony_cifail with an error. Conversely, the XDP_ZEROCOPY flag will force the 25662306a36Sopenharmony_cisocket into zero-copy mode or fail. 25762306a36Sopenharmony_ci 25862306a36Sopenharmony_ciXDP_SHARED_UMEM bind flag 25962306a36Sopenharmony_ci------------------------- 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ciThis flag enables you to bind multiple sockets to the same UMEM. It 26262306a36Sopenharmony_ciworks on the same queue id, between queue ids and between 26362306a36Sopenharmony_cinetdevs/devices. In this mode, each socket has their own RX and TX 26462306a36Sopenharmony_cirings as usual, but you are going to have one or more FILL and 26562306a36Sopenharmony_ciCOMPLETION ring pairs. You have to create one of these pairs per 26662306a36Sopenharmony_ciunique netdev and queue id tuple that you bind to. 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ciStarting with the case were we would like to share a UMEM between 26962306a36Sopenharmony_cisockets bound to the same netdev and queue id. The UMEM (tied to the 27062306a36Sopenharmony_cifist socket created) will only have a single FILL ring and a single 27162306a36Sopenharmony_ciCOMPLETION ring as there is only on unique netdev,queue_id tuple that 27262306a36Sopenharmony_ciwe have bound to. To use this mode, create the first socket and bind 27362306a36Sopenharmony_ciit in the normal way. Create a second socket and create an RX and a TX 27462306a36Sopenharmony_ciring, or at least one of them, but no FILL or COMPLETION rings as the 27562306a36Sopenharmony_ciones from the first socket will be used. In the bind call, set he 27662306a36Sopenharmony_ciXDP_SHARED_UMEM option and provide the initial socket's fd in the 27762306a36Sopenharmony_cisxdp_shared_umem_fd field. You can attach an arbitrary number of extra 27862306a36Sopenharmony_cisockets this way. 27962306a36Sopenharmony_ci 28062306a36Sopenharmony_ciWhat socket will then a packet arrive on? This is decided by the XDP 28162306a36Sopenharmony_ciprogram. Put all the sockets in the XSK_MAP and just indicate which 28262306a36Sopenharmony_ciindex in the array you would like to send each packet to. A simple 28362306a36Sopenharmony_ciround-robin example of distributing packets is shown below: 28462306a36Sopenharmony_ci 28562306a36Sopenharmony_ci.. code-block:: c 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_ci #include <linux/bpf.h> 28862306a36Sopenharmony_ci #include "bpf_helpers.h" 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ci #define MAX_SOCKS 16 29162306a36Sopenharmony_ci 29262306a36Sopenharmony_ci struct { 29362306a36Sopenharmony_ci __uint(type, BPF_MAP_TYPE_XSKMAP); 29462306a36Sopenharmony_ci __uint(max_entries, MAX_SOCKS); 29562306a36Sopenharmony_ci __uint(key_size, sizeof(int)); 29662306a36Sopenharmony_ci __uint(value_size, sizeof(int)); 29762306a36Sopenharmony_ci } xsks_map SEC(".maps"); 29862306a36Sopenharmony_ci 29962306a36Sopenharmony_ci static unsigned int rr; 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ci SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 30262306a36Sopenharmony_ci { 30362306a36Sopenharmony_ci rr = (rr + 1) & (MAX_SOCKS - 1); 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ci return bpf_redirect_map(&xsks_map, rr, XDP_DROP); 30662306a36Sopenharmony_ci } 30762306a36Sopenharmony_ci 30862306a36Sopenharmony_ciNote, that since there is only a single set of FILL and COMPLETION 30962306a36Sopenharmony_cirings, and they are single producer, single consumer rings, you need 31062306a36Sopenharmony_cito make sure that multiple processes or threads do not use these rings 31162306a36Sopenharmony_ciconcurrently. There are no synchronization primitives in the 31262306a36Sopenharmony_cilibbpf code that protects multiple users at this point in time. 31362306a36Sopenharmony_ci 31462306a36Sopenharmony_ciLibbpf uses this mode if you create more than one socket tied to the 31562306a36Sopenharmony_cisame UMEM. However, note that you need to supply the 31662306a36Sopenharmony_ciXSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the 31762306a36Sopenharmony_cixsk_socket__create calls and load your own XDP program as there is no 31862306a36Sopenharmony_cibuilt in one in libbpf that will route the traffic for you. 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ciThe second case is when you share a UMEM between sockets that are 32162306a36Sopenharmony_cibound to different queue ids and/or netdevs. In this case you have to 32262306a36Sopenharmony_cicreate one FILL ring and one COMPLETION ring for each unique 32362306a36Sopenharmony_cinetdev,queue_id pair. Let us say you want to create two sockets bound 32462306a36Sopenharmony_cito two different queue ids on the same netdev. Create the first socket 32562306a36Sopenharmony_ciand bind it in the normal way. Create a second socket and create an RX 32662306a36Sopenharmony_ciand a TX ring, or at least one of them, and then one FILL and 32762306a36Sopenharmony_ciCOMPLETION ring for this socket. Then in the bind call, set he 32862306a36Sopenharmony_ciXDP_SHARED_UMEM option and provide the initial socket's fd in the 32962306a36Sopenharmony_cisxdp_shared_umem_fd field as you registered the UMEM on that 33062306a36Sopenharmony_cisocket. These two sockets will now share one and the same UMEM. 33162306a36Sopenharmony_ci 33262306a36Sopenharmony_ciThere is no need to supply an XDP program like the one in the previous 33362306a36Sopenharmony_cicase where sockets were bound to the same queue id and 33462306a36Sopenharmony_cidevice. Instead, use the NIC's packet steering capabilities to steer 33562306a36Sopenharmony_cithe packets to the right queue. In the previous example, there is only 33662306a36Sopenharmony_cione queue shared among sockets, so the NIC cannot do this steering. It 33762306a36Sopenharmony_cican only steer between queues. 33862306a36Sopenharmony_ci 33962306a36Sopenharmony_ciIn libbpf, you need to use the xsk_socket__create_shared() API as it 34062306a36Sopenharmony_citakes a reference to a FILL ring and a COMPLETION ring that will be 34162306a36Sopenharmony_cicreated for you and bound to the shared UMEM. You can use this 34262306a36Sopenharmony_cifunction for all the sockets you create, or you can use it for the 34362306a36Sopenharmony_cisecond and following ones and use xsk_socket__create() for the first 34462306a36Sopenharmony_cione. Both methods yield the same result. 34562306a36Sopenharmony_ci 34662306a36Sopenharmony_ciNote that a UMEM can be shared between sockets on the same queue id 34762306a36Sopenharmony_ciand device, as well as between queues on the same device and between 34862306a36Sopenharmony_cidevices at the same time. 34962306a36Sopenharmony_ci 35062306a36Sopenharmony_ciXDP_USE_NEED_WAKEUP bind flag 35162306a36Sopenharmony_ci----------------------------- 35262306a36Sopenharmony_ci 35362306a36Sopenharmony_ciThis option adds support for a new flag called need_wakeup that is 35462306a36Sopenharmony_cipresent in the FILL ring and the TX ring, the rings for which user 35562306a36Sopenharmony_cispace is a producer. When this option is set in the bind call, the 35662306a36Sopenharmony_cineed_wakeup flag will be set if the kernel needs to be explicitly 35762306a36Sopenharmony_ciwoken up by a syscall to continue processing packets. If the flag is 35862306a36Sopenharmony_cizero, no syscall is needed. 35962306a36Sopenharmony_ci 36062306a36Sopenharmony_ciIf the flag is set on the FILL ring, the application needs to call 36162306a36Sopenharmony_cipoll() to be able to continue to receive packets on the RX ring. This 36262306a36Sopenharmony_cican happen, for example, when the kernel has detected that there are no 36362306a36Sopenharmony_cimore buffers on the FILL ring and no buffers left on the RX HW ring of 36462306a36Sopenharmony_cithe NIC. In this case, interrupts are turned off as the NIC cannot 36562306a36Sopenharmony_cireceive any packets (as there are no buffers to put them in), and the 36662306a36Sopenharmony_cineed_wakeup flag is set so that user space can put buffers on the 36762306a36Sopenharmony_ciFILL ring and then call poll() so that the kernel driver can put these 36862306a36Sopenharmony_cibuffers on the HW ring and start to receive packets. 36962306a36Sopenharmony_ci 37062306a36Sopenharmony_ciIf the flag is set for the TX ring, it means that the application 37162306a36Sopenharmony_cineeds to explicitly notify the kernel to send any packets put on the 37262306a36Sopenharmony_ciTX ring. This can be accomplished either by a poll() call, as in the 37362306a36Sopenharmony_ciRX path, or by calling sendto(). 37462306a36Sopenharmony_ci 37562306a36Sopenharmony_ciAn example of how to use this flag can be found in 37662306a36Sopenharmony_cisamples/bpf/xdpsock_user.c. An example with the use of libbpf helpers 37762306a36Sopenharmony_ciwould look like this for the TX path: 37862306a36Sopenharmony_ci 37962306a36Sopenharmony_ci.. code-block:: c 38062306a36Sopenharmony_ci 38162306a36Sopenharmony_ci if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) 38262306a36Sopenharmony_ci sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); 38362306a36Sopenharmony_ci 38462306a36Sopenharmony_ciI.e., only use the syscall if the flag is set. 38562306a36Sopenharmony_ci 38662306a36Sopenharmony_ciWe recommend that you always enable this mode as it usually leads to 38762306a36Sopenharmony_cibetter performance especially if you run the application and the 38862306a36Sopenharmony_cidriver on the same core, but also if you use different cores for the 38962306a36Sopenharmony_ciapplication and the kernel driver, as it reduces the number of 39062306a36Sopenharmony_cisyscalls needed for the TX path. 39162306a36Sopenharmony_ci 39262306a36Sopenharmony_ciXDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts 39362306a36Sopenharmony_ci------------------------------------------------------ 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ciThese setsockopts sets the number of descriptors that the RX, TX, 39662306a36Sopenharmony_ciFILL, and COMPLETION rings respectively should have. It is mandatory 39762306a36Sopenharmony_cito set the size of at least one of the RX and TX rings. If you set 39862306a36Sopenharmony_ciboth, you will be able to both receive and send traffic from your 39962306a36Sopenharmony_ciapplication, but if you only want to do one of them, you can save 40062306a36Sopenharmony_ciresources by only setting up one of them. Both the FILL ring and the 40162306a36Sopenharmony_ciCOMPLETION ring are mandatory as you need to have a UMEM tied to your 40262306a36Sopenharmony_cisocket. But if the XDP_SHARED_UMEM flag is used, any socket after the 40362306a36Sopenharmony_cifirst one does not have a UMEM and should in that case not have any 40462306a36Sopenharmony_ciFILL or COMPLETION rings created as the ones from the shared UMEM will 40562306a36Sopenharmony_cibe used. Note, that the rings are single-producer single-consumer, so 40662306a36Sopenharmony_cido not try to access them from multiple processes at the same 40762306a36Sopenharmony_citime. See the XDP_SHARED_UMEM section. 40862306a36Sopenharmony_ci 40962306a36Sopenharmony_ciIn libbpf, you can create Rx-only and Tx-only sockets by supplying 41062306a36Sopenharmony_ciNULL to the rx and tx arguments, respectively, to the 41162306a36Sopenharmony_cixsk_socket__create function. 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ciIf you create a Tx-only socket, we recommend that you do not put any 41462306a36Sopenharmony_cipackets on the fill ring. If you do this, drivers might think you are 41562306a36Sopenharmony_cigoing to receive something when you in fact will not, and this can 41662306a36Sopenharmony_cinegatively impact performance. 41762306a36Sopenharmony_ci 41862306a36Sopenharmony_ciXDP_UMEM_REG setsockopt 41962306a36Sopenharmony_ci----------------------- 42062306a36Sopenharmony_ci 42162306a36Sopenharmony_ciThis setsockopt registers a UMEM to a socket. This is the area that 42262306a36Sopenharmony_cicontain all the buffers that packet can reside in. The call takes a 42362306a36Sopenharmony_cipointer to the beginning of this area and the size of it. Moreover, it 42462306a36Sopenharmony_cialso has parameter called chunk_size that is the size that the UMEM is 42562306a36Sopenharmony_cidivided into. It can only be 2K or 4K at the moment. If you have an 42662306a36Sopenharmony_ciUMEM area that is 128K and a chunk size of 2K, this means that you 42762306a36Sopenharmony_ciwill be able to hold a maximum of 128K / 2K = 64 packets in your UMEM 42862306a36Sopenharmony_ciarea and that your largest packet size can be 2K. 42962306a36Sopenharmony_ci 43062306a36Sopenharmony_ciThere is also an option to set the headroom of each single buffer in 43162306a36Sopenharmony_cithe UMEM. If you set this to N bytes, it means that the packet will 43262306a36Sopenharmony_cistart N bytes into the buffer leaving the first N bytes for the 43362306a36Sopenharmony_ciapplication to use. The final option is the flags field, but it will 43462306a36Sopenharmony_cibe dealt with in separate sections for each UMEM flag. 43562306a36Sopenharmony_ci 43662306a36Sopenharmony_ciSO_BINDTODEVICE setsockopt 43762306a36Sopenharmony_ci-------------------------- 43862306a36Sopenharmony_ci 43962306a36Sopenharmony_ciThis is a generic SOL_SOCKET option that can be used to tie AF_XDP 44062306a36Sopenharmony_cisocket to a particular network interface. It is useful when a socket 44162306a36Sopenharmony_ciis created by a privileged process and passed to a non-privileged one. 44262306a36Sopenharmony_ciOnce the option is set, kernel will refuse attempts to bind that socket 44362306a36Sopenharmony_cito a different interface. Updating the value requires CAP_NET_RAW. 44462306a36Sopenharmony_ci 44562306a36Sopenharmony_ciXDP_STATISTICS getsockopt 44662306a36Sopenharmony_ci------------------------- 44762306a36Sopenharmony_ci 44862306a36Sopenharmony_ciGets drop statistics of a socket that can be useful for debug 44962306a36Sopenharmony_cipurposes. The supported statistics are shown below: 45062306a36Sopenharmony_ci 45162306a36Sopenharmony_ci.. code-block:: c 45262306a36Sopenharmony_ci 45362306a36Sopenharmony_ci struct xdp_statistics { 45462306a36Sopenharmony_ci __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ 45562306a36Sopenharmony_ci __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ 45662306a36Sopenharmony_ci __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ 45762306a36Sopenharmony_ci }; 45862306a36Sopenharmony_ci 45962306a36Sopenharmony_ciXDP_OPTIONS getsockopt 46062306a36Sopenharmony_ci---------------------- 46162306a36Sopenharmony_ci 46262306a36Sopenharmony_ciGets options from an XDP socket. The only one supported so far is 46362306a36Sopenharmony_ciXDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. 46462306a36Sopenharmony_ci 46562306a36Sopenharmony_ciMulti-Buffer Support 46662306a36Sopenharmony_ci==================== 46762306a36Sopenharmony_ci 46862306a36Sopenharmony_ciWith multi-buffer support, programs using AF_XDP sockets can receive 46962306a36Sopenharmony_ciand transmit packets consisting of multiple buffers both in copy and 47062306a36Sopenharmony_cizero-copy mode. For example, a packet can consist of two 47162306a36Sopenharmony_ciframes/buffers, one with the header and the other one with the data, 47262306a36Sopenharmony_cior a 9K Ethernet jumbo frame can be constructed by chaining together 47362306a36Sopenharmony_cithree 4K frames. 47462306a36Sopenharmony_ci 47562306a36Sopenharmony_ciSome definitions: 47662306a36Sopenharmony_ci 47762306a36Sopenharmony_ci* A packet consists of one or more frames 47862306a36Sopenharmony_ci 47962306a36Sopenharmony_ci* A descriptor in one of the AF_XDP rings always refers to a single 48062306a36Sopenharmony_ci frame. In the case the packet consists of a single frame, the 48162306a36Sopenharmony_ci descriptor refers to the whole packet. 48262306a36Sopenharmony_ci 48362306a36Sopenharmony_ciTo enable multi-buffer support for an AF_XDP socket, use the new bind 48462306a36Sopenharmony_ciflag XDP_USE_SG. If this is not provided, all multi-buffer packets 48562306a36Sopenharmony_ciwill be dropped just as before. Note that the XDP program loaded also 48662306a36Sopenharmony_cineeds to be in multi-buffer mode. This can be accomplished by using 48762306a36Sopenharmony_ci"xdp.frags" as the section name of the XDP program used. 48862306a36Sopenharmony_ci 48962306a36Sopenharmony_ciTo represent a packet consisting of multiple frames, a new flag called 49062306a36Sopenharmony_ciXDP_PKT_CONTD is introduced in the options field of the Rx and Tx 49162306a36Sopenharmony_cidescriptors. If it is true (1) the packet continues with the next 49262306a36Sopenharmony_cidescriptor and if it is false (0) it means this is the last descriptor 49362306a36Sopenharmony_ciof the packet. Why the reverse logic of end-of-packet (eop) flag found 49462306a36Sopenharmony_ciin many NICs? Just to preserve compatibility with non-multi-buffer 49562306a36Sopenharmony_ciapplications that have this bit set to false for all packets on Rx, 49662306a36Sopenharmony_ciand the apps set the options field to zero for Tx, as anything else 49762306a36Sopenharmony_ciwill be treated as an invalid descriptor. 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ciThese are the semantics for producing packets onto AF_XDP Tx ring 50062306a36Sopenharmony_ciconsisting of multiple frames: 50162306a36Sopenharmony_ci 50262306a36Sopenharmony_ci* When an invalid descriptor is found, all the other 50362306a36Sopenharmony_ci descriptors/frames of this packet are marked as invalid and not 50462306a36Sopenharmony_ci completed. The next descriptor is treated as the start of a new 50562306a36Sopenharmony_ci packet, even if this was not the intent (because we cannot guess 50662306a36Sopenharmony_ci the intent). As before, if your program is producing invalid 50762306a36Sopenharmony_ci descriptors you have a bug that must be fixed. 50862306a36Sopenharmony_ci 50962306a36Sopenharmony_ci* Zero length descriptors are treated as invalid descriptors. 51062306a36Sopenharmony_ci 51162306a36Sopenharmony_ci* For copy mode, the maximum supported number of frames in a packet is 51262306a36Sopenharmony_ci equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all 51362306a36Sopenharmony_ci descriptors accumulated so far are dropped and treated as 51462306a36Sopenharmony_ci invalid. To produce an application that will work on any system 51562306a36Sopenharmony_ci regardless of this config setting, limit the number of frags to 18, 51662306a36Sopenharmony_ci as the minimum value of the config is 17. 51762306a36Sopenharmony_ci 51862306a36Sopenharmony_ci* For zero-copy mode, the limit is up to what the NIC HW 51962306a36Sopenharmony_ci supports. Usually at least five on the NICs we have checked. We 52062306a36Sopenharmony_ci consciously chose to not enforce a rigid limit (such as 52162306a36Sopenharmony_ci CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have 52262306a36Sopenharmony_ci resulted in copy actions under the hood to fit into what limit the 52362306a36Sopenharmony_ci NIC supports. Kind of defeats the purpose of zero-copy mode. How to 52462306a36Sopenharmony_ci probe for this limit is explained in the "probe for multi-buffer 52562306a36Sopenharmony_ci support" section. 52662306a36Sopenharmony_ci 52762306a36Sopenharmony_ciOn the Rx path in copy-mode, the xsk core copies the XDP data into 52862306a36Sopenharmony_cimultiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as 52962306a36Sopenharmony_cidetailed before. Zero-copy mode works the same, though the data is not 53062306a36Sopenharmony_cicopied. When the application gets a descriptor with the XDP_PKT_CONTD 53162306a36Sopenharmony_ciflag set to one, it means that the packet consists of multiple buffers 53262306a36Sopenharmony_ciand it continues with the next buffer in the following 53362306a36Sopenharmony_cidescriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it 53462306a36Sopenharmony_cimeans that this is the last buffer of the packet. AF_XDP guarantees 53562306a36Sopenharmony_cithat only a complete packet (all frames in the packet) is sent to the 53662306a36Sopenharmony_ciapplication. If there is not enough space in the AF_XDP Rx ring, all 53762306a36Sopenharmony_ciframes of the packet will be dropped. 53862306a36Sopenharmony_ci 53962306a36Sopenharmony_ciIf application reads a batch of descriptors, using for example the libxdp 54062306a36Sopenharmony_ciinterfaces, it is not guaranteed that the batch will end with a full 54162306a36Sopenharmony_cipacket. It might end in the middle of a packet and the rest of the 54262306a36Sopenharmony_cibuffers of that packet will arrive at the beginning of the next batch, 54362306a36Sopenharmony_cisince the libxdp interface does not read the whole ring (unless you 54462306a36Sopenharmony_cihave an enormous batch size or a very small ring size). 54562306a36Sopenharmony_ci 54662306a36Sopenharmony_ciAn example program each for Rx and Tx multi-buffer support can be found 54762306a36Sopenharmony_cilater in this document. 54862306a36Sopenharmony_ci 54962306a36Sopenharmony_ciUsage 55062306a36Sopenharmony_ci----- 55162306a36Sopenharmony_ci 55262306a36Sopenharmony_ciIn order to use AF_XDP sockets two parts are needed. The 55362306a36Sopenharmony_ciuser-space application and the XDP program. For a complete setup and 55462306a36Sopenharmony_ciusage example, please refer to the sample application. The user-space 55562306a36Sopenharmony_ciside is xdpsock_user.c and the XDP side is part of libbpf. 55662306a36Sopenharmony_ci 55762306a36Sopenharmony_ciThe XDP code sample included in tools/lib/bpf/xsk.c is the following: 55862306a36Sopenharmony_ci 55962306a36Sopenharmony_ci.. code-block:: c 56062306a36Sopenharmony_ci 56162306a36Sopenharmony_ci SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) 56262306a36Sopenharmony_ci { 56362306a36Sopenharmony_ci int index = ctx->rx_queue_index; 56462306a36Sopenharmony_ci 56562306a36Sopenharmony_ci // A set entry here means that the corresponding queue_id 56662306a36Sopenharmony_ci // has an active AF_XDP socket bound to it. 56762306a36Sopenharmony_ci if (bpf_map_lookup_elem(&xsks_map, &index)) 56862306a36Sopenharmony_ci return bpf_redirect_map(&xsks_map, index, 0); 56962306a36Sopenharmony_ci 57062306a36Sopenharmony_ci return XDP_PASS; 57162306a36Sopenharmony_ci } 57262306a36Sopenharmony_ci 57362306a36Sopenharmony_ciA simple but not so performance ring dequeue and enqueue could look 57462306a36Sopenharmony_cilike this: 57562306a36Sopenharmony_ci 57662306a36Sopenharmony_ci.. code-block:: c 57762306a36Sopenharmony_ci 57862306a36Sopenharmony_ci // struct xdp_rxtx_ring { 57962306a36Sopenharmony_ci // __u32 *producer; 58062306a36Sopenharmony_ci // __u32 *consumer; 58162306a36Sopenharmony_ci // struct xdp_desc *desc; 58262306a36Sopenharmony_ci // }; 58362306a36Sopenharmony_ci 58462306a36Sopenharmony_ci // struct xdp_umem_ring { 58562306a36Sopenharmony_ci // __u32 *producer; 58662306a36Sopenharmony_ci // __u32 *consumer; 58762306a36Sopenharmony_ci // __u64 *desc; 58862306a36Sopenharmony_ci // }; 58962306a36Sopenharmony_ci 59062306a36Sopenharmony_ci // typedef struct xdp_rxtx_ring RING; 59162306a36Sopenharmony_ci // typedef struct xdp_umem_ring RING; 59262306a36Sopenharmony_ci 59362306a36Sopenharmony_ci // typedef struct xdp_desc RING_TYPE; 59462306a36Sopenharmony_ci // typedef __u64 RING_TYPE; 59562306a36Sopenharmony_ci 59662306a36Sopenharmony_ci int dequeue_one(RING *ring, RING_TYPE *item) 59762306a36Sopenharmony_ci { 59862306a36Sopenharmony_ci __u32 entries = *ring->producer - *ring->consumer; 59962306a36Sopenharmony_ci 60062306a36Sopenharmony_ci if (entries == 0) 60162306a36Sopenharmony_ci return -1; 60262306a36Sopenharmony_ci 60362306a36Sopenharmony_ci // read-barrier! 60462306a36Sopenharmony_ci 60562306a36Sopenharmony_ci *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; 60662306a36Sopenharmony_ci (*ring->consumer)++; 60762306a36Sopenharmony_ci return 0; 60862306a36Sopenharmony_ci } 60962306a36Sopenharmony_ci 61062306a36Sopenharmony_ci int enqueue_one(RING *ring, const RING_TYPE *item) 61162306a36Sopenharmony_ci { 61262306a36Sopenharmony_ci u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); 61362306a36Sopenharmony_ci 61462306a36Sopenharmony_ci if (free_entries == 0) 61562306a36Sopenharmony_ci return -1; 61662306a36Sopenharmony_ci 61762306a36Sopenharmony_ci ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; 61862306a36Sopenharmony_ci 61962306a36Sopenharmony_ci // write-barrier! 62062306a36Sopenharmony_ci 62162306a36Sopenharmony_ci (*ring->producer)++; 62262306a36Sopenharmony_ci return 0; 62362306a36Sopenharmony_ci } 62462306a36Sopenharmony_ci 62562306a36Sopenharmony_ciBut please use the libbpf functions as they are optimized and ready to 62662306a36Sopenharmony_ciuse. Will make your life easier. 62762306a36Sopenharmony_ci 62862306a36Sopenharmony_ciUsage Multi-Buffer Rx 62962306a36Sopenharmony_ci--------------------- 63062306a36Sopenharmony_ci 63162306a36Sopenharmony_ciHere is a simple Rx path pseudo-code example (using libxdp interfaces 63262306a36Sopenharmony_cifor simplicity). Error paths have been excluded to keep it short: 63362306a36Sopenharmony_ci 63462306a36Sopenharmony_ci.. code-block:: c 63562306a36Sopenharmony_ci 63662306a36Sopenharmony_ci void rx_packets(struct xsk_socket_info *xsk) 63762306a36Sopenharmony_ci { 63862306a36Sopenharmony_ci static bool new_packet = true; 63962306a36Sopenharmony_ci u32 idx_rx = 0, idx_fq = 0; 64062306a36Sopenharmony_ci static char *pkt; 64162306a36Sopenharmony_ci 64262306a36Sopenharmony_ci int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); 64362306a36Sopenharmony_ci 64462306a36Sopenharmony_ci xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); 64562306a36Sopenharmony_ci 64662306a36Sopenharmony_ci for (int i = 0; i < rcvd; i++) { 64762306a36Sopenharmony_ci struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); 64862306a36Sopenharmony_ci char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); 64962306a36Sopenharmony_ci bool eop = !(desc->options & XDP_PKT_CONTD); 65062306a36Sopenharmony_ci 65162306a36Sopenharmony_ci if (new_packet) 65262306a36Sopenharmony_ci pkt = frag; 65362306a36Sopenharmony_ci else 65462306a36Sopenharmony_ci add_frag_to_pkt(pkt, frag); 65562306a36Sopenharmony_ci 65662306a36Sopenharmony_ci if (eop) 65762306a36Sopenharmony_ci process_pkt(pkt); 65862306a36Sopenharmony_ci 65962306a36Sopenharmony_ci new_packet = eop; 66062306a36Sopenharmony_ci 66162306a36Sopenharmony_ci *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; 66262306a36Sopenharmony_ci } 66362306a36Sopenharmony_ci 66462306a36Sopenharmony_ci xsk_ring_prod__submit(&xsk->umem->fq, rcvd); 66562306a36Sopenharmony_ci xsk_ring_cons__release(&xsk->rx, rcvd); 66662306a36Sopenharmony_ci } 66762306a36Sopenharmony_ci 66862306a36Sopenharmony_ciUsage Multi-Buffer Tx 66962306a36Sopenharmony_ci--------------------- 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ciHere is an example Tx path pseudo-code (using libxdp interfaces for 67262306a36Sopenharmony_cisimplicity) ignoring that the umem is finite in size, and that we 67362306a36Sopenharmony_cieventually will run out of packets to send. Also assumes pkts.addr 67462306a36Sopenharmony_cipoints to a valid location in the umem. 67562306a36Sopenharmony_ci 67662306a36Sopenharmony_ci.. code-block:: c 67762306a36Sopenharmony_ci 67862306a36Sopenharmony_ci void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, 67962306a36Sopenharmony_ci int batch_size) 68062306a36Sopenharmony_ci { 68162306a36Sopenharmony_ci u32 idx, i, pkt_nb = 0; 68262306a36Sopenharmony_ci 68362306a36Sopenharmony_ci xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); 68462306a36Sopenharmony_ci 68562306a36Sopenharmony_ci for (i = 0; i < batch_size;) { 68662306a36Sopenharmony_ci u64 addr = pkts[pkt_nb].addr; 68762306a36Sopenharmony_ci u32 len = pkts[pkt_nb].size; 68862306a36Sopenharmony_ci 68962306a36Sopenharmony_ci do { 69062306a36Sopenharmony_ci struct xdp_desc *tx_desc; 69162306a36Sopenharmony_ci 69262306a36Sopenharmony_ci tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); 69362306a36Sopenharmony_ci tx_desc->addr = addr; 69462306a36Sopenharmony_ci 69562306a36Sopenharmony_ci if (len > xsk_frame_size) { 69662306a36Sopenharmony_ci tx_desc->len = xsk_frame_size; 69762306a36Sopenharmony_ci tx_desc->options = XDP_PKT_CONTD; 69862306a36Sopenharmony_ci } else { 69962306a36Sopenharmony_ci tx_desc->len = len; 70062306a36Sopenharmony_ci tx_desc->options = 0; 70162306a36Sopenharmony_ci pkt_nb++; 70262306a36Sopenharmony_ci } 70362306a36Sopenharmony_ci len -= tx_desc->len; 70462306a36Sopenharmony_ci addr += xsk_frame_size; 70562306a36Sopenharmony_ci 70662306a36Sopenharmony_ci if (i == batch_size) { 70762306a36Sopenharmony_ci /* Remember len, addr, pkt_nb for next iteration. 70862306a36Sopenharmony_ci * Skipped for simplicity. 70962306a36Sopenharmony_ci */ 71062306a36Sopenharmony_ci break; 71162306a36Sopenharmony_ci } 71262306a36Sopenharmony_ci } while (len); 71362306a36Sopenharmony_ci } 71462306a36Sopenharmony_ci 71562306a36Sopenharmony_ci xsk_ring_prod__submit(&xsk->tx, i); 71662306a36Sopenharmony_ci } 71762306a36Sopenharmony_ci 71862306a36Sopenharmony_ciProbing for Multi-Buffer Support 71962306a36Sopenharmony_ci-------------------------------- 72062306a36Sopenharmony_ci 72162306a36Sopenharmony_ciTo discover if a driver supports multi-buffer AF_XDP in SKB or DRV 72262306a36Sopenharmony_cimode, use the XDP_FEATURES feature of netlink in linux/netdev.h to 72362306a36Sopenharmony_ciquery for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for 72462306a36Sopenharmony_ciquerying for XDP multi-buffer support. If XDP supports multi-buffer in 72562306a36Sopenharmony_cia driver, then AF_XDP will also support that in SKB and DRV mode. 72662306a36Sopenharmony_ci 72762306a36Sopenharmony_ciTo discover if a driver supports multi-buffer AF_XDP in zero-copy 72862306a36Sopenharmony_cimode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY 72962306a36Sopenharmony_ciflag. If it is set, it means that at least zero-copy is supported and 73062306a36Sopenharmony_ciyou should go and check the netlink attribute 73162306a36Sopenharmony_ciNETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer 73262306a36Sopenharmony_civalue will be returned stating the max number of frags that are 73362306a36Sopenharmony_cisupported by this device in zero-copy mode. These are the possible 73462306a36Sopenharmony_cireturn values: 73562306a36Sopenharmony_ci 73662306a36Sopenharmony_ci1: Multi-buffer for zero-copy is not supported by this device, as max 73762306a36Sopenharmony_ci one fragment supported means that multi-buffer is not possible. 73862306a36Sopenharmony_ci 73962306a36Sopenharmony_ci>=2: Multi-buffer is supported in zero-copy mode for this device. The 74062306a36Sopenharmony_ci returned number signifies the max number of frags supported. 74162306a36Sopenharmony_ci 74262306a36Sopenharmony_ciFor an example on how these are used through libbpf, please take a 74362306a36Sopenharmony_cilook at tools/testing/selftests/bpf/xskxceiver.c. 74462306a36Sopenharmony_ci 74562306a36Sopenharmony_ciMulti-Buffer Support for Zero-Copy Drivers 74662306a36Sopenharmony_ci------------------------------------------ 74762306a36Sopenharmony_ci 74862306a36Sopenharmony_ciZero-copy drivers usually use the batched APIs for Rx and Tx 74962306a36Sopenharmony_ciprocessing. Note that the Tx batch API guarantees that it will provide 75062306a36Sopenharmony_cia batch of Tx descriptors that ends with full packet at the end. This 75162306a36Sopenharmony_cito facilitate extending a zero-copy driver with multi-buffer support. 75262306a36Sopenharmony_ci 75362306a36Sopenharmony_ciSample application 75462306a36Sopenharmony_ci================== 75562306a36Sopenharmony_ci 75662306a36Sopenharmony_ciThere is a xdpsock benchmarking/test application included that 75762306a36Sopenharmony_cidemonstrates how to use AF_XDP sockets with private UMEMs. Say that 75862306a36Sopenharmony_ciyou would like your UDP traffic from port 4242 to end up in queue 16, 75962306a36Sopenharmony_cithat we will enable AF_XDP on. Here, we use ethtool for this:: 76062306a36Sopenharmony_ci 76162306a36Sopenharmony_ci ethtool -N p3p2 rx-flow-hash udp4 fn 76262306a36Sopenharmony_ci ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ 76362306a36Sopenharmony_ci action 16 76462306a36Sopenharmony_ci 76562306a36Sopenharmony_ciRunning the rxdrop benchmark in XDP_DRV mode can then be done 76662306a36Sopenharmony_ciusing:: 76762306a36Sopenharmony_ci 76862306a36Sopenharmony_ci samples/bpf/xdpsock -i p3p2 -q 16 -r -N 76962306a36Sopenharmony_ci 77062306a36Sopenharmony_ciFor XDP_SKB mode, use the switch "-S" instead of "-N" and all options 77162306a36Sopenharmony_cican be displayed with "-h", as usual. 77262306a36Sopenharmony_ci 77362306a36Sopenharmony_ciThis sample application uses libbpf to make the setup and usage of 77462306a36Sopenharmony_ciAF_XDP simpler. If you want to know how the raw uapi of AF_XDP is 77562306a36Sopenharmony_cireally used to make something more advanced, take a look at the libbpf 77662306a36Sopenharmony_cicode in tools/lib/bpf/xsk.[ch]. 77762306a36Sopenharmony_ci 77862306a36Sopenharmony_ciFAQ 77962306a36Sopenharmony_ci======= 78062306a36Sopenharmony_ci 78162306a36Sopenharmony_ciQ: I am not seeing any traffic on the socket. What am I doing wrong? 78262306a36Sopenharmony_ci 78362306a36Sopenharmony_ciA: When a netdev of a physical NIC is initialized, Linux usually 78462306a36Sopenharmony_ci allocates one RX and TX queue pair per core. So on a 8 core system, 78562306a36Sopenharmony_ci queue ids 0 to 7 will be allocated, one per core. In the AF_XDP 78662306a36Sopenharmony_ci bind call or the xsk_socket__create libbpf function call, you 78762306a36Sopenharmony_ci specify a specific queue id to bind to and it is only the traffic 78862306a36Sopenharmony_ci towards that queue you are going to get on you socket. So in the 78962306a36Sopenharmony_ci example above, if you bind to queue 0, you are NOT going to get any 79062306a36Sopenharmony_ci traffic that is distributed to queues 1 through 7. If you are 79162306a36Sopenharmony_ci lucky, you will see the traffic, but usually it will end up on one 79262306a36Sopenharmony_ci of the queues you have not bound to. 79362306a36Sopenharmony_ci 79462306a36Sopenharmony_ci There are a number of ways to solve the problem of getting the 79562306a36Sopenharmony_ci traffic you want to the queue id you bound to. If you want to see 79662306a36Sopenharmony_ci all the traffic, you can force the netdev to only have 1 queue, queue 79762306a36Sopenharmony_ci id 0, and then bind to queue 0. You can use ethtool to do this:: 79862306a36Sopenharmony_ci 79962306a36Sopenharmony_ci sudo ethtool -L <interface> combined 1 80062306a36Sopenharmony_ci 80162306a36Sopenharmony_ci If you want to only see part of the traffic, you can program the 80262306a36Sopenharmony_ci NIC through ethtool to filter out your traffic to a single queue id 80362306a36Sopenharmony_ci that you can bind your XDP socket to. Here is one example in which 80462306a36Sopenharmony_ci UDP traffic to and from port 4242 are sent to queue 2:: 80562306a36Sopenharmony_ci 80662306a36Sopenharmony_ci sudo ethtool -N <interface> rx-flow-hash udp4 fn 80762306a36Sopenharmony_ci sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ 80862306a36Sopenharmony_ci 4242 action 2 80962306a36Sopenharmony_ci 81062306a36Sopenharmony_ci A number of other ways are possible all up to the capabilities of 81162306a36Sopenharmony_ci the NIC you have. 81262306a36Sopenharmony_ci 81362306a36Sopenharmony_ciQ: Can I use the XSKMAP to implement a switch between different umems 81462306a36Sopenharmony_ci in copy mode? 81562306a36Sopenharmony_ci 81662306a36Sopenharmony_ciA: The short answer is no, that is not supported at the moment. The 81762306a36Sopenharmony_ci XSKMAP can only be used to switch traffic coming in on queue id X 81862306a36Sopenharmony_ci to sockets bound to the same queue id X. The XSKMAP can contain 81962306a36Sopenharmony_ci sockets bound to different queue ids, for example X and Y, but only 82062306a36Sopenharmony_ci traffic goming in from queue id Y can be directed to sockets bound 82162306a36Sopenharmony_ci to the same queue id Y. In zero-copy mode, you should use the 82262306a36Sopenharmony_ci switch, or other distribution mechanism, in your NIC to direct 82362306a36Sopenharmony_ci traffic to the correct queue id and socket. 82462306a36Sopenharmony_ci 82562306a36Sopenharmony_ciQ: My packets are sometimes corrupted. What is wrong? 82662306a36Sopenharmony_ci 82762306a36Sopenharmony_ciA: Care has to be taken not to feed the same buffer in the UMEM into 82862306a36Sopenharmony_ci more than one ring at the same time. If you for example feed the 82962306a36Sopenharmony_ci same buffer into the FILL ring and the TX ring at the same time, the 83062306a36Sopenharmony_ci NIC might receive data into the buffer at the same time it is 83162306a36Sopenharmony_ci sending it. This will cause some packets to become corrupted. Same 83262306a36Sopenharmony_ci thing goes for feeding the same buffer into the FILL rings 83362306a36Sopenharmony_ci belonging to different queue ids or netdevs bound with the 83462306a36Sopenharmony_ci XDP_SHARED_UMEM flag. 83562306a36Sopenharmony_ci 83662306a36Sopenharmony_ciCredits 83762306a36Sopenharmony_ci======= 83862306a36Sopenharmony_ci 83962306a36Sopenharmony_ci- Björn Töpel (AF_XDP core) 84062306a36Sopenharmony_ci- Magnus Karlsson (AF_XDP core) 84162306a36Sopenharmony_ci- Alexander Duyck 84262306a36Sopenharmony_ci- Alexei Starovoitov 84362306a36Sopenharmony_ci- Daniel Borkmann 84462306a36Sopenharmony_ci- Jesper Dangaard Brouer 84562306a36Sopenharmony_ci- John Fastabend 84662306a36Sopenharmony_ci- Jonathan Corbet (LWN coverage) 84762306a36Sopenharmony_ci- Michael S. Tsirkin 84862306a36Sopenharmony_ci- Qi Z Zhang 84962306a36Sopenharmony_ci- Willem de Bruijn 850