162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci======
462306a36Sopenharmony_ciAF_XDP
562306a36Sopenharmony_ci======
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciOverview
862306a36Sopenharmony_ci========
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciAF_XDP is an address family that is optimized for high performance
1162306a36Sopenharmony_cipacket processing.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciThis document assumes that the reader is familiar with BPF and XDP. If
1462306a36Sopenharmony_cinot, the Cilium project has an excellent reference guide at
1562306a36Sopenharmony_cihttp://cilium.readthedocs.io/en/latest/bpf/.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciUsing the XDP_REDIRECT action from an XDP program, the program can
1862306a36Sopenharmony_ciredirect ingress frames to other XDP enabled netdevs, using the
1962306a36Sopenharmony_cibpf_redirect_map() function. AF_XDP sockets enable the possibility for
2062306a36Sopenharmony_ciXDP programs to redirect frames to a memory buffer in a user-space
2162306a36Sopenharmony_ciapplication.
2262306a36Sopenharmony_ci
2362306a36Sopenharmony_ciAn AF_XDP socket (XSK) is created with the normal socket()
2462306a36Sopenharmony_cisyscall. Associated with each XSK are two rings: the RX ring and the
2562306a36Sopenharmony_ciTX ring. A socket can receive packets on the RX ring and it can send
2662306a36Sopenharmony_cipackets on the TX ring. These rings are registered and sized with the
2762306a36Sopenharmony_cisetsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
2862306a36Sopenharmony_cito have at least one of these rings for each socket. An RX or TX
2962306a36Sopenharmony_cidescriptor ring points to a data buffer in a memory area called a
3062306a36Sopenharmony_ciUMEM. RX and TX can share the same UMEM so that a packet does not have
3162306a36Sopenharmony_cito be copied between RX and TX. Moreover, if a packet needs to be kept
3262306a36Sopenharmony_cifor a while due to a possible retransmit, the descriptor that points
3362306a36Sopenharmony_cito that packet can be changed to point to another and reused right
3462306a36Sopenharmony_ciaway. This again avoids copying data.
3562306a36Sopenharmony_ci
3662306a36Sopenharmony_ciThe UMEM consists of a number of equally sized chunks. A descriptor in
3762306a36Sopenharmony_cione of the rings references a frame by referencing its addr. The addr
3862306a36Sopenharmony_ciis simply an offset within the entire UMEM region. The user space
3962306a36Sopenharmony_ciallocates memory for this UMEM using whatever means it feels is most
4062306a36Sopenharmony_ciappropriate (malloc, mmap, huge pages, etc). This memory area is then
4162306a36Sopenharmony_ciregistered with the kernel using the new setsockopt XDP_UMEM_REG. The
4262306a36Sopenharmony_ciUMEM also has two rings: the FILL ring and the COMPLETION ring. The
4362306a36Sopenharmony_ciFILL ring is used by the application to send down addr for the kernel
4462306a36Sopenharmony_cito fill in with RX packet data. References to these frames will then
4562306a36Sopenharmony_ciappear in the RX ring once each packet has been received. The
4662306a36Sopenharmony_ciCOMPLETION ring, on the other hand, contains frame addr that the
4762306a36Sopenharmony_cikernel has transmitted completely and can now be used again by user
4862306a36Sopenharmony_cispace, for either TX or RX. Thus, the frame addrs appearing in the
4962306a36Sopenharmony_ciCOMPLETION ring are addrs that were previously transmitted using the
5062306a36Sopenharmony_ciTX ring. In summary, the RX and FILL rings are used for the RX path
5162306a36Sopenharmony_ciand the TX and COMPLETION rings are used for the TX path.
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciThe socket is then finally bound with a bind() call to a device and a
5462306a36Sopenharmony_cispecific queue id on that device, and it is not until bind is
5562306a36Sopenharmony_cicompleted that traffic starts to flow.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciThe UMEM can be shared between processes, if desired. If a process
5862306a36Sopenharmony_ciwants to do this, it simply skips the registration of the UMEM and its
5962306a36Sopenharmony_cicorresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
6062306a36Sopenharmony_cicall and submits the XSK of the process it would like to share UMEM
6162306a36Sopenharmony_ciwith as well as its own newly created XSK socket. The new process will
6262306a36Sopenharmony_cithen receive frame addr references in its own RX ring that point to
6362306a36Sopenharmony_cithis shared UMEM. Note that since the ring structures are
6462306a36Sopenharmony_cisingle-consumer / single-producer (for performance reasons), the new
6562306a36Sopenharmony_ciprocess has to create its own socket with associated RX and TX rings,
6662306a36Sopenharmony_cisince it cannot share this with the other process. This is also the
6762306a36Sopenharmony_cireason that there is only one set of FILL and COMPLETION rings per
6862306a36Sopenharmony_ciUMEM. It is the responsibility of a single process to handle the UMEM.
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ciHow is then packets distributed from an XDP program to the XSKs? There
7162306a36Sopenharmony_ciis a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
7262306a36Sopenharmony_ciuser-space application can place an XSK at an arbitrary place in this
7362306a36Sopenharmony_cimap. The XDP program can then redirect a packet to a specific index in
7462306a36Sopenharmony_cithis map and at this point XDP validates that the XSK in that map was
7562306a36Sopenharmony_ciindeed bound to that device and ring number. If not, the packet is
7662306a36Sopenharmony_cidropped. If the map is empty at that index, the packet is also
7762306a36Sopenharmony_cidropped. This also means that it is currently mandatory to have an XDP
7862306a36Sopenharmony_ciprogram loaded (and one XSK in the XSKMAP) to be able to get any
7962306a36Sopenharmony_citraffic to user space through the XSK.
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_ciAF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
8262306a36Sopenharmony_cidriver does not have support for XDP, or XDP_SKB is explicitly chosen
8362306a36Sopenharmony_ciwhen loading the XDP program, XDP_SKB mode is employed that uses SKBs
8462306a36Sopenharmony_citogether with the generic XDP support and copies out the data to user
8562306a36Sopenharmony_cispace. A fallback mode that works for any network device. On the other
8662306a36Sopenharmony_cihand, if the driver has support for XDP, it will be used by the AF_XDP
8762306a36Sopenharmony_cicode to provide better performance, but there is still a copy of the
8862306a36Sopenharmony_cidata into user space.
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ciConcepts
9162306a36Sopenharmony_ci========
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ciIn order to use an AF_XDP socket, a number of associated objects need
9462306a36Sopenharmony_cito be setup. These objects and their options are explained in the
9562306a36Sopenharmony_cifollowing sections.
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ciFor an overview on how AF_XDP works, you can also take a look at the
9862306a36Sopenharmony_ciLinux Plumbers paper from 2018 on the subject:
9962306a36Sopenharmony_cihttp://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do
10062306a36Sopenharmony_ciNOT consult the paper from 2017 on "AF_PACKET v4", the first attempt
10162306a36Sopenharmony_ciat AF_XDP. Nearly everything changed since then. Jonathan Corbet has
10262306a36Sopenharmony_cialso written an excellent article on LWN, "Accelerating networking
10362306a36Sopenharmony_ciwith AF_XDP". It can be found at https://lwn.net/Articles/750845/.
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ciUMEM
10662306a36Sopenharmony_ci----
10762306a36Sopenharmony_ci
10862306a36Sopenharmony_ciUMEM is a region of virtual contiguous memory, divided into
10962306a36Sopenharmony_ciequal-sized frames. An UMEM is associated to a netdev and a specific
11062306a36Sopenharmony_ciqueue id of that netdev. It is created and configured (chunk size,
11162306a36Sopenharmony_ciheadroom, start address and size) by using the XDP_UMEM_REG setsockopt
11262306a36Sopenharmony_cisystem call. A UMEM is bound to a netdev and queue id, via the bind()
11362306a36Sopenharmony_cisystem call.
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ciAn AF_XDP is socket linked to a single UMEM, but one UMEM can have
11662306a36Sopenharmony_cimultiple AF_XDP sockets. To share an UMEM created via one socket A,
11762306a36Sopenharmony_cithe next socket B can do this by setting the XDP_SHARED_UMEM flag in
11862306a36Sopenharmony_cistruct sockaddr_xdp member sxdp_flags, and passing the file descriptor
11962306a36Sopenharmony_ciof A to struct sockaddr_xdp member sxdp_shared_umem_fd.
12062306a36Sopenharmony_ci
12162306a36Sopenharmony_ciThe UMEM has two single-producer/single-consumer rings that are used
12262306a36Sopenharmony_cito transfer ownership of UMEM frames between the kernel and the
12362306a36Sopenharmony_ciuser-space application.
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ciRings
12662306a36Sopenharmony_ci-----
12762306a36Sopenharmony_ci
12862306a36Sopenharmony_ciThere are a four different kind of rings: FILL, COMPLETION, RX and
12962306a36Sopenharmony_ciTX. All rings are single-producer/single-consumer, so the user-space
13062306a36Sopenharmony_ciapplication need explicit synchronization of multiple
13162306a36Sopenharmony_ciprocesses/threads are reading/writing to them.
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciThe UMEM uses two rings: FILL and COMPLETION. Each socket associated
13462306a36Sopenharmony_ciwith the UMEM must have an RX queue, TX queue or both. Say, that there
13562306a36Sopenharmony_ciis a setup with four sockets (all doing TX and RX). Then there will be
13662306a36Sopenharmony_cione FILL ring, one COMPLETION ring, four TX rings and four RX rings.
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ciThe rings are head(producer)/tail(consumer) based rings. A producer
13962306a36Sopenharmony_ciwrites the data ring at the index pointed out by struct xdp_ring
14062306a36Sopenharmony_ciproducer member, and increasing the producer index. A consumer reads
14162306a36Sopenharmony_cithe data ring at the index pointed out by struct xdp_ring consumer
14262306a36Sopenharmony_cimember, and increasing the consumer index.
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ciThe rings are configured and created via the _RING setsockopt system
14562306a36Sopenharmony_cicalls and mmapped to user-space using the appropriate offset to mmap()
14662306a36Sopenharmony_ci(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
14762306a36Sopenharmony_ciXDP_UMEM_PGOFF_COMPLETION_RING).
14862306a36Sopenharmony_ci
14962306a36Sopenharmony_ciThe size of the rings need to be of size power of two.
15062306a36Sopenharmony_ci
15162306a36Sopenharmony_ciUMEM Fill Ring
15262306a36Sopenharmony_ci~~~~~~~~~~~~~~
15362306a36Sopenharmony_ci
15462306a36Sopenharmony_ciThe FILL ring is used to transfer ownership of UMEM frames from
15562306a36Sopenharmony_ciuser-space to kernel-space. The UMEM addrs are passed in the ring. As
15662306a36Sopenharmony_cian example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
15762306a36Sopenharmony_ci16 chunks and can pass addrs between 0 and 64k.
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ciFrames passed to the kernel are used for the ingress path (RX rings).
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ciThe user application produces UMEM addrs to this ring. Note that, if
16262306a36Sopenharmony_cirunning the application with aligned chunk mode, the kernel will mask
16362306a36Sopenharmony_cithe incoming addr.  E.g. for a chunk size of 2k, the log2(2048) LSB of
16462306a36Sopenharmony_cithe addr will be masked off, meaning that 2048, 2050 and 3000 refers
16562306a36Sopenharmony_cito the same chunk. If the user application is run in the unaligned
16662306a36Sopenharmony_cichunks mode, then the incoming addr will be left untouched.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciUMEM Completion Ring
17062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciThe COMPLETION Ring is used transfer ownership of UMEM frames from
17362306a36Sopenharmony_cikernel-space to user-space. Just like the FILL ring, UMEM indices are
17462306a36Sopenharmony_ciused.
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ciFrames passed from the kernel to user-space are frames that has been
17762306a36Sopenharmony_cisent (TX ring) and can be used by user-space again.
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ciThe user application consumes UMEM addrs from this ring.
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ciRX Ring
18362306a36Sopenharmony_ci~~~~~~~
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ciThe RX ring is the receiving side of a socket. Each entry in the ring
18662306a36Sopenharmony_ciis a struct xdp_desc descriptor. The descriptor contains UMEM offset
18762306a36Sopenharmony_ci(addr) and the length of the data (len).
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ciIf no frames have been passed to kernel via the FILL ring, no
19062306a36Sopenharmony_cidescriptors will (or can) appear on the RX ring.
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ciThe user application consumes struct xdp_desc descriptors from this
19362306a36Sopenharmony_ciring.
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ciTX Ring
19662306a36Sopenharmony_ci~~~~~~~
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ciThe TX ring is used to send frames. The struct xdp_desc descriptor is
19962306a36Sopenharmony_cifilled (index, length and offset) and passed into the ring.
20062306a36Sopenharmony_ci
20162306a36Sopenharmony_ciTo start the transfer a sendmsg() system call is required. This might
20262306a36Sopenharmony_cibe relaxed in the future.
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ciThe user application produces struct xdp_desc descriptors to this
20562306a36Sopenharmony_ciring.
20662306a36Sopenharmony_ci
20762306a36Sopenharmony_ciLibbpf
20862306a36Sopenharmony_ci======
20962306a36Sopenharmony_ci
21062306a36Sopenharmony_ciLibbpf is a helper library for eBPF and XDP that makes using these
21162306a36Sopenharmony_citechnologies a lot simpler. It also contains specific helper functions
21262306a36Sopenharmony_ciin tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
21362306a36Sopenharmony_cicontains two types of functions: those that can be used to make the
21462306a36Sopenharmony_cisetup of AF_XDP socket easier and ones that can be used in the data
21562306a36Sopenharmony_ciplane to access the rings safely and quickly. To see an example on how
21662306a36Sopenharmony_cito use this API, please take a look at the sample application in
21762306a36Sopenharmony_cisamples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
21862306a36Sopenharmony_ciplane operations.
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciWe recommend that you use this library unless you have become a power
22162306a36Sopenharmony_ciuser. It will make your program a lot simpler.
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ciXSKMAP / BPF_MAP_TYPE_XSKMAP
22462306a36Sopenharmony_ci============================
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ciOn XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
22762306a36Sopenharmony_ciis used in conjunction with bpf_redirect_map() to pass the ingress
22862306a36Sopenharmony_ciframe to a socket.
22962306a36Sopenharmony_ci
23062306a36Sopenharmony_ciThe user application inserts the socket into the map, via the bpf()
23162306a36Sopenharmony_cisystem call.
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ciNote that if an XDP program tries to redirect to a socket that does
23462306a36Sopenharmony_cinot match the queue configuration and netdev, the frame will be
23562306a36Sopenharmony_cidropped. E.g. an AF_XDP socket is bound to netdev eth0 and
23662306a36Sopenharmony_ciqueue 17. Only the XDP program executing for eth0 and queue 17 will
23762306a36Sopenharmony_cisuccessfully pass data to the socket. Please refer to the sample
23862306a36Sopenharmony_ciapplication (samples/bpf/) in for an example.
23962306a36Sopenharmony_ci
24062306a36Sopenharmony_ciConfiguration Flags and Socket Options
24162306a36Sopenharmony_ci======================================
24262306a36Sopenharmony_ci
24362306a36Sopenharmony_ciThese are the various configuration flags that can be used to control
24462306a36Sopenharmony_ciand monitor the behavior of AF_XDP sockets.
24562306a36Sopenharmony_ci
24662306a36Sopenharmony_ciXDP_COPY and XDP_ZEROCOPY bind flags
24762306a36Sopenharmony_ci------------------------------------
24862306a36Sopenharmony_ci
24962306a36Sopenharmony_ciWhen you bind to a socket, the kernel will first try to use zero-copy
25062306a36Sopenharmony_cicopy. If zero-copy is not supported, it will fall back on using copy
25162306a36Sopenharmony_cimode, i.e. copying all packets out to user space. But if you would
25262306a36Sopenharmony_cilike to force a certain mode, you can use the following flags. If you
25362306a36Sopenharmony_cipass the XDP_COPY flag to the bind call, the kernel will force the
25462306a36Sopenharmony_cisocket into copy mode. If it cannot use copy mode, the bind call will
25562306a36Sopenharmony_cifail with an error. Conversely, the XDP_ZEROCOPY flag will force the
25662306a36Sopenharmony_cisocket into zero-copy mode or fail.
25762306a36Sopenharmony_ci
25862306a36Sopenharmony_ciXDP_SHARED_UMEM bind flag
25962306a36Sopenharmony_ci-------------------------
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ciThis flag enables you to bind multiple sockets to the same UMEM. It
26262306a36Sopenharmony_ciworks on the same queue id, between queue ids and between
26362306a36Sopenharmony_cinetdevs/devices. In this mode, each socket has their own RX and TX
26462306a36Sopenharmony_cirings as usual, but you are going to have one or more FILL and
26562306a36Sopenharmony_ciCOMPLETION ring pairs. You have to create one of these pairs per
26662306a36Sopenharmony_ciunique netdev and queue id tuple that you bind to.
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ciStarting with the case were we would like to share a UMEM between
26962306a36Sopenharmony_cisockets bound to the same netdev and queue id. The UMEM (tied to the
27062306a36Sopenharmony_cifist socket created) will only have a single FILL ring and a single
27162306a36Sopenharmony_ciCOMPLETION ring as there is only on unique netdev,queue_id tuple that
27262306a36Sopenharmony_ciwe have bound to. To use this mode, create the first socket and bind
27362306a36Sopenharmony_ciit in the normal way. Create a second socket and create an RX and a TX
27462306a36Sopenharmony_ciring, or at least one of them, but no FILL or COMPLETION rings as the
27562306a36Sopenharmony_ciones from the first socket will be used. In the bind call, set he
27662306a36Sopenharmony_ciXDP_SHARED_UMEM option and provide the initial socket's fd in the
27762306a36Sopenharmony_cisxdp_shared_umem_fd field. You can attach an arbitrary number of extra
27862306a36Sopenharmony_cisockets this way.
27962306a36Sopenharmony_ci
28062306a36Sopenharmony_ciWhat socket will then a packet arrive on? This is decided by the XDP
28162306a36Sopenharmony_ciprogram. Put all the sockets in the XSK_MAP and just indicate which
28262306a36Sopenharmony_ciindex in the array you would like to send each packet to. A simple
28362306a36Sopenharmony_ciround-robin example of distributing packets is shown below:
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ci.. code-block:: c
28662306a36Sopenharmony_ci
28762306a36Sopenharmony_ci   #include <linux/bpf.h>
28862306a36Sopenharmony_ci   #include "bpf_helpers.h"
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ci   #define MAX_SOCKS 16
29162306a36Sopenharmony_ci
29262306a36Sopenharmony_ci   struct {
29362306a36Sopenharmony_ci       __uint(type, BPF_MAP_TYPE_XSKMAP);
29462306a36Sopenharmony_ci       __uint(max_entries, MAX_SOCKS);
29562306a36Sopenharmony_ci       __uint(key_size, sizeof(int));
29662306a36Sopenharmony_ci       __uint(value_size, sizeof(int));
29762306a36Sopenharmony_ci   } xsks_map SEC(".maps");
29862306a36Sopenharmony_ci
29962306a36Sopenharmony_ci   static unsigned int rr;
30062306a36Sopenharmony_ci
30162306a36Sopenharmony_ci   SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
30262306a36Sopenharmony_ci   {
30362306a36Sopenharmony_ci       rr = (rr + 1) & (MAX_SOCKS - 1);
30462306a36Sopenharmony_ci
30562306a36Sopenharmony_ci       return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
30662306a36Sopenharmony_ci   }
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ciNote, that since there is only a single set of FILL and COMPLETION
30962306a36Sopenharmony_cirings, and they are single producer, single consumer rings, you need
31062306a36Sopenharmony_cito make sure that multiple processes or threads do not use these rings
31162306a36Sopenharmony_ciconcurrently. There are no synchronization primitives in the
31262306a36Sopenharmony_cilibbpf code that protects multiple users at this point in time.
31362306a36Sopenharmony_ci
31462306a36Sopenharmony_ciLibbpf uses this mode if you create more than one socket tied to the
31562306a36Sopenharmony_cisame UMEM. However, note that you need to supply the
31662306a36Sopenharmony_ciXSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
31762306a36Sopenharmony_cixsk_socket__create calls and load your own XDP program as there is no
31862306a36Sopenharmony_cibuilt in one in libbpf that will route the traffic for you.
31962306a36Sopenharmony_ci
32062306a36Sopenharmony_ciThe second case is when you share a UMEM between sockets that are
32162306a36Sopenharmony_cibound to different queue ids and/or netdevs. In this case you have to
32262306a36Sopenharmony_cicreate one FILL ring and one COMPLETION ring for each unique
32362306a36Sopenharmony_cinetdev,queue_id pair. Let us say you want to create two sockets bound
32462306a36Sopenharmony_cito two different queue ids on the same netdev. Create the first socket
32562306a36Sopenharmony_ciand bind it in the normal way. Create a second socket and create an RX
32662306a36Sopenharmony_ciand a TX ring, or at least one of them, and then one FILL and
32762306a36Sopenharmony_ciCOMPLETION ring for this socket. Then in the bind call, set he
32862306a36Sopenharmony_ciXDP_SHARED_UMEM option and provide the initial socket's fd in the
32962306a36Sopenharmony_cisxdp_shared_umem_fd field as you registered the UMEM on that
33062306a36Sopenharmony_cisocket. These two sockets will now share one and the same UMEM.
33162306a36Sopenharmony_ci
33262306a36Sopenharmony_ciThere is no need to supply an XDP program like the one in the previous
33362306a36Sopenharmony_cicase where sockets were bound to the same queue id and
33462306a36Sopenharmony_cidevice. Instead, use the NIC's packet steering capabilities to steer
33562306a36Sopenharmony_cithe packets to the right queue. In the previous example, there is only
33662306a36Sopenharmony_cione queue shared among sockets, so the NIC cannot do this steering. It
33762306a36Sopenharmony_cican only steer between queues.
33862306a36Sopenharmony_ci
33962306a36Sopenharmony_ciIn libbpf, you need to use the xsk_socket__create_shared() API as it
34062306a36Sopenharmony_citakes a reference to a FILL ring and a COMPLETION ring that will be
34162306a36Sopenharmony_cicreated for you and bound to the shared UMEM. You can use this
34262306a36Sopenharmony_cifunction for all the sockets you create, or you can use it for the
34362306a36Sopenharmony_cisecond and following ones and use xsk_socket__create() for the first
34462306a36Sopenharmony_cione. Both methods yield the same result.
34562306a36Sopenharmony_ci
34662306a36Sopenharmony_ciNote that a UMEM can be shared between sockets on the same queue id
34762306a36Sopenharmony_ciand device, as well as between queues on the same device and between
34862306a36Sopenharmony_cidevices at the same time.
34962306a36Sopenharmony_ci
35062306a36Sopenharmony_ciXDP_USE_NEED_WAKEUP bind flag
35162306a36Sopenharmony_ci-----------------------------
35262306a36Sopenharmony_ci
35362306a36Sopenharmony_ciThis option adds support for a new flag called need_wakeup that is
35462306a36Sopenharmony_cipresent in the FILL ring and the TX ring, the rings for which user
35562306a36Sopenharmony_cispace is a producer. When this option is set in the bind call, the
35662306a36Sopenharmony_cineed_wakeup flag will be set if the kernel needs to be explicitly
35762306a36Sopenharmony_ciwoken up by a syscall to continue processing packets. If the flag is
35862306a36Sopenharmony_cizero, no syscall is needed.
35962306a36Sopenharmony_ci
36062306a36Sopenharmony_ciIf the flag is set on the FILL ring, the application needs to call
36162306a36Sopenharmony_cipoll() to be able to continue to receive packets on the RX ring. This
36262306a36Sopenharmony_cican happen, for example, when the kernel has detected that there are no
36362306a36Sopenharmony_cimore buffers on the FILL ring and no buffers left on the RX HW ring of
36462306a36Sopenharmony_cithe NIC. In this case, interrupts are turned off as the NIC cannot
36562306a36Sopenharmony_cireceive any packets (as there are no buffers to put them in), and the
36662306a36Sopenharmony_cineed_wakeup flag is set so that user space can put buffers on the
36762306a36Sopenharmony_ciFILL ring and then call poll() so that the kernel driver can put these
36862306a36Sopenharmony_cibuffers on the HW ring and start to receive packets.
36962306a36Sopenharmony_ci
37062306a36Sopenharmony_ciIf the flag is set for the TX ring, it means that the application
37162306a36Sopenharmony_cineeds to explicitly notify the kernel to send any packets put on the
37262306a36Sopenharmony_ciTX ring. This can be accomplished either by a poll() call, as in the
37362306a36Sopenharmony_ciRX path, or by calling sendto().
37462306a36Sopenharmony_ci
37562306a36Sopenharmony_ciAn example of how to use this flag can be found in
37662306a36Sopenharmony_cisamples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
37762306a36Sopenharmony_ciwould look like this for the TX path:
37862306a36Sopenharmony_ci
37962306a36Sopenharmony_ci.. code-block:: c
38062306a36Sopenharmony_ci
38162306a36Sopenharmony_ci   if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
38262306a36Sopenharmony_ci       sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
38362306a36Sopenharmony_ci
38462306a36Sopenharmony_ciI.e., only use the syscall if the flag is set.
38562306a36Sopenharmony_ci
38662306a36Sopenharmony_ciWe recommend that you always enable this mode as it usually leads to
38762306a36Sopenharmony_cibetter performance especially if you run the application and the
38862306a36Sopenharmony_cidriver on the same core, but also if you use different cores for the
38962306a36Sopenharmony_ciapplication and the kernel driver, as it reduces the number of
39062306a36Sopenharmony_cisyscalls needed for the TX path.
39162306a36Sopenharmony_ci
39262306a36Sopenharmony_ciXDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts
39362306a36Sopenharmony_ci------------------------------------------------------
39462306a36Sopenharmony_ci
39562306a36Sopenharmony_ciThese setsockopts sets the number of descriptors that the RX, TX,
39662306a36Sopenharmony_ciFILL, and COMPLETION rings respectively should have. It is mandatory
39762306a36Sopenharmony_cito set the size of at least one of the RX and TX rings. If you set
39862306a36Sopenharmony_ciboth, you will be able to both receive and send traffic from your
39962306a36Sopenharmony_ciapplication, but if you only want to do one of them, you can save
40062306a36Sopenharmony_ciresources by only setting up one of them. Both the FILL ring and the
40162306a36Sopenharmony_ciCOMPLETION ring are mandatory as you need to have a UMEM tied to your
40262306a36Sopenharmony_cisocket. But if the XDP_SHARED_UMEM flag is used, any socket after the
40362306a36Sopenharmony_cifirst one does not have a UMEM and should in that case not have any
40462306a36Sopenharmony_ciFILL or COMPLETION rings created as the ones from the shared UMEM will
40562306a36Sopenharmony_cibe used. Note, that the rings are single-producer single-consumer, so
40662306a36Sopenharmony_cido not try to access them from multiple processes at the same
40762306a36Sopenharmony_citime. See the XDP_SHARED_UMEM section.
40862306a36Sopenharmony_ci
40962306a36Sopenharmony_ciIn libbpf, you can create Rx-only and Tx-only sockets by supplying
41062306a36Sopenharmony_ciNULL to the rx and tx arguments, respectively, to the
41162306a36Sopenharmony_cixsk_socket__create function.
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ciIf you create a Tx-only socket, we recommend that you do not put any
41462306a36Sopenharmony_cipackets on the fill ring. If you do this, drivers might think you are
41562306a36Sopenharmony_cigoing to receive something when you in fact will not, and this can
41662306a36Sopenharmony_cinegatively impact performance.
41762306a36Sopenharmony_ci
41862306a36Sopenharmony_ciXDP_UMEM_REG setsockopt
41962306a36Sopenharmony_ci-----------------------
42062306a36Sopenharmony_ci
42162306a36Sopenharmony_ciThis setsockopt registers a UMEM to a socket. This is the area that
42262306a36Sopenharmony_cicontain all the buffers that packet can reside in. The call takes a
42362306a36Sopenharmony_cipointer to the beginning of this area and the size of it. Moreover, it
42462306a36Sopenharmony_cialso has parameter called chunk_size that is the size that the UMEM is
42562306a36Sopenharmony_cidivided into. It can only be 2K or 4K at the moment. If you have an
42662306a36Sopenharmony_ciUMEM area that is 128K and a chunk size of 2K, this means that you
42762306a36Sopenharmony_ciwill be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
42862306a36Sopenharmony_ciarea and that your largest packet size can be 2K.
42962306a36Sopenharmony_ci
43062306a36Sopenharmony_ciThere is also an option to set the headroom of each single buffer in
43162306a36Sopenharmony_cithe UMEM. If you set this to N bytes, it means that the packet will
43262306a36Sopenharmony_cistart N bytes into the buffer leaving the first N bytes for the
43362306a36Sopenharmony_ciapplication to use. The final option is the flags field, but it will
43462306a36Sopenharmony_cibe dealt with in separate sections for each UMEM flag.
43562306a36Sopenharmony_ci
43662306a36Sopenharmony_ciSO_BINDTODEVICE setsockopt
43762306a36Sopenharmony_ci--------------------------
43862306a36Sopenharmony_ci
43962306a36Sopenharmony_ciThis is a generic SOL_SOCKET option that can be used to tie AF_XDP
44062306a36Sopenharmony_cisocket to a particular network interface.  It is useful when a socket
44162306a36Sopenharmony_ciis created by a privileged process and passed to a non-privileged one.
44262306a36Sopenharmony_ciOnce the option is set, kernel will refuse attempts to bind that socket
44362306a36Sopenharmony_cito a different interface.  Updating the value requires CAP_NET_RAW.
44462306a36Sopenharmony_ci
44562306a36Sopenharmony_ciXDP_STATISTICS getsockopt
44662306a36Sopenharmony_ci-------------------------
44762306a36Sopenharmony_ci
44862306a36Sopenharmony_ciGets drop statistics of a socket that can be useful for debug
44962306a36Sopenharmony_cipurposes. The supported statistics are shown below:
45062306a36Sopenharmony_ci
45162306a36Sopenharmony_ci.. code-block:: c
45262306a36Sopenharmony_ci
45362306a36Sopenharmony_ci   struct xdp_statistics {
45462306a36Sopenharmony_ci       __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
45562306a36Sopenharmony_ci       __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
45662306a36Sopenharmony_ci       __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
45762306a36Sopenharmony_ci   };
45862306a36Sopenharmony_ci
45962306a36Sopenharmony_ciXDP_OPTIONS getsockopt
46062306a36Sopenharmony_ci----------------------
46162306a36Sopenharmony_ci
46262306a36Sopenharmony_ciGets options from an XDP socket. The only one supported so far is
46362306a36Sopenharmony_ciXDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
46462306a36Sopenharmony_ci
46562306a36Sopenharmony_ciMulti-Buffer Support
46662306a36Sopenharmony_ci====================
46762306a36Sopenharmony_ci
46862306a36Sopenharmony_ciWith multi-buffer support, programs using AF_XDP sockets can receive
46962306a36Sopenharmony_ciand transmit packets consisting of multiple buffers both in copy and
47062306a36Sopenharmony_cizero-copy mode. For example, a packet can consist of two
47162306a36Sopenharmony_ciframes/buffers, one with the header and the other one with the data,
47262306a36Sopenharmony_cior a 9K Ethernet jumbo frame can be constructed by chaining together
47362306a36Sopenharmony_cithree 4K frames.
47462306a36Sopenharmony_ci
47562306a36Sopenharmony_ciSome definitions:
47662306a36Sopenharmony_ci
47762306a36Sopenharmony_ci* A packet consists of one or more frames
47862306a36Sopenharmony_ci
47962306a36Sopenharmony_ci* A descriptor in one of the AF_XDP rings always refers to a single
48062306a36Sopenharmony_ci  frame. In the case the packet consists of a single frame, the
48162306a36Sopenharmony_ci  descriptor refers to the whole packet.
48262306a36Sopenharmony_ci
48362306a36Sopenharmony_ciTo enable multi-buffer support for an AF_XDP socket, use the new bind
48462306a36Sopenharmony_ciflag XDP_USE_SG. If this is not provided, all multi-buffer packets
48562306a36Sopenharmony_ciwill be dropped just as before. Note that the XDP program loaded also
48662306a36Sopenharmony_cineeds to be in multi-buffer mode. This can be accomplished by using
48762306a36Sopenharmony_ci"xdp.frags" as the section name of the XDP program used.
48862306a36Sopenharmony_ci
48962306a36Sopenharmony_ciTo represent a packet consisting of multiple frames, a new flag called
49062306a36Sopenharmony_ciXDP_PKT_CONTD is introduced in the options field of the Rx and Tx
49162306a36Sopenharmony_cidescriptors. If it is true (1) the packet continues with the next
49262306a36Sopenharmony_cidescriptor and if it is false (0) it means this is the last descriptor
49362306a36Sopenharmony_ciof the packet. Why the reverse logic of end-of-packet (eop) flag found
49462306a36Sopenharmony_ciin many NICs? Just to preserve compatibility with non-multi-buffer
49562306a36Sopenharmony_ciapplications that have this bit set to false for all packets on Rx,
49662306a36Sopenharmony_ciand the apps set the options field to zero for Tx, as anything else
49762306a36Sopenharmony_ciwill be treated as an invalid descriptor.
49862306a36Sopenharmony_ci
49962306a36Sopenharmony_ciThese are the semantics for producing packets onto AF_XDP Tx ring
50062306a36Sopenharmony_ciconsisting of multiple frames:
50162306a36Sopenharmony_ci
50262306a36Sopenharmony_ci* When an invalid descriptor is found, all the other
50362306a36Sopenharmony_ci  descriptors/frames of this packet are marked as invalid and not
50462306a36Sopenharmony_ci  completed. The next descriptor is treated as the start of a new
50562306a36Sopenharmony_ci  packet, even if this was not the intent (because we cannot guess
50662306a36Sopenharmony_ci  the intent). As before, if your program is producing invalid
50762306a36Sopenharmony_ci  descriptors you have a bug that must be fixed.
50862306a36Sopenharmony_ci
50962306a36Sopenharmony_ci* Zero length descriptors are treated as invalid descriptors.
51062306a36Sopenharmony_ci
51162306a36Sopenharmony_ci* For copy mode, the maximum supported number of frames in a packet is
51262306a36Sopenharmony_ci  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
51362306a36Sopenharmony_ci  descriptors accumulated so far are dropped and treated as
51462306a36Sopenharmony_ci  invalid. To produce an application that will work on any system
51562306a36Sopenharmony_ci  regardless of this config setting, limit the number of frags to 18,
51662306a36Sopenharmony_ci  as the minimum value of the config is 17.
51762306a36Sopenharmony_ci
51862306a36Sopenharmony_ci* For zero-copy mode, the limit is up to what the NIC HW
51962306a36Sopenharmony_ci  supports. Usually at least five on the NICs we have checked. We
52062306a36Sopenharmony_ci  consciously chose to not enforce a rigid limit (such as
52162306a36Sopenharmony_ci  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
52262306a36Sopenharmony_ci  resulted in copy actions under the hood to fit into what limit the
52362306a36Sopenharmony_ci  NIC supports. Kind of defeats the purpose of zero-copy mode. How to
52462306a36Sopenharmony_ci  probe for this limit is explained in the "probe for multi-buffer
52562306a36Sopenharmony_ci  support" section.
52662306a36Sopenharmony_ci
52762306a36Sopenharmony_ciOn the Rx path in copy-mode, the xsk core copies the XDP data into
52862306a36Sopenharmony_cimultiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
52962306a36Sopenharmony_cidetailed before. Zero-copy mode works the same, though the data is not
53062306a36Sopenharmony_cicopied. When the application gets a descriptor with the XDP_PKT_CONTD
53162306a36Sopenharmony_ciflag set to one, it means that the packet consists of multiple buffers
53262306a36Sopenharmony_ciand it continues with the next buffer in the following
53362306a36Sopenharmony_cidescriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
53462306a36Sopenharmony_cimeans that this is the last buffer of the packet. AF_XDP guarantees
53562306a36Sopenharmony_cithat only a complete packet (all frames in the packet) is sent to the
53662306a36Sopenharmony_ciapplication. If there is not enough space in the AF_XDP Rx ring, all
53762306a36Sopenharmony_ciframes of the packet will be dropped.
53862306a36Sopenharmony_ci
53962306a36Sopenharmony_ciIf application reads a batch of descriptors, using for example the libxdp
54062306a36Sopenharmony_ciinterfaces, it is not guaranteed that the batch will end with a full
54162306a36Sopenharmony_cipacket. It might end in the middle of a packet and the rest of the
54262306a36Sopenharmony_cibuffers of that packet will arrive at the beginning of the next batch,
54362306a36Sopenharmony_cisince the libxdp interface does not read the whole ring (unless you
54462306a36Sopenharmony_cihave an enormous batch size or a very small ring size).
54562306a36Sopenharmony_ci
54662306a36Sopenharmony_ciAn example program each for Rx and Tx multi-buffer support can be found
54762306a36Sopenharmony_cilater in this document.
54862306a36Sopenharmony_ci
54962306a36Sopenharmony_ciUsage
55062306a36Sopenharmony_ci-----
55162306a36Sopenharmony_ci
55262306a36Sopenharmony_ciIn order to use AF_XDP sockets two parts are needed. The
55362306a36Sopenharmony_ciuser-space application and the XDP program. For a complete setup and
55462306a36Sopenharmony_ciusage example, please refer to the sample application. The user-space
55562306a36Sopenharmony_ciside is xdpsock_user.c and the XDP side is part of libbpf.
55662306a36Sopenharmony_ci
55762306a36Sopenharmony_ciThe XDP code sample included in tools/lib/bpf/xsk.c is the following:
55862306a36Sopenharmony_ci
55962306a36Sopenharmony_ci.. code-block:: c
56062306a36Sopenharmony_ci
56162306a36Sopenharmony_ci   SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
56262306a36Sopenharmony_ci   {
56362306a36Sopenharmony_ci       int index = ctx->rx_queue_index;
56462306a36Sopenharmony_ci
56562306a36Sopenharmony_ci       // A set entry here means that the corresponding queue_id
56662306a36Sopenharmony_ci       // has an active AF_XDP socket bound to it.
56762306a36Sopenharmony_ci       if (bpf_map_lookup_elem(&xsks_map, &index))
56862306a36Sopenharmony_ci           return bpf_redirect_map(&xsks_map, index, 0);
56962306a36Sopenharmony_ci
57062306a36Sopenharmony_ci       return XDP_PASS;
57162306a36Sopenharmony_ci   }
57262306a36Sopenharmony_ci
57362306a36Sopenharmony_ciA simple but not so performance ring dequeue and enqueue could look
57462306a36Sopenharmony_cilike this:
57562306a36Sopenharmony_ci
57662306a36Sopenharmony_ci.. code-block:: c
57762306a36Sopenharmony_ci
57862306a36Sopenharmony_ci    // struct xdp_rxtx_ring {
57962306a36Sopenharmony_ci    //     __u32 *producer;
58062306a36Sopenharmony_ci    //     __u32 *consumer;
58162306a36Sopenharmony_ci    //     struct xdp_desc *desc;
58262306a36Sopenharmony_ci    // };
58362306a36Sopenharmony_ci
58462306a36Sopenharmony_ci    // struct xdp_umem_ring {
58562306a36Sopenharmony_ci    //     __u32 *producer;
58662306a36Sopenharmony_ci    //     __u32 *consumer;
58762306a36Sopenharmony_ci    //     __u64 *desc;
58862306a36Sopenharmony_ci    // };
58962306a36Sopenharmony_ci
59062306a36Sopenharmony_ci    // typedef struct xdp_rxtx_ring RING;
59162306a36Sopenharmony_ci    // typedef struct xdp_umem_ring RING;
59262306a36Sopenharmony_ci
59362306a36Sopenharmony_ci    // typedef struct xdp_desc RING_TYPE;
59462306a36Sopenharmony_ci    // typedef __u64 RING_TYPE;
59562306a36Sopenharmony_ci
59662306a36Sopenharmony_ci    int dequeue_one(RING *ring, RING_TYPE *item)
59762306a36Sopenharmony_ci    {
59862306a36Sopenharmony_ci        __u32 entries = *ring->producer - *ring->consumer;
59962306a36Sopenharmony_ci
60062306a36Sopenharmony_ci        if (entries == 0)
60162306a36Sopenharmony_ci            return -1;
60262306a36Sopenharmony_ci
60362306a36Sopenharmony_ci        // read-barrier!
60462306a36Sopenharmony_ci
60562306a36Sopenharmony_ci        *item = ring->desc[*ring->consumer & (RING_SIZE - 1)];
60662306a36Sopenharmony_ci        (*ring->consumer)++;
60762306a36Sopenharmony_ci        return 0;
60862306a36Sopenharmony_ci    }
60962306a36Sopenharmony_ci
61062306a36Sopenharmony_ci    int enqueue_one(RING *ring, const RING_TYPE *item)
61162306a36Sopenharmony_ci    {
61262306a36Sopenharmony_ci        u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer);
61362306a36Sopenharmony_ci
61462306a36Sopenharmony_ci        if (free_entries == 0)
61562306a36Sopenharmony_ci            return -1;
61662306a36Sopenharmony_ci
61762306a36Sopenharmony_ci        ring->desc[*ring->producer & (RING_SIZE - 1)] = *item;
61862306a36Sopenharmony_ci
61962306a36Sopenharmony_ci        // write-barrier!
62062306a36Sopenharmony_ci
62162306a36Sopenharmony_ci        (*ring->producer)++;
62262306a36Sopenharmony_ci        return 0;
62362306a36Sopenharmony_ci    }
62462306a36Sopenharmony_ci
62562306a36Sopenharmony_ciBut please use the libbpf functions as they are optimized and ready to
62662306a36Sopenharmony_ciuse. Will make your life easier.
62762306a36Sopenharmony_ci
62862306a36Sopenharmony_ciUsage Multi-Buffer Rx
62962306a36Sopenharmony_ci---------------------
63062306a36Sopenharmony_ci
63162306a36Sopenharmony_ciHere is a simple Rx path pseudo-code example (using libxdp interfaces
63262306a36Sopenharmony_cifor simplicity). Error paths have been excluded to keep it short:
63362306a36Sopenharmony_ci
63462306a36Sopenharmony_ci.. code-block:: c
63562306a36Sopenharmony_ci
63662306a36Sopenharmony_ci    void rx_packets(struct xsk_socket_info *xsk)
63762306a36Sopenharmony_ci    {
63862306a36Sopenharmony_ci        static bool new_packet = true;
63962306a36Sopenharmony_ci        u32 idx_rx = 0, idx_fq = 0;
64062306a36Sopenharmony_ci        static char *pkt;
64162306a36Sopenharmony_ci
64262306a36Sopenharmony_ci        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
64362306a36Sopenharmony_ci
64462306a36Sopenharmony_ci        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
64562306a36Sopenharmony_ci
64662306a36Sopenharmony_ci        for (int i = 0; i < rcvd; i++) {
64762306a36Sopenharmony_ci            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
64862306a36Sopenharmony_ci            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
64962306a36Sopenharmony_ci            bool eop = !(desc->options & XDP_PKT_CONTD);
65062306a36Sopenharmony_ci
65162306a36Sopenharmony_ci            if (new_packet)
65262306a36Sopenharmony_ci                pkt = frag;
65362306a36Sopenharmony_ci            else
65462306a36Sopenharmony_ci                add_frag_to_pkt(pkt, frag);
65562306a36Sopenharmony_ci
65662306a36Sopenharmony_ci            if (eop)
65762306a36Sopenharmony_ci                process_pkt(pkt);
65862306a36Sopenharmony_ci
65962306a36Sopenharmony_ci            new_packet = eop;
66062306a36Sopenharmony_ci
66162306a36Sopenharmony_ci            *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
66262306a36Sopenharmony_ci        }
66362306a36Sopenharmony_ci
66462306a36Sopenharmony_ci        xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
66562306a36Sopenharmony_ci        xsk_ring_cons__release(&xsk->rx, rcvd);
66662306a36Sopenharmony_ci    }
66762306a36Sopenharmony_ci
66862306a36Sopenharmony_ciUsage Multi-Buffer Tx
66962306a36Sopenharmony_ci---------------------
67062306a36Sopenharmony_ci
67162306a36Sopenharmony_ciHere is an example Tx path pseudo-code (using libxdp interfaces for
67262306a36Sopenharmony_cisimplicity) ignoring that the umem is finite in size, and that we
67362306a36Sopenharmony_cieventually will run out of packets to send. Also assumes pkts.addr
67462306a36Sopenharmony_cipoints to a valid location in the umem.
67562306a36Sopenharmony_ci
67662306a36Sopenharmony_ci.. code-block:: c
67762306a36Sopenharmony_ci
67862306a36Sopenharmony_ci    void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
67962306a36Sopenharmony_ci                    int batch_size)
68062306a36Sopenharmony_ci    {
68162306a36Sopenharmony_ci        u32 idx, i, pkt_nb = 0;
68262306a36Sopenharmony_ci
68362306a36Sopenharmony_ci        xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
68462306a36Sopenharmony_ci
68562306a36Sopenharmony_ci        for (i = 0; i < batch_size;) {
68662306a36Sopenharmony_ci            u64 addr = pkts[pkt_nb].addr;
68762306a36Sopenharmony_ci            u32 len = pkts[pkt_nb].size;
68862306a36Sopenharmony_ci
68962306a36Sopenharmony_ci            do {
69062306a36Sopenharmony_ci                struct xdp_desc *tx_desc;
69162306a36Sopenharmony_ci
69262306a36Sopenharmony_ci                tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
69362306a36Sopenharmony_ci                tx_desc->addr = addr;
69462306a36Sopenharmony_ci
69562306a36Sopenharmony_ci                if (len > xsk_frame_size) {
69662306a36Sopenharmony_ci                    tx_desc->len = xsk_frame_size;
69762306a36Sopenharmony_ci                    tx_desc->options = XDP_PKT_CONTD;
69862306a36Sopenharmony_ci                } else {
69962306a36Sopenharmony_ci                    tx_desc->len = len;
70062306a36Sopenharmony_ci                    tx_desc->options = 0;
70162306a36Sopenharmony_ci                    pkt_nb++;
70262306a36Sopenharmony_ci                }
70362306a36Sopenharmony_ci                len -= tx_desc->len;
70462306a36Sopenharmony_ci                addr += xsk_frame_size;
70562306a36Sopenharmony_ci
70662306a36Sopenharmony_ci                if (i == batch_size) {
70762306a36Sopenharmony_ci                    /* Remember len, addr, pkt_nb for next iteration.
70862306a36Sopenharmony_ci                     * Skipped for simplicity.
70962306a36Sopenharmony_ci                     */
71062306a36Sopenharmony_ci                    break;
71162306a36Sopenharmony_ci                }
71262306a36Sopenharmony_ci            } while (len);
71362306a36Sopenharmony_ci        }
71462306a36Sopenharmony_ci
71562306a36Sopenharmony_ci        xsk_ring_prod__submit(&xsk->tx, i);
71662306a36Sopenharmony_ci    }
71762306a36Sopenharmony_ci
71862306a36Sopenharmony_ciProbing for Multi-Buffer Support
71962306a36Sopenharmony_ci--------------------------------
72062306a36Sopenharmony_ci
72162306a36Sopenharmony_ciTo discover if a driver supports multi-buffer AF_XDP in SKB or DRV
72262306a36Sopenharmony_cimode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
72362306a36Sopenharmony_ciquery for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
72462306a36Sopenharmony_ciquerying for XDP multi-buffer support. If XDP supports multi-buffer in
72562306a36Sopenharmony_cia driver, then AF_XDP will also support that in SKB and DRV mode.
72662306a36Sopenharmony_ci
72762306a36Sopenharmony_ciTo discover if a driver supports multi-buffer AF_XDP in zero-copy
72862306a36Sopenharmony_cimode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
72962306a36Sopenharmony_ciflag. If it is set, it means that at least zero-copy is supported and
73062306a36Sopenharmony_ciyou should go and check the netlink attribute
73162306a36Sopenharmony_ciNETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
73262306a36Sopenharmony_civalue will be returned stating the max number of frags that are
73362306a36Sopenharmony_cisupported by this device in zero-copy mode. These are the possible
73462306a36Sopenharmony_cireturn values:
73562306a36Sopenharmony_ci
73662306a36Sopenharmony_ci1: Multi-buffer for zero-copy is not supported by this device, as max
73762306a36Sopenharmony_ci   one fragment supported means that multi-buffer is not possible.
73862306a36Sopenharmony_ci
73962306a36Sopenharmony_ci>=2: Multi-buffer is supported in zero-copy mode for this device. The
74062306a36Sopenharmony_ci     returned number signifies the max number of frags supported.
74162306a36Sopenharmony_ci
74262306a36Sopenharmony_ciFor an example on how these are used through libbpf, please take a
74362306a36Sopenharmony_cilook at tools/testing/selftests/bpf/xskxceiver.c.
74462306a36Sopenharmony_ci
74562306a36Sopenharmony_ciMulti-Buffer Support for Zero-Copy Drivers
74662306a36Sopenharmony_ci------------------------------------------
74762306a36Sopenharmony_ci
74862306a36Sopenharmony_ciZero-copy drivers usually use the batched APIs for Rx and Tx
74962306a36Sopenharmony_ciprocessing. Note that the Tx batch API guarantees that it will provide
75062306a36Sopenharmony_cia batch of Tx descriptors that ends with full packet at the end. This
75162306a36Sopenharmony_cito facilitate extending a zero-copy driver with multi-buffer support.
75262306a36Sopenharmony_ci
75362306a36Sopenharmony_ciSample application
75462306a36Sopenharmony_ci==================
75562306a36Sopenharmony_ci
75662306a36Sopenharmony_ciThere is a xdpsock benchmarking/test application included that
75762306a36Sopenharmony_cidemonstrates how to use AF_XDP sockets with private UMEMs. Say that
75862306a36Sopenharmony_ciyou would like your UDP traffic from port 4242 to end up in queue 16,
75962306a36Sopenharmony_cithat we will enable AF_XDP on. Here, we use ethtool for this::
76062306a36Sopenharmony_ci
76162306a36Sopenharmony_ci      ethtool -N p3p2 rx-flow-hash udp4 fn
76262306a36Sopenharmony_ci      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
76362306a36Sopenharmony_ci          action 16
76462306a36Sopenharmony_ci
76562306a36Sopenharmony_ciRunning the rxdrop benchmark in XDP_DRV mode can then be done
76662306a36Sopenharmony_ciusing::
76762306a36Sopenharmony_ci
76862306a36Sopenharmony_ci      samples/bpf/xdpsock -i p3p2 -q 16 -r -N
76962306a36Sopenharmony_ci
77062306a36Sopenharmony_ciFor XDP_SKB mode, use the switch "-S" instead of "-N" and all options
77162306a36Sopenharmony_cican be displayed with "-h", as usual.
77262306a36Sopenharmony_ci
77362306a36Sopenharmony_ciThis sample application uses libbpf to make the setup and usage of
77462306a36Sopenharmony_ciAF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
77562306a36Sopenharmony_cireally used to make something more advanced, take a look at the libbpf
77662306a36Sopenharmony_cicode in tools/lib/bpf/xsk.[ch].
77762306a36Sopenharmony_ci
77862306a36Sopenharmony_ciFAQ
77962306a36Sopenharmony_ci=======
78062306a36Sopenharmony_ci
78162306a36Sopenharmony_ciQ: I am not seeing any traffic on the socket. What am I doing wrong?
78262306a36Sopenharmony_ci
78362306a36Sopenharmony_ciA: When a netdev of a physical NIC is initialized, Linux usually
78462306a36Sopenharmony_ci   allocates one RX and TX queue pair per core. So on a 8 core system,
78562306a36Sopenharmony_ci   queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
78662306a36Sopenharmony_ci   bind call or the xsk_socket__create libbpf function call, you
78762306a36Sopenharmony_ci   specify a specific queue id to bind to and it is only the traffic
78862306a36Sopenharmony_ci   towards that queue you are going to get on you socket. So in the
78962306a36Sopenharmony_ci   example above, if you bind to queue 0, you are NOT going to get any
79062306a36Sopenharmony_ci   traffic that is distributed to queues 1 through 7. If you are
79162306a36Sopenharmony_ci   lucky, you will see the traffic, but usually it will end up on one
79262306a36Sopenharmony_ci   of the queues you have not bound to.
79362306a36Sopenharmony_ci
79462306a36Sopenharmony_ci   There are a number of ways to solve the problem of getting the
79562306a36Sopenharmony_ci   traffic you want to the queue id you bound to. If you want to see
79662306a36Sopenharmony_ci   all the traffic, you can force the netdev to only have 1 queue, queue
79762306a36Sopenharmony_ci   id 0, and then bind to queue 0. You can use ethtool to do this::
79862306a36Sopenharmony_ci
79962306a36Sopenharmony_ci     sudo ethtool -L <interface> combined 1
80062306a36Sopenharmony_ci
80162306a36Sopenharmony_ci   If you want to only see part of the traffic, you can program the
80262306a36Sopenharmony_ci   NIC through ethtool to filter out your traffic to a single queue id
80362306a36Sopenharmony_ci   that you can bind your XDP socket to. Here is one example in which
80462306a36Sopenharmony_ci   UDP traffic to and from port 4242 are sent to queue 2::
80562306a36Sopenharmony_ci
80662306a36Sopenharmony_ci     sudo ethtool -N <interface> rx-flow-hash udp4 fn
80762306a36Sopenharmony_ci     sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
80862306a36Sopenharmony_ci     4242 action 2
80962306a36Sopenharmony_ci
81062306a36Sopenharmony_ci   A number of other ways are possible all up to the capabilities of
81162306a36Sopenharmony_ci   the NIC you have.
81262306a36Sopenharmony_ci
81362306a36Sopenharmony_ciQ: Can I use the XSKMAP to implement a switch between different umems
81462306a36Sopenharmony_ci   in copy mode?
81562306a36Sopenharmony_ci
81662306a36Sopenharmony_ciA: The short answer is no, that is not supported at the moment. The
81762306a36Sopenharmony_ci   XSKMAP can only be used to switch traffic coming in on queue id X
81862306a36Sopenharmony_ci   to sockets bound to the same queue id X. The XSKMAP can contain
81962306a36Sopenharmony_ci   sockets bound to different queue ids, for example X and Y, but only
82062306a36Sopenharmony_ci   traffic goming in from queue id Y can be directed to sockets bound
82162306a36Sopenharmony_ci   to the same queue id Y. In zero-copy mode, you should use the
82262306a36Sopenharmony_ci   switch, or other distribution mechanism, in your NIC to direct
82362306a36Sopenharmony_ci   traffic to the correct queue id and socket.
82462306a36Sopenharmony_ci
82562306a36Sopenharmony_ciQ: My packets are sometimes corrupted. What is wrong?
82662306a36Sopenharmony_ci
82762306a36Sopenharmony_ciA: Care has to be taken not to feed the same buffer in the UMEM into
82862306a36Sopenharmony_ci   more than one ring at the same time. If you for example feed the
82962306a36Sopenharmony_ci   same buffer into the FILL ring and the TX ring at the same time, the
83062306a36Sopenharmony_ci   NIC might receive data into the buffer at the same time it is
83162306a36Sopenharmony_ci   sending it. This will cause some packets to become corrupted. Same
83262306a36Sopenharmony_ci   thing goes for feeding the same buffer into the FILL rings
83362306a36Sopenharmony_ci   belonging to different queue ids or netdevs bound with the
83462306a36Sopenharmony_ci   XDP_SHARED_UMEM flag.
83562306a36Sopenharmony_ci
83662306a36Sopenharmony_ciCredits
83762306a36Sopenharmony_ci=======
83862306a36Sopenharmony_ci
83962306a36Sopenharmony_ci- Björn Töpel (AF_XDP core)
84062306a36Sopenharmony_ci- Magnus Karlsson (AF_XDP core)
84162306a36Sopenharmony_ci- Alexander Duyck
84262306a36Sopenharmony_ci- Alexei Starovoitov
84362306a36Sopenharmony_ci- Daniel Borkmann
84462306a36Sopenharmony_ci- Jesper Dangaard Brouer
84562306a36Sopenharmony_ci- John Fastabend
84662306a36Sopenharmony_ci- Jonathan Corbet (LWN coverage)
84762306a36Sopenharmony_ci- Michael S. Tsirkin
84862306a36Sopenharmony_ci- Qi Z Zhang
84962306a36Sopenharmony_ci- Willem de Bruijn
850