162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci===========
462306a36Sopenharmony_ciPacket MMAP
562306a36Sopenharmony_ci===========
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciAbstract
862306a36Sopenharmony_ci========
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciThis file documents the mmap() facility available with the PACKET
1162306a36Sopenharmony_cisocket interface. This type of sockets is used for
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_cii) capture network traffic with utilities like tcpdump,
1462306a36Sopenharmony_ciii) transmit network traffic, or any other that needs raw
1562306a36Sopenharmony_ci    access to network interface.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciHowto can be found at:
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ci    https://sites.google.com/site/packetmmap/
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciPlease send your comments to
2262306a36Sopenharmony_ci    - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
2362306a36Sopenharmony_ci    - Johann Baudy
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciWhy use PACKET_MMAP
2662306a36Sopenharmony_ci===================
2762306a36Sopenharmony_ci
2862306a36Sopenharmony_ciNon PACKET_MMAP capture process (plain AF_PACKET) is very
2962306a36Sopenharmony_ciinefficient. It uses very limited buffers and requires one system call to
3062306a36Sopenharmony_cicapture each packet, it requires two if you want to get packet's timestamp
3162306a36Sopenharmony_ci(like libpcap always does).
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ciOn the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
3462306a36Sopenharmony_ciconfigurable circular buffer mapped in user space that can be used to either
3562306a36Sopenharmony_cisend or receive packets. This way reading packets just needs to wait for them,
3662306a36Sopenharmony_cimost of the time there is no need to issue a single system call. Concerning
3762306a36Sopenharmony_citransmission, multiple packets can be sent through one system call to get the
3862306a36Sopenharmony_cihighest bandwidth. By using a shared buffer between the kernel and the user
3962306a36Sopenharmony_cialso has the benefit of minimizing packet copies.
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciIt's fine to use PACKET_MMAP to improve the performance of the capture and
4262306a36Sopenharmony_citransmission process, but it isn't everything. At least, if you are capturing
4362306a36Sopenharmony_ciat high speeds (this is relative to the cpu speed), you should check if the
4462306a36Sopenharmony_cidevice driver of your network interface card supports some sort of interrupt
4562306a36Sopenharmony_ciload mitigation or (even better) if it supports NAPI, also make sure it is
4662306a36Sopenharmony_cienabled. For transmission, check the MTU (Maximum Transmission Unit) used and
4762306a36Sopenharmony_cisupported by devices of your network. CPU IRQ pinning of your network interface
4862306a36Sopenharmony_cicard can also be an advantage.
4962306a36Sopenharmony_ci
5062306a36Sopenharmony_ciHow to use mmap() to improve capture process
5162306a36Sopenharmony_ci============================================
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciFrom the user standpoint, you should use the higher level libpcap library, which
5462306a36Sopenharmony_ciis a de facto standard, portable across nearly all operating systems
5562306a36Sopenharmony_ciincluding Win32.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciPacket MMAP support was integrated into libpcap around the time of version 1.3.0;
5862306a36Sopenharmony_ciTPACKET_V3 support was added in version 1.5.0
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciHow to use mmap() directly to improve capture process
6162306a36Sopenharmony_ci=====================================================
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ciFrom the system calls stand point, the use of PACKET_MMAP involves
6462306a36Sopenharmony_cithe following process::
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ci    [setup]     socket() -------> creation of the capture socket
6862306a36Sopenharmony_ci		setsockopt() ---> allocation of the circular buffer (ring)
6962306a36Sopenharmony_ci				  option: PACKET_RX_RING
7062306a36Sopenharmony_ci		mmap() ---------> mapping of the allocated buffer to the
7162306a36Sopenharmony_ci				  user process
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ci    [capture]   poll() ---------> to wait for incoming packets
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ci    [shutdown]  close() --------> destruction of the capture socket and
7662306a36Sopenharmony_ci				  deallocation of all associated
7762306a36Sopenharmony_ci				  resources.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_cisocket creation and destruction is straight forward, and is done
8162306a36Sopenharmony_cithe same way with or without PACKET_MMAP::
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ci int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciwhere mode is SOCK_RAW for the raw interface were link level
8662306a36Sopenharmony_ciinformation can be captured or SOCK_DGRAM for the cooked
8762306a36Sopenharmony_ciinterface where link level information capture is not
8862306a36Sopenharmony_cisupported and a link level pseudo-header is provided
8962306a36Sopenharmony_ciby the kernel.
9062306a36Sopenharmony_ci
9162306a36Sopenharmony_ciThe destruction of the socket and all associated resources
9262306a36Sopenharmony_ciis done by a simple call to close(fd).
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ciSimilarly as without PACKET_MMAP, it is possible to use one socket
9562306a36Sopenharmony_cifor capture and transmission. This can be done by mapping the
9662306a36Sopenharmony_ciallocated RX and TX buffer ring with a single mmap() call.
9762306a36Sopenharmony_ciSee "Mapping and use of the circular buffer (ring)".
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ciNext I will describe PACKET_MMAP settings and its constraints,
10062306a36Sopenharmony_cialso the mapping of the circular buffer in the user process and
10162306a36Sopenharmony_cithe use of this buffer.
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ciHow to use mmap() directly to improve transmission process
10462306a36Sopenharmony_ci==========================================================
10562306a36Sopenharmony_ciTransmission process is similar to capture as shown below::
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ci    [setup]         socket() -------> creation of the transmission socket
10862306a36Sopenharmony_ci		    setsockopt() ---> allocation of the circular buffer (ring)
10962306a36Sopenharmony_ci				      option: PACKET_TX_RING
11062306a36Sopenharmony_ci		    bind() ---------> bind transmission socket with a network interface
11162306a36Sopenharmony_ci		    mmap() ---------> mapping of the allocated buffer to the
11262306a36Sopenharmony_ci				      user process
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci    [transmission]  poll() ---------> wait for free packets (optional)
11562306a36Sopenharmony_ci		    send() ---------> send all packets that are set as ready in
11662306a36Sopenharmony_ci				      the ring
11762306a36Sopenharmony_ci				      The flag MSG_DONTWAIT can be used to return
11862306a36Sopenharmony_ci				      before end of transfer.
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ci    [shutdown]      close() --------> destruction of the transmission socket and
12162306a36Sopenharmony_ci				      deallocation of all associated resources.
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ciSocket creation and destruction is also straight forward, and is done
12462306a36Sopenharmony_cithe same way as in capturing described in the previous paragraph::
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ci int fd = socket(PF_PACKET, mode, 0);
12762306a36Sopenharmony_ci
12862306a36Sopenharmony_ciThe protocol can optionally be 0 in case we only want to transmit
12962306a36Sopenharmony_civia this socket, which avoids an expensive call to packet_rcv().
13062306a36Sopenharmony_ciIn this case, you also need to bind(2) the TX_RING with sll_protocol = 0
13162306a36Sopenharmony_ciset. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciBinding the socket to your network interface is mandatory (with zero copy) to
13462306a36Sopenharmony_ciknow the header size of frames used in the circular buffer.
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ciAs capture, each frame contains two parts::
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ci    --------------------
13962306a36Sopenharmony_ci    | struct tpacket_hdr | Header. It contains the status of
14062306a36Sopenharmony_ci    |                    | of this frame
14162306a36Sopenharmony_ci    |--------------------|
14262306a36Sopenharmony_ci    | data buffer        |
14362306a36Sopenharmony_ci    .                    .  Data that will be sent over the network interface.
14462306a36Sopenharmony_ci    .                    .
14562306a36Sopenharmony_ci    --------------------
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ci bind() associates the socket to your network interface thanks to
14862306a36Sopenharmony_ci sll_ifindex parameter of struct sockaddr_ll.
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ci Initialization example::
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ci    struct sockaddr_ll my_addr;
15362306a36Sopenharmony_ci    struct ifreq s_ifr;
15462306a36Sopenharmony_ci    ...
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ci    strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ci    /* get interface index of eth0 */
15962306a36Sopenharmony_ci    ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci    /* fill sockaddr_ll struct to prepare binding */
16262306a36Sopenharmony_ci    my_addr.sll_family = AF_PACKET;
16362306a36Sopenharmony_ci    my_addr.sll_protocol = htons(ETH_P_ALL);
16462306a36Sopenharmony_ci    my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_ci    /* bind socket to eth0 */
16762306a36Sopenharmony_ci    bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ci A complete tutorial is available at: https://sites.google.com/site/packetmmap/
17062306a36Sopenharmony_ci
17162306a36Sopenharmony_ciBy default, the user should put data at::
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ci frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ciSo, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
17662306a36Sopenharmony_cithe beginning of the user data will be at::
17762306a36Sopenharmony_ci
17862306a36Sopenharmony_ci frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
17962306a36Sopenharmony_ci
18062306a36Sopenharmony_ciIf you wish to put user data at a custom offset from the beginning of
18162306a36Sopenharmony_cithe frame (for payload alignment with SOCK_RAW mode for instance) you
18262306a36Sopenharmony_cican set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
18362306a36Sopenharmony_cito make this work it must be enabled previously with setsockopt()
18462306a36Sopenharmony_ciand the PACKET_TX_HAS_OFF option.
18562306a36Sopenharmony_ci
18662306a36Sopenharmony_ciPACKET_MMAP settings
18762306a36Sopenharmony_ci====================
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ciTo setup PACKET_MMAP from user level code is done with a call like
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci - Capture process::
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ci     setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ci - Transmission process::
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ci     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ciThe most significant argument in the previous call is the req parameter,
20062306a36Sopenharmony_cithis parameter must to have the following structure::
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ci    struct tpacket_req
20362306a36Sopenharmony_ci    {
20462306a36Sopenharmony_ci	unsigned int    tp_block_size;  /* Minimal size of contiguous block */
20562306a36Sopenharmony_ci	unsigned int    tp_block_nr;    /* Number of blocks */
20662306a36Sopenharmony_ci	unsigned int    tp_frame_size;  /* Size of frame */
20762306a36Sopenharmony_ci	unsigned int    tp_frame_nr;    /* Total number of frames */
20862306a36Sopenharmony_ci    };
20962306a36Sopenharmony_ci
21062306a36Sopenharmony_ciThis structure is defined in /usr/include/linux/if_packet.h and establishes a
21162306a36Sopenharmony_cicircular buffer (ring) of unswappable memory.
21262306a36Sopenharmony_ciBeing mapped in the capture process allows reading the captured frames and
21362306a36Sopenharmony_cirelated meta-information like timestamps without requiring a system call.
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ciFrames are grouped in blocks. Each block is a physically contiguous
21662306a36Sopenharmony_ciregion of memory and holds tp_block_size/tp_frame_size frames. The total number
21762306a36Sopenharmony_ciof blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
21862306a36Sopenharmony_ci
21962306a36Sopenharmony_ci    frames_per_block = tp_block_size/tp_frame_size
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ciindeed, packet_set_ring checks that the following condition is true::
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ci    frames_per_block * tp_block_nr == tp_frame_nr
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ciLets see an example, with the following values::
22662306a36Sopenharmony_ci
22762306a36Sopenharmony_ci     tp_block_size= 4096
22862306a36Sopenharmony_ci     tp_frame_size= 2048
22962306a36Sopenharmony_ci     tp_block_nr  = 4
23062306a36Sopenharmony_ci     tp_frame_nr  = 8
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ciwe will get the following buffer structure::
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci	    block #1                 block #2
23562306a36Sopenharmony_ci    +---------+---------+    +---------+---------+
23662306a36Sopenharmony_ci    | frame 1 | frame 2 |    | frame 3 | frame 4 |
23762306a36Sopenharmony_ci    +---------+---------+    +---------+---------+
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ci	    block #3                 block #4
24062306a36Sopenharmony_ci    +---------+---------+    +---------+---------+
24162306a36Sopenharmony_ci    | frame 5 | frame 6 |    | frame 7 | frame 8 |
24262306a36Sopenharmony_ci    +---------+---------+    +---------+---------+
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ciA frame can be of any size with the only condition it can fit in a block. A block
24562306a36Sopenharmony_cican only hold an integer number of frames, or in other words, a frame cannot
24662306a36Sopenharmony_cibe spawned across two blocks, so there are some details you have to take into
24762306a36Sopenharmony_ciaccount when choosing the frame_size. See "Mapping and use of the circular
24862306a36Sopenharmony_cibuffer (ring)".
24962306a36Sopenharmony_ci
25062306a36Sopenharmony_ciPACKET_MMAP setting constraints
25162306a36Sopenharmony_ci===============================
25262306a36Sopenharmony_ci
25362306a36Sopenharmony_ciIn kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
25462306a36Sopenharmony_cithe PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
25562306a36Sopenharmony_ci16384 in a 64 bit architecture.
25662306a36Sopenharmony_ci
25762306a36Sopenharmony_ciBlock size limit
25862306a36Sopenharmony_ci----------------
25962306a36Sopenharmony_ci
26062306a36Sopenharmony_ciAs stated earlier, each block is a contiguous physical region of memory. These
26162306a36Sopenharmony_cimemory regions are allocated with calls to the __get_free_pages() function. As
26262306a36Sopenharmony_cithe name indicates, this function allocates pages of memory, and the second
26362306a36Sopenharmony_ciargument is "order" or a power of two number of pages, that is
26462306a36Sopenharmony_ci(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
26562306a36Sopenharmony_ciorder=2 ==> 16384 bytes, etc. The maximum size of a
26662306a36Sopenharmony_ciregion allocated by __get_free_pages is determined by the MAX_ORDER macro. More
26762306a36Sopenharmony_ciprecisely the limit can be calculated as::
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ci   PAGE_SIZE << MAX_ORDER
27062306a36Sopenharmony_ci
27162306a36Sopenharmony_ci   In a i386 architecture PAGE_SIZE is 4096 bytes
27262306a36Sopenharmony_ci   In a 2.4/i386 kernel MAX_ORDER is 10
27362306a36Sopenharmony_ci   In a 2.6/i386 kernel MAX_ORDER is 11
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciSo get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
27662306a36Sopenharmony_cirespectively, with an i386 architecture.
27762306a36Sopenharmony_ci
27862306a36Sopenharmony_ciUser space programs can include /usr/include/sys/user.h and
27962306a36Sopenharmony_ci/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
28062306a36Sopenharmony_ci
28162306a36Sopenharmony_ciThe pagesize can also be determined dynamically with the getpagesize (2)
28262306a36Sopenharmony_cisystem call.
28362306a36Sopenharmony_ci
28462306a36Sopenharmony_ciBlock number limit
28562306a36Sopenharmony_ci------------------
28662306a36Sopenharmony_ci
28762306a36Sopenharmony_ciTo understand the constraints of PACKET_MMAP, we have to see the structure
28862306a36Sopenharmony_ciused to hold the pointers to each block.
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciCurrently, this structure is a dynamically allocated vector with kmalloc
29162306a36Sopenharmony_cicalled pg_vec, its size limits the number of blocks that can be allocated::
29262306a36Sopenharmony_ci
29362306a36Sopenharmony_ci    +---+---+---+---+
29462306a36Sopenharmony_ci    | x | x | x | x |
29562306a36Sopenharmony_ci    +---+---+---+---+
29662306a36Sopenharmony_ci      |   |   |   |
29762306a36Sopenharmony_ci      |   |   |   v
29862306a36Sopenharmony_ci      |   |   v  block #4
29962306a36Sopenharmony_ci      |   v  block #3
30062306a36Sopenharmony_ci      v  block #2
30162306a36Sopenharmony_ci     block #1
30262306a36Sopenharmony_ci
30362306a36Sopenharmony_cikmalloc allocates any number of bytes of physically contiguous memory from
30462306a36Sopenharmony_cia pool of pre-determined sizes. This pool of memory is maintained by the slab
30562306a36Sopenharmony_ciallocator which is at the end the responsible for doing the allocation and
30662306a36Sopenharmony_cihence which imposes the maximum memory that kmalloc can allocate.
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ciIn a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
30962306a36Sopenharmony_cipredetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
31062306a36Sopenharmony_cientries of /proc/slabinfo
31162306a36Sopenharmony_ci
31262306a36Sopenharmony_ciIn a 32 bit architecture, pointers are 4 bytes long, so the total number of
31362306a36Sopenharmony_cipointers to blocks is::
31462306a36Sopenharmony_ci
31562306a36Sopenharmony_ci     131072/4 = 32768 blocks
31662306a36Sopenharmony_ci
31762306a36Sopenharmony_ciPACKET_MMAP buffer size calculator
31862306a36Sopenharmony_ci==================================
31962306a36Sopenharmony_ci
32062306a36Sopenharmony_ciDefinitions:
32162306a36Sopenharmony_ci
32262306a36Sopenharmony_ci==============  ================================================================
32362306a36Sopenharmony_ci<size-max>      is the maximum size of allocable with kmalloc
32462306a36Sopenharmony_ci		(see /proc/slabinfo)
32562306a36Sopenharmony_ci<pointer size>  depends on the architecture -- ``sizeof(void *)``
32662306a36Sopenharmony_ci<page size>     depends on the architecture -- PAGE_SIZE or getpagesize (2)
32762306a36Sopenharmony_ci<max-order>     is the value defined with MAX_ORDER
32862306a36Sopenharmony_ci<frame size>    it's an upper bound of frame's capture size (more on this later)
32962306a36Sopenharmony_ci==============  ================================================================
33062306a36Sopenharmony_ci
33162306a36Sopenharmony_cifrom these definitions we will derive::
33262306a36Sopenharmony_ci
33362306a36Sopenharmony_ci	<block number> = <size-max>/<pointer size>
33462306a36Sopenharmony_ci	<block size> = <pagesize> << <max-order>
33562306a36Sopenharmony_ci
33662306a36Sopenharmony_ciso, the max buffer size is::
33762306a36Sopenharmony_ci
33862306a36Sopenharmony_ci	<block number> * <block size>
33962306a36Sopenharmony_ci
34062306a36Sopenharmony_ciand, the number of frames be::
34162306a36Sopenharmony_ci
34262306a36Sopenharmony_ci	<block number> * <block size> / <frame size>
34362306a36Sopenharmony_ci
34462306a36Sopenharmony_ciSuppose the following parameters, which apply for 2.6 kernel and an
34562306a36Sopenharmony_cii386 architecture::
34662306a36Sopenharmony_ci
34762306a36Sopenharmony_ci	<size-max> = 131072 bytes
34862306a36Sopenharmony_ci	<pointer size> = 4 bytes
34962306a36Sopenharmony_ci	<pagesize> = 4096 bytes
35062306a36Sopenharmony_ci	<max-order> = 11
35162306a36Sopenharmony_ci
35262306a36Sopenharmony_ciand a value for <frame size> of 2048 bytes. These parameters will yield::
35362306a36Sopenharmony_ci
35462306a36Sopenharmony_ci	<block number> = 131072/4 = 32768 blocks
35562306a36Sopenharmony_ci	<block size> = 4096 << 11 = 8 MiB.
35662306a36Sopenharmony_ci
35762306a36Sopenharmony_ciand hence the buffer will have a 262144 MiB size. So it can hold
35862306a36Sopenharmony_ci262144 MiB / 2048 bytes = 134217728 frames
35962306a36Sopenharmony_ci
36062306a36Sopenharmony_ciActually, this buffer size is not possible with an i386 architecture.
36162306a36Sopenharmony_ciRemember that the memory is allocated in kernel space, in the case of
36262306a36Sopenharmony_cian i386 kernel's memory size is limited to 1GiB.
36362306a36Sopenharmony_ci
36462306a36Sopenharmony_ciAll memory allocations are not freed until the socket is closed. The memory
36562306a36Sopenharmony_ciallocations are done with GFP_KERNEL priority, this basically means that
36662306a36Sopenharmony_cithe allocation can wait and swap other process' memory in order to allocate
36762306a36Sopenharmony_cithe necessary memory, so normally limits can be reached.
36862306a36Sopenharmony_ci
36962306a36Sopenharmony_ciOther constraints
37062306a36Sopenharmony_ci-----------------
37162306a36Sopenharmony_ci
37262306a36Sopenharmony_ciIf you check the source code you will see that what I draw here as a frame
37362306a36Sopenharmony_ciis not only the link level frame. At the beginning of each frame there is a
37462306a36Sopenharmony_ciheader called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
37562306a36Sopenharmony_cimeta information like timestamp. So what we draw here a frame it's really
37662306a36Sopenharmony_cithe following (from include/linux/if_packet.h)::
37762306a36Sopenharmony_ci
37862306a36Sopenharmony_ci /*
37962306a36Sopenharmony_ci   Frame structure:
38062306a36Sopenharmony_ci
38162306a36Sopenharmony_ci   - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
38262306a36Sopenharmony_ci   - struct tpacket_hdr
38362306a36Sopenharmony_ci   - pad to TPACKET_ALIGNMENT=16
38462306a36Sopenharmony_ci   - struct sockaddr_ll
38562306a36Sopenharmony_ci   - Gap, chosen so that packet data (Start+tp_net) aligns to
38662306a36Sopenharmony_ci     TPACKET_ALIGNMENT=16
38762306a36Sopenharmony_ci   - Start+tp_mac: [ Optional MAC header ]
38862306a36Sopenharmony_ci   - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
38962306a36Sopenharmony_ci   - Pad to align to TPACKET_ALIGNMENT=16
39062306a36Sopenharmony_ci */
39162306a36Sopenharmony_ci
39262306a36Sopenharmony_ciThe following are conditions that are checked in packet_set_ring
39362306a36Sopenharmony_ci
39462306a36Sopenharmony_ci   - tp_block_size must be a multiple of PAGE_SIZE (1)
39562306a36Sopenharmony_ci   - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
39662306a36Sopenharmony_ci   - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
39762306a36Sopenharmony_ci   - tp_frame_nr   must be exactly frames_per_block*tp_block_nr
39862306a36Sopenharmony_ci
39962306a36Sopenharmony_ciNote that tp_block_size should be chosen to be a power of two or there will
40062306a36Sopenharmony_cibe a waste of memory.
40162306a36Sopenharmony_ci
40262306a36Sopenharmony_ciMapping and use of the circular buffer (ring)
40362306a36Sopenharmony_ci---------------------------------------------
40462306a36Sopenharmony_ci
40562306a36Sopenharmony_ciThe mapping of the buffer in the user process is done with the conventional
40662306a36Sopenharmony_cimmap function. Even the circular buffer is compound of several physically
40762306a36Sopenharmony_cidiscontiguous blocks of memory, they are contiguous to the user space, hence
40862306a36Sopenharmony_cijust one call to mmap is needed::
40962306a36Sopenharmony_ci
41062306a36Sopenharmony_ci    mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
41162306a36Sopenharmony_ci
41262306a36Sopenharmony_ciIf tp_frame_size is a divisor of tp_block_size frames will be
41362306a36Sopenharmony_cicontiguously spaced by tp_frame_size bytes. If not, each
41462306a36Sopenharmony_citp_block_size/tp_frame_size frames there will be a gap between
41562306a36Sopenharmony_cithe frames. This is because a frame cannot be spawn across two
41662306a36Sopenharmony_ciblocks.
41762306a36Sopenharmony_ci
41862306a36Sopenharmony_ciTo use one socket for capture and transmission, the mapping of both the
41962306a36Sopenharmony_ciRX and TX buffer ring has to be done with one call to mmap::
42062306a36Sopenharmony_ci
42162306a36Sopenharmony_ci    ...
42262306a36Sopenharmony_ci    setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
42362306a36Sopenharmony_ci    setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
42462306a36Sopenharmony_ci    ...
42562306a36Sopenharmony_ci    rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
42662306a36Sopenharmony_ci    tx_ring = rx_ring + size;
42762306a36Sopenharmony_ci
42862306a36Sopenharmony_ciRX must be the first as the kernel maps the TX ring memory right
42962306a36Sopenharmony_ciafter the RX one.
43062306a36Sopenharmony_ci
43162306a36Sopenharmony_ciAt the beginning of each frame there is an status field (see
43262306a36Sopenharmony_cistruct tpacket_hdr). If this field is 0 means that the frame is ready
43362306a36Sopenharmony_cito be used for the kernel, If not, there is a frame the user can read
43462306a36Sopenharmony_ciand the following flags apply:
43562306a36Sopenharmony_ci
43662306a36Sopenharmony_ciCapture process
43762306a36Sopenharmony_ci^^^^^^^^^^^^^^^
43862306a36Sopenharmony_ci
43962306a36Sopenharmony_ciFrom include/linux/if_packet.h::
44062306a36Sopenharmony_ci
44162306a36Sopenharmony_ci     #define TP_STATUS_COPY          (1 << 1)
44262306a36Sopenharmony_ci     #define TP_STATUS_LOSING        (1 << 2)
44362306a36Sopenharmony_ci     #define TP_STATUS_CSUMNOTREADY  (1 << 3)
44462306a36Sopenharmony_ci     #define TP_STATUS_CSUM_VALID    (1 << 7)
44562306a36Sopenharmony_ci
44662306a36Sopenharmony_ci======================  =======================================================
44762306a36Sopenharmony_ciTP_STATUS_COPY		This flag indicates that the frame (and associated
44862306a36Sopenharmony_ci			meta information) has been truncated because it's
44962306a36Sopenharmony_ci			larger than tp_frame_size. This packet can be
45062306a36Sopenharmony_ci			read entirely with recvfrom().
45162306a36Sopenharmony_ci
45262306a36Sopenharmony_ci			In order to make this work it must to be
45362306a36Sopenharmony_ci			enabled previously with setsockopt() and
45462306a36Sopenharmony_ci			the PACKET_COPY_THRESH option.
45562306a36Sopenharmony_ci
45662306a36Sopenharmony_ci			The number of frames that can be buffered to
45762306a36Sopenharmony_ci			be read with recvfrom is limited like a normal socket.
45862306a36Sopenharmony_ci			See the SO_RCVBUF option in the socket (7) man page.
45962306a36Sopenharmony_ci
46062306a36Sopenharmony_ciTP_STATUS_LOSING	indicates there were packet drops from last time
46162306a36Sopenharmony_ci			statistics where checked with getsockopt() and
46262306a36Sopenharmony_ci			the PACKET_STATISTICS option.
46362306a36Sopenharmony_ci
46462306a36Sopenharmony_ciTP_STATUS_CSUMNOTREADY	currently it's used for outgoing IP packets which
46562306a36Sopenharmony_ci			its checksum will be done in hardware. So while
46662306a36Sopenharmony_ci			reading the packet we should not try to check the
46762306a36Sopenharmony_ci			checksum.
46862306a36Sopenharmony_ci
46962306a36Sopenharmony_ciTP_STATUS_CSUM_VALID	This flag indicates that at least the transport
47062306a36Sopenharmony_ci			header checksum of the packet has been already
47162306a36Sopenharmony_ci			validated on the kernel side. If the flag is not set
47262306a36Sopenharmony_ci			then we are free to check the checksum by ourselves
47362306a36Sopenharmony_ci			provided that TP_STATUS_CSUMNOTREADY is also not set.
47462306a36Sopenharmony_ci======================  =======================================================
47562306a36Sopenharmony_ci
47662306a36Sopenharmony_cifor convenience there are also the following defines::
47762306a36Sopenharmony_ci
47862306a36Sopenharmony_ci     #define TP_STATUS_KERNEL        0
47962306a36Sopenharmony_ci     #define TP_STATUS_USER          1
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ciThe kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
48262306a36Sopenharmony_cireceives a packet it puts in the buffer and updates the status with
48362306a36Sopenharmony_ciat least the TP_STATUS_USER flag. Then the user can read the packet,
48462306a36Sopenharmony_cionce the packet is read the user must zero the status field, so the kernel
48562306a36Sopenharmony_cican use again that frame buffer.
48662306a36Sopenharmony_ci
48762306a36Sopenharmony_ciThe user can use poll (any other variant should apply too) to check if new
48862306a36Sopenharmony_cipackets are in the ring::
48962306a36Sopenharmony_ci
49062306a36Sopenharmony_ci    struct pollfd pfd;
49162306a36Sopenharmony_ci
49262306a36Sopenharmony_ci    pfd.fd = fd;
49362306a36Sopenharmony_ci    pfd.revents = 0;
49462306a36Sopenharmony_ci    pfd.events = POLLIN|POLLRDNORM|POLLERR;
49562306a36Sopenharmony_ci
49662306a36Sopenharmony_ci    if (status == TP_STATUS_KERNEL)
49762306a36Sopenharmony_ci	retval = poll(&pfd, 1, timeout);
49862306a36Sopenharmony_ci
49962306a36Sopenharmony_ciIt doesn't incur in a race condition to first check the status value and
50062306a36Sopenharmony_cithen poll for frames.
50162306a36Sopenharmony_ci
50262306a36Sopenharmony_ciTransmission process
50362306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^
50462306a36Sopenharmony_ci
50562306a36Sopenharmony_ciThose defines are also used for transmission::
50662306a36Sopenharmony_ci
50762306a36Sopenharmony_ci     #define TP_STATUS_AVAILABLE        0 // Frame is available
50862306a36Sopenharmony_ci     #define TP_STATUS_SEND_REQUEST     1 // Frame will be sent on next send()
50962306a36Sopenharmony_ci     #define TP_STATUS_SENDING          2 // Frame is currently in transmission
51062306a36Sopenharmony_ci     #define TP_STATUS_WRONG_FORMAT     4 // Frame format is not correct
51162306a36Sopenharmony_ci
51262306a36Sopenharmony_ciFirst, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
51362306a36Sopenharmony_cipacket, the user fills a data buffer of an available frame, sets tp_len to
51462306a36Sopenharmony_cicurrent data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
51562306a36Sopenharmony_ciThis can be done on multiple frames. Once the user is ready to transmit, it
51662306a36Sopenharmony_cicalls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
51762306a36Sopenharmony_ciforwarded to the network device. The kernel updates each status of sent
51862306a36Sopenharmony_ciframes with TP_STATUS_SENDING until the end of transfer.
51962306a36Sopenharmony_ci
52062306a36Sopenharmony_ciAt the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
52162306a36Sopenharmony_ci
52262306a36Sopenharmony_ci::
52362306a36Sopenharmony_ci
52462306a36Sopenharmony_ci    header->tp_len = in_i_size;
52562306a36Sopenharmony_ci    header->tp_status = TP_STATUS_SEND_REQUEST;
52662306a36Sopenharmony_ci    retval = send(this->socket, NULL, 0, 0);
52762306a36Sopenharmony_ci
52862306a36Sopenharmony_ciThe user can also use poll() to check if a buffer is available:
52962306a36Sopenharmony_ci
53062306a36Sopenharmony_ci(status == TP_STATUS_SENDING)
53162306a36Sopenharmony_ci
53262306a36Sopenharmony_ci::
53362306a36Sopenharmony_ci
53462306a36Sopenharmony_ci    struct pollfd pfd;
53562306a36Sopenharmony_ci    pfd.fd = fd;
53662306a36Sopenharmony_ci    pfd.revents = 0;
53762306a36Sopenharmony_ci    pfd.events = POLLOUT;
53862306a36Sopenharmony_ci    retval = poll(&pfd, 1, timeout);
53962306a36Sopenharmony_ci
54062306a36Sopenharmony_ciWhat TPACKET versions are available and when to use them?
54162306a36Sopenharmony_ci=========================================================
54262306a36Sopenharmony_ci
54362306a36Sopenharmony_ci::
54462306a36Sopenharmony_ci
54562306a36Sopenharmony_ci int val = tpacket_version;
54662306a36Sopenharmony_ci setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
54762306a36Sopenharmony_ci getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
54862306a36Sopenharmony_ci
54962306a36Sopenharmony_ciwhere 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
55062306a36Sopenharmony_ci
55162306a36Sopenharmony_ciTPACKET_V1:
55262306a36Sopenharmony_ci	- Default if not otherwise specified by setsockopt(2)
55362306a36Sopenharmony_ci	- RX_RING, TX_RING available
55462306a36Sopenharmony_ci
55562306a36Sopenharmony_ciTPACKET_V1 --> TPACKET_V2:
55662306a36Sopenharmony_ci	- Made 64 bit clean due to unsigned long usage in TPACKET_V1
55762306a36Sopenharmony_ci	  structures, thus this also works on 64 bit kernel with 32 bit
55862306a36Sopenharmony_ci	  userspace and the like
55962306a36Sopenharmony_ci	- Timestamp resolution in nanoseconds instead of microseconds
56062306a36Sopenharmony_ci	- RX_RING, TX_RING available
56162306a36Sopenharmony_ci	- VLAN metadata information available for packets
56262306a36Sopenharmony_ci	  (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
56362306a36Sopenharmony_ci	  in the tpacket2_hdr structure:
56462306a36Sopenharmony_ci
56562306a36Sopenharmony_ci		- TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
56662306a36Sopenharmony_ci		  that the tp_vlan_tci field has valid VLAN TCI value
56762306a36Sopenharmony_ci		- TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
56862306a36Sopenharmony_ci		  indicates that the tp_vlan_tpid field has valid VLAN TPID value
56962306a36Sopenharmony_ci
57062306a36Sopenharmony_ci	- How to switch to TPACKET_V2:
57162306a36Sopenharmony_ci
57262306a36Sopenharmony_ci		1. Replace struct tpacket_hdr by struct tpacket2_hdr
57362306a36Sopenharmony_ci		2. Query header len and save
57462306a36Sopenharmony_ci		3. Set protocol version to 2, set up ring as usual
57562306a36Sopenharmony_ci		4. For getting the sockaddr_ll,
57662306a36Sopenharmony_ci		   use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
57762306a36Sopenharmony_ci		   ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
57862306a36Sopenharmony_ci
57962306a36Sopenharmony_ciTPACKET_V2 --> TPACKET_V3:
58062306a36Sopenharmony_ci	- Flexible buffer implementation for RX_RING:
58162306a36Sopenharmony_ci		1. Blocks can be configured with non-static frame-size
58262306a36Sopenharmony_ci		2. Read/poll is at a block-level (as opposed to packet-level)
58362306a36Sopenharmony_ci		3. Added poll timeout to avoid indefinite user-space wait
58462306a36Sopenharmony_ci		   on idle links
58562306a36Sopenharmony_ci		4. Added user-configurable knobs:
58662306a36Sopenharmony_ci
58762306a36Sopenharmony_ci			4.1 block::timeout
58862306a36Sopenharmony_ci			4.2 tpkt_hdr::sk_rxhash
58962306a36Sopenharmony_ci
59062306a36Sopenharmony_ci	- RX Hash data available in user space
59162306a36Sopenharmony_ci	- TX_RING semantics are conceptually similar to TPACKET_V2;
59262306a36Sopenharmony_ci	  use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
59362306a36Sopenharmony_ci	  instead of TPACKET2_HDRLEN. In the current implementation,
59462306a36Sopenharmony_ci	  the tp_next_offset field in the tpacket3_hdr MUST be set to
59562306a36Sopenharmony_ci	  zero, indicating that the ring does not hold variable sized frames.
59662306a36Sopenharmony_ci	  Packets with non-zero values of tp_next_offset will be dropped.
59762306a36Sopenharmony_ci
59862306a36Sopenharmony_ciAF_PACKET fanout mode
59962306a36Sopenharmony_ci=====================
60062306a36Sopenharmony_ci
60162306a36Sopenharmony_ciIn the AF_PACKET fanout mode, packet reception can be load balanced among
60262306a36Sopenharmony_ciprocesses. This also works in combination with mmap(2) on packet sockets.
60362306a36Sopenharmony_ci
60462306a36Sopenharmony_ciCurrently implemented fanout policies are:
60562306a36Sopenharmony_ci
60662306a36Sopenharmony_ci  - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
60762306a36Sopenharmony_ci  - PACKET_FANOUT_LB: schedule to socket by round-robin
60862306a36Sopenharmony_ci  - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
60962306a36Sopenharmony_ci  - PACKET_FANOUT_RND: schedule to socket by random selection
61062306a36Sopenharmony_ci  - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
61162306a36Sopenharmony_ci  - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
61262306a36Sopenharmony_ci
61362306a36Sopenharmony_ciMinimal example code by David S. Miller (try things like "./test eth0 hash",
61462306a36Sopenharmony_ci"./test eth0 lb", etc.)::
61562306a36Sopenharmony_ci
61662306a36Sopenharmony_ci    #include <stddef.h>
61762306a36Sopenharmony_ci    #include <stdlib.h>
61862306a36Sopenharmony_ci    #include <stdio.h>
61962306a36Sopenharmony_ci    #include <string.h>
62062306a36Sopenharmony_ci
62162306a36Sopenharmony_ci    #include <sys/types.h>
62262306a36Sopenharmony_ci    #include <sys/wait.h>
62362306a36Sopenharmony_ci    #include <sys/socket.h>
62462306a36Sopenharmony_ci    #include <sys/ioctl.h>
62562306a36Sopenharmony_ci
62662306a36Sopenharmony_ci    #include <unistd.h>
62762306a36Sopenharmony_ci
62862306a36Sopenharmony_ci    #include <linux/if_ether.h>
62962306a36Sopenharmony_ci    #include <linux/if_packet.h>
63062306a36Sopenharmony_ci
63162306a36Sopenharmony_ci    #include <net/if.h>
63262306a36Sopenharmony_ci
63362306a36Sopenharmony_ci    static const char *device_name;
63462306a36Sopenharmony_ci    static int fanout_type;
63562306a36Sopenharmony_ci    static int fanout_id;
63662306a36Sopenharmony_ci
63762306a36Sopenharmony_ci    #ifndef PACKET_FANOUT
63862306a36Sopenharmony_ci    # define PACKET_FANOUT			18
63962306a36Sopenharmony_ci    # define PACKET_FANOUT_HASH		0
64062306a36Sopenharmony_ci    # define PACKET_FANOUT_LB		1
64162306a36Sopenharmony_ci    #endif
64262306a36Sopenharmony_ci
64362306a36Sopenharmony_ci    static int setup_socket(void)
64462306a36Sopenharmony_ci    {
64562306a36Sopenharmony_ci	    int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
64662306a36Sopenharmony_ci	    struct sockaddr_ll ll;
64762306a36Sopenharmony_ci	    struct ifreq ifr;
64862306a36Sopenharmony_ci	    int fanout_arg;
64962306a36Sopenharmony_ci
65062306a36Sopenharmony_ci	    if (fd < 0) {
65162306a36Sopenharmony_ci		    perror("socket");
65262306a36Sopenharmony_ci		    return EXIT_FAILURE;
65362306a36Sopenharmony_ci	    }
65462306a36Sopenharmony_ci
65562306a36Sopenharmony_ci	    memset(&ifr, 0, sizeof(ifr));
65662306a36Sopenharmony_ci	    strcpy(ifr.ifr_name, device_name);
65762306a36Sopenharmony_ci	    err = ioctl(fd, SIOCGIFINDEX, &ifr);
65862306a36Sopenharmony_ci	    if (err < 0) {
65962306a36Sopenharmony_ci		    perror("SIOCGIFINDEX");
66062306a36Sopenharmony_ci		    return EXIT_FAILURE;
66162306a36Sopenharmony_ci	    }
66262306a36Sopenharmony_ci
66362306a36Sopenharmony_ci	    memset(&ll, 0, sizeof(ll));
66462306a36Sopenharmony_ci	    ll.sll_family = AF_PACKET;
66562306a36Sopenharmony_ci	    ll.sll_ifindex = ifr.ifr_ifindex;
66662306a36Sopenharmony_ci	    err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
66762306a36Sopenharmony_ci	    if (err < 0) {
66862306a36Sopenharmony_ci		    perror("bind");
66962306a36Sopenharmony_ci		    return EXIT_FAILURE;
67062306a36Sopenharmony_ci	    }
67162306a36Sopenharmony_ci
67262306a36Sopenharmony_ci	    fanout_arg = (fanout_id | (fanout_type << 16));
67362306a36Sopenharmony_ci	    err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
67462306a36Sopenharmony_ci			    &fanout_arg, sizeof(fanout_arg));
67562306a36Sopenharmony_ci	    if (err) {
67662306a36Sopenharmony_ci		    perror("setsockopt");
67762306a36Sopenharmony_ci		    return EXIT_FAILURE;
67862306a36Sopenharmony_ci	    }
67962306a36Sopenharmony_ci
68062306a36Sopenharmony_ci	    return fd;
68162306a36Sopenharmony_ci    }
68262306a36Sopenharmony_ci
68362306a36Sopenharmony_ci    static void fanout_thread(void)
68462306a36Sopenharmony_ci    {
68562306a36Sopenharmony_ci	    int fd = setup_socket();
68662306a36Sopenharmony_ci	    int limit = 10000;
68762306a36Sopenharmony_ci
68862306a36Sopenharmony_ci	    if (fd < 0)
68962306a36Sopenharmony_ci		    exit(fd);
69062306a36Sopenharmony_ci
69162306a36Sopenharmony_ci	    while (limit-- > 0) {
69262306a36Sopenharmony_ci		    char buf[1600];
69362306a36Sopenharmony_ci		    int err;
69462306a36Sopenharmony_ci
69562306a36Sopenharmony_ci		    err = read(fd, buf, sizeof(buf));
69662306a36Sopenharmony_ci		    if (err < 0) {
69762306a36Sopenharmony_ci			    perror("read");
69862306a36Sopenharmony_ci			    exit(EXIT_FAILURE);
69962306a36Sopenharmony_ci		    }
70062306a36Sopenharmony_ci		    if ((limit % 10) == 0)
70162306a36Sopenharmony_ci			    fprintf(stdout, "(%d) \n", getpid());
70262306a36Sopenharmony_ci	    }
70362306a36Sopenharmony_ci
70462306a36Sopenharmony_ci	    fprintf(stdout, "%d: Received 10000 packets\n", getpid());
70562306a36Sopenharmony_ci
70662306a36Sopenharmony_ci	    close(fd);
70762306a36Sopenharmony_ci	    exit(0);
70862306a36Sopenharmony_ci    }
70962306a36Sopenharmony_ci
71062306a36Sopenharmony_ci    int main(int argc, char **argp)
71162306a36Sopenharmony_ci    {
71262306a36Sopenharmony_ci	    int fd, err;
71362306a36Sopenharmony_ci	    int i;
71462306a36Sopenharmony_ci
71562306a36Sopenharmony_ci	    if (argc != 3) {
71662306a36Sopenharmony_ci		    fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
71762306a36Sopenharmony_ci		    return EXIT_FAILURE;
71862306a36Sopenharmony_ci	    }
71962306a36Sopenharmony_ci
72062306a36Sopenharmony_ci	    if (!strcmp(argp[2], "hash"))
72162306a36Sopenharmony_ci		    fanout_type = PACKET_FANOUT_HASH;
72262306a36Sopenharmony_ci	    else if (!strcmp(argp[2], "lb"))
72362306a36Sopenharmony_ci		    fanout_type = PACKET_FANOUT_LB;
72462306a36Sopenharmony_ci	    else {
72562306a36Sopenharmony_ci		    fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
72662306a36Sopenharmony_ci		    exit(EXIT_FAILURE);
72762306a36Sopenharmony_ci	    }
72862306a36Sopenharmony_ci
72962306a36Sopenharmony_ci	    device_name = argp[1];
73062306a36Sopenharmony_ci	    fanout_id = getpid() & 0xffff;
73162306a36Sopenharmony_ci
73262306a36Sopenharmony_ci	    for (i = 0; i < 4; i++) {
73362306a36Sopenharmony_ci		    pid_t pid = fork();
73462306a36Sopenharmony_ci
73562306a36Sopenharmony_ci		    switch (pid) {
73662306a36Sopenharmony_ci		    case 0:
73762306a36Sopenharmony_ci			    fanout_thread();
73862306a36Sopenharmony_ci
73962306a36Sopenharmony_ci		    case -1:
74062306a36Sopenharmony_ci			    perror("fork");
74162306a36Sopenharmony_ci			    exit(EXIT_FAILURE);
74262306a36Sopenharmony_ci		    }
74362306a36Sopenharmony_ci	    }
74462306a36Sopenharmony_ci
74562306a36Sopenharmony_ci	    for (i = 0; i < 4; i++) {
74662306a36Sopenharmony_ci		    int status;
74762306a36Sopenharmony_ci
74862306a36Sopenharmony_ci		    wait(&status);
74962306a36Sopenharmony_ci	    }
75062306a36Sopenharmony_ci
75162306a36Sopenharmony_ci	    return 0;
75262306a36Sopenharmony_ci    }
75362306a36Sopenharmony_ci
75462306a36Sopenharmony_ciAF_PACKET TPACKET_V3 example
75562306a36Sopenharmony_ci============================
75662306a36Sopenharmony_ci
75762306a36Sopenharmony_ciAF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
75862306a36Sopenharmony_cisizes by doing its own memory management. It is based on blocks where polling
75962306a36Sopenharmony_ciworks on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
76062306a36Sopenharmony_ci
76162306a36Sopenharmony_ciIt is said that TPACKET_V3 brings the following benefits:
76262306a36Sopenharmony_ci
76362306a36Sopenharmony_ci * ~15% - 20% reduction in CPU-usage
76462306a36Sopenharmony_ci * ~20% increase in packet capture rate
76562306a36Sopenharmony_ci * ~2x increase in packet density
76662306a36Sopenharmony_ci * Port aggregation analysis
76762306a36Sopenharmony_ci * Non static frame size to capture entire packet payload
76862306a36Sopenharmony_ci
76962306a36Sopenharmony_ciSo it seems to be a good candidate to be used with packet fanout.
77062306a36Sopenharmony_ci
77162306a36Sopenharmony_ciMinimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
77262306a36Sopenharmony_ciit with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
77362306a36Sopenharmony_ci
77462306a36Sopenharmony_ci    /* Written from scratch, but kernel-to-user space API usage
77562306a36Sopenharmony_ci    * dissected from lolpcap:
77662306a36Sopenharmony_ci    *  Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
77762306a36Sopenharmony_ci    *  License: GPL, version 2.0
77862306a36Sopenharmony_ci    */
77962306a36Sopenharmony_ci
78062306a36Sopenharmony_ci    #include <stdio.h>
78162306a36Sopenharmony_ci    #include <stdlib.h>
78262306a36Sopenharmony_ci    #include <stdint.h>
78362306a36Sopenharmony_ci    #include <string.h>
78462306a36Sopenharmony_ci    #include <assert.h>
78562306a36Sopenharmony_ci    #include <net/if.h>
78662306a36Sopenharmony_ci    #include <arpa/inet.h>
78762306a36Sopenharmony_ci    #include <netdb.h>
78862306a36Sopenharmony_ci    #include <poll.h>
78962306a36Sopenharmony_ci    #include <unistd.h>
79062306a36Sopenharmony_ci    #include <signal.h>
79162306a36Sopenharmony_ci    #include <inttypes.h>
79262306a36Sopenharmony_ci    #include <sys/socket.h>
79362306a36Sopenharmony_ci    #include <sys/mman.h>
79462306a36Sopenharmony_ci    #include <linux/if_packet.h>
79562306a36Sopenharmony_ci    #include <linux/if_ether.h>
79662306a36Sopenharmony_ci    #include <linux/ip.h>
79762306a36Sopenharmony_ci
79862306a36Sopenharmony_ci    #ifndef likely
79962306a36Sopenharmony_ci    # define likely(x)		__builtin_expect(!!(x), 1)
80062306a36Sopenharmony_ci    #endif
80162306a36Sopenharmony_ci    #ifndef unlikely
80262306a36Sopenharmony_ci    # define unlikely(x)		__builtin_expect(!!(x), 0)
80362306a36Sopenharmony_ci    #endif
80462306a36Sopenharmony_ci
80562306a36Sopenharmony_ci    struct block_desc {
80662306a36Sopenharmony_ci	    uint32_t version;
80762306a36Sopenharmony_ci	    uint32_t offset_to_priv;
80862306a36Sopenharmony_ci	    struct tpacket_hdr_v1 h1;
80962306a36Sopenharmony_ci    };
81062306a36Sopenharmony_ci
81162306a36Sopenharmony_ci    struct ring {
81262306a36Sopenharmony_ci	    struct iovec *rd;
81362306a36Sopenharmony_ci	    uint8_t *map;
81462306a36Sopenharmony_ci	    struct tpacket_req3 req;
81562306a36Sopenharmony_ci    };
81662306a36Sopenharmony_ci
81762306a36Sopenharmony_ci    static unsigned long packets_total = 0, bytes_total = 0;
81862306a36Sopenharmony_ci    static sig_atomic_t sigint = 0;
81962306a36Sopenharmony_ci
82062306a36Sopenharmony_ci    static void sighandler(int num)
82162306a36Sopenharmony_ci    {
82262306a36Sopenharmony_ci	    sigint = 1;
82362306a36Sopenharmony_ci    }
82462306a36Sopenharmony_ci
82562306a36Sopenharmony_ci    static int setup_socket(struct ring *ring, char *netdev)
82662306a36Sopenharmony_ci    {
82762306a36Sopenharmony_ci	    int err, i, fd, v = TPACKET_V3;
82862306a36Sopenharmony_ci	    struct sockaddr_ll ll;
82962306a36Sopenharmony_ci	    unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
83062306a36Sopenharmony_ci	    unsigned int blocknum = 64;
83162306a36Sopenharmony_ci
83262306a36Sopenharmony_ci	    fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
83362306a36Sopenharmony_ci	    if (fd < 0) {
83462306a36Sopenharmony_ci		    perror("socket");
83562306a36Sopenharmony_ci		    exit(1);
83662306a36Sopenharmony_ci	    }
83762306a36Sopenharmony_ci
83862306a36Sopenharmony_ci	    err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
83962306a36Sopenharmony_ci	    if (err < 0) {
84062306a36Sopenharmony_ci		    perror("setsockopt");
84162306a36Sopenharmony_ci		    exit(1);
84262306a36Sopenharmony_ci	    }
84362306a36Sopenharmony_ci
84462306a36Sopenharmony_ci	    memset(&ring->req, 0, sizeof(ring->req));
84562306a36Sopenharmony_ci	    ring->req.tp_block_size = blocksiz;
84662306a36Sopenharmony_ci	    ring->req.tp_frame_size = framesiz;
84762306a36Sopenharmony_ci	    ring->req.tp_block_nr = blocknum;
84862306a36Sopenharmony_ci	    ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
84962306a36Sopenharmony_ci	    ring->req.tp_retire_blk_tov = 60;
85062306a36Sopenharmony_ci	    ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
85162306a36Sopenharmony_ci
85262306a36Sopenharmony_ci	    err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
85362306a36Sopenharmony_ci			    sizeof(ring->req));
85462306a36Sopenharmony_ci	    if (err < 0) {
85562306a36Sopenharmony_ci		    perror("setsockopt");
85662306a36Sopenharmony_ci		    exit(1);
85762306a36Sopenharmony_ci	    }
85862306a36Sopenharmony_ci
85962306a36Sopenharmony_ci	    ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
86062306a36Sopenharmony_ci			    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
86162306a36Sopenharmony_ci	    if (ring->map == MAP_FAILED) {
86262306a36Sopenharmony_ci		    perror("mmap");
86362306a36Sopenharmony_ci		    exit(1);
86462306a36Sopenharmony_ci	    }
86562306a36Sopenharmony_ci
86662306a36Sopenharmony_ci	    ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
86762306a36Sopenharmony_ci	    assert(ring->rd);
86862306a36Sopenharmony_ci	    for (i = 0; i < ring->req.tp_block_nr; ++i) {
86962306a36Sopenharmony_ci		    ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
87062306a36Sopenharmony_ci		    ring->rd[i].iov_len = ring->req.tp_block_size;
87162306a36Sopenharmony_ci	    }
87262306a36Sopenharmony_ci
87362306a36Sopenharmony_ci	    memset(&ll, 0, sizeof(ll));
87462306a36Sopenharmony_ci	    ll.sll_family = PF_PACKET;
87562306a36Sopenharmony_ci	    ll.sll_protocol = htons(ETH_P_ALL);
87662306a36Sopenharmony_ci	    ll.sll_ifindex = if_nametoindex(netdev);
87762306a36Sopenharmony_ci	    ll.sll_hatype = 0;
87862306a36Sopenharmony_ci	    ll.sll_pkttype = 0;
87962306a36Sopenharmony_ci	    ll.sll_halen = 0;
88062306a36Sopenharmony_ci
88162306a36Sopenharmony_ci	    err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
88262306a36Sopenharmony_ci	    if (err < 0) {
88362306a36Sopenharmony_ci		    perror("bind");
88462306a36Sopenharmony_ci		    exit(1);
88562306a36Sopenharmony_ci	    }
88662306a36Sopenharmony_ci
88762306a36Sopenharmony_ci	    return fd;
88862306a36Sopenharmony_ci    }
88962306a36Sopenharmony_ci
89062306a36Sopenharmony_ci    static void display(struct tpacket3_hdr *ppd)
89162306a36Sopenharmony_ci    {
89262306a36Sopenharmony_ci	    struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
89362306a36Sopenharmony_ci	    struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
89462306a36Sopenharmony_ci
89562306a36Sopenharmony_ci	    if (eth->h_proto == htons(ETH_P_IP)) {
89662306a36Sopenharmony_ci		    struct sockaddr_in ss, sd;
89762306a36Sopenharmony_ci		    char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
89862306a36Sopenharmony_ci
89962306a36Sopenharmony_ci		    memset(&ss, 0, sizeof(ss));
90062306a36Sopenharmony_ci		    ss.sin_family = PF_INET;
90162306a36Sopenharmony_ci		    ss.sin_addr.s_addr = ip->saddr;
90262306a36Sopenharmony_ci		    getnameinfo((struct sockaddr *) &ss, sizeof(ss),
90362306a36Sopenharmony_ci				sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
90462306a36Sopenharmony_ci
90562306a36Sopenharmony_ci		    memset(&sd, 0, sizeof(sd));
90662306a36Sopenharmony_ci		    sd.sin_family = PF_INET;
90762306a36Sopenharmony_ci		    sd.sin_addr.s_addr = ip->daddr;
90862306a36Sopenharmony_ci		    getnameinfo((struct sockaddr *) &sd, sizeof(sd),
90962306a36Sopenharmony_ci				dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
91062306a36Sopenharmony_ci
91162306a36Sopenharmony_ci		    printf("%s -> %s, ", sbuff, dbuff);
91262306a36Sopenharmony_ci	    }
91362306a36Sopenharmony_ci
91462306a36Sopenharmony_ci	    printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
91562306a36Sopenharmony_ci    }
91662306a36Sopenharmony_ci
91762306a36Sopenharmony_ci    static void walk_block(struct block_desc *pbd, const int block_num)
91862306a36Sopenharmony_ci    {
91962306a36Sopenharmony_ci	    int num_pkts = pbd->h1.num_pkts, i;
92062306a36Sopenharmony_ci	    unsigned long bytes = 0;
92162306a36Sopenharmony_ci	    struct tpacket3_hdr *ppd;
92262306a36Sopenharmony_ci
92362306a36Sopenharmony_ci	    ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
92462306a36Sopenharmony_ci					pbd->h1.offset_to_first_pkt);
92562306a36Sopenharmony_ci	    for (i = 0; i < num_pkts; ++i) {
92662306a36Sopenharmony_ci		    bytes += ppd->tp_snaplen;
92762306a36Sopenharmony_ci		    display(ppd);
92862306a36Sopenharmony_ci
92962306a36Sopenharmony_ci		    ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
93062306a36Sopenharmony_ci						ppd->tp_next_offset);
93162306a36Sopenharmony_ci	    }
93262306a36Sopenharmony_ci
93362306a36Sopenharmony_ci	    packets_total += num_pkts;
93462306a36Sopenharmony_ci	    bytes_total += bytes;
93562306a36Sopenharmony_ci    }
93662306a36Sopenharmony_ci
93762306a36Sopenharmony_ci    static void flush_block(struct block_desc *pbd)
93862306a36Sopenharmony_ci    {
93962306a36Sopenharmony_ci	    pbd->h1.block_status = TP_STATUS_KERNEL;
94062306a36Sopenharmony_ci    }
94162306a36Sopenharmony_ci
94262306a36Sopenharmony_ci    static void teardown_socket(struct ring *ring, int fd)
94362306a36Sopenharmony_ci    {
94462306a36Sopenharmony_ci	    munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
94562306a36Sopenharmony_ci	    free(ring->rd);
94662306a36Sopenharmony_ci	    close(fd);
94762306a36Sopenharmony_ci    }
94862306a36Sopenharmony_ci
94962306a36Sopenharmony_ci    int main(int argc, char **argp)
95062306a36Sopenharmony_ci    {
95162306a36Sopenharmony_ci	    int fd, err;
95262306a36Sopenharmony_ci	    socklen_t len;
95362306a36Sopenharmony_ci	    struct ring ring;
95462306a36Sopenharmony_ci	    struct pollfd pfd;
95562306a36Sopenharmony_ci	    unsigned int block_num = 0, blocks = 64;
95662306a36Sopenharmony_ci	    struct block_desc *pbd;
95762306a36Sopenharmony_ci	    struct tpacket_stats_v3 stats;
95862306a36Sopenharmony_ci
95962306a36Sopenharmony_ci	    if (argc != 2) {
96062306a36Sopenharmony_ci		    fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
96162306a36Sopenharmony_ci		    return EXIT_FAILURE;
96262306a36Sopenharmony_ci	    }
96362306a36Sopenharmony_ci
96462306a36Sopenharmony_ci	    signal(SIGINT, sighandler);
96562306a36Sopenharmony_ci
96662306a36Sopenharmony_ci	    memset(&ring, 0, sizeof(ring));
96762306a36Sopenharmony_ci	    fd = setup_socket(&ring, argp[argc - 1]);
96862306a36Sopenharmony_ci	    assert(fd > 0);
96962306a36Sopenharmony_ci
97062306a36Sopenharmony_ci	    memset(&pfd, 0, sizeof(pfd));
97162306a36Sopenharmony_ci	    pfd.fd = fd;
97262306a36Sopenharmony_ci	    pfd.events = POLLIN | POLLERR;
97362306a36Sopenharmony_ci	    pfd.revents = 0;
97462306a36Sopenharmony_ci
97562306a36Sopenharmony_ci	    while (likely(!sigint)) {
97662306a36Sopenharmony_ci		    pbd = (struct block_desc *) ring.rd[block_num].iov_base;
97762306a36Sopenharmony_ci
97862306a36Sopenharmony_ci		    if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
97962306a36Sopenharmony_ci			    poll(&pfd, 1, -1);
98062306a36Sopenharmony_ci			    continue;
98162306a36Sopenharmony_ci		    }
98262306a36Sopenharmony_ci
98362306a36Sopenharmony_ci		    walk_block(pbd, block_num);
98462306a36Sopenharmony_ci		    flush_block(pbd);
98562306a36Sopenharmony_ci		    block_num = (block_num + 1) % blocks;
98662306a36Sopenharmony_ci	    }
98762306a36Sopenharmony_ci
98862306a36Sopenharmony_ci	    len = sizeof(stats);
98962306a36Sopenharmony_ci	    err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
99062306a36Sopenharmony_ci	    if (err < 0) {
99162306a36Sopenharmony_ci		    perror("getsockopt");
99262306a36Sopenharmony_ci		    exit(1);
99362306a36Sopenharmony_ci	    }
99462306a36Sopenharmony_ci
99562306a36Sopenharmony_ci	    fflush(stdout);
99662306a36Sopenharmony_ci	    printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
99762306a36Sopenharmony_ci		stats.tp_packets, bytes_total, stats.tp_drops,
99862306a36Sopenharmony_ci		stats.tp_freeze_q_cnt);
99962306a36Sopenharmony_ci
100062306a36Sopenharmony_ci	    teardown_socket(&ring, fd);
100162306a36Sopenharmony_ci	    return 0;
100262306a36Sopenharmony_ci    }
100362306a36Sopenharmony_ci
100462306a36Sopenharmony_ciPACKET_QDISC_BYPASS
100562306a36Sopenharmony_ci===================
100662306a36Sopenharmony_ci
100762306a36Sopenharmony_ciIf there is a requirement to load the network with many packets in a similar
100862306a36Sopenharmony_cifashion as pktgen does, you might set the following option after socket
100962306a36Sopenharmony_cicreation::
101062306a36Sopenharmony_ci
101162306a36Sopenharmony_ci    int one = 1;
101262306a36Sopenharmony_ci    setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
101362306a36Sopenharmony_ci
101462306a36Sopenharmony_ciThis has the side-effect, that packets sent through PF_PACKET will bypass the
101562306a36Sopenharmony_cikernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
101662306a36Sopenharmony_cipacket are not buffered, tc disciplines are ignored, increased loss can occur
101762306a36Sopenharmony_ciand such packets are also not visible to other PF_PACKET sockets anymore. So,
101862306a36Sopenharmony_ciyou have been warned; generally, this can be useful for stress testing various
101962306a36Sopenharmony_cicomponents of a system.
102062306a36Sopenharmony_ci
102162306a36Sopenharmony_ciOn default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
102262306a36Sopenharmony_cion PF_PACKET sockets.
102362306a36Sopenharmony_ci
102462306a36Sopenharmony_ciPACKET_TIMESTAMP
102562306a36Sopenharmony_ci================
102662306a36Sopenharmony_ci
102762306a36Sopenharmony_ciThe PACKET_TIMESTAMP setting determines the source of the timestamp in
102862306a36Sopenharmony_cithe packet meta information for mmap(2)ed RX_RING and TX_RINGs.  If your
102962306a36Sopenharmony_ciNIC is capable of timestamping packets in hardware, you can request those
103062306a36Sopenharmony_cihardware timestamps to be used. Note: you may need to enable the generation
103162306a36Sopenharmony_ciof hardware timestamps with SIOCSHWTSTAMP (see related information from
103262306a36Sopenharmony_ciDocumentation/networking/timestamping.rst).
103362306a36Sopenharmony_ci
103462306a36Sopenharmony_ciPACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
103562306a36Sopenharmony_ci
103662306a36Sopenharmony_ci    int req = SOF_TIMESTAMPING_RAW_HARDWARE;
103762306a36Sopenharmony_ci    setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
103862306a36Sopenharmony_ci
103962306a36Sopenharmony_ciFor the mmap(2)ed ring buffers, such timestamps are stored in the
104062306a36Sopenharmony_ci``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
104162306a36Sopenharmony_ciTo determine what kind of timestamp has been reported, the tp_status field
104262306a36Sopenharmony_ciis binary or'ed with the following possible bits ...
104362306a36Sopenharmony_ci
104462306a36Sopenharmony_ci::
104562306a36Sopenharmony_ci
104662306a36Sopenharmony_ci    TP_STATUS_TS_RAW_HARDWARE
104762306a36Sopenharmony_ci    TP_STATUS_TS_SOFTWARE
104862306a36Sopenharmony_ci
104962306a36Sopenharmony_ci... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
105062306a36Sopenharmony_ciRX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
105162306a36Sopenharmony_cisoftware fallback was invoked *within* PF_PACKET's processing code (less
105262306a36Sopenharmony_ciprecise).
105362306a36Sopenharmony_ci
105462306a36Sopenharmony_ciGetting timestamps for the TX_RING works as follows: i) fill the ring frames,
105562306a36Sopenharmony_ciii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
105662306a36Sopenharmony_ciframes to be updated resp. the frame handed over to the application, iv) walk
105762306a36Sopenharmony_cithrough the frames to pick up the individual hw/sw timestamps.
105862306a36Sopenharmony_ci
105962306a36Sopenharmony_ciOnly (!) if transmit timestamping is enabled, then these bits are combined
106062306a36Sopenharmony_ciwith binary | with TP_STATUS_AVAILABLE, so you must check for that in your
106162306a36Sopenharmony_ciapplication (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
106262306a36Sopenharmony_ciin a first step to see if the frame belongs to the application, and then
106362306a36Sopenharmony_cione can extract the type of timestamp in a second step from tp_status)!
106462306a36Sopenharmony_ci
106562306a36Sopenharmony_ciIf you don't care about them, thus having it disabled, checking for
106662306a36Sopenharmony_ciTP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
106762306a36Sopenharmony_ciTX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
106862306a36Sopenharmony_cimembers do not contain a valid value. For TX_RINGs, by default no timestamp
106962306a36Sopenharmony_ciis generated!
107062306a36Sopenharmony_ci
107162306a36Sopenharmony_ciSee include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
107262306a36Sopenharmony_cifor more information on hardware timestamps.
107362306a36Sopenharmony_ci
107462306a36Sopenharmony_ciMiscellaneous bits
107562306a36Sopenharmony_ci==================
107662306a36Sopenharmony_ci
107762306a36Sopenharmony_ci- Packet sockets work well together with Linux socket filters, thus you also
107862306a36Sopenharmony_ci  might want to have a look at Documentation/networking/filter.rst
107962306a36Sopenharmony_ci
108062306a36Sopenharmony_ciTHANKS
108162306a36Sopenharmony_ci======
108262306a36Sopenharmony_ci
108362306a36Sopenharmony_ci   Jesse Brandeburg, for fixing my grammathical/spelling errors
1084