18c2ecf20Sopenharmony_ci 28c2ecf20Sopenharmony_ci============ 38c2ecf20Sopenharmony_ciMSG_ZEROCOPY 48c2ecf20Sopenharmony_ci============ 58c2ecf20Sopenharmony_ci 68c2ecf20Sopenharmony_ciIntro 78c2ecf20Sopenharmony_ci===== 88c2ecf20Sopenharmony_ci 98c2ecf20Sopenharmony_ciThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 108c2ecf20Sopenharmony_ciThe feature is currently implemented for TCP and UDP sockets. 118c2ecf20Sopenharmony_ci 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ciOpportunity and Caveats 148c2ecf20Sopenharmony_ci----------------------- 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ciCopying large buffers between user process and kernel can be 178c2ecf20Sopenharmony_ciexpensive. Linux supports various interfaces that eschew copying, 188c2ecf20Sopenharmony_cisuch as sendpage and splice. The MSG_ZEROCOPY flag extends the 198c2ecf20Sopenharmony_ciunderlying copy avoidance mechanism to common socket send calls. 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciCopy avoidance is not a free lunch. As implemented, with page pinning, 228c2ecf20Sopenharmony_ciit replaces per byte copy cost with page accounting and completion 238c2ecf20Sopenharmony_cinotification overhead. As a result, MSG_ZEROCOPY is generally only 248c2ecf20Sopenharmony_cieffective at writes over around 10 KB. 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ciPage pinning also changes system call semantics. It temporarily shares 278c2ecf20Sopenharmony_cithe buffer between process and network stack. Unlike with copying, the 288c2ecf20Sopenharmony_ciprocess cannot immediately overwrite the buffer after system call 298c2ecf20Sopenharmony_cireturn without possibly modifying the data in flight. Kernel integrity 308c2ecf20Sopenharmony_ciis not affected, but a buggy program can possibly corrupt its own data 318c2ecf20Sopenharmony_cistream. 328c2ecf20Sopenharmony_ci 338c2ecf20Sopenharmony_ciThe kernel returns a notification when it is safe to modify data. 348c2ecf20Sopenharmony_ciConverting an existing application to MSG_ZEROCOPY is not always as 358c2ecf20Sopenharmony_citrivial as just passing the flag, then. 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ci 388c2ecf20Sopenharmony_ciMore Info 398c2ecf20Sopenharmony_ci--------- 408c2ecf20Sopenharmony_ci 418c2ecf20Sopenharmony_ciMuch of this document was derived from a longer paper presented at 428c2ecf20Sopenharmony_cinetdev 2.1. For more in-depth information see that paper and talk, 438c2ecf20Sopenharmony_cithe excellent reporting over at LWN.net or read the original code. 448c2ecf20Sopenharmony_ci 458c2ecf20Sopenharmony_ci paper, slides, video 468c2ecf20Sopenharmony_ci https://netdevconf.org/2.1/session.html?debruijn 478c2ecf20Sopenharmony_ci 488c2ecf20Sopenharmony_ci LWN article 498c2ecf20Sopenharmony_ci https://lwn.net/Articles/726917/ 508c2ecf20Sopenharmony_ci 518c2ecf20Sopenharmony_ci patchset 528c2ecf20Sopenharmony_ci [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 538c2ecf20Sopenharmony_ci https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 548c2ecf20Sopenharmony_ci 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ciInterface 578c2ecf20Sopenharmony_ci========= 588c2ecf20Sopenharmony_ci 598c2ecf20Sopenharmony_ciPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy 608c2ecf20Sopenharmony_ciavoidance, but not the only one. 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ciSocket Setup 638c2ecf20Sopenharmony_ci------------ 648c2ecf20Sopenharmony_ci 658c2ecf20Sopenharmony_ciThe kernel is permissive when applications pass undefined flags to the 668c2ecf20Sopenharmony_cisend system call. By default it simply ignores these. To avoid enabling 678c2ecf20Sopenharmony_cicopy avoidance mode for legacy processes that accidentally already pass 688c2ecf20Sopenharmony_cithis flag, a process must first signal intent by setting a socket option: 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ci:: 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ci if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 738c2ecf20Sopenharmony_ci error(1, errno, "setsockopt zerocopy"); 748c2ecf20Sopenharmony_ci 758c2ecf20Sopenharmony_ciTransmission 768c2ecf20Sopenharmony_ci------------ 778c2ecf20Sopenharmony_ci 788c2ecf20Sopenharmony_ciThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 798c2ecf20Sopenharmony_ciPass the new flag. 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ci:: 828c2ecf20Sopenharmony_ci 838c2ecf20Sopenharmony_ci ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 848c2ecf20Sopenharmony_ci 858c2ecf20Sopenharmony_ciA zerocopy failure will return -1 with errno ENOBUFS. This happens if 868c2ecf20Sopenharmony_cithe socket option was not set, the socket exceeds its optmem limit or 878c2ecf20Sopenharmony_cithe user exceeds its ulimit on locked pages. 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ci 908c2ecf20Sopenharmony_ciMixing copy avoidance and copying 918c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 928c2ecf20Sopenharmony_ci 938c2ecf20Sopenharmony_ciMany workloads have a mixture of large and small buffers. Because copy 948c2ecf20Sopenharmony_ciavoidance is more expensive than copying for small packets, the 958c2ecf20Sopenharmony_cifeature is implemented as a flag. It is safe to mix calls with the flag 968c2ecf20Sopenharmony_ciwith those without. 978c2ecf20Sopenharmony_ci 988c2ecf20Sopenharmony_ci 998c2ecf20Sopenharmony_ciNotifications 1008c2ecf20Sopenharmony_ci------------- 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ciThe kernel has to notify the process when it is safe to reuse a 1038c2ecf20Sopenharmony_cipreviously passed buffer. It queues completion notifications on the 1048c2ecf20Sopenharmony_cisocket error queue, akin to the transmit timestamping interface. 1058c2ecf20Sopenharmony_ci 1068c2ecf20Sopenharmony_ciThe notification itself is a simple scalar value. Each socket 1078c2ecf20Sopenharmony_cimaintains an internal unsigned 32-bit counter. Each send call with 1088c2ecf20Sopenharmony_ciMSG_ZEROCOPY that successfully sends data increments the counter. The 1098c2ecf20Sopenharmony_cicounter is not incremented on failure or if called with length zero. 1108c2ecf20Sopenharmony_ciThe counter counts system call invocations, not bytes. It wraps after 1118c2ecf20Sopenharmony_ciUINT_MAX calls. 1128c2ecf20Sopenharmony_ci 1138c2ecf20Sopenharmony_ci 1148c2ecf20Sopenharmony_ciNotification Reception 1158c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 1168c2ecf20Sopenharmony_ci 1178c2ecf20Sopenharmony_ciThe below snippet demonstrates the API. In the simplest case, each 1188c2ecf20Sopenharmony_cisend syscall is followed by a poll and recvmsg on the error queue. 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ciReading from the error queue is always a non-blocking operation. The 1218c2ecf20Sopenharmony_cipoll call is there to block until an error is outstanding. It will set 1228c2ecf20Sopenharmony_ciPOLLERR in its output flags. That flag does not have to be set in the 1238c2ecf20Sopenharmony_cievents field. Errors are signaled unconditionally. 1248c2ecf20Sopenharmony_ci 1258c2ecf20Sopenharmony_ci:: 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ci pfd.fd = fd; 1288c2ecf20Sopenharmony_ci pfd.events = 0; 1298c2ecf20Sopenharmony_ci if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 1308c2ecf20Sopenharmony_ci error(1, errno, "poll"); 1318c2ecf20Sopenharmony_ci 1328c2ecf20Sopenharmony_ci ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 1338c2ecf20Sopenharmony_ci if (ret == -1) 1348c2ecf20Sopenharmony_ci error(1, errno, "recvmsg"); 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ci read_notification(msg); 1378c2ecf20Sopenharmony_ci 1388c2ecf20Sopenharmony_ciThe example is for demonstration purpose only. In practice, it is more 1398c2ecf20Sopenharmony_ciefficient to not wait for notifications, but read without blocking 1408c2ecf20Sopenharmony_cievery couple of send calls. 1418c2ecf20Sopenharmony_ci 1428c2ecf20Sopenharmony_ciNotifications can be processed out of order with other operations on 1438c2ecf20Sopenharmony_cithe socket. A socket that has an error queued would normally block 1448c2ecf20Sopenharmony_ciother operations until the error is read. Zerocopy notifications have 1458c2ecf20Sopenharmony_cia zero error code, however, to not block send and recv calls. 1468c2ecf20Sopenharmony_ci 1478c2ecf20Sopenharmony_ci 1488c2ecf20Sopenharmony_ciNotification Batching 1498c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 1508c2ecf20Sopenharmony_ci 1518c2ecf20Sopenharmony_ciMultiple outstanding packets can be read at once using the recvmmsg 1528c2ecf20Sopenharmony_cicall. This is often not needed. In each message the kernel returns not 1538c2ecf20Sopenharmony_cia single value, but a range. It coalesces consecutive notifications 1548c2ecf20Sopenharmony_ciwhile one is outstanding for reception on the error queue. 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_ciWhen a new notification is about to be queued, it checks whether the 1578c2ecf20Sopenharmony_cinew value extends the range of the notification at the tail of the 1588c2ecf20Sopenharmony_ciqueue. If so, it drops the new notification packet and instead increases 1598c2ecf20Sopenharmony_cithe range upper value of the outstanding notification. 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ciFor protocols that acknowledge data in-order, like TCP, each 1628c2ecf20Sopenharmony_cinotification can be squashed into the previous one, so that no more 1638c2ecf20Sopenharmony_cithan one notification is outstanding at any one point. 1648c2ecf20Sopenharmony_ci 1658c2ecf20Sopenharmony_ciOrdered delivery is the common case, but not guaranteed. Notifications 1668c2ecf20Sopenharmony_cimay arrive out of order on retransmission and socket teardown. 1678c2ecf20Sopenharmony_ci 1688c2ecf20Sopenharmony_ci 1698c2ecf20Sopenharmony_ciNotification Parsing 1708c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 1718c2ecf20Sopenharmony_ci 1728c2ecf20Sopenharmony_ciThe below snippet demonstrates how to parse the control message: the 1738c2ecf20Sopenharmony_ciread_notification() call in the previous snippet. A notification 1748c2ecf20Sopenharmony_ciis encoded in the standard error format, sock_extended_err. 1758c2ecf20Sopenharmony_ci 1768c2ecf20Sopenharmony_ciThe level and type fields in the control data are protocol family 1778c2ecf20Sopenharmony_cispecific, IP_RECVERR or IPV6_RECVERR. 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ciError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 1808c2ecf20Sopenharmony_cias explained before, to avoid blocking read and write system calls on 1818c2ecf20Sopenharmony_cithe socket. 1828c2ecf20Sopenharmony_ci 1838c2ecf20Sopenharmony_ciThe 32-bit notification range is encoded as [ee_info, ee_data]. This 1848c2ecf20Sopenharmony_cirange is inclusive. Other fields in the struct must be treated as 1858c2ecf20Sopenharmony_ciundefined, bar for ee_code, as discussed below. 1868c2ecf20Sopenharmony_ci 1878c2ecf20Sopenharmony_ci:: 1888c2ecf20Sopenharmony_ci 1898c2ecf20Sopenharmony_ci struct sock_extended_err *serr; 1908c2ecf20Sopenharmony_ci struct cmsghdr *cm; 1918c2ecf20Sopenharmony_ci 1928c2ecf20Sopenharmony_ci cm = CMSG_FIRSTHDR(msg); 1938c2ecf20Sopenharmony_ci if (cm->cmsg_level != SOL_IP && 1948c2ecf20Sopenharmony_ci cm->cmsg_type != IP_RECVERR) 1958c2ecf20Sopenharmony_ci error(1, 0, "cmsg"); 1968c2ecf20Sopenharmony_ci 1978c2ecf20Sopenharmony_ci serr = (void *) CMSG_DATA(cm); 1988c2ecf20Sopenharmony_ci if (serr->ee_errno != 0 || 1998c2ecf20Sopenharmony_ci serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 2008c2ecf20Sopenharmony_ci error(1, 0, "serr"); 2018c2ecf20Sopenharmony_ci 2028c2ecf20Sopenharmony_ci printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 2038c2ecf20Sopenharmony_ci 2048c2ecf20Sopenharmony_ci 2058c2ecf20Sopenharmony_ciDeferred copies 2068c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~ 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ciPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 2098c2ecf20Sopenharmony_ciavoidance, and a contract that the kernel will queue a completion 2108c2ecf20Sopenharmony_cinotification. It is not a guarantee that the copy is elided. 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ciCopy avoidance is not always feasible. Devices that do not support 2138c2ecf20Sopenharmony_ciscatter-gather I/O cannot send packets made up of kernel generated 2148c2ecf20Sopenharmony_ciprotocol headers plus zerocopy user data. A packet may need to be 2158c2ecf20Sopenharmony_ciconverted to a private copy of data deep in the stack, say to compute 2168c2ecf20Sopenharmony_cia checksum. 2178c2ecf20Sopenharmony_ci 2188c2ecf20Sopenharmony_ciIn all these cases, the kernel returns a completion notification when 2198c2ecf20Sopenharmony_ciit releases its hold on the shared pages. That notification may arrive 2208c2ecf20Sopenharmony_cibefore the (copied) data is fully transmitted. A zerocopy completion 2218c2ecf20Sopenharmony_cinotification is not a transmit completion notification, therefore. 2228c2ecf20Sopenharmony_ci 2238c2ecf20Sopenharmony_ciDeferred copies can be more expensive than a copy immediately in the 2248c2ecf20Sopenharmony_cisystem call, if the data is no longer warm in the cache. The process 2258c2ecf20Sopenharmony_cialso incurs notification processing cost for no benefit. For this 2268c2ecf20Sopenharmony_cireason, the kernel signals if data was completed with a copy, by 2278c2ecf20Sopenharmony_cisetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 2288c2ecf20Sopenharmony_ciA process may use this signal to stop passing flag MSG_ZEROCOPY on 2298c2ecf20Sopenharmony_cisubsequent requests on the same socket. 2308c2ecf20Sopenharmony_ci 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ciImplementation 2338c2ecf20Sopenharmony_ci============== 2348c2ecf20Sopenharmony_ci 2358c2ecf20Sopenharmony_ciLoopback 2368c2ecf20Sopenharmony_ci-------- 2378c2ecf20Sopenharmony_ci 2388c2ecf20Sopenharmony_ciData sent to local sockets can be queued indefinitely if the receive 2398c2ecf20Sopenharmony_ciprocess does not read its socket. Unbound notification latency is not 2408c2ecf20Sopenharmony_ciacceptable. For this reason all packets generated with MSG_ZEROCOPY 2418c2ecf20Sopenharmony_cithat are looped to a local socket will incur a deferred copy. This 2428c2ecf20Sopenharmony_ciincludes looping onto packet sockets (e.g., tcpdump) and tun devices. 2438c2ecf20Sopenharmony_ci 2448c2ecf20Sopenharmony_ci 2458c2ecf20Sopenharmony_ciTesting 2468c2ecf20Sopenharmony_ci======= 2478c2ecf20Sopenharmony_ci 2488c2ecf20Sopenharmony_ciMore realistic example code can be found in the kernel source under 2498c2ecf20Sopenharmony_citools/testing/selftests/net/msg_zerocopy.c. 2508c2ecf20Sopenharmony_ci 2518c2ecf20Sopenharmony_ciBe cognizant of the loopback constraint. The test can be run between 2528c2ecf20Sopenharmony_cia pair of hosts. But if run between a local pair of processes, for 2538c2ecf20Sopenharmony_ciinstance when run with msg_zerocopy.sh between a veth pair across 2548c2ecf20Sopenharmony_cinamespaces, the test will not show any improvement. For testing, the 2558c2ecf20Sopenharmony_ciloopback restriction can be temporarily relaxed by making 2568c2ecf20Sopenharmony_ciskb_orphan_frags_rx identical to skb_orphan_frags. 257