162306a36Sopenharmony_ci 262306a36Sopenharmony_ci============ 362306a36Sopenharmony_ciMSG_ZEROCOPY 462306a36Sopenharmony_ci============ 562306a36Sopenharmony_ci 662306a36Sopenharmony_ciIntro 762306a36Sopenharmony_ci===== 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 1062306a36Sopenharmony_ciThe feature is currently implemented for TCP and UDP sockets. 1162306a36Sopenharmony_ci 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciOpportunity and Caveats 1462306a36Sopenharmony_ci----------------------- 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ciCopying large buffers between user process and kernel can be 1762306a36Sopenharmony_ciexpensive. Linux supports various interfaces that eschew copying, 1862306a36Sopenharmony_cisuch as sendfile and splice. The MSG_ZEROCOPY flag extends the 1962306a36Sopenharmony_ciunderlying copy avoidance mechanism to common socket send calls. 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciCopy avoidance is not a free lunch. As implemented, with page pinning, 2262306a36Sopenharmony_ciit replaces per byte copy cost with page accounting and completion 2362306a36Sopenharmony_cinotification overhead. As a result, MSG_ZEROCOPY is generally only 2462306a36Sopenharmony_cieffective at writes over around 10 KB. 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ciPage pinning also changes system call semantics. It temporarily shares 2762306a36Sopenharmony_cithe buffer between process and network stack. Unlike with copying, the 2862306a36Sopenharmony_ciprocess cannot immediately overwrite the buffer after system call 2962306a36Sopenharmony_cireturn without possibly modifying the data in flight. Kernel integrity 3062306a36Sopenharmony_ciis not affected, but a buggy program can possibly corrupt its own data 3162306a36Sopenharmony_cistream. 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ciThe kernel returns a notification when it is safe to modify data. 3462306a36Sopenharmony_ciConverting an existing application to MSG_ZEROCOPY is not always as 3562306a36Sopenharmony_citrivial as just passing the flag, then. 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciMore Info 3962306a36Sopenharmony_ci--------- 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciMuch of this document was derived from a longer paper presented at 4262306a36Sopenharmony_cinetdev 2.1. For more in-depth information see that paper and talk, 4362306a36Sopenharmony_cithe excellent reporting over at LWN.net or read the original code. 4462306a36Sopenharmony_ci 4562306a36Sopenharmony_ci paper, slides, video 4662306a36Sopenharmony_ci https://netdevconf.org/2.1/session.html?debruijn 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ci LWN article 4962306a36Sopenharmony_ci https://lwn.net/Articles/726917/ 5062306a36Sopenharmony_ci 5162306a36Sopenharmony_ci patchset 5262306a36Sopenharmony_ci [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 5362306a36Sopenharmony_ci https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_ciInterface 5762306a36Sopenharmony_ci========= 5862306a36Sopenharmony_ci 5962306a36Sopenharmony_ciPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy 6062306a36Sopenharmony_ciavoidance, but not the only one. 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ciSocket Setup 6362306a36Sopenharmony_ci------------ 6462306a36Sopenharmony_ci 6562306a36Sopenharmony_ciThe kernel is permissive when applications pass undefined flags to the 6662306a36Sopenharmony_cisend system call. By default it simply ignores these. To avoid enabling 6762306a36Sopenharmony_cicopy avoidance mode for legacy processes that accidentally already pass 6862306a36Sopenharmony_cithis flag, a process must first signal intent by setting a socket option: 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci:: 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ci if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 7362306a36Sopenharmony_ci error(1, errno, "setsockopt zerocopy"); 7462306a36Sopenharmony_ci 7562306a36Sopenharmony_ciTransmission 7662306a36Sopenharmony_ci------------ 7762306a36Sopenharmony_ci 7862306a36Sopenharmony_ciThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 7962306a36Sopenharmony_ciPass the new flag. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ci:: 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ci ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ciA zerocopy failure will return -1 with errno ENOBUFS. This happens if 8662306a36Sopenharmony_cithe socket exceeds its optmem limit or the user exceeds their ulimit on 8762306a36Sopenharmony_cilocked pages. 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ciMixing copy avoidance and copying 9162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9262306a36Sopenharmony_ci 9362306a36Sopenharmony_ciMany workloads have a mixture of large and small buffers. Because copy 9462306a36Sopenharmony_ciavoidance is more expensive than copying for small packets, the 9562306a36Sopenharmony_cifeature is implemented as a flag. It is safe to mix calls with the flag 9662306a36Sopenharmony_ciwith those without. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ciNotifications 10062306a36Sopenharmony_ci------------- 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_ciThe kernel has to notify the process when it is safe to reuse a 10362306a36Sopenharmony_cipreviously passed buffer. It queues completion notifications on the 10462306a36Sopenharmony_cisocket error queue, akin to the transmit timestamping interface. 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ciThe notification itself is a simple scalar value. Each socket 10762306a36Sopenharmony_cimaintains an internal unsigned 32-bit counter. Each send call with 10862306a36Sopenharmony_ciMSG_ZEROCOPY that successfully sends data increments the counter. The 10962306a36Sopenharmony_cicounter is not incremented on failure or if called with length zero. 11062306a36Sopenharmony_ciThe counter counts system call invocations, not bytes. It wraps after 11162306a36Sopenharmony_ciUINT_MAX calls. 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ciNotification Reception 11562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ciThe below snippet demonstrates the API. In the simplest case, each 11862306a36Sopenharmony_cisend syscall is followed by a poll and recvmsg on the error queue. 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ciReading from the error queue is always a non-blocking operation. The 12162306a36Sopenharmony_cipoll call is there to block until an error is outstanding. It will set 12262306a36Sopenharmony_ciPOLLERR in its output flags. That flag does not have to be set in the 12362306a36Sopenharmony_cievents field. Errors are signaled unconditionally. 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ci:: 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ci pfd.fd = fd; 12862306a36Sopenharmony_ci pfd.events = 0; 12962306a36Sopenharmony_ci if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 13062306a36Sopenharmony_ci error(1, errno, "poll"); 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ci ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 13362306a36Sopenharmony_ci if (ret == -1) 13462306a36Sopenharmony_ci error(1, errno, "recvmsg"); 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci read_notification(msg); 13762306a36Sopenharmony_ci 13862306a36Sopenharmony_ciThe example is for demonstration purpose only. In practice, it is more 13962306a36Sopenharmony_ciefficient to not wait for notifications, but read without blocking 14062306a36Sopenharmony_cievery couple of send calls. 14162306a36Sopenharmony_ci 14262306a36Sopenharmony_ciNotifications can be processed out of order with other operations on 14362306a36Sopenharmony_cithe socket. A socket that has an error queued would normally block 14462306a36Sopenharmony_ciother operations until the error is read. Zerocopy notifications have 14562306a36Sopenharmony_cia zero error code, however, to not block send and recv calls. 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_ci 14862306a36Sopenharmony_ciNotification Batching 14962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 15062306a36Sopenharmony_ci 15162306a36Sopenharmony_ciMultiple outstanding packets can be read at once using the recvmmsg 15262306a36Sopenharmony_cicall. This is often not needed. In each message the kernel returns not 15362306a36Sopenharmony_cia single value, but a range. It coalesces consecutive notifications 15462306a36Sopenharmony_ciwhile one is outstanding for reception on the error queue. 15562306a36Sopenharmony_ci 15662306a36Sopenharmony_ciWhen a new notification is about to be queued, it checks whether the 15762306a36Sopenharmony_cinew value extends the range of the notification at the tail of the 15862306a36Sopenharmony_ciqueue. If so, it drops the new notification packet and instead increases 15962306a36Sopenharmony_cithe range upper value of the outstanding notification. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ciFor protocols that acknowledge data in-order, like TCP, each 16262306a36Sopenharmony_cinotification can be squashed into the previous one, so that no more 16362306a36Sopenharmony_cithan one notification is outstanding at any one point. 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciOrdered delivery is the common case, but not guaranteed. Notifications 16662306a36Sopenharmony_cimay arrive out of order on retransmission and socket teardown. 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_ciNotification Parsing 17062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciThe below snippet demonstrates how to parse the control message: the 17362306a36Sopenharmony_ciread_notification() call in the previous snippet. A notification 17462306a36Sopenharmony_ciis encoded in the standard error format, sock_extended_err. 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciThe level and type fields in the control data are protocol family 17762306a36Sopenharmony_cispecific, IP_RECVERR or IPV6_RECVERR. 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ciError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 18062306a36Sopenharmony_cias explained before, to avoid blocking read and write system calls on 18162306a36Sopenharmony_cithe socket. 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ciThe 32-bit notification range is encoded as [ee_info, ee_data]. This 18462306a36Sopenharmony_cirange is inclusive. Other fields in the struct must be treated as 18562306a36Sopenharmony_ciundefined, bar for ee_code, as discussed below. 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ci:: 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci struct sock_extended_err *serr; 19062306a36Sopenharmony_ci struct cmsghdr *cm; 19162306a36Sopenharmony_ci 19262306a36Sopenharmony_ci cm = CMSG_FIRSTHDR(msg); 19362306a36Sopenharmony_ci if (cm->cmsg_level != SOL_IP && 19462306a36Sopenharmony_ci cm->cmsg_type != IP_RECVERR) 19562306a36Sopenharmony_ci error(1, 0, "cmsg"); 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ci serr = (void *) CMSG_DATA(cm); 19862306a36Sopenharmony_ci if (serr->ee_errno != 0 || 19962306a36Sopenharmony_ci serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 20062306a36Sopenharmony_ci error(1, 0, "serr"); 20162306a36Sopenharmony_ci 20262306a36Sopenharmony_ci printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ci 20562306a36Sopenharmony_ciDeferred copies 20662306a36Sopenharmony_ci~~~~~~~~~~~~~~~ 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ciPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 20962306a36Sopenharmony_ciavoidance, and a contract that the kernel will queue a completion 21062306a36Sopenharmony_cinotification. It is not a guarantee that the copy is elided. 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciCopy avoidance is not always feasible. Devices that do not support 21362306a36Sopenharmony_ciscatter-gather I/O cannot send packets made up of kernel generated 21462306a36Sopenharmony_ciprotocol headers plus zerocopy user data. A packet may need to be 21562306a36Sopenharmony_ciconverted to a private copy of data deep in the stack, say to compute 21662306a36Sopenharmony_cia checksum. 21762306a36Sopenharmony_ci 21862306a36Sopenharmony_ciIn all these cases, the kernel returns a completion notification when 21962306a36Sopenharmony_ciit releases its hold on the shared pages. That notification may arrive 22062306a36Sopenharmony_cibefore the (copied) data is fully transmitted. A zerocopy completion 22162306a36Sopenharmony_cinotification is not a transmit completion notification, therefore. 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ciDeferred copies can be more expensive than a copy immediately in the 22462306a36Sopenharmony_cisystem call, if the data is no longer warm in the cache. The process 22562306a36Sopenharmony_cialso incurs notification processing cost for no benefit. For this 22662306a36Sopenharmony_cireason, the kernel signals if data was completed with a copy, by 22762306a36Sopenharmony_cisetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 22862306a36Sopenharmony_ciA process may use this signal to stop passing flag MSG_ZEROCOPY on 22962306a36Sopenharmony_cisubsequent requests on the same socket. 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ci 23262306a36Sopenharmony_ciImplementation 23362306a36Sopenharmony_ci============== 23462306a36Sopenharmony_ci 23562306a36Sopenharmony_ciLoopback 23662306a36Sopenharmony_ci-------- 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ciData sent to local sockets can be queued indefinitely if the receive 23962306a36Sopenharmony_ciprocess does not read its socket. Unbound notification latency is not 24062306a36Sopenharmony_ciacceptable. For this reason all packets generated with MSG_ZEROCOPY 24162306a36Sopenharmony_cithat are looped to a local socket will incur a deferred copy. This 24262306a36Sopenharmony_ciincludes looping onto packet sockets (e.g., tcpdump) and tun devices. 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciTesting 24662306a36Sopenharmony_ci======= 24762306a36Sopenharmony_ci 24862306a36Sopenharmony_ciMore realistic example code can be found in the kernel source under 24962306a36Sopenharmony_citools/testing/selftests/net/msg_zerocopy.c. 25062306a36Sopenharmony_ci 25162306a36Sopenharmony_ciBe cognizant of the loopback constraint. The test can be run between 25262306a36Sopenharmony_cia pair of hosts. But if run between a local pair of processes, for 25362306a36Sopenharmony_ciinstance when run with msg_zerocopy.sh between a veth pair across 25462306a36Sopenharmony_cinamespaces, the test will not show any improvement. For testing, the 25562306a36Sopenharmony_ciloopback restriction can be temporarily relaxed by making 25662306a36Sopenharmony_ciskb_orphan_frags_rx identical to skb_orphan_frags. 257