18c2ecf20Sopenharmony_ci
28c2ecf20Sopenharmony_ci============
38c2ecf20Sopenharmony_ciMSG_ZEROCOPY
48c2ecf20Sopenharmony_ci============
58c2ecf20Sopenharmony_ci
68c2ecf20Sopenharmony_ciIntro
78c2ecf20Sopenharmony_ci=====
88c2ecf20Sopenharmony_ci
98c2ecf20Sopenharmony_ciThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
108c2ecf20Sopenharmony_ciThe feature is currently implemented for TCP and UDP sockets.
118c2ecf20Sopenharmony_ci
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciOpportunity and Caveats
148c2ecf20Sopenharmony_ci-----------------------
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ciCopying large buffers between user process and kernel can be
178c2ecf20Sopenharmony_ciexpensive. Linux supports various interfaces that eschew copying,
188c2ecf20Sopenharmony_cisuch as sendpage and splice. The MSG_ZEROCOPY flag extends the
198c2ecf20Sopenharmony_ciunderlying copy avoidance mechanism to common socket send calls.
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciCopy avoidance is not a free lunch. As implemented, with page pinning,
228c2ecf20Sopenharmony_ciit replaces per byte copy cost with page accounting and completion
238c2ecf20Sopenharmony_cinotification overhead. As a result, MSG_ZEROCOPY is generally only
248c2ecf20Sopenharmony_cieffective at writes over around 10 KB.
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ciPage pinning also changes system call semantics. It temporarily shares
278c2ecf20Sopenharmony_cithe buffer between process and network stack. Unlike with copying, the
288c2ecf20Sopenharmony_ciprocess cannot immediately overwrite the buffer after system call
298c2ecf20Sopenharmony_cireturn without possibly modifying the data in flight. Kernel integrity
308c2ecf20Sopenharmony_ciis not affected, but a buggy program can possibly corrupt its own data
318c2ecf20Sopenharmony_cistream.
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ciThe kernel returns a notification when it is safe to modify data.
348c2ecf20Sopenharmony_ciConverting an existing application to MSG_ZEROCOPY is not always as
358c2ecf20Sopenharmony_citrivial as just passing the flag, then.
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ci
388c2ecf20Sopenharmony_ciMore Info
398c2ecf20Sopenharmony_ci---------
408c2ecf20Sopenharmony_ci
418c2ecf20Sopenharmony_ciMuch of this document was derived from a longer paper presented at
428c2ecf20Sopenharmony_cinetdev 2.1. For more in-depth information see that paper and talk,
438c2ecf20Sopenharmony_cithe excellent reporting over at LWN.net or read the original code.
448c2ecf20Sopenharmony_ci
458c2ecf20Sopenharmony_ci  paper, slides, video
468c2ecf20Sopenharmony_ci    https://netdevconf.org/2.1/session.html?debruijn
478c2ecf20Sopenharmony_ci
488c2ecf20Sopenharmony_ci  LWN article
498c2ecf20Sopenharmony_ci    https://lwn.net/Articles/726917/
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ci  patchset
528c2ecf20Sopenharmony_ci    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
538c2ecf20Sopenharmony_ci    https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
548c2ecf20Sopenharmony_ci
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciInterface
578c2ecf20Sopenharmony_ci=========
588c2ecf20Sopenharmony_ci
598c2ecf20Sopenharmony_ciPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy
608c2ecf20Sopenharmony_ciavoidance, but not the only one.
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ciSocket Setup
638c2ecf20Sopenharmony_ci------------
648c2ecf20Sopenharmony_ci
658c2ecf20Sopenharmony_ciThe kernel is permissive when applications pass undefined flags to the
668c2ecf20Sopenharmony_cisend system call. By default it simply ignores these. To avoid enabling
678c2ecf20Sopenharmony_cicopy avoidance mode for legacy processes that accidentally already pass
688c2ecf20Sopenharmony_cithis flag, a process must first signal intent by setting a socket option:
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci::
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ci	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
738c2ecf20Sopenharmony_ci		error(1, errno, "setsockopt zerocopy");
748c2ecf20Sopenharmony_ci
758c2ecf20Sopenharmony_ciTransmission
768c2ecf20Sopenharmony_ci------------
778c2ecf20Sopenharmony_ci
788c2ecf20Sopenharmony_ciThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
798c2ecf20Sopenharmony_ciPass the new flag.
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ci::
828c2ecf20Sopenharmony_ci
838c2ecf20Sopenharmony_ci	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
848c2ecf20Sopenharmony_ci
858c2ecf20Sopenharmony_ciA zerocopy failure will return -1 with errno ENOBUFS. This happens if
868c2ecf20Sopenharmony_cithe socket option was not set, the socket exceeds its optmem limit or
878c2ecf20Sopenharmony_cithe user exceeds its ulimit on locked pages.
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ciMixing copy avoidance and copying
918c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ciMany workloads have a mixture of large and small buffers. Because copy
948c2ecf20Sopenharmony_ciavoidance is more expensive than copying for small packets, the
958c2ecf20Sopenharmony_cifeature is implemented as a flag. It is safe to mix calls with the flag
968c2ecf20Sopenharmony_ciwith those without.
978c2ecf20Sopenharmony_ci
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ciNotifications
1008c2ecf20Sopenharmony_ci-------------
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciThe kernel has to notify the process when it is safe to reuse a
1038c2ecf20Sopenharmony_cipreviously passed buffer. It queues completion notifications on the
1048c2ecf20Sopenharmony_cisocket error queue, akin to the transmit timestamping interface.
1058c2ecf20Sopenharmony_ci
1068c2ecf20Sopenharmony_ciThe notification itself is a simple scalar value. Each socket
1078c2ecf20Sopenharmony_cimaintains an internal unsigned 32-bit counter. Each send call with
1088c2ecf20Sopenharmony_ciMSG_ZEROCOPY that successfully sends data increments the counter. The
1098c2ecf20Sopenharmony_cicounter is not incremented on failure or if called with length zero.
1108c2ecf20Sopenharmony_ciThe counter counts system call invocations, not bytes. It wraps after
1118c2ecf20Sopenharmony_ciUINT_MAX calls.
1128c2ecf20Sopenharmony_ci
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ciNotification Reception
1158c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
1168c2ecf20Sopenharmony_ci
1178c2ecf20Sopenharmony_ciThe below snippet demonstrates the API. In the simplest case, each
1188c2ecf20Sopenharmony_cisend syscall is followed by a poll and recvmsg on the error queue.
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ciReading from the error queue is always a non-blocking operation. The
1218c2ecf20Sopenharmony_cipoll call is there to block until an error is outstanding. It will set
1228c2ecf20Sopenharmony_ciPOLLERR in its output flags. That flag does not have to be set in the
1238c2ecf20Sopenharmony_cievents field. Errors are signaled unconditionally.
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ci::
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ci	pfd.fd = fd;
1288c2ecf20Sopenharmony_ci	pfd.events = 0;
1298c2ecf20Sopenharmony_ci	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
1308c2ecf20Sopenharmony_ci		error(1, errno, "poll");
1318c2ecf20Sopenharmony_ci
1328c2ecf20Sopenharmony_ci	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
1338c2ecf20Sopenharmony_ci	if (ret == -1)
1348c2ecf20Sopenharmony_ci		error(1, errno, "recvmsg");
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ci	read_notification(msg);
1378c2ecf20Sopenharmony_ci
1388c2ecf20Sopenharmony_ciThe example is for demonstration purpose only. In practice, it is more
1398c2ecf20Sopenharmony_ciefficient to not wait for notifications, but read without blocking
1408c2ecf20Sopenharmony_cievery couple of send calls.
1418c2ecf20Sopenharmony_ci
1428c2ecf20Sopenharmony_ciNotifications can be processed out of order with other operations on
1438c2ecf20Sopenharmony_cithe socket. A socket that has an error queued would normally block
1448c2ecf20Sopenharmony_ciother operations until the error is read. Zerocopy notifications have
1458c2ecf20Sopenharmony_cia zero error code, however, to not block send and recv calls.
1468c2ecf20Sopenharmony_ci
1478c2ecf20Sopenharmony_ci
1488c2ecf20Sopenharmony_ciNotification Batching
1498c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~
1508c2ecf20Sopenharmony_ci
1518c2ecf20Sopenharmony_ciMultiple outstanding packets can be read at once using the recvmmsg
1528c2ecf20Sopenharmony_cicall. This is often not needed. In each message the kernel returns not
1538c2ecf20Sopenharmony_cia single value, but a range. It coalesces consecutive notifications
1548c2ecf20Sopenharmony_ciwhile one is outstanding for reception on the error queue.
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_ciWhen a new notification is about to be queued, it checks whether the
1578c2ecf20Sopenharmony_cinew value extends the range of the notification at the tail of the
1588c2ecf20Sopenharmony_ciqueue. If so, it drops the new notification packet and instead increases
1598c2ecf20Sopenharmony_cithe range upper value of the outstanding notification.
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ciFor protocols that acknowledge data in-order, like TCP, each
1628c2ecf20Sopenharmony_cinotification can be squashed into the previous one, so that no more
1638c2ecf20Sopenharmony_cithan one notification is outstanding at any one point.
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciOrdered delivery is the common case, but not guaranteed. Notifications
1668c2ecf20Sopenharmony_cimay arrive out of order on retransmission and socket teardown.
1678c2ecf20Sopenharmony_ci
1688c2ecf20Sopenharmony_ci
1698c2ecf20Sopenharmony_ciNotification Parsing
1708c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
1718c2ecf20Sopenharmony_ci
1728c2ecf20Sopenharmony_ciThe below snippet demonstrates how to parse the control message: the
1738c2ecf20Sopenharmony_ciread_notification() call in the previous snippet. A notification
1748c2ecf20Sopenharmony_ciis encoded in the standard error format, sock_extended_err.
1758c2ecf20Sopenharmony_ci
1768c2ecf20Sopenharmony_ciThe level and type fields in the control data are protocol family
1778c2ecf20Sopenharmony_cispecific, IP_RECVERR or IPV6_RECVERR.
1788c2ecf20Sopenharmony_ci
1798c2ecf20Sopenharmony_ciError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
1808c2ecf20Sopenharmony_cias explained before, to avoid blocking read and write system calls on
1818c2ecf20Sopenharmony_cithe socket.
1828c2ecf20Sopenharmony_ci
1838c2ecf20Sopenharmony_ciThe 32-bit notification range is encoded as [ee_info, ee_data]. This
1848c2ecf20Sopenharmony_cirange is inclusive. Other fields in the struct must be treated as
1858c2ecf20Sopenharmony_ciundefined, bar for ee_code, as discussed below.
1868c2ecf20Sopenharmony_ci
1878c2ecf20Sopenharmony_ci::
1888c2ecf20Sopenharmony_ci
1898c2ecf20Sopenharmony_ci	struct sock_extended_err *serr;
1908c2ecf20Sopenharmony_ci	struct cmsghdr *cm;
1918c2ecf20Sopenharmony_ci
1928c2ecf20Sopenharmony_ci	cm = CMSG_FIRSTHDR(msg);
1938c2ecf20Sopenharmony_ci	if (cm->cmsg_level != SOL_IP &&
1948c2ecf20Sopenharmony_ci	    cm->cmsg_type != IP_RECVERR)
1958c2ecf20Sopenharmony_ci		error(1, 0, "cmsg");
1968c2ecf20Sopenharmony_ci
1978c2ecf20Sopenharmony_ci	serr = (void *) CMSG_DATA(cm);
1988c2ecf20Sopenharmony_ci	if (serr->ee_errno != 0 ||
1998c2ecf20Sopenharmony_ci	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
2008c2ecf20Sopenharmony_ci		error(1, 0, "serr");
2018c2ecf20Sopenharmony_ci
2028c2ecf20Sopenharmony_ci	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
2038c2ecf20Sopenharmony_ci
2048c2ecf20Sopenharmony_ci
2058c2ecf20Sopenharmony_ciDeferred copies
2068c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ciPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
2098c2ecf20Sopenharmony_ciavoidance, and a contract that the kernel will queue a completion
2108c2ecf20Sopenharmony_cinotification. It is not a guarantee that the copy is elided.
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciCopy avoidance is not always feasible. Devices that do not support
2138c2ecf20Sopenharmony_ciscatter-gather I/O cannot send packets made up of kernel generated
2148c2ecf20Sopenharmony_ciprotocol headers plus zerocopy user data. A packet may need to be
2158c2ecf20Sopenharmony_ciconverted to a private copy of data deep in the stack, say to compute
2168c2ecf20Sopenharmony_cia checksum.
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ciIn all these cases, the kernel returns a completion notification when
2198c2ecf20Sopenharmony_ciit releases its hold on the shared pages. That notification may arrive
2208c2ecf20Sopenharmony_cibefore the (copied) data is fully transmitted. A zerocopy completion
2218c2ecf20Sopenharmony_cinotification is not a transmit completion notification, therefore.
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ciDeferred copies can be more expensive than a copy immediately in the
2248c2ecf20Sopenharmony_cisystem call, if the data is no longer warm in the cache. The process
2258c2ecf20Sopenharmony_cialso incurs notification processing cost for no benefit. For this
2268c2ecf20Sopenharmony_cireason, the kernel signals if data was completed with a copy, by
2278c2ecf20Sopenharmony_cisetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
2288c2ecf20Sopenharmony_ciA process may use this signal to stop passing flag MSG_ZEROCOPY on
2298c2ecf20Sopenharmony_cisubsequent requests on the same socket.
2308c2ecf20Sopenharmony_ci
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ciImplementation
2338c2ecf20Sopenharmony_ci==============
2348c2ecf20Sopenharmony_ci
2358c2ecf20Sopenharmony_ciLoopback
2368c2ecf20Sopenharmony_ci--------
2378c2ecf20Sopenharmony_ci
2388c2ecf20Sopenharmony_ciData sent to local sockets can be queued indefinitely if the receive
2398c2ecf20Sopenharmony_ciprocess does not read its socket. Unbound notification latency is not
2408c2ecf20Sopenharmony_ciacceptable. For this reason all packets generated with MSG_ZEROCOPY
2418c2ecf20Sopenharmony_cithat are looped to a local socket will incur a deferred copy. This
2428c2ecf20Sopenharmony_ciincludes looping onto packet sockets (e.g., tcpdump) and tun devices.
2438c2ecf20Sopenharmony_ci
2448c2ecf20Sopenharmony_ci
2458c2ecf20Sopenharmony_ciTesting
2468c2ecf20Sopenharmony_ci=======
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ciMore realistic example code can be found in the kernel source under
2498c2ecf20Sopenharmony_citools/testing/selftests/net/msg_zerocopy.c.
2508c2ecf20Sopenharmony_ci
2518c2ecf20Sopenharmony_ciBe cognizant of the loopback constraint. The test can be run between
2528c2ecf20Sopenharmony_cia pair of hosts. But if run between a local pair of processes, for
2538c2ecf20Sopenharmony_ciinstance when run with msg_zerocopy.sh between a veth pair across
2548c2ecf20Sopenharmony_cinamespaces, the test will not show any improvement. For testing, the
2558c2ecf20Sopenharmony_ciloopback restriction can be temporarily relaxed by making
2568c2ecf20Sopenharmony_ciskb_orphan_frags_rx identical to skb_orphan_frags.
257