162306a36Sopenharmony_ci
262306a36Sopenharmony_ci============
362306a36Sopenharmony_ciMSG_ZEROCOPY
462306a36Sopenharmony_ci============
562306a36Sopenharmony_ci
662306a36Sopenharmony_ciIntro
762306a36Sopenharmony_ci=====
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciThe MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
1062306a36Sopenharmony_ciThe feature is currently implemented for TCP and UDP sockets.
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciOpportunity and Caveats
1462306a36Sopenharmony_ci-----------------------
1562306a36Sopenharmony_ci
1662306a36Sopenharmony_ciCopying large buffers between user process and kernel can be
1762306a36Sopenharmony_ciexpensive. Linux supports various interfaces that eschew copying,
1862306a36Sopenharmony_cisuch as sendfile and splice. The MSG_ZEROCOPY flag extends the
1962306a36Sopenharmony_ciunderlying copy avoidance mechanism to common socket send calls.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciCopy avoidance is not a free lunch. As implemented, with page pinning,
2262306a36Sopenharmony_ciit replaces per byte copy cost with page accounting and completion
2362306a36Sopenharmony_cinotification overhead. As a result, MSG_ZEROCOPY is generally only
2462306a36Sopenharmony_cieffective at writes over around 10 KB.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciPage pinning also changes system call semantics. It temporarily shares
2762306a36Sopenharmony_cithe buffer between process and network stack. Unlike with copying, the
2862306a36Sopenharmony_ciprocess cannot immediately overwrite the buffer after system call
2962306a36Sopenharmony_cireturn without possibly modifying the data in flight. Kernel integrity
3062306a36Sopenharmony_ciis not affected, but a buggy program can possibly corrupt its own data
3162306a36Sopenharmony_cistream.
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ciThe kernel returns a notification when it is safe to modify data.
3462306a36Sopenharmony_ciConverting an existing application to MSG_ZEROCOPY is not always as
3562306a36Sopenharmony_citrivial as just passing the flag, then.
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciMore Info
3962306a36Sopenharmony_ci---------
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciMuch of this document was derived from a longer paper presented at
4262306a36Sopenharmony_cinetdev 2.1. For more in-depth information see that paper and talk,
4362306a36Sopenharmony_cithe excellent reporting over at LWN.net or read the original code.
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ci  paper, slides, video
4662306a36Sopenharmony_ci    https://netdevconf.org/2.1/session.html?debruijn
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ci  LWN article
4962306a36Sopenharmony_ci    https://lwn.net/Articles/726917/
5062306a36Sopenharmony_ci
5162306a36Sopenharmony_ci  patchset
5262306a36Sopenharmony_ci    [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
5362306a36Sopenharmony_ci    https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ciInterface
5762306a36Sopenharmony_ci=========
5862306a36Sopenharmony_ci
5962306a36Sopenharmony_ciPassing the MSG_ZEROCOPY flag is the most obvious step to enable copy
6062306a36Sopenharmony_ciavoidance, but not the only one.
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ciSocket Setup
6362306a36Sopenharmony_ci------------
6462306a36Sopenharmony_ci
6562306a36Sopenharmony_ciThe kernel is permissive when applications pass undefined flags to the
6662306a36Sopenharmony_cisend system call. By default it simply ignores these. To avoid enabling
6762306a36Sopenharmony_cicopy avoidance mode for legacy processes that accidentally already pass
6862306a36Sopenharmony_cithis flag, a process must first signal intent by setting a socket option:
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci::
7162306a36Sopenharmony_ci
7262306a36Sopenharmony_ci	if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
7362306a36Sopenharmony_ci		error(1, errno, "setsockopt zerocopy");
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ciTransmission
7662306a36Sopenharmony_ci------------
7762306a36Sopenharmony_ci
7862306a36Sopenharmony_ciThe change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
7962306a36Sopenharmony_ciPass the new flag.
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_ci::
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ci	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciA zerocopy failure will return -1 with errno ENOBUFS. This happens if
8662306a36Sopenharmony_cithe socket exceeds its optmem limit or the user exceeds their ulimit on
8762306a36Sopenharmony_cilocked pages.
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ciMixing copy avoidance and copying
9162306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9262306a36Sopenharmony_ci
9362306a36Sopenharmony_ciMany workloads have a mixture of large and small buffers. Because copy
9462306a36Sopenharmony_ciavoidance is more expensive than copying for small packets, the
9562306a36Sopenharmony_cifeature is implemented as a flag. It is safe to mix calls with the flag
9662306a36Sopenharmony_ciwith those without.
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ciNotifications
10062306a36Sopenharmony_ci-------------
10162306a36Sopenharmony_ci
10262306a36Sopenharmony_ciThe kernel has to notify the process when it is safe to reuse a
10362306a36Sopenharmony_cipreviously passed buffer. It queues completion notifications on the
10462306a36Sopenharmony_cisocket error queue, akin to the transmit timestamping interface.
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciThe notification itself is a simple scalar value. Each socket
10762306a36Sopenharmony_cimaintains an internal unsigned 32-bit counter. Each send call with
10862306a36Sopenharmony_ciMSG_ZEROCOPY that successfully sends data increments the counter. The
10962306a36Sopenharmony_cicounter is not incremented on failure or if called with length zero.
11062306a36Sopenharmony_ciThe counter counts system call invocations, not bytes. It wraps after
11162306a36Sopenharmony_ciUINT_MAX calls.
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ciNotification Reception
11562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciThe below snippet demonstrates the API. In the simplest case, each
11862306a36Sopenharmony_cisend syscall is followed by a poll and recvmsg on the error queue.
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ciReading from the error queue is always a non-blocking operation. The
12162306a36Sopenharmony_cipoll call is there to block until an error is outstanding. It will set
12262306a36Sopenharmony_ciPOLLERR in its output flags. That flag does not have to be set in the
12362306a36Sopenharmony_cievents field. Errors are signaled unconditionally.
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ci::
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ci	pfd.fd = fd;
12862306a36Sopenharmony_ci	pfd.events = 0;
12962306a36Sopenharmony_ci	if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
13062306a36Sopenharmony_ci		error(1, errno, "poll");
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ci	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
13362306a36Sopenharmony_ci	if (ret == -1)
13462306a36Sopenharmony_ci		error(1, errno, "recvmsg");
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci	read_notification(msg);
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ciThe example is for demonstration purpose only. In practice, it is more
13962306a36Sopenharmony_ciefficient to not wait for notifications, but read without blocking
14062306a36Sopenharmony_cievery couple of send calls.
14162306a36Sopenharmony_ci
14262306a36Sopenharmony_ciNotifications can be processed out of order with other operations on
14362306a36Sopenharmony_cithe socket. A socket that has an error queued would normally block
14462306a36Sopenharmony_ciother operations until the error is read. Zerocopy notifications have
14562306a36Sopenharmony_cia zero error code, however, to not block send and recv calls.
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ci
14862306a36Sopenharmony_ciNotification Batching
14962306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~
15062306a36Sopenharmony_ci
15162306a36Sopenharmony_ciMultiple outstanding packets can be read at once using the recvmmsg
15262306a36Sopenharmony_cicall. This is often not needed. In each message the kernel returns not
15362306a36Sopenharmony_cia single value, but a range. It coalesces consecutive notifications
15462306a36Sopenharmony_ciwhile one is outstanding for reception on the error queue.
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ciWhen a new notification is about to be queued, it checks whether the
15762306a36Sopenharmony_cinew value extends the range of the notification at the tail of the
15862306a36Sopenharmony_ciqueue. If so, it drops the new notification packet and instead increases
15962306a36Sopenharmony_cithe range upper value of the outstanding notification.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ciFor protocols that acknowledge data in-order, like TCP, each
16262306a36Sopenharmony_cinotification can be squashed into the previous one, so that no more
16362306a36Sopenharmony_cithan one notification is outstanding at any one point.
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciOrdered delivery is the common case, but not guaranteed. Notifications
16662306a36Sopenharmony_cimay arrive out of order on retransmission and socket teardown.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciNotification Parsing
17062306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciThe below snippet demonstrates how to parse the control message: the
17362306a36Sopenharmony_ciread_notification() call in the previous snippet. A notification
17462306a36Sopenharmony_ciis encoded in the standard error format, sock_extended_err.
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ciThe level and type fields in the control data are protocol family
17762306a36Sopenharmony_cispecific, IP_RECVERR or IPV6_RECVERR.
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ciError origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
18062306a36Sopenharmony_cias explained before, to avoid blocking read and write system calls on
18162306a36Sopenharmony_cithe socket.
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ciThe 32-bit notification range is encoded as [ee_info, ee_data]. This
18462306a36Sopenharmony_cirange is inclusive. Other fields in the struct must be treated as
18562306a36Sopenharmony_ciundefined, bar for ee_code, as discussed below.
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci::
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci	struct sock_extended_err *serr;
19062306a36Sopenharmony_ci	struct cmsghdr *cm;
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ci	cm = CMSG_FIRSTHDR(msg);
19362306a36Sopenharmony_ci	if (cm->cmsg_level != SOL_IP &&
19462306a36Sopenharmony_ci	    cm->cmsg_type != IP_RECVERR)
19562306a36Sopenharmony_ci		error(1, 0, "cmsg");
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ci	serr = (void *) CMSG_DATA(cm);
19862306a36Sopenharmony_ci	if (serr->ee_errno != 0 ||
19962306a36Sopenharmony_ci	    serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
20062306a36Sopenharmony_ci		error(1, 0, "serr");
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ci	printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ci
20562306a36Sopenharmony_ciDeferred copies
20662306a36Sopenharmony_ci~~~~~~~~~~~~~~~
20762306a36Sopenharmony_ci
20862306a36Sopenharmony_ciPassing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
20962306a36Sopenharmony_ciavoidance, and a contract that the kernel will queue a completion
21062306a36Sopenharmony_cinotification. It is not a guarantee that the copy is elided.
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ciCopy avoidance is not always feasible. Devices that do not support
21362306a36Sopenharmony_ciscatter-gather I/O cannot send packets made up of kernel generated
21462306a36Sopenharmony_ciprotocol headers plus zerocopy user data. A packet may need to be
21562306a36Sopenharmony_ciconverted to a private copy of data deep in the stack, say to compute
21662306a36Sopenharmony_cia checksum.
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ciIn all these cases, the kernel returns a completion notification when
21962306a36Sopenharmony_ciit releases its hold on the shared pages. That notification may arrive
22062306a36Sopenharmony_cibefore the (copied) data is fully transmitted. A zerocopy completion
22162306a36Sopenharmony_cinotification is not a transmit completion notification, therefore.
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ciDeferred copies can be more expensive than a copy immediately in the
22462306a36Sopenharmony_cisystem call, if the data is no longer warm in the cache. The process
22562306a36Sopenharmony_cialso incurs notification processing cost for no benefit. For this
22662306a36Sopenharmony_cireason, the kernel signals if data was completed with a copy, by
22762306a36Sopenharmony_cisetting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
22862306a36Sopenharmony_ciA process may use this signal to stop passing flag MSG_ZEROCOPY on
22962306a36Sopenharmony_cisubsequent requests on the same socket.
23062306a36Sopenharmony_ci
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ciImplementation
23362306a36Sopenharmony_ci==============
23462306a36Sopenharmony_ci
23562306a36Sopenharmony_ciLoopback
23662306a36Sopenharmony_ci--------
23762306a36Sopenharmony_ci
23862306a36Sopenharmony_ciData sent to local sockets can be queued indefinitely if the receive
23962306a36Sopenharmony_ciprocess does not read its socket. Unbound notification latency is not
24062306a36Sopenharmony_ciacceptable. For this reason all packets generated with MSG_ZEROCOPY
24162306a36Sopenharmony_cithat are looped to a local socket will incur a deferred copy. This
24262306a36Sopenharmony_ciincludes looping onto packet sockets (e.g., tcpdump) and tun devices.
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciTesting
24662306a36Sopenharmony_ci=======
24762306a36Sopenharmony_ci
24862306a36Sopenharmony_ciMore realistic example code can be found in the kernel source under
24962306a36Sopenharmony_citools/testing/selftests/net/msg_zerocopy.c.
25062306a36Sopenharmony_ci
25162306a36Sopenharmony_ciBe cognizant of the loopback constraint. The test can be run between
25262306a36Sopenharmony_cia pair of hosts. But if run between a local pair of processes, for
25362306a36Sopenharmony_ciinstance when run with msg_zerocopy.sh between a veth pair across
25462306a36Sopenharmony_cinamespaces, the test will not show any improvement. For testing, the
25562306a36Sopenharmony_ciloopback restriction can be temporarily relaxed by making
25662306a36Sopenharmony_ciskb_orphan_frags_rx identical to skb_orphan_frags.
257