162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============================
462306a36Sopenharmony_ciKernel Connection Multiplexor
562306a36Sopenharmony_ci=============================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciKernel Connection Multiplexor (KCM) is a mechanism that provides a message based
862306a36Sopenharmony_ciinterface over TCP for generic application protocols. With KCM an application
962306a36Sopenharmony_cican efficiently send and receive application protocol messages over TCP using
1062306a36Sopenharmony_cidatagram sockets.
1162306a36Sopenharmony_ci
1262306a36Sopenharmony_ciKCM implements an NxM multiplexor in the kernel as diagrammed below::
1362306a36Sopenharmony_ci
1462306a36Sopenharmony_ci    +------------+   +------------+   +------------+   +------------+
1562306a36Sopenharmony_ci    | KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
1662306a36Sopenharmony_ci    +------------+   +------------+   +------------+   +------------+
1762306a36Sopenharmony_ci	|                 |               |                |
1862306a36Sopenharmony_ci	+-----------+     |               |     +----------+
1962306a36Sopenharmony_ci		    |     |               |     |
2062306a36Sopenharmony_ci		+----------------------------------+
2162306a36Sopenharmony_ci		|           Multiplexor            |
2262306a36Sopenharmony_ci		+----------------------------------+
2362306a36Sopenharmony_ci		    |   |           |           |  |
2462306a36Sopenharmony_ci	+---------+   |           |           |  ------------+
2562306a36Sopenharmony_ci	|             |           |           |              |
2662306a36Sopenharmony_ci    +----------+  +----------+  +----------+  +----------+ +----------+
2762306a36Sopenharmony_ci    |  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
2862306a36Sopenharmony_ci    +----------+  +----------+  +----------+  +----------+ +----------+
2962306a36Sopenharmony_ci	|              |           |            |             |
3062306a36Sopenharmony_ci    +----------+  +----------+  +----------+  +----------+ +----------+
3162306a36Sopenharmony_ci    | TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
3262306a36Sopenharmony_ci    +----------+  +----------+  +----------+  +----------+ +----------+
3362306a36Sopenharmony_ci
3462306a36Sopenharmony_ciKCM sockets
3562306a36Sopenharmony_ci===========
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ciThe KCM sockets provide the user interface to the multiplexor. All the KCM sockets
3862306a36Sopenharmony_cibound to a multiplexor are considered to have equivalent function, and I/O
3962306a36Sopenharmony_cioperations in different sockets may be done in parallel without the need for
4062306a36Sopenharmony_cisynchronization between threads in userspace.
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciMultiplexor
4362306a36Sopenharmony_ci===========
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ciThe multiplexor provides the message steering. In the transmit path, messages
4662306a36Sopenharmony_ciwritten on a KCM socket are sent atomically on an appropriate TCP socket.
4762306a36Sopenharmony_ciSimilarly, in the receive path, messages are constructed on each TCP socket
4862306a36Sopenharmony_ci(Psock) and complete messages are steered to a KCM socket.
4962306a36Sopenharmony_ci
5062306a36Sopenharmony_ciTCP sockets & Psocks
5162306a36Sopenharmony_ci====================
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciTCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
5462306a36Sopenharmony_cifor each bound TCP socket, this structure holds the state for constructing
5562306a36Sopenharmony_cimessages on receive as well as other connection specific information for KCM.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciConnected mode semantics
5862306a36Sopenharmony_ci========================
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciEach multiplexor assumes that all attached TCP connections are to the same
6162306a36Sopenharmony_cidestination and can use the different connections for load balancing when
6262306a36Sopenharmony_citransmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
6362306a36Sopenharmony_cican be used to send and receive messages from the KCM socket.
6462306a36Sopenharmony_ci
6562306a36Sopenharmony_ciSocket types
6662306a36Sopenharmony_ci============
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ciKCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ciMessage delineation
7162306a36Sopenharmony_ci-------------------
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ciMessages are sent over a TCP stream with some application protocol message
7462306a36Sopenharmony_ciformat that typically includes a header which frames the messages. The length
7562306a36Sopenharmony_ciof a received message can be deduced from the application protocol header
7662306a36Sopenharmony_ci(often just a simple length field).
7762306a36Sopenharmony_ci
7862306a36Sopenharmony_ciA TCP stream must be parsed to determine message boundaries. Berkeley Packet
7962306a36Sopenharmony_ciFilter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
8062306a36Sopenharmony_ciBPF program must be specified. The program is called at the start of receiving
8162306a36Sopenharmony_cia new message and is given an skbuff that contains the bytes received so far.
8262306a36Sopenharmony_ciIt parses the message header and returns the length of the message. Given this
8362306a36Sopenharmony_ciinformation, KCM will construct the message of the stated length and deliver it
8462306a36Sopenharmony_cito a KCM socket.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciTCP socket management
8762306a36Sopenharmony_ci---------------------
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ciWhen a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
9062306a36Sopenharmony_ciwrite space available (POLLOUT) events are handled by the multiplexor. If there
9162306a36Sopenharmony_ciis a state change (disconnection) or other error on a TCP socket, an error is
9262306a36Sopenharmony_ciposted on the TCP socket so that a POLLERR event happens and KCM discontinues
9362306a36Sopenharmony_ciusing the socket. When the application gets the error notification for a
9462306a36Sopenharmony_ciTCP socket, it should unattach the socket from KCM and then handle the error
9562306a36Sopenharmony_cicondition (the typical response is to close the socket and create a new
9662306a36Sopenharmony_ciconnection if necessary).
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciKCM limits the maximum receive message size to be the size of the receive
9962306a36Sopenharmony_cisocket buffer on the attached TCP socket (the socket buffer size can be set by
10062306a36Sopenharmony_ciSO_RCVBUF). If the length of a new message reported by the BPF program is
10162306a36Sopenharmony_cigreater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
10262306a36Sopenharmony_cisocket. The BPF program may also enforce a maximum messages size and report an
10362306a36Sopenharmony_cierror when it is exceeded.
10462306a36Sopenharmony_ci
10562306a36Sopenharmony_ciA timeout may be set for assembling messages on a receive socket. The timeout
10662306a36Sopenharmony_civalue is taken from the receive timeout of the attached TCP socket (this is set
10762306a36Sopenharmony_ciby SO_RCVTIMEO). If the timer expires before assembly is complete an error
10862306a36Sopenharmony_ci(ETIMEDOUT) is posted on the socket.
10962306a36Sopenharmony_ci
11062306a36Sopenharmony_ciUser interface
11162306a36Sopenharmony_ci==============
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ciCreating a multiplexor
11462306a36Sopenharmony_ci----------------------
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciA new multiplexor and initial KCM socket is created by a socket call::
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_ci  socket(AF_KCM, type, protocol)
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ci- type is either SOCK_DGRAM or SOCK_SEQPACKET
12162306a36Sopenharmony_ci- protocol is KCMPROTO_CONNECTED
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ciCloning KCM sockets
12462306a36Sopenharmony_ci-------------------
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciAfter the first KCM socket is created using the socket call as described
12762306a36Sopenharmony_ciabove, additional sockets for the multiplexor can be created by cloning
12862306a36Sopenharmony_cia KCM socket. This is accomplished by an ioctl on a KCM socket::
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ci  /* From linux/kcm.h */
13162306a36Sopenharmony_ci  struct kcm_clone {
13262306a36Sopenharmony_ci	int fd;
13362306a36Sopenharmony_ci  };
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ci  struct kcm_clone info;
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ci  memset(&info, 0, sizeof(info));
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ci  err = ioctl(kcmfd, SIOCKCMCLONE, &info);
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ci  if (!err)
14262306a36Sopenharmony_ci    newkcmfd = info.fd;
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ciAttach transport sockets
14562306a36Sopenharmony_ci------------------------
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ciAttaching of transport sockets to a multiplexor is performed by calling an
14862306a36Sopenharmony_ciioctl on a KCM socket for the multiplexor. e.g.::
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ci  /* From linux/kcm.h */
15162306a36Sopenharmony_ci  struct kcm_attach {
15262306a36Sopenharmony_ci	int fd;
15362306a36Sopenharmony_ci	int bpf_fd;
15462306a36Sopenharmony_ci  };
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ci  struct kcm_attach info;
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ci  memset(&info, 0, sizeof(info));
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_ci  info.fd = tcpfd;
16162306a36Sopenharmony_ci  info.bpf_fd = bpf_prog_fd;
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ci  ioctl(kcmfd, SIOCKCMATTACH, &info);
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciThe kcm_attach structure contains:
16662306a36Sopenharmony_ci
16762306a36Sopenharmony_ci  - fd: file descriptor for TCP socket being attached
16862306a36Sopenharmony_ci  - bpf_prog_fd: file descriptor for compiled BPF program downloaded
16962306a36Sopenharmony_ci
17062306a36Sopenharmony_ciUnattach transport sockets
17162306a36Sopenharmony_ci--------------------------
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ciUnattaching a transport socket from a multiplexor is straightforward. An
17462306a36Sopenharmony_ci"unattach" ioctl is done with the kcm_unattach structure as the argument::
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ci  /* From linux/kcm.h */
17762306a36Sopenharmony_ci  struct kcm_unattach {
17862306a36Sopenharmony_ci	int fd;
17962306a36Sopenharmony_ci  };
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci  struct kcm_unattach info;
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ci  memset(&info, 0, sizeof(info));
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ci  info.fd = cfd;
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci  ioctl(fd, SIOCKCMUNATTACH, &info);
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ciDisabling receive on KCM socket
19062306a36Sopenharmony_ci-------------------------------
19162306a36Sopenharmony_ci
19262306a36Sopenharmony_ciA setsockopt is used to disable or enable receiving on a KCM socket.
19362306a36Sopenharmony_ciWhen receive is disabled, any pending messages in the socket's
19462306a36Sopenharmony_cireceive buffer are moved to other sockets. This feature is useful
19562306a36Sopenharmony_ciif an application thread knows that it will be doing a lot of
19662306a36Sopenharmony_ciwork on a request and won't be able to service new messages for a
19762306a36Sopenharmony_ciwhile. Example use::
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ci  int val = 1;
20062306a36Sopenharmony_ci
20162306a36Sopenharmony_ci  setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ciBFP programs for message delineation
20462306a36Sopenharmony_ci------------------------------------
20562306a36Sopenharmony_ci
20662306a36Sopenharmony_ciBPF programs can be compiled using the BPF LLVM backend. For example,
20762306a36Sopenharmony_cithe BPF program for parsing Thrift is::
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ci  #include "bpf.h" /* for __sk_buff */
21062306a36Sopenharmony_ci  #include "bpf_helpers.h" /* for load_word intrinsic */
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ci  SEC("socket_kcm")
21362306a36Sopenharmony_ci  int bpf_prog1(struct __sk_buff *skb)
21462306a36Sopenharmony_ci  {
21562306a36Sopenharmony_ci       return load_word(skb, 0) + 4;
21662306a36Sopenharmony_ci  }
21762306a36Sopenharmony_ci
21862306a36Sopenharmony_ci  char _license[] SEC("license") = "GPL";
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciUse in applications
22162306a36Sopenharmony_ci===================
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ciKCM accelerates application layer protocols. Specifically, it allows
22462306a36Sopenharmony_ciapplications to use a message based interface for sending and receiving
22562306a36Sopenharmony_cimessages. The kernel provides necessary assurances that messages are sent
22662306a36Sopenharmony_ciand received atomically. This relieves much of the burden applications have
22762306a36Sopenharmony_ciin mapping a message based protocol onto the TCP stream. KCM also make
22862306a36Sopenharmony_ciapplication layer messages a unit of work in the kernel for the purposes of
22962306a36Sopenharmony_cisteering and scheduling, which in turn allows a simpler networking model in
23062306a36Sopenharmony_cimultithreaded applications.
23162306a36Sopenharmony_ci
23262306a36Sopenharmony_ciConfigurations
23362306a36Sopenharmony_ci--------------
23462306a36Sopenharmony_ci
23562306a36Sopenharmony_ciIn an Nx1 configuration, KCM logically provides multiple socket handles
23662306a36Sopenharmony_cito the same TCP connection. This allows parallelism between in I/O
23762306a36Sopenharmony_cioperations on the TCP socket (for instance copyin and copyout of data is
23862306a36Sopenharmony_ciparallelized). In an application, a KCM socket can be opened for each
23962306a36Sopenharmony_ciprocessing thread and inserted into the epoll (similar to how SO_REUSEPORT
24062306a36Sopenharmony_ciis used to allow multiple listener sockets on the same port).
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciIn a MxN configuration, multiple connections are established to the
24362306a36Sopenharmony_cisame destination. These are used for simple load balancing.
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciMessage batching
24662306a36Sopenharmony_ci----------------
24762306a36Sopenharmony_ci
24862306a36Sopenharmony_ciThe primary purpose of KCM is load balancing between KCM sockets and hence
24962306a36Sopenharmony_cithreads in a nominal use case. Perfect load balancing, that is steering
25062306a36Sopenharmony_cieach received message to a different KCM socket or steering each sent
25162306a36Sopenharmony_cimessage to a different TCP socket, can negatively impact performance
25262306a36Sopenharmony_cisince this doesn't allow for affinities to be established. Balancing
25362306a36Sopenharmony_cibased on groups, or batches of messages, can be beneficial for performance.
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ciOn transmit, there are three ways an application can batch (pipeline)
25662306a36Sopenharmony_cimessages on a KCM socket.
25762306a36Sopenharmony_ci
25862306a36Sopenharmony_ci  1) Send multiple messages in a single sendmmsg.
25962306a36Sopenharmony_ci  2) Send a group of messages each with a sendmsg call, where all messages
26062306a36Sopenharmony_ci     except the last have MSG_BATCH in the flags of sendmsg call.
26162306a36Sopenharmony_ci  3) Create "super message" composed of multiple messages and send this
26262306a36Sopenharmony_ci     with a single sendmsg.
26362306a36Sopenharmony_ci
26462306a36Sopenharmony_ciOn receive, the KCM module attempts to queue messages received on the
26562306a36Sopenharmony_cisame KCM socket during each TCP ready callback. The targeted KCM socket
26662306a36Sopenharmony_cichanges at each receive ready callback on the KCM socket. The application
26762306a36Sopenharmony_cidoes not need to configure this.
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciError handling
27062306a36Sopenharmony_ci--------------
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ciAn application should include a thread to monitor errors raised on
27362306a36Sopenharmony_cithe TCP connection. Normally, this will be done by placing each
27462306a36Sopenharmony_ciTCP socket attached to a KCM multiplexor in epoll set for POLLERR
27562306a36Sopenharmony_cievent. If an error occurs on an attached TCP socket, KCM sets an EPIPE
27662306a36Sopenharmony_cion the socket thus waking up the application thread. When the application
27762306a36Sopenharmony_cisees the error (which may just be a disconnect) it should unattach the
27862306a36Sopenharmony_cisocket from KCM and then close it. It is assumed that once an error is
27962306a36Sopenharmony_ciposted on the TCP socket the data stream is unrecoverable (i.e. an error
28062306a36Sopenharmony_cimay have occurred in the middle of receiving a message).
28162306a36Sopenharmony_ci
28262306a36Sopenharmony_ciTCP connection monitoring
28362306a36Sopenharmony_ci-------------------------
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ciIn KCM there is no means to correlate a message to the TCP socket that
28662306a36Sopenharmony_ciwas used to send or receive the message (except in the case there is
28762306a36Sopenharmony_cionly one attached TCP socket). However, the application does retain
28862306a36Sopenharmony_cian open file descriptor to the socket so it will be able to get statistics
28962306a36Sopenharmony_cifrom the socket which can be used in detecting issues (such as high
29062306a36Sopenharmony_ciretransmissions on the socket).
291