162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci===
462306a36Sopenharmony_ciRDS
562306a36Sopenharmony_ci===
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciOverview
862306a36Sopenharmony_ci========
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciThis readme tries to provide some background on the hows and whys of RDS,
1162306a36Sopenharmony_ciand will hopefully help you find your way around the code.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciIn addition, please see this email about RDS origins:
1462306a36Sopenharmony_cihttp://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
1562306a36Sopenharmony_ci
1662306a36Sopenharmony_ciRDS Architecture
1762306a36Sopenharmony_ci================
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciRDS provides reliable, ordered datagram delivery by using a single
2062306a36Sopenharmony_cireliable connection between any two nodes in the cluster. This allows
2162306a36Sopenharmony_ciapplications to use a single socket to talk to any other process in the
2262306a36Sopenharmony_cicluster - so in a cluster with N processes you need N sockets, in contrast
2362306a36Sopenharmony_cito N*N if you use a connection-oriented socket transport like TCP.
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciRDS is not Infiniband-specific; it was designed to support different
2662306a36Sopenharmony_citransports.  The current implementation used to support RDS over TCP as well
2762306a36Sopenharmony_cias IB.
2862306a36Sopenharmony_ci
2962306a36Sopenharmony_ciThe high-level semantics of RDS from the application's point of view are
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ci *	Addressing
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ci	RDS uses IPv4 addresses and 16bit port numbers to identify
3462306a36Sopenharmony_ci	the end point of a connection. All socket operations that involve
3562306a36Sopenharmony_ci	passing addresses between kernel and user space generally
3662306a36Sopenharmony_ci	use a struct sockaddr_in.
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ci	The fact that IPv4 addresses are used does not mean the underlying
3962306a36Sopenharmony_ci	transport has to be IP-based. In fact, RDS over IB uses a
4062306a36Sopenharmony_ci	reliable IB connection; the IP address is used exclusively to
4162306a36Sopenharmony_ci	locate the remote node's GID (by ARPing for the given IP).
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ci	The port space is entirely independent of UDP, TCP or any other
4462306a36Sopenharmony_ci	protocol.
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ci *	Socket interface
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ci	RDS sockets work *mostly* as you would expect from a BSD
4962306a36Sopenharmony_ci	socket. The next section will cover the details. At any rate,
5062306a36Sopenharmony_ci	all I/O is performed through the standard BSD socket API.
5162306a36Sopenharmony_ci	Some additions like zerocopy support are implemented through
5262306a36Sopenharmony_ci	control messages, while other extensions use the getsockopt/
5362306a36Sopenharmony_ci	setsockopt calls.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ci	Sockets must be bound before you can send or receive data.
5662306a36Sopenharmony_ci	This is needed because binding also selects a transport and
5762306a36Sopenharmony_ci	attaches it to the socket. Once bound, the transport assignment
5862306a36Sopenharmony_ci	does not change. RDS will tolerate IPs moving around (eg in
5962306a36Sopenharmony_ci	a active-active HA scenario), but only as long as the address
6062306a36Sopenharmony_ci	doesn't move to a different transport.
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci *	sysctls
6362306a36Sopenharmony_ci
6462306a36Sopenharmony_ci	RDS supports a number of sysctls in /proc/sys/net/rds
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciSocket Interface
6862306a36Sopenharmony_ci================
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci  AF_RDS, PF_RDS, SOL_RDS
7162306a36Sopenharmony_ci	AF_RDS and PF_RDS are the domain type to be used with socket(2)
7262306a36Sopenharmony_ci	to create RDS sockets. SOL_RDS is the socket-level to be used
7362306a36Sopenharmony_ci	with setsockopt(2) and getsockopt(2) for RDS specific socket
7462306a36Sopenharmony_ci	options.
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_ci  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
7762306a36Sopenharmony_ci	This creates a new, unbound RDS socket.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ci  setsockopt(SOL_SOCKET): send and receive buffer size
8062306a36Sopenharmony_ci	RDS honors the send and receive buffer size socket options.
8162306a36Sopenharmony_ci	You are not allowed to queue more than SO_SNDSIZE bytes to
8262306a36Sopenharmony_ci	a socket. A message is queued when sendmsg is called, and
8362306a36Sopenharmony_ci	it leaves the queue when the remote system acknowledges
8462306a36Sopenharmony_ci	its arrival.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ci	The SO_RCVSIZE option controls the maximum receive queue length.
8762306a36Sopenharmony_ci	This is a soft limit rather than a hard limit - RDS will
8862306a36Sopenharmony_ci	continue to accept and queue incoming messages, even if that
8962306a36Sopenharmony_ci	takes the queue length over the limit. However, it will also
9062306a36Sopenharmony_ci	mark the port as "congested" and send a congestion update to
9162306a36Sopenharmony_ci	the source node. The source node is supposed to throttle any
9262306a36Sopenharmony_ci	processes sending to this congested port.
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ci  bind(fd, &sockaddr_in, ...)
9562306a36Sopenharmony_ci	This binds the socket to a local IP address and port, and a
9662306a36Sopenharmony_ci	transport, if one has not already been selected via the
9762306a36Sopenharmony_ci	SO_RDS_TRANSPORT socket option
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ci  sendmsg(fd, ...)
10062306a36Sopenharmony_ci	Sends a message to the indicated recipient. The kernel will
10162306a36Sopenharmony_ci	transparently establish the underlying reliable connection
10262306a36Sopenharmony_ci	if it isn't up yet.
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ci	An attempt to send a message that exceeds SO_SNDSIZE will
10562306a36Sopenharmony_ci	return with -EMSGSIZE
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ci	An attempt to send a message that would take the total number
10862306a36Sopenharmony_ci	of queued bytes over the SO_SNDSIZE threshold will return
10962306a36Sopenharmony_ci	EAGAIN.
11062306a36Sopenharmony_ci
11162306a36Sopenharmony_ci	An attempt to send a message to a destination that is marked
11262306a36Sopenharmony_ci	as "congested" will return ENOBUFS.
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci  recvmsg(fd, ...)
11562306a36Sopenharmony_ci	Receives a message that was queued to this socket. The sockets
11662306a36Sopenharmony_ci	recv queue accounting is adjusted, and if the queue length
11762306a36Sopenharmony_ci	drops below SO_SNDSIZE, the port is marked uncongested, and
11862306a36Sopenharmony_ci	a congestion update is sent to all peers.
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ci	Applications can ask the RDS kernel module to receive
12162306a36Sopenharmony_ci	notifications via control messages (for instance, there is a
12262306a36Sopenharmony_ci	notification when a congestion update arrived, or when a RDMA
12362306a36Sopenharmony_ci	operation completes). These notifications are received through
12462306a36Sopenharmony_ci	the msg.msg_control buffer of struct msghdr. The format of the
12562306a36Sopenharmony_ci	messages is described in manpages.
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ci  poll(fd)
12862306a36Sopenharmony_ci	RDS supports the poll interface to allow the application
12962306a36Sopenharmony_ci	to implement async I/O.
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci	POLLIN handling is pretty straightforward. When there's an
13262306a36Sopenharmony_ci	incoming message queued to the socket, or a pending notification,
13362306a36Sopenharmony_ci	we signal POLLIN.
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ci	POLLOUT is a little harder. Since you can essentially send
13662306a36Sopenharmony_ci	to any destination, RDS will always signal POLLOUT as long as
13762306a36Sopenharmony_ci	there's room on the send queue (ie the number of bytes queued
13862306a36Sopenharmony_ci	is less than the sendbuf size).
13962306a36Sopenharmony_ci
14062306a36Sopenharmony_ci	However, the kernel will refuse to accept messages to
14162306a36Sopenharmony_ci	a destination marked congested - in this case you will loop
14262306a36Sopenharmony_ci	forever if you rely on poll to tell you what to do.
14362306a36Sopenharmony_ci	This isn't a trivial problem, but applications can deal with
14462306a36Sopenharmony_ci	this - by using congestion notifications, and by checking for
14562306a36Sopenharmony_ci	ENOBUFS errors returned by sendmsg.
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ci  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
14862306a36Sopenharmony_ci	This allows the application to discard all messages queued to a
14962306a36Sopenharmony_ci	specific destination on this particular socket.
15062306a36Sopenharmony_ci
15162306a36Sopenharmony_ci	This allows the application to cancel outstanding messages if
15262306a36Sopenharmony_ci	it detects a timeout. For instance, if it tried to send a message,
15362306a36Sopenharmony_ci	and the remote host is unreachable, RDS will keep trying forever.
15462306a36Sopenharmony_ci	The application may decide it's not worth it, and cancel the
15562306a36Sopenharmony_ci	operation. In this case, it would use RDS_CANCEL_SENT_TO to
15662306a36Sopenharmony_ci	nuke any pending messages.
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ci  ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
15962306a36Sopenharmony_ci	Set or read an integer defining  the underlying
16062306a36Sopenharmony_ci	encapsulating transport to be used for RDS packets on the
16162306a36Sopenharmony_ci	socket. When setting the option, integer argument may be
16262306a36Sopenharmony_ci	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
16362306a36Sopenharmony_ci	value, RDS_TRANS_NONE will be returned on an unbound socket.
16462306a36Sopenharmony_ci	This socket option may only be set exactly once on the socket,
16562306a36Sopenharmony_ci	prior to binding it via the bind(2) system call. Attempts to
16662306a36Sopenharmony_ci	set SO_RDS_TRANSPORT on a socket for which the transport has
16762306a36Sopenharmony_ci	been previously attached explicitly (by SO_RDS_TRANSPORT) or
16862306a36Sopenharmony_ci	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
16962306a36Sopenharmony_ci	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
17062306a36Sopenharmony_ci	always return EINVAL.
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciRDMA for RDS
17362306a36Sopenharmony_ci============
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ci  see rds-rdma(7) manpage (available in rds-tools)
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ci
17862306a36Sopenharmony_ciCongestion Notifications
17962306a36Sopenharmony_ci========================
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci  see rds(7) manpage
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ciRDS Protocol
18562306a36Sopenharmony_ci============
18662306a36Sopenharmony_ci
18762306a36Sopenharmony_ci  Message header
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci    The message header is a 'struct rds_header' (see rds.h):
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ci    Fields:
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ci      h_sequence:
19462306a36Sopenharmony_ci	  per-packet sequence number
19562306a36Sopenharmony_ci      h_ack:
19662306a36Sopenharmony_ci	  piggybacked acknowledgment of last packet received
19762306a36Sopenharmony_ci      h_len:
19862306a36Sopenharmony_ci	  length of data, not including header
19962306a36Sopenharmony_ci      h_sport:
20062306a36Sopenharmony_ci	  source port
20162306a36Sopenharmony_ci      h_dport:
20262306a36Sopenharmony_ci	  destination port
20362306a36Sopenharmony_ci      h_flags:
20462306a36Sopenharmony_ci	  Can be:
20562306a36Sopenharmony_ci
20662306a36Sopenharmony_ci	  =============  ==================================
20762306a36Sopenharmony_ci	  CONG_BITMAP    this is a congestion update bitmap
20862306a36Sopenharmony_ci	  ACK_REQUIRED   receiver must ack this packet
20962306a36Sopenharmony_ci	  RETRANSMITTED  packet has previously been sent
21062306a36Sopenharmony_ci	  =============  ==================================
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ci      h_credit:
21362306a36Sopenharmony_ci	  indicate to other end of connection that
21462306a36Sopenharmony_ci	  it has more credits available (i.e. there is
21562306a36Sopenharmony_ci	  more send room)
21662306a36Sopenharmony_ci      h_padding[4]:
21762306a36Sopenharmony_ci	  unused, for future use
21862306a36Sopenharmony_ci      h_csum:
21962306a36Sopenharmony_ci	  header checksum
22062306a36Sopenharmony_ci      h_exthdr:
22162306a36Sopenharmony_ci	  optional data can be passed here. This is currently used for
22262306a36Sopenharmony_ci	  passing RDMA-related information.
22362306a36Sopenharmony_ci
22462306a36Sopenharmony_ci  ACK and retransmit handling
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ci      One might think that with reliable IB connections you wouldn't need
22762306a36Sopenharmony_ci      to ack messages that have been received.  The problem is that IB
22862306a36Sopenharmony_ci      hardware generates an ack message before it has DMAed the message
22962306a36Sopenharmony_ci      into memory.  This creates a potential message loss if the HCA is
23062306a36Sopenharmony_ci      disabled for any reason between when it sends the ack and before
23162306a36Sopenharmony_ci      the message is DMAed and processed.  This is only a potential issue
23262306a36Sopenharmony_ci      if another HCA is available for fail-over.
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci      Sending an ack immediately would allow the sender to free the sent
23562306a36Sopenharmony_ci      message from their send queue quickly, but could cause excessive
23662306a36Sopenharmony_ci      traffic to be used for acks. RDS piggybacks acks on sent data
23762306a36Sopenharmony_ci      packets.  Ack-only packets are reduced by only allowing one to be
23862306a36Sopenharmony_ci      in flight at a time, and by the sender only asking for acks when
23962306a36Sopenharmony_ci      its send buffers start to fill up. All retransmissions are also
24062306a36Sopenharmony_ci      acked.
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ci  Flow Control
24362306a36Sopenharmony_ci
24462306a36Sopenharmony_ci      RDS's IB transport uses a credit-based mechanism to verify that
24562306a36Sopenharmony_ci      there is space in the peer's receive buffers for more data. This
24662306a36Sopenharmony_ci      eliminates the need for hardware retries on the connection.
24762306a36Sopenharmony_ci
24862306a36Sopenharmony_ci  Congestion
24962306a36Sopenharmony_ci
25062306a36Sopenharmony_ci      Messages waiting in the receive queue on the receiving socket
25162306a36Sopenharmony_ci      are accounted against the sockets SO_RCVBUF option value.  Only
25262306a36Sopenharmony_ci      the payload bytes in the message are accounted for.  If the
25362306a36Sopenharmony_ci      number of bytes queued equals or exceeds rcvbuf then the socket
25462306a36Sopenharmony_ci      is congested.  All sends attempted to this socket's address
25562306a36Sopenharmony_ci      should return block or return -EWOULDBLOCK.
25662306a36Sopenharmony_ci
25762306a36Sopenharmony_ci      Applications are expected to be reasonably tuned such that this
25862306a36Sopenharmony_ci      situation very rarely occurs.  An application encountering this
25962306a36Sopenharmony_ci      "back-pressure" is considered a bug.
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ci      This is implemented by having each node maintain bitmaps which
26262306a36Sopenharmony_ci      indicate which ports on bound addresses are congested.  As the
26362306a36Sopenharmony_ci      bitmap changes it is sent through all the connections which
26462306a36Sopenharmony_ci      terminate in the local address of the bitmap which changed.
26562306a36Sopenharmony_ci
26662306a36Sopenharmony_ci      The bitmaps are allocated as connections are brought up.  This
26762306a36Sopenharmony_ci      avoids allocation in the interrupt handling path which queues
26862306a36Sopenharmony_ci      sages on sockets.  The dense bitmaps let transports send the
26962306a36Sopenharmony_ci      entire bitmap on any bitmap change reasonably efficiently.  This
27062306a36Sopenharmony_ci      is much easier to implement than some finer-grained
27162306a36Sopenharmony_ci      communication of per-port congestion.  The sender does a very
27262306a36Sopenharmony_ci      inexpensive bit test to test if the port it's about to send to
27362306a36Sopenharmony_ci      is congested or not.
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ci
27662306a36Sopenharmony_ciRDS Transport Layer
27762306a36Sopenharmony_ci===================
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ci  As mentioned above, RDS is not IB-specific. Its code is divided
28062306a36Sopenharmony_ci  into a general RDS layer and a transport layer.
28162306a36Sopenharmony_ci
28262306a36Sopenharmony_ci  The general layer handles the socket API, congestion handling,
28362306a36Sopenharmony_ci  loopback, stats, usermem pinning, and the connection state machine.
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ci  The transport layer handles the details of the transport. The IB
28662306a36Sopenharmony_ci  transport, for example, handles all the queue pairs, work requests,
28762306a36Sopenharmony_ci  CM event handlers, and other Infiniband details.
28862306a36Sopenharmony_ci
28962306a36Sopenharmony_ci
29062306a36Sopenharmony_ciRDS Kernel Structures
29162306a36Sopenharmony_ci=====================
29262306a36Sopenharmony_ci
29362306a36Sopenharmony_ci  struct rds_message
29462306a36Sopenharmony_ci    aka possibly "rds_outgoing", the generic RDS layer copies data to
29562306a36Sopenharmony_ci    be sent and sets header fields as needed, based on the socket API.
29662306a36Sopenharmony_ci    This is then queued for the individual connection and sent by the
29762306a36Sopenharmony_ci    connection's transport.
29862306a36Sopenharmony_ci
29962306a36Sopenharmony_ci  struct rds_incoming
30062306a36Sopenharmony_ci    a generic struct referring to incoming data that can be handed from
30162306a36Sopenharmony_ci    the transport to the general code and queued by the general code
30262306a36Sopenharmony_ci    while the socket is awoken. It is then passed back to the transport
30362306a36Sopenharmony_ci    code to handle the actual copy-to-user.
30462306a36Sopenharmony_ci
30562306a36Sopenharmony_ci  struct rds_socket
30662306a36Sopenharmony_ci    per-socket information
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ci  struct rds_connection
30962306a36Sopenharmony_ci    per-connection information
31062306a36Sopenharmony_ci
31162306a36Sopenharmony_ci  struct rds_transport
31262306a36Sopenharmony_ci    pointers to transport-specific functions
31362306a36Sopenharmony_ci
31462306a36Sopenharmony_ci  struct rds_statistics
31562306a36Sopenharmony_ci    non-transport-specific statistics
31662306a36Sopenharmony_ci
31762306a36Sopenharmony_ci  struct rds_cong_map
31862306a36Sopenharmony_ci    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
31962306a36Sopenharmony_ci
32062306a36Sopenharmony_ciConnection management
32162306a36Sopenharmony_ci=====================
32262306a36Sopenharmony_ci
32362306a36Sopenharmony_ci  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
32462306a36Sopenharmony_ci  ERROR states.
32562306a36Sopenharmony_ci
32662306a36Sopenharmony_ci  The first time an attempt is made by an RDS socket to send data to
32762306a36Sopenharmony_ci  a node, a connection is allocated and connected. That connection is
32862306a36Sopenharmony_ci  then maintained forever -- if there are transport errors, the
32962306a36Sopenharmony_ci  connection will be dropped and re-established.
33062306a36Sopenharmony_ci
33162306a36Sopenharmony_ci  Dropping a connection while packets are queued will cause queued or
33262306a36Sopenharmony_ci  partially-sent datagrams to be retransmitted when the connection is
33362306a36Sopenharmony_ci  re-established.
33462306a36Sopenharmony_ci
33562306a36Sopenharmony_ci
33662306a36Sopenharmony_ciThe send path
33762306a36Sopenharmony_ci=============
33862306a36Sopenharmony_ci
33962306a36Sopenharmony_ci  rds_sendmsg()
34062306a36Sopenharmony_ci    - struct rds_message built from incoming data
34162306a36Sopenharmony_ci    - CMSGs parsed (e.g. RDMA ops)
34262306a36Sopenharmony_ci    - transport connection alloced and connected if not already
34362306a36Sopenharmony_ci    - rds_message placed on send queue
34462306a36Sopenharmony_ci    - send worker awoken
34562306a36Sopenharmony_ci
34662306a36Sopenharmony_ci  rds_send_worker()
34762306a36Sopenharmony_ci    - calls rds_send_xmit() until queue is empty
34862306a36Sopenharmony_ci
34962306a36Sopenharmony_ci  rds_send_xmit()
35062306a36Sopenharmony_ci    - transmits congestion map if one is pending
35162306a36Sopenharmony_ci    - may set ACK_REQUIRED
35262306a36Sopenharmony_ci    - calls transport to send either non-RDMA or RDMA message
35362306a36Sopenharmony_ci      (RDMA ops never retransmitted)
35462306a36Sopenharmony_ci
35562306a36Sopenharmony_ci  rds_ib_xmit()
35662306a36Sopenharmony_ci    - allocs work requests from send ring
35762306a36Sopenharmony_ci    - adds any new send credits available to peer (h_credits)
35862306a36Sopenharmony_ci    - maps the rds_message's sg list
35962306a36Sopenharmony_ci    - piggybacks ack
36062306a36Sopenharmony_ci    - populates work requests
36162306a36Sopenharmony_ci    - post send to connection's queue pair
36262306a36Sopenharmony_ci
36362306a36Sopenharmony_ciThe recv path
36462306a36Sopenharmony_ci=============
36562306a36Sopenharmony_ci
36662306a36Sopenharmony_ci  rds_ib_recv_cq_comp_handler()
36762306a36Sopenharmony_ci    - looks at write completions
36862306a36Sopenharmony_ci    - unmaps recv buffer from device
36962306a36Sopenharmony_ci    - no errors, call rds_ib_process_recv()
37062306a36Sopenharmony_ci    - refill recv ring
37162306a36Sopenharmony_ci
37262306a36Sopenharmony_ci  rds_ib_process_recv()
37362306a36Sopenharmony_ci    - validate header checksum
37462306a36Sopenharmony_ci    - copy header to rds_ib_incoming struct if start of a new datagram
37562306a36Sopenharmony_ci    - add to ibinc's fraglist
37662306a36Sopenharmony_ci    - if competed datagram:
37762306a36Sopenharmony_ci	 - update cong map if datagram was cong update
37862306a36Sopenharmony_ci	 - call rds_recv_incoming() otherwise
37962306a36Sopenharmony_ci	 - note if ack is required
38062306a36Sopenharmony_ci
38162306a36Sopenharmony_ci  rds_recv_incoming()
38262306a36Sopenharmony_ci    - drop duplicate packets
38362306a36Sopenharmony_ci    - respond to pings
38462306a36Sopenharmony_ci    - find the sock associated with this datagram
38562306a36Sopenharmony_ci    - add to sock queue
38662306a36Sopenharmony_ci    - wake up sock
38762306a36Sopenharmony_ci    - do some congestion calculations
38862306a36Sopenharmony_ci  rds_recvmsg
38962306a36Sopenharmony_ci    - copy data into user iovec
39062306a36Sopenharmony_ci    - handle CMSGs
39162306a36Sopenharmony_ci    - return to application
39262306a36Sopenharmony_ci
39362306a36Sopenharmony_ciMultipath RDS (mprds)
39462306a36Sopenharmony_ci=====================
39562306a36Sopenharmony_ci  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
39662306a36Sopenharmony_ci  (though the concept can be extended to other transports). The classical
39762306a36Sopenharmony_ci  implementation of RDS-over-TCP is implemented by demultiplexing multiple
39862306a36Sopenharmony_ci  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
39962306a36Sopenharmony_ci  port]) over a single TCP socket between the 2 IP addresses involved. This
40062306a36Sopenharmony_ci  has the limitation that it ends up funneling multiple RDS flows over a
40162306a36Sopenharmony_ci  single TCP flow, thus it is
40262306a36Sopenharmony_ci  (a) upper-bounded to the single-flow bandwidth,
40362306a36Sopenharmony_ci  (b) suffers from head-of-line blocking for all the RDS sockets.
40462306a36Sopenharmony_ci
40562306a36Sopenharmony_ci  Better throughput (for a fixed small packet size, MTU) can be achieved
40662306a36Sopenharmony_ci  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
40762306a36Sopenharmony_ci  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
40862306a36Sopenharmony_ci  connection. RDS sockets will be attached to a path based on some hash
40962306a36Sopenharmony_ci  (e.g., of local address and RDS port number) and packets for that RDS
41062306a36Sopenharmony_ci  socket will be sent over the attached path using TCP to segment/reassemble
41162306a36Sopenharmony_ci  RDS datagrams on that path.
41262306a36Sopenharmony_ci
41362306a36Sopenharmony_ci  Multipathed RDS is implemented by splitting the struct rds_connection into
41462306a36Sopenharmony_ci  a common (to all paths) part, and a per-path struct rds_conn_path. All
41562306a36Sopenharmony_ci  I/O workqs and reconnect threads are driven from the rds_conn_path.
41662306a36Sopenharmony_ci  Transports such as TCP that are multipath capable may then set up a
41762306a36Sopenharmony_ci  TCP socket per rds_conn_path, and this is managed by the transport via
41862306a36Sopenharmony_ci  the transport privatee cp_transport_data pointer.
41962306a36Sopenharmony_ci
42062306a36Sopenharmony_ci  Transports announce themselves as multipath capable by setting the
42162306a36Sopenharmony_ci  t_mp_capable bit during registration with the rds core module. When the
42262306a36Sopenharmony_ci  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
42362306a36Sopenharmony_ci  across multiple paths. The outgoing hash is computed based on the
42462306a36Sopenharmony_ci  local address and port that the PF_RDS socket is bound to.
42562306a36Sopenharmony_ci
42662306a36Sopenharmony_ci  Additionally, even if the transport is MP capable, we may be
42762306a36Sopenharmony_ci  peering with some node that does not support mprds, or supports
42862306a36Sopenharmony_ci  a different number of paths. As a result, the peering nodes need
42962306a36Sopenharmony_ci  to agree on the number of paths to be used for the connection.
43062306a36Sopenharmony_ci  This is done by sending out a control packet exchange before the
43162306a36Sopenharmony_ci  first data packet. The control packet exchange must have completed
43262306a36Sopenharmony_ci  prior to outgoing hash completion in rds_sendmsg() when the transport
43362306a36Sopenharmony_ci  is mutlipath capable.
43462306a36Sopenharmony_ci
43562306a36Sopenharmony_ci  The control packet is an RDS ping packet (i.e., packet to rds dest
43662306a36Sopenharmony_ci  port 0) with the ping packet having a rds extension header option  of
43762306a36Sopenharmony_ci  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
43862306a36Sopenharmony_ci  number of paths supported by the sender. The "probe" ping packet will
43962306a36Sopenharmony_ci  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
44062306a36Sopenharmony_ci  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
44162306a36Sopenharmony_ci  be able to compute the min(sender_paths, rcvr_paths). The pong
44262306a36Sopenharmony_ci  sent in response to a probe-ping should contain the rcvr's npaths
44362306a36Sopenharmony_ci  when the rcvr is mprds-capable.
44462306a36Sopenharmony_ci
44562306a36Sopenharmony_ci  If the rcvr is not mprds-capable, the exthdr in the ping will be
44662306a36Sopenharmony_ci  ignored.  In this case the pong will not have any exthdrs, so the sender
44762306a36Sopenharmony_ci  of the probe-ping can default to single-path mprds.
44862306a36Sopenharmony_ci
449