162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci=== 462306a36Sopenharmony_ciRDS 562306a36Sopenharmony_ci=== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciOverview 862306a36Sopenharmony_ci======== 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciThis readme tries to provide some background on the hows and whys of RDS, 1162306a36Sopenharmony_ciand will hopefully help you find your way around the code. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciIn addition, please see this email about RDS origins: 1462306a36Sopenharmony_cihttp://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ciRDS Architecture 1762306a36Sopenharmony_ci================ 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciRDS provides reliable, ordered datagram delivery by using a single 2062306a36Sopenharmony_cireliable connection between any two nodes in the cluster. This allows 2162306a36Sopenharmony_ciapplications to use a single socket to talk to any other process in the 2262306a36Sopenharmony_cicluster - so in a cluster with N processes you need N sockets, in contrast 2362306a36Sopenharmony_cito N*N if you use a connection-oriented socket transport like TCP. 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciRDS is not Infiniband-specific; it was designed to support different 2662306a36Sopenharmony_citransports. The current implementation used to support RDS over TCP as well 2762306a36Sopenharmony_cias IB. 2862306a36Sopenharmony_ci 2962306a36Sopenharmony_ciThe high-level semantics of RDS from the application's point of view are 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ci * Addressing 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ci RDS uses IPv4 addresses and 16bit port numbers to identify 3462306a36Sopenharmony_ci the end point of a connection. All socket operations that involve 3562306a36Sopenharmony_ci passing addresses between kernel and user space generally 3662306a36Sopenharmony_ci use a struct sockaddr_in. 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ci The fact that IPv4 addresses are used does not mean the underlying 3962306a36Sopenharmony_ci transport has to be IP-based. In fact, RDS over IB uses a 4062306a36Sopenharmony_ci reliable IB connection; the IP address is used exclusively to 4162306a36Sopenharmony_ci locate the remote node's GID (by ARPing for the given IP). 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ci The port space is entirely independent of UDP, TCP or any other 4462306a36Sopenharmony_ci protocol. 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ci * Socket interface 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ci RDS sockets work *mostly* as you would expect from a BSD 4962306a36Sopenharmony_ci socket. The next section will cover the details. At any rate, 5062306a36Sopenharmony_ci all I/O is performed through the standard BSD socket API. 5162306a36Sopenharmony_ci Some additions like zerocopy support are implemented through 5262306a36Sopenharmony_ci control messages, while other extensions use the getsockopt/ 5362306a36Sopenharmony_ci setsockopt calls. 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ci Sockets must be bound before you can send or receive data. 5662306a36Sopenharmony_ci This is needed because binding also selects a transport and 5762306a36Sopenharmony_ci attaches it to the socket. Once bound, the transport assignment 5862306a36Sopenharmony_ci does not change. RDS will tolerate IPs moving around (eg in 5962306a36Sopenharmony_ci a active-active HA scenario), but only as long as the address 6062306a36Sopenharmony_ci doesn't move to a different transport. 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ci * sysctls 6362306a36Sopenharmony_ci 6462306a36Sopenharmony_ci RDS supports a number of sysctls in /proc/sys/net/rds 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ciSocket Interface 6862306a36Sopenharmony_ci================ 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci AF_RDS, PF_RDS, SOL_RDS 7162306a36Sopenharmony_ci AF_RDS and PF_RDS are the domain type to be used with socket(2) 7262306a36Sopenharmony_ci to create RDS sockets. SOL_RDS is the socket-level to be used 7362306a36Sopenharmony_ci with setsockopt(2) and getsockopt(2) for RDS specific socket 7462306a36Sopenharmony_ci options. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ci fd = socket(PF_RDS, SOCK_SEQPACKET, 0); 7762306a36Sopenharmony_ci This creates a new, unbound RDS socket. 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ci setsockopt(SOL_SOCKET): send and receive buffer size 8062306a36Sopenharmony_ci RDS honors the send and receive buffer size socket options. 8162306a36Sopenharmony_ci You are not allowed to queue more than SO_SNDSIZE bytes to 8262306a36Sopenharmony_ci a socket. A message is queued when sendmsg is called, and 8362306a36Sopenharmony_ci it leaves the queue when the remote system acknowledges 8462306a36Sopenharmony_ci its arrival. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ci The SO_RCVSIZE option controls the maximum receive queue length. 8762306a36Sopenharmony_ci This is a soft limit rather than a hard limit - RDS will 8862306a36Sopenharmony_ci continue to accept and queue incoming messages, even if that 8962306a36Sopenharmony_ci takes the queue length over the limit. However, it will also 9062306a36Sopenharmony_ci mark the port as "congested" and send a congestion update to 9162306a36Sopenharmony_ci the source node. The source node is supposed to throttle any 9262306a36Sopenharmony_ci processes sending to this congested port. 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ci bind(fd, &sockaddr_in, ...) 9562306a36Sopenharmony_ci This binds the socket to a local IP address and port, and a 9662306a36Sopenharmony_ci transport, if one has not already been selected via the 9762306a36Sopenharmony_ci SO_RDS_TRANSPORT socket option 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ci sendmsg(fd, ...) 10062306a36Sopenharmony_ci Sends a message to the indicated recipient. The kernel will 10162306a36Sopenharmony_ci transparently establish the underlying reliable connection 10262306a36Sopenharmony_ci if it isn't up yet. 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ci An attempt to send a message that exceeds SO_SNDSIZE will 10562306a36Sopenharmony_ci return with -EMSGSIZE 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ci An attempt to send a message that would take the total number 10862306a36Sopenharmony_ci of queued bytes over the SO_SNDSIZE threshold will return 10962306a36Sopenharmony_ci EAGAIN. 11062306a36Sopenharmony_ci 11162306a36Sopenharmony_ci An attempt to send a message to a destination that is marked 11262306a36Sopenharmony_ci as "congested" will return ENOBUFS. 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ci recvmsg(fd, ...) 11562306a36Sopenharmony_ci Receives a message that was queued to this socket. The sockets 11662306a36Sopenharmony_ci recv queue accounting is adjusted, and if the queue length 11762306a36Sopenharmony_ci drops below SO_SNDSIZE, the port is marked uncongested, and 11862306a36Sopenharmony_ci a congestion update is sent to all peers. 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ci Applications can ask the RDS kernel module to receive 12162306a36Sopenharmony_ci notifications via control messages (for instance, there is a 12262306a36Sopenharmony_ci notification when a congestion update arrived, or when a RDMA 12362306a36Sopenharmony_ci operation completes). These notifications are received through 12462306a36Sopenharmony_ci the msg.msg_control buffer of struct msghdr. The format of the 12562306a36Sopenharmony_ci messages is described in manpages. 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ci poll(fd) 12862306a36Sopenharmony_ci RDS supports the poll interface to allow the application 12962306a36Sopenharmony_ci to implement async I/O. 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci POLLIN handling is pretty straightforward. When there's an 13262306a36Sopenharmony_ci incoming message queued to the socket, or a pending notification, 13362306a36Sopenharmony_ci we signal POLLIN. 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ci POLLOUT is a little harder. Since you can essentially send 13662306a36Sopenharmony_ci to any destination, RDS will always signal POLLOUT as long as 13762306a36Sopenharmony_ci there's room on the send queue (ie the number of bytes queued 13862306a36Sopenharmony_ci is less than the sendbuf size). 13962306a36Sopenharmony_ci 14062306a36Sopenharmony_ci However, the kernel will refuse to accept messages to 14162306a36Sopenharmony_ci a destination marked congested - in this case you will loop 14262306a36Sopenharmony_ci forever if you rely on poll to tell you what to do. 14362306a36Sopenharmony_ci This isn't a trivial problem, but applications can deal with 14462306a36Sopenharmony_ci this - by using congestion notifications, and by checking for 14562306a36Sopenharmony_ci ENOBUFS errors returned by sendmsg. 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_ci setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) 14862306a36Sopenharmony_ci This allows the application to discard all messages queued to a 14962306a36Sopenharmony_ci specific destination on this particular socket. 15062306a36Sopenharmony_ci 15162306a36Sopenharmony_ci This allows the application to cancel outstanding messages if 15262306a36Sopenharmony_ci it detects a timeout. For instance, if it tried to send a message, 15362306a36Sopenharmony_ci and the remote host is unreachable, RDS will keep trying forever. 15462306a36Sopenharmony_ci The application may decide it's not worth it, and cancel the 15562306a36Sopenharmony_ci operation. In this case, it would use RDS_CANCEL_SENT_TO to 15662306a36Sopenharmony_ci nuke any pending messages. 15762306a36Sopenharmony_ci 15862306a36Sopenharmony_ci ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` 15962306a36Sopenharmony_ci Set or read an integer defining the underlying 16062306a36Sopenharmony_ci encapsulating transport to be used for RDS packets on the 16162306a36Sopenharmony_ci socket. When setting the option, integer argument may be 16262306a36Sopenharmony_ci one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the 16362306a36Sopenharmony_ci value, RDS_TRANS_NONE will be returned on an unbound socket. 16462306a36Sopenharmony_ci This socket option may only be set exactly once on the socket, 16562306a36Sopenharmony_ci prior to binding it via the bind(2) system call. Attempts to 16662306a36Sopenharmony_ci set SO_RDS_TRANSPORT on a socket for which the transport has 16762306a36Sopenharmony_ci been previously attached explicitly (by SO_RDS_TRANSPORT) or 16862306a36Sopenharmony_ci implicitly (via bind(2)) will return an error of EOPNOTSUPP. 16962306a36Sopenharmony_ci An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will 17062306a36Sopenharmony_ci always return EINVAL. 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciRDMA for RDS 17362306a36Sopenharmony_ci============ 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ci see rds-rdma(7) manpage (available in rds-tools) 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ci 17862306a36Sopenharmony_ciCongestion Notifications 17962306a36Sopenharmony_ci======================== 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ci see rds(7) manpage 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ciRDS Protocol 18562306a36Sopenharmony_ci============ 18662306a36Sopenharmony_ci 18762306a36Sopenharmony_ci Message header 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci The message header is a 'struct rds_header' (see rds.h): 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ci Fields: 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ci h_sequence: 19462306a36Sopenharmony_ci per-packet sequence number 19562306a36Sopenharmony_ci h_ack: 19662306a36Sopenharmony_ci piggybacked acknowledgment of last packet received 19762306a36Sopenharmony_ci h_len: 19862306a36Sopenharmony_ci length of data, not including header 19962306a36Sopenharmony_ci h_sport: 20062306a36Sopenharmony_ci source port 20162306a36Sopenharmony_ci h_dport: 20262306a36Sopenharmony_ci destination port 20362306a36Sopenharmony_ci h_flags: 20462306a36Sopenharmony_ci Can be: 20562306a36Sopenharmony_ci 20662306a36Sopenharmony_ci ============= ================================== 20762306a36Sopenharmony_ci CONG_BITMAP this is a congestion update bitmap 20862306a36Sopenharmony_ci ACK_REQUIRED receiver must ack this packet 20962306a36Sopenharmony_ci RETRANSMITTED packet has previously been sent 21062306a36Sopenharmony_ci ============= ================================== 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ci h_credit: 21362306a36Sopenharmony_ci indicate to other end of connection that 21462306a36Sopenharmony_ci it has more credits available (i.e. there is 21562306a36Sopenharmony_ci more send room) 21662306a36Sopenharmony_ci h_padding[4]: 21762306a36Sopenharmony_ci unused, for future use 21862306a36Sopenharmony_ci h_csum: 21962306a36Sopenharmony_ci header checksum 22062306a36Sopenharmony_ci h_exthdr: 22162306a36Sopenharmony_ci optional data can be passed here. This is currently used for 22262306a36Sopenharmony_ci passing RDMA-related information. 22362306a36Sopenharmony_ci 22462306a36Sopenharmony_ci ACK and retransmit handling 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ci One might think that with reliable IB connections you wouldn't need 22762306a36Sopenharmony_ci to ack messages that have been received. The problem is that IB 22862306a36Sopenharmony_ci hardware generates an ack message before it has DMAed the message 22962306a36Sopenharmony_ci into memory. This creates a potential message loss if the HCA is 23062306a36Sopenharmony_ci disabled for any reason between when it sends the ack and before 23162306a36Sopenharmony_ci the message is DMAed and processed. This is only a potential issue 23262306a36Sopenharmony_ci if another HCA is available for fail-over. 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ci Sending an ack immediately would allow the sender to free the sent 23562306a36Sopenharmony_ci message from their send queue quickly, but could cause excessive 23662306a36Sopenharmony_ci traffic to be used for acks. RDS piggybacks acks on sent data 23762306a36Sopenharmony_ci packets. Ack-only packets are reduced by only allowing one to be 23862306a36Sopenharmony_ci in flight at a time, and by the sender only asking for acks when 23962306a36Sopenharmony_ci its send buffers start to fill up. All retransmissions are also 24062306a36Sopenharmony_ci acked. 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ci Flow Control 24362306a36Sopenharmony_ci 24462306a36Sopenharmony_ci RDS's IB transport uses a credit-based mechanism to verify that 24562306a36Sopenharmony_ci there is space in the peer's receive buffers for more data. This 24662306a36Sopenharmony_ci eliminates the need for hardware retries on the connection. 24762306a36Sopenharmony_ci 24862306a36Sopenharmony_ci Congestion 24962306a36Sopenharmony_ci 25062306a36Sopenharmony_ci Messages waiting in the receive queue on the receiving socket 25162306a36Sopenharmony_ci are accounted against the sockets SO_RCVBUF option value. Only 25262306a36Sopenharmony_ci the payload bytes in the message are accounted for. If the 25362306a36Sopenharmony_ci number of bytes queued equals or exceeds rcvbuf then the socket 25462306a36Sopenharmony_ci is congested. All sends attempted to this socket's address 25562306a36Sopenharmony_ci should return block or return -EWOULDBLOCK. 25662306a36Sopenharmony_ci 25762306a36Sopenharmony_ci Applications are expected to be reasonably tuned such that this 25862306a36Sopenharmony_ci situation very rarely occurs. An application encountering this 25962306a36Sopenharmony_ci "back-pressure" is considered a bug. 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci This is implemented by having each node maintain bitmaps which 26262306a36Sopenharmony_ci indicate which ports on bound addresses are congested. As the 26362306a36Sopenharmony_ci bitmap changes it is sent through all the connections which 26462306a36Sopenharmony_ci terminate in the local address of the bitmap which changed. 26562306a36Sopenharmony_ci 26662306a36Sopenharmony_ci The bitmaps are allocated as connections are brought up. This 26762306a36Sopenharmony_ci avoids allocation in the interrupt handling path which queues 26862306a36Sopenharmony_ci sages on sockets. The dense bitmaps let transports send the 26962306a36Sopenharmony_ci entire bitmap on any bitmap change reasonably efficiently. This 27062306a36Sopenharmony_ci is much easier to implement than some finer-grained 27162306a36Sopenharmony_ci communication of per-port congestion. The sender does a very 27262306a36Sopenharmony_ci inexpensive bit test to test if the port it's about to send to 27362306a36Sopenharmony_ci is congested or not. 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ci 27662306a36Sopenharmony_ciRDS Transport Layer 27762306a36Sopenharmony_ci=================== 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ci As mentioned above, RDS is not IB-specific. Its code is divided 28062306a36Sopenharmony_ci into a general RDS layer and a transport layer. 28162306a36Sopenharmony_ci 28262306a36Sopenharmony_ci The general layer handles the socket API, congestion handling, 28362306a36Sopenharmony_ci loopback, stats, usermem pinning, and the connection state machine. 28462306a36Sopenharmony_ci 28562306a36Sopenharmony_ci The transport layer handles the details of the transport. The IB 28662306a36Sopenharmony_ci transport, for example, handles all the queue pairs, work requests, 28762306a36Sopenharmony_ci CM event handlers, and other Infiniband details. 28862306a36Sopenharmony_ci 28962306a36Sopenharmony_ci 29062306a36Sopenharmony_ciRDS Kernel Structures 29162306a36Sopenharmony_ci===================== 29262306a36Sopenharmony_ci 29362306a36Sopenharmony_ci struct rds_message 29462306a36Sopenharmony_ci aka possibly "rds_outgoing", the generic RDS layer copies data to 29562306a36Sopenharmony_ci be sent and sets header fields as needed, based on the socket API. 29662306a36Sopenharmony_ci This is then queued for the individual connection and sent by the 29762306a36Sopenharmony_ci connection's transport. 29862306a36Sopenharmony_ci 29962306a36Sopenharmony_ci struct rds_incoming 30062306a36Sopenharmony_ci a generic struct referring to incoming data that can be handed from 30162306a36Sopenharmony_ci the transport to the general code and queued by the general code 30262306a36Sopenharmony_ci while the socket is awoken. It is then passed back to the transport 30362306a36Sopenharmony_ci code to handle the actual copy-to-user. 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ci struct rds_socket 30662306a36Sopenharmony_ci per-socket information 30762306a36Sopenharmony_ci 30862306a36Sopenharmony_ci struct rds_connection 30962306a36Sopenharmony_ci per-connection information 31062306a36Sopenharmony_ci 31162306a36Sopenharmony_ci struct rds_transport 31262306a36Sopenharmony_ci pointers to transport-specific functions 31362306a36Sopenharmony_ci 31462306a36Sopenharmony_ci struct rds_statistics 31562306a36Sopenharmony_ci non-transport-specific statistics 31662306a36Sopenharmony_ci 31762306a36Sopenharmony_ci struct rds_cong_map 31862306a36Sopenharmony_ci wraps the raw congestion bitmap, contains rbnode, waitq, etc. 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ciConnection management 32162306a36Sopenharmony_ci===================== 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ci Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and 32462306a36Sopenharmony_ci ERROR states. 32562306a36Sopenharmony_ci 32662306a36Sopenharmony_ci The first time an attempt is made by an RDS socket to send data to 32762306a36Sopenharmony_ci a node, a connection is allocated and connected. That connection is 32862306a36Sopenharmony_ci then maintained forever -- if there are transport errors, the 32962306a36Sopenharmony_ci connection will be dropped and re-established. 33062306a36Sopenharmony_ci 33162306a36Sopenharmony_ci Dropping a connection while packets are queued will cause queued or 33262306a36Sopenharmony_ci partially-sent datagrams to be retransmitted when the connection is 33362306a36Sopenharmony_ci re-established. 33462306a36Sopenharmony_ci 33562306a36Sopenharmony_ci 33662306a36Sopenharmony_ciThe send path 33762306a36Sopenharmony_ci============= 33862306a36Sopenharmony_ci 33962306a36Sopenharmony_ci rds_sendmsg() 34062306a36Sopenharmony_ci - struct rds_message built from incoming data 34162306a36Sopenharmony_ci - CMSGs parsed (e.g. RDMA ops) 34262306a36Sopenharmony_ci - transport connection alloced and connected if not already 34362306a36Sopenharmony_ci - rds_message placed on send queue 34462306a36Sopenharmony_ci - send worker awoken 34562306a36Sopenharmony_ci 34662306a36Sopenharmony_ci rds_send_worker() 34762306a36Sopenharmony_ci - calls rds_send_xmit() until queue is empty 34862306a36Sopenharmony_ci 34962306a36Sopenharmony_ci rds_send_xmit() 35062306a36Sopenharmony_ci - transmits congestion map if one is pending 35162306a36Sopenharmony_ci - may set ACK_REQUIRED 35262306a36Sopenharmony_ci - calls transport to send either non-RDMA or RDMA message 35362306a36Sopenharmony_ci (RDMA ops never retransmitted) 35462306a36Sopenharmony_ci 35562306a36Sopenharmony_ci rds_ib_xmit() 35662306a36Sopenharmony_ci - allocs work requests from send ring 35762306a36Sopenharmony_ci - adds any new send credits available to peer (h_credits) 35862306a36Sopenharmony_ci - maps the rds_message's sg list 35962306a36Sopenharmony_ci - piggybacks ack 36062306a36Sopenharmony_ci - populates work requests 36162306a36Sopenharmony_ci - post send to connection's queue pair 36262306a36Sopenharmony_ci 36362306a36Sopenharmony_ciThe recv path 36462306a36Sopenharmony_ci============= 36562306a36Sopenharmony_ci 36662306a36Sopenharmony_ci rds_ib_recv_cq_comp_handler() 36762306a36Sopenharmony_ci - looks at write completions 36862306a36Sopenharmony_ci - unmaps recv buffer from device 36962306a36Sopenharmony_ci - no errors, call rds_ib_process_recv() 37062306a36Sopenharmony_ci - refill recv ring 37162306a36Sopenharmony_ci 37262306a36Sopenharmony_ci rds_ib_process_recv() 37362306a36Sopenharmony_ci - validate header checksum 37462306a36Sopenharmony_ci - copy header to rds_ib_incoming struct if start of a new datagram 37562306a36Sopenharmony_ci - add to ibinc's fraglist 37662306a36Sopenharmony_ci - if competed datagram: 37762306a36Sopenharmony_ci - update cong map if datagram was cong update 37862306a36Sopenharmony_ci - call rds_recv_incoming() otherwise 37962306a36Sopenharmony_ci - note if ack is required 38062306a36Sopenharmony_ci 38162306a36Sopenharmony_ci rds_recv_incoming() 38262306a36Sopenharmony_ci - drop duplicate packets 38362306a36Sopenharmony_ci - respond to pings 38462306a36Sopenharmony_ci - find the sock associated with this datagram 38562306a36Sopenharmony_ci - add to sock queue 38662306a36Sopenharmony_ci - wake up sock 38762306a36Sopenharmony_ci - do some congestion calculations 38862306a36Sopenharmony_ci rds_recvmsg 38962306a36Sopenharmony_ci - copy data into user iovec 39062306a36Sopenharmony_ci - handle CMSGs 39162306a36Sopenharmony_ci - return to application 39262306a36Sopenharmony_ci 39362306a36Sopenharmony_ciMultipath RDS (mprds) 39462306a36Sopenharmony_ci===================== 39562306a36Sopenharmony_ci Mprds is multipathed-RDS, primarily intended for RDS-over-TCP 39662306a36Sopenharmony_ci (though the concept can be extended to other transports). The classical 39762306a36Sopenharmony_ci implementation of RDS-over-TCP is implemented by demultiplexing multiple 39862306a36Sopenharmony_ci PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, 39962306a36Sopenharmony_ci port]) over a single TCP socket between the 2 IP addresses involved. This 40062306a36Sopenharmony_ci has the limitation that it ends up funneling multiple RDS flows over a 40162306a36Sopenharmony_ci single TCP flow, thus it is 40262306a36Sopenharmony_ci (a) upper-bounded to the single-flow bandwidth, 40362306a36Sopenharmony_ci (b) suffers from head-of-line blocking for all the RDS sockets. 40462306a36Sopenharmony_ci 40562306a36Sopenharmony_ci Better throughput (for a fixed small packet size, MTU) can be achieved 40662306a36Sopenharmony_ci by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed 40762306a36Sopenharmony_ci RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp 40862306a36Sopenharmony_ci connection. RDS sockets will be attached to a path based on some hash 40962306a36Sopenharmony_ci (e.g., of local address and RDS port number) and packets for that RDS 41062306a36Sopenharmony_ci socket will be sent over the attached path using TCP to segment/reassemble 41162306a36Sopenharmony_ci RDS datagrams on that path. 41262306a36Sopenharmony_ci 41362306a36Sopenharmony_ci Multipathed RDS is implemented by splitting the struct rds_connection into 41462306a36Sopenharmony_ci a common (to all paths) part, and a per-path struct rds_conn_path. All 41562306a36Sopenharmony_ci I/O workqs and reconnect threads are driven from the rds_conn_path. 41662306a36Sopenharmony_ci Transports such as TCP that are multipath capable may then set up a 41762306a36Sopenharmony_ci TCP socket per rds_conn_path, and this is managed by the transport via 41862306a36Sopenharmony_ci the transport privatee cp_transport_data pointer. 41962306a36Sopenharmony_ci 42062306a36Sopenharmony_ci Transports announce themselves as multipath capable by setting the 42162306a36Sopenharmony_ci t_mp_capable bit during registration with the rds core module. When the 42262306a36Sopenharmony_ci transport is multipath-capable, rds_sendmsg() hashes outgoing traffic 42362306a36Sopenharmony_ci across multiple paths. The outgoing hash is computed based on the 42462306a36Sopenharmony_ci local address and port that the PF_RDS socket is bound to. 42562306a36Sopenharmony_ci 42662306a36Sopenharmony_ci Additionally, even if the transport is MP capable, we may be 42762306a36Sopenharmony_ci peering with some node that does not support mprds, or supports 42862306a36Sopenharmony_ci a different number of paths. As a result, the peering nodes need 42962306a36Sopenharmony_ci to agree on the number of paths to be used for the connection. 43062306a36Sopenharmony_ci This is done by sending out a control packet exchange before the 43162306a36Sopenharmony_ci first data packet. The control packet exchange must have completed 43262306a36Sopenharmony_ci prior to outgoing hash completion in rds_sendmsg() when the transport 43362306a36Sopenharmony_ci is mutlipath capable. 43462306a36Sopenharmony_ci 43562306a36Sopenharmony_ci The control packet is an RDS ping packet (i.e., packet to rds dest 43662306a36Sopenharmony_ci port 0) with the ping packet having a rds extension header option of 43762306a36Sopenharmony_ci type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the 43862306a36Sopenharmony_ci number of paths supported by the sender. The "probe" ping packet will 43962306a36Sopenharmony_ci get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) 44062306a36Sopenharmony_ci The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately 44162306a36Sopenharmony_ci be able to compute the min(sender_paths, rcvr_paths). The pong 44262306a36Sopenharmony_ci sent in response to a probe-ping should contain the rcvr's npaths 44362306a36Sopenharmony_ci when the rcvr is mprds-capable. 44462306a36Sopenharmony_ci 44562306a36Sopenharmony_ci If the rcvr is not mprds-capable, the exthdr in the ping will be 44662306a36Sopenharmony_ci ignored. In this case the pong will not have any exthdrs, so the sender 44762306a36Sopenharmony_ci of the probe-ping can default to single-path mprds. 44862306a36Sopenharmony_ci 449