18c2ecf20Sopenharmony_ci**************************** 28c2ecf20Sopenharmony_ciRDMA Transport (RTRS) 38c2ecf20Sopenharmony_ci**************************** 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ciRTRS (RDMA Transport) is a reliable high speed transport library 68c2ecf20Sopenharmony_ciwhich provides support to establish optimal number of connections 78c2ecf20Sopenharmony_cibetween client and server machines using RDMA (InfiniBand, RoCE, iWarp) 88c2ecf20Sopenharmony_citransport. It is optimized to transfer (read/write) IO blocks. 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciIn its core interface it follows the BIO semantics of providing the 118c2ecf20Sopenharmony_cipossibility to either write data from an sg list to the remote side 128c2ecf20Sopenharmony_cior to request ("read") data transfer from the remote side into a given 138c2ecf20Sopenharmony_cisg list. 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_ciRTRS provides I/O fail-over and load-balancing capabilities by using 168c2ecf20Sopenharmony_cimultipath I/O (see "add_path" and "mp_policy" configuration entries in 178c2ecf20Sopenharmony_ciDocumentation/ABI/testing/sysfs-class-rtrs-client). 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciRTRS is used by the RNBD (RDMA Network Block Device) modules. 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ci================== 228c2ecf20Sopenharmony_ciTransport protocol 238c2ecf20Sopenharmony_ci================== 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ciOverview 268c2ecf20Sopenharmony_ci-------- 278c2ecf20Sopenharmony_ciAn established connection between a client and a server is called rtrs 288c2ecf20Sopenharmony_cisession. A session is associated with a set of memory chunks reserved on the 298c2ecf20Sopenharmony_ciserver side for a given client for rdma transfer. A session 308c2ecf20Sopenharmony_ciconsists of multiple paths, each representing a separate physical link 318c2ecf20Sopenharmony_cibetween client and server. Those are used for load balancing and failover. 328c2ecf20Sopenharmony_ciEach path consists of as many connections (QPs) as there are cpus on 338c2ecf20Sopenharmony_cithe client. 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ciWhen processing an incoming write or read request, rtrs client uses memory 368c2ecf20Sopenharmony_cichunks reserved for him on the server side. Their number, size and addresses 378c2ecf20Sopenharmony_cineed to be exchanged between client and server during the connection 388c2ecf20Sopenharmony_ciestablishment phase. Apart from the memory related information client needs to 398c2ecf20Sopenharmony_ciinform the server about the session name and identify each path and connection 408c2ecf20Sopenharmony_ciindividually. 418c2ecf20Sopenharmony_ci 428c2ecf20Sopenharmony_ciOn an established session client sends to server write or read messages. 438c2ecf20Sopenharmony_ciServer uses immediate field to tell the client which request is being 448c2ecf20Sopenharmony_ciacknowledged and for errno. Client uses immediate field to tell the server 458c2ecf20Sopenharmony_ciwhich of the memory chunks has been accessed and at which offset the message 468c2ecf20Sopenharmony_cican be found. 478c2ecf20Sopenharmony_ci 488c2ecf20Sopenharmony_ciModule parameter always_invalidate is introduced for the security problem 498c2ecf20Sopenharmony_cidiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we 508c2ecf20Sopenharmony_ciinvalidate each rdma buffer before we hand it over to RNBD server and 518c2ecf20Sopenharmony_cithen pass it to the block layer. A new rkey is generated and registered for the 528c2ecf20Sopenharmony_cibuffer after it returns back from the block layer and RNBD server. 538c2ecf20Sopenharmony_ciThe new rkey is sent back to the client along with the IO result. 548c2ecf20Sopenharmony_ciThe procedure is the default behaviour of the driver. This invalidation and 558c2ecf20Sopenharmony_ciregistration on each IO causes performance drop of up to 20%. A user of the 568c2ecf20Sopenharmony_cidriver may choose to load the modules with this mechanism switched off 578c2ecf20Sopenharmony_ci(always_invalidate=N), if he understands and can take the risk of a malicious 588c2ecf20Sopenharmony_ciclient being able to corrupt memory of a server it is connected to. This might 598c2ecf20Sopenharmony_cibe a reasonable option in a scenario where all the clients and all the servers 608c2ecf20Sopenharmony_ciare located within a secure datacenter. 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ciConnection establishment 648c2ecf20Sopenharmony_ci------------------------ 658c2ecf20Sopenharmony_ci 668c2ecf20Sopenharmony_ci1. Client starts establishing connections belonging to a path of a session one 678c2ecf20Sopenharmony_ciby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. 688c2ecf20Sopenharmony_ciThose include uuid of the session and uuid of the path to be 698c2ecf20Sopenharmony_ciestablished. They are used by the server to find a persisting session/path or 708c2ecf20Sopenharmony_cito create a new one when necessary. The message also contains the protocol 718c2ecf20Sopenharmony_civersion and magic for compatibility, total number of connections per session 728c2ecf20Sopenharmony_ci(as many as cpus on the client), the id of the current connection and 738c2ecf20Sopenharmony_cithe reconnect counter, which is used to resolve the situations where 748c2ecf20Sopenharmony_ciclient is trying to reconnect a path, while server is still destroying the old 758c2ecf20Sopenharmony_cione. 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ci2. Server accepts the connection requests one by one and attaches 788c2ecf20Sopenharmony_ciRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and 798c2ecf20Sopenharmony_ciprotocol version, the messages include error code, queue depth supported by 808c2ecf20Sopenharmony_cithe server (number of memory chunks which are going to be allocated for that 818c2ecf20Sopenharmony_cisession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set 828c2ecf20Sopenharmony_ciwhen always_invalidate=Y. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ci3. After all connections of a path are established client sends to server the 858c2ecf20Sopenharmony_ciRTRS_MSG_INFO_REQ message, containing the name of the session. This message 868c2ecf20Sopenharmony_cirequests the address information from the server. 878c2ecf20Sopenharmony_ci 888c2ecf20Sopenharmony_ci4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, 898c2ecf20Sopenharmony_ciwhich contains the addresses and keys of the RDMA buffers allocated for that 908c2ecf20Sopenharmony_cisession. 918c2ecf20Sopenharmony_ci 928c2ecf20Sopenharmony_ci5. Session becomes connected after all paths to be established are connected 938c2ecf20Sopenharmony_ci(i.e. steps 1-4 finished for all paths requested for a session) 948c2ecf20Sopenharmony_ci 958c2ecf20Sopenharmony_ci6. Server and client exchange periodically heartbeat messages (empty rdma 968c2ecf20Sopenharmony_cimessages with an immediate field) which are used to detect a crash on remote 978c2ecf20Sopenharmony_ciside or network outage in an absence of IO. 988c2ecf20Sopenharmony_ci 998c2ecf20Sopenharmony_ci7. On any RDMA related error or in the case of a heartbeat timeout, the 1008c2ecf20Sopenharmony_cicorresponding path is disconnected, all the inflight IO are failed over to a 1018c2ecf20Sopenharmony_cihealthy path, if any, and the reconnect mechanism is triggered. 1028c2ecf20Sopenharmony_ci 1038c2ecf20Sopenharmony_ciCLT SRV 1048c2ecf20Sopenharmony_ci*for each connection belonging to a path and for each path: 1058c2ecf20Sopenharmony_ciRTRS_MSG_CON_REQ -------------------> 1068c2ecf20Sopenharmony_ci <------------------- RTRS_MSG_CON_RSP 1078c2ecf20Sopenharmony_ci... 1088c2ecf20Sopenharmony_ci*after all connections are established: 1098c2ecf20Sopenharmony_ciRTRS_MSG_INFO_REQ -------------------> 1108c2ecf20Sopenharmony_ci <------------------- RTRS_MSG_INFO_RSP 1118c2ecf20Sopenharmony_ci*heartbeat is started from both sides: 1128c2ecf20Sopenharmony_ci -------------------> [RTRS_HB_MSG_IMM] 1138c2ecf20Sopenharmony_ci[RTRS_HB_MSG_ACK] <------------------- 1148c2ecf20Sopenharmony_ci[RTRS_HB_MSG_IMM] <------------------- 1158c2ecf20Sopenharmony_ci -------------------> [RTRS_HB_MSG_ACK] 1168c2ecf20Sopenharmony_ci 1178c2ecf20Sopenharmony_ciIO path 1188c2ecf20Sopenharmony_ci------- 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ci* Write (always_invalidate=N) * 1218c2ecf20Sopenharmony_ci 1228c2ecf20Sopenharmony_ci1. When processing a write request client selects one of the memory chunks 1238c2ecf20Sopenharmony_cion the server side and rdma writes there the user data, user header and the 1248c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 1258c2ecf20Sopenharmony_cicontains size of the user header. The client tells the server which chunk has 1268c2ecf20Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 1278c2ecf20Sopenharmony_ciusing the IMM field. 1288c2ecf20Sopenharmony_ci 1298c2ecf20Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with 1308c2ecf20Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding 1318c2ecf20Sopenharmony_ciinflight IO and for the error code. 1328c2ecf20Sopenharmony_ci 1338c2ecf20Sopenharmony_ciCLT SRV 1348c2ecf20Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 1358c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM] <----------------- (id + errno) 1368c2ecf20Sopenharmony_ci 1378c2ecf20Sopenharmony_ci* Write (always_invalidate=Y) * 1388c2ecf20Sopenharmony_ci 1398c2ecf20Sopenharmony_ci1. When processing a write request client selects one of the memory chunks 1408c2ecf20Sopenharmony_cion the server side and rdma writes there the user data, user header and the 1418c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 1428c2ecf20Sopenharmony_cicontains size of the user header. The client tells the server which chunk has 1438c2ecf20Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 1448c2ecf20Sopenharmony_ciusing the IMM field, Server invalidate rkey associated to the memory chunks 1458c2ecf20Sopenharmony_cifirst, when it finishes, pass the IO to RNBD server module. 1468c2ecf20Sopenharmony_ci 1478c2ecf20Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with 1488c2ecf20Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding 1498c2ecf20Sopenharmony_ciinflight IO and for the error code. The new rkey is sent back using 1508c2ecf20Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates 1518c2ecf20Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post 1528c2ecf20Sopenharmony_ciback the recv buffer for later use. 1538c2ecf20Sopenharmony_ci 1548c2ecf20Sopenharmony_ciCLT SRV 1558c2ecf20Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 1568c2ecf20Sopenharmony_ci[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 1578c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM] <----------------- (id + errno) 1588c2ecf20Sopenharmony_ci 1598c2ecf20Sopenharmony_ci 1608c2ecf20Sopenharmony_ci* Read (always_invalidate=N)* 1618c2ecf20Sopenharmony_ci 1628c2ecf20Sopenharmony_ci1. When processing a read request client selects one of the memory chunks 1638c2ecf20Sopenharmony_cion the server side and rdma writes there the user header and the 1648c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 1658c2ecf20Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the 1668c2ecf20Sopenharmony_cilist of addresses along with keys for the data to be read into. 1678c2ecf20Sopenharmony_ci 1688c2ecf20Sopenharmony_ci2. When confirming a read request server transfers the requested data first, 1698c2ecf20Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma 1708c2ecf20Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the 1718c2ecf20Sopenharmony_cioutstanding inflight IO and the error code. 1728c2ecf20Sopenharmony_ci 1738c2ecf20Sopenharmony_ciCLT SRV 1748c2ecf20Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 1758c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 1768c2ecf20Sopenharmony_cior in case client requested invalidation: 1778c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 1788c2ecf20Sopenharmony_ci 1798c2ecf20Sopenharmony_ci* Read (always_invalidate=Y)* 1808c2ecf20Sopenharmony_ci 1818c2ecf20Sopenharmony_ci1. When processing a read request client selects one of the memory chunks 1828c2ecf20Sopenharmony_cion the server side and rdma writes there the user header and the 1838c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 1848c2ecf20Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the 1858c2ecf20Sopenharmony_cilist of addresses along with keys for the data to be read into. 1868c2ecf20Sopenharmony_ciServer invalidate rkey associated to the memory chunks first, when it finishes, 1878c2ecf20Sopenharmony_cipasses the IO to RNBD server module. 1888c2ecf20Sopenharmony_ci 1898c2ecf20Sopenharmony_ci2. When confirming a read request server transfers the requested data first, 1908c2ecf20Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma 1918c2ecf20Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the 1928c2ecf20Sopenharmony_cioutstanding inflight IO and the error code. The new rkey is sent back using 1938c2ecf20Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates 1948c2ecf20Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post 1958c2ecf20Sopenharmony_ciback the recv buffer for later use. 1968c2ecf20Sopenharmony_ci 1978c2ecf20Sopenharmony_ciCLT SRV 1988c2ecf20Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 1998c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 2008c2ecf20Sopenharmony_ci[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 2018c2ecf20Sopenharmony_cior in case client requested invalidation: 2028c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 2038c2ecf20Sopenharmony_ci========================================= 2048c2ecf20Sopenharmony_ciContributors List(in alphabetical order) 2058c2ecf20Sopenharmony_ci========================================= 2068c2ecf20Sopenharmony_ciDanil Kipnis <danil.kipnis@profitbricks.com> 2078c2ecf20Sopenharmony_ciFabian Holler <mail@fholler.de> 2088c2ecf20Sopenharmony_ciGuoqing Jiang <guoqing.jiang@cloud.ionos.com> 2098c2ecf20Sopenharmony_ciJack Wang <jinpu.wang@profitbricks.com> 2108c2ecf20Sopenharmony_ciKleber Souza <kleber.souza@profitbricks.com> 2118c2ecf20Sopenharmony_ciLutz Pogrell <lutz.pogrell@cloud.ionos.com> 2128c2ecf20Sopenharmony_ciMilind Dumbare <Milind.dumbare@gmail.com> 2138c2ecf20Sopenharmony_ciRoman Penyaev <roman.penyaev@profitbricks.com> 214