162306a36Sopenharmony_ci**************************** 262306a36Sopenharmony_ciRDMA Transport (RTRS) 362306a36Sopenharmony_ci**************************** 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciRTRS (RDMA Transport) is a reliable high speed transport library 662306a36Sopenharmony_ciwhich provides support to establish optimal number of connections 762306a36Sopenharmony_cibetween client and server machines using RDMA (InfiniBand, RoCE, iWarp) 862306a36Sopenharmony_citransport. It is optimized to transfer (read/write) IO blocks. 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciIn its core interface it follows the BIO semantics of providing the 1162306a36Sopenharmony_cipossibility to either write data from an sg list to the remote side 1262306a36Sopenharmony_cior to request ("read") data transfer from the remote side into a given 1362306a36Sopenharmony_cisg list. 1462306a36Sopenharmony_ci 1562306a36Sopenharmony_ciRTRS provides I/O fail-over and load-balancing capabilities by using 1662306a36Sopenharmony_cimultipath I/O (see "add_path" and "mp_policy" configuration entries in 1762306a36Sopenharmony_ciDocumentation/ABI/testing/sysfs-class-rtrs-client). 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciRTRS is used by the RNBD (RDMA Network Block Device) modules. 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ci================== 2262306a36Sopenharmony_ciTransport protocol 2362306a36Sopenharmony_ci================== 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciOverview 2662306a36Sopenharmony_ci-------- 2762306a36Sopenharmony_ciAn established connection between a client and a server is called rtrs 2862306a36Sopenharmony_cisession. A session is associated with a set of memory chunks reserved on the 2962306a36Sopenharmony_ciserver side for a given client for rdma transfer. A session 3062306a36Sopenharmony_ciconsists of multiple paths, each representing a separate physical link 3162306a36Sopenharmony_cibetween client and server. Those are used for load balancing and failover. 3262306a36Sopenharmony_ciEach path consists of as many connections (QPs) as there are cpus on 3362306a36Sopenharmony_cithe client. 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciWhen processing an incoming write or read request, rtrs client uses memory 3662306a36Sopenharmony_cichunks reserved for him on the server side. Their number, size and addresses 3762306a36Sopenharmony_cineed to be exchanged between client and server during the connection 3862306a36Sopenharmony_ciestablishment phase. Apart from the memory related information client needs to 3962306a36Sopenharmony_ciinform the server about the session name and identify each path and connection 4062306a36Sopenharmony_ciindividually. 4162306a36Sopenharmony_ci 4262306a36Sopenharmony_ciOn an established session client sends to server write or read messages. 4362306a36Sopenharmony_ciServer uses immediate field to tell the client which request is being 4462306a36Sopenharmony_ciacknowledged and for errno. Client uses immediate field to tell the server 4562306a36Sopenharmony_ciwhich of the memory chunks has been accessed and at which offset the message 4662306a36Sopenharmony_cican be found. 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ciModule parameter always_invalidate is introduced for the security problem 4962306a36Sopenharmony_cidiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we 5062306a36Sopenharmony_ciinvalidate each rdma buffer before we hand it over to RNBD server and 5162306a36Sopenharmony_cithen pass it to the block layer. A new rkey is generated and registered for the 5262306a36Sopenharmony_cibuffer after it returns back from the block layer and RNBD server. 5362306a36Sopenharmony_ciThe new rkey is sent back to the client along with the IO result. 5462306a36Sopenharmony_ciThe procedure is the default behaviour of the driver. This invalidation and 5562306a36Sopenharmony_ciregistration on each IO causes performance drop of up to 20%. A user of the 5662306a36Sopenharmony_cidriver may choose to load the modules with this mechanism switched off 5762306a36Sopenharmony_ci(always_invalidate=N), if he understands and can take the risk of a malicious 5862306a36Sopenharmony_ciclient being able to corrupt memory of a server it is connected to. This might 5962306a36Sopenharmony_cibe a reasonable option in a scenario where all the clients and all the servers 6062306a36Sopenharmony_ciare located within a secure datacenter. 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ciConnection establishment 6462306a36Sopenharmony_ci------------------------ 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ci1. Client starts establishing connections belonging to a path of a session one 6762306a36Sopenharmony_ciby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. 6862306a36Sopenharmony_ciThose include uuid of the session and uuid of the path to be 6962306a36Sopenharmony_ciestablished. They are used by the server to find a persisting session/path or 7062306a36Sopenharmony_cito create a new one when necessary. The message also contains the protocol 7162306a36Sopenharmony_civersion and magic for compatibility, total number of connections per session 7262306a36Sopenharmony_ci(as many as cpus on the client), the id of the current connection and 7362306a36Sopenharmony_cithe reconnect counter, which is used to resolve the situations where 7462306a36Sopenharmony_ciclient is trying to reconnect a path, while server is still destroying the old 7562306a36Sopenharmony_cione. 7662306a36Sopenharmony_ci 7762306a36Sopenharmony_ci2. Server accepts the connection requests one by one and attaches 7862306a36Sopenharmony_ciRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and 7962306a36Sopenharmony_ciprotocol version, the messages include error code, queue depth supported by 8062306a36Sopenharmony_cithe server (number of memory chunks which are going to be allocated for that 8162306a36Sopenharmony_cisession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set 8262306a36Sopenharmony_ciwhen always_invalidate=Y. 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ci3. After all connections of a path are established client sends to server the 8562306a36Sopenharmony_ciRTRS_MSG_INFO_REQ message, containing the name of the session. This message 8662306a36Sopenharmony_cirequests the address information from the server. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, 8962306a36Sopenharmony_ciwhich contains the addresses and keys of the RDMA buffers allocated for that 9062306a36Sopenharmony_cisession. 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci5. Session becomes connected after all paths to be established are connected 9362306a36Sopenharmony_ci(i.e. steps 1-4 finished for all paths requested for a session) 9462306a36Sopenharmony_ci 9562306a36Sopenharmony_ci6. Server and client exchange periodically heartbeat messages (empty rdma 9662306a36Sopenharmony_cimessages with an immediate field) which are used to detect a crash on remote 9762306a36Sopenharmony_ciside or network outage in an absence of IO. 9862306a36Sopenharmony_ci 9962306a36Sopenharmony_ci7. On any RDMA related error or in the case of a heartbeat timeout, the 10062306a36Sopenharmony_cicorresponding path is disconnected, all the inflight IO are failed over to a 10162306a36Sopenharmony_cihealthy path, if any, and the reconnect mechanism is triggered. 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ciCLT SRV 10462306a36Sopenharmony_ci*for each connection belonging to a path and for each path: 10562306a36Sopenharmony_ciRTRS_MSG_CON_REQ -------------------> 10662306a36Sopenharmony_ci <------------------- RTRS_MSG_CON_RSP 10762306a36Sopenharmony_ci... 10862306a36Sopenharmony_ci*after all connections are established: 10962306a36Sopenharmony_ciRTRS_MSG_INFO_REQ -------------------> 11062306a36Sopenharmony_ci <------------------- RTRS_MSG_INFO_RSP 11162306a36Sopenharmony_ci*heartbeat is started from both sides: 11262306a36Sopenharmony_ci -------------------> [RTRS_HB_MSG_IMM] 11362306a36Sopenharmony_ci[RTRS_HB_MSG_ACK] <------------------- 11462306a36Sopenharmony_ci[RTRS_HB_MSG_IMM] <------------------- 11562306a36Sopenharmony_ci -------------------> [RTRS_HB_MSG_ACK] 11662306a36Sopenharmony_ci 11762306a36Sopenharmony_ciIO path 11862306a36Sopenharmony_ci------- 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ci* Write (always_invalidate=N) * 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci1. When processing a write request client selects one of the memory chunks 12362306a36Sopenharmony_cion the server side and rdma writes there the user data, user header and the 12462306a36Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 12562306a36Sopenharmony_cicontains size of the user header. The client tells the server which chunk has 12662306a36Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 12762306a36Sopenharmony_ciusing the IMM field. 12862306a36Sopenharmony_ci 12962306a36Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with 13062306a36Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding 13162306a36Sopenharmony_ciinflight IO and for the error code. 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ciCLT SRV 13462306a36Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 13562306a36Sopenharmony_ci[RTRS_IO_RSP_IMM] <----------------- (id + errno) 13662306a36Sopenharmony_ci 13762306a36Sopenharmony_ci* Write (always_invalidate=Y) * 13862306a36Sopenharmony_ci 13962306a36Sopenharmony_ci1. When processing a write request client selects one of the memory chunks 14062306a36Sopenharmony_cion the server side and rdma writes there the user data, user header and the 14162306a36Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 14262306a36Sopenharmony_cicontains size of the user header. The client tells the server which chunk has 14362306a36Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 14462306a36Sopenharmony_ciusing the IMM field, Server invalidate rkey associated to the memory chunks 14562306a36Sopenharmony_cifirst, when it finishes, pass the IO to RNBD server module. 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with 14862306a36Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding 14962306a36Sopenharmony_ciinflight IO and for the error code. The new rkey is sent back using 15062306a36Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates 15162306a36Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post 15262306a36Sopenharmony_ciback the recv buffer for later use. 15362306a36Sopenharmony_ci 15462306a36Sopenharmony_ciCLT SRV 15562306a36Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 15662306a36Sopenharmony_ci[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 15762306a36Sopenharmony_ci[RTRS_IO_RSP_IMM] <----------------- (id + errno) 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_ci* Read (always_invalidate=N)* 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ci1. When processing a read request client selects one of the memory chunks 16362306a36Sopenharmony_cion the server side and rdma writes there the user header and the 16462306a36Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 16562306a36Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the 16662306a36Sopenharmony_cilist of addresses along with keys for the data to be read into. 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci2. When confirming a read request server transfers the requested data first, 16962306a36Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma 17062306a36Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the 17162306a36Sopenharmony_cioutstanding inflight IO and the error code. 17262306a36Sopenharmony_ci 17362306a36Sopenharmony_ciCLT SRV 17462306a36Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 17562306a36Sopenharmony_ci[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 17662306a36Sopenharmony_cior in case client requested invalidation: 17762306a36Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ci* Read (always_invalidate=Y)* 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ci1. When processing a read request client selects one of the memory chunks 18262306a36Sopenharmony_cion the server side and rdma writes there the user header and the 18362306a36Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of 18462306a36Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the 18562306a36Sopenharmony_cilist of addresses along with keys for the data to be read into. 18662306a36Sopenharmony_ciServer invalidate rkey associated to the memory chunks first, when it finishes, 18762306a36Sopenharmony_cipasses the IO to RNBD server module. 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci2. When confirming a read request server transfers the requested data first, 19062306a36Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma 19162306a36Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the 19262306a36Sopenharmony_cioutstanding inflight IO and the error code. The new rkey is sent back using 19362306a36Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates 19462306a36Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post 19562306a36Sopenharmony_ciback the recv buffer for later use. 19662306a36Sopenharmony_ci 19762306a36Sopenharmony_ciCLT SRV 19862306a36Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 19962306a36Sopenharmony_ci[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 20062306a36Sopenharmony_ci[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 20162306a36Sopenharmony_cior in case client requested invalidation: 20262306a36Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 20362306a36Sopenharmony_ci========================================= 20462306a36Sopenharmony_ciContributors List(in alphabetical order) 20562306a36Sopenharmony_ci========================================= 20662306a36Sopenharmony_ciDanil Kipnis <danil.kipnis@profitbricks.com> 20762306a36Sopenharmony_ciFabian Holler <mail@fholler.de> 20862306a36Sopenharmony_ciGuoqing Jiang <guoqing.jiang@cloud.ionos.com> 20962306a36Sopenharmony_ciJack Wang <jinpu.wang@profitbricks.com> 21062306a36Sopenharmony_ciKleber Souza <kleber.souza@profitbricks.com> 21162306a36Sopenharmony_ciLutz Pogrell <lutz.pogrell@cloud.ionos.com> 21262306a36Sopenharmony_ciMilind Dumbare <Milind.dumbare@gmail.com> 21362306a36Sopenharmony_ciRoman Penyaev <roman.penyaev@profitbricks.com> 214