18c2ecf20Sopenharmony_ci****************************
28c2ecf20Sopenharmony_ciRDMA Transport (RTRS)
38c2ecf20Sopenharmony_ci****************************
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ciRTRS (RDMA Transport) is a reliable high speed transport library
68c2ecf20Sopenharmony_ciwhich provides support to establish optimal number of connections
78c2ecf20Sopenharmony_cibetween client and server machines using RDMA (InfiniBand, RoCE, iWarp)
88c2ecf20Sopenharmony_citransport. It is optimized to transfer (read/write) IO blocks.
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciIn its core interface it follows the BIO semantics of providing the
118c2ecf20Sopenharmony_cipossibility to either write data from an sg list to the remote side
128c2ecf20Sopenharmony_cior to request ("read") data transfer from the remote side into a given
138c2ecf20Sopenharmony_cisg list.
148c2ecf20Sopenharmony_ci
158c2ecf20Sopenharmony_ciRTRS provides I/O fail-over and load-balancing capabilities by using
168c2ecf20Sopenharmony_cimultipath I/O (see "add_path" and "mp_policy" configuration entries in
178c2ecf20Sopenharmony_ciDocumentation/ABI/testing/sysfs-class-rtrs-client).
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ciRTRS is used by the RNBD (RDMA Network Block Device) modules.
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ci==================
228c2ecf20Sopenharmony_ciTransport protocol
238c2ecf20Sopenharmony_ci==================
248c2ecf20Sopenharmony_ci
258c2ecf20Sopenharmony_ciOverview
268c2ecf20Sopenharmony_ci--------
278c2ecf20Sopenharmony_ciAn established connection between a client and a server is called rtrs
288c2ecf20Sopenharmony_cisession. A session is associated with a set of memory chunks reserved on the
298c2ecf20Sopenharmony_ciserver side for a given client for rdma transfer. A session
308c2ecf20Sopenharmony_ciconsists of multiple paths, each representing a separate physical link
318c2ecf20Sopenharmony_cibetween client and server. Those are used for load balancing and failover.
328c2ecf20Sopenharmony_ciEach path consists of as many connections (QPs) as there are cpus on
338c2ecf20Sopenharmony_cithe client.
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ciWhen processing an incoming write or read request, rtrs client uses memory
368c2ecf20Sopenharmony_cichunks reserved for him on the server side. Their number, size and addresses
378c2ecf20Sopenharmony_cineed to be exchanged between client and server during the connection
388c2ecf20Sopenharmony_ciestablishment phase. Apart from the memory related information client needs to
398c2ecf20Sopenharmony_ciinform the server about the session name and identify each path and connection
408c2ecf20Sopenharmony_ciindividually.
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_ciOn an established session client sends to server write or read messages.
438c2ecf20Sopenharmony_ciServer uses immediate field to tell the client which request is being
448c2ecf20Sopenharmony_ciacknowledged and for errno. Client uses immediate field to tell the server
458c2ecf20Sopenharmony_ciwhich of the memory chunks has been accessed and at which offset the message
468c2ecf20Sopenharmony_cican be found.
478c2ecf20Sopenharmony_ci
488c2ecf20Sopenharmony_ciModule parameter always_invalidate is introduced for the security problem
498c2ecf20Sopenharmony_cidiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
508c2ecf20Sopenharmony_ciinvalidate each rdma buffer before we hand it over to RNBD server and
518c2ecf20Sopenharmony_cithen pass it to the block layer. A new rkey is generated and registered for the
528c2ecf20Sopenharmony_cibuffer after it returns back from the block layer and RNBD server.
538c2ecf20Sopenharmony_ciThe new rkey is sent back to the client along with the IO result.
548c2ecf20Sopenharmony_ciThe procedure is the default behaviour of the driver. This invalidation and
558c2ecf20Sopenharmony_ciregistration on each IO causes performance drop of up to 20%. A user of the
568c2ecf20Sopenharmony_cidriver may choose to load the modules with this mechanism switched off
578c2ecf20Sopenharmony_ci(always_invalidate=N), if he understands and can take the risk of a malicious
588c2ecf20Sopenharmony_ciclient being able to corrupt memory of a server it is connected to. This might
598c2ecf20Sopenharmony_cibe a reasonable option in a scenario where all the clients and all the servers
608c2ecf20Sopenharmony_ciare located within a secure datacenter.
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ciConnection establishment
648c2ecf20Sopenharmony_ci------------------------
658c2ecf20Sopenharmony_ci
668c2ecf20Sopenharmony_ci1. Client starts establishing connections belonging to a path of a session one
678c2ecf20Sopenharmony_ciby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
688c2ecf20Sopenharmony_ciThose include uuid of the session and uuid of the path to be
698c2ecf20Sopenharmony_ciestablished. They are used by the server to find a persisting session/path or
708c2ecf20Sopenharmony_cito create a new one when necessary. The message also contains the protocol
718c2ecf20Sopenharmony_civersion and magic for compatibility, total number of connections per session
728c2ecf20Sopenharmony_ci(as many as cpus on the client), the id of the current connection and
738c2ecf20Sopenharmony_cithe reconnect counter, which is used to resolve the situations where
748c2ecf20Sopenharmony_ciclient is trying to reconnect a path, while server is still destroying the old
758c2ecf20Sopenharmony_cione.
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ci2. Server accepts the connection requests one by one and attaches
788c2ecf20Sopenharmony_ciRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
798c2ecf20Sopenharmony_ciprotocol version, the messages include error code, queue depth supported by
808c2ecf20Sopenharmony_cithe server (number of memory chunks which are going to be allocated for that
818c2ecf20Sopenharmony_cisession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
828c2ecf20Sopenharmony_ciwhen always_invalidate=Y.
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ci3. After all connections of a path are established client sends to server the
858c2ecf20Sopenharmony_ciRTRS_MSG_INFO_REQ message, containing the name of the session. This message
868c2ecf20Sopenharmony_cirequests the address information from the server.
878c2ecf20Sopenharmony_ci
888c2ecf20Sopenharmony_ci4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
898c2ecf20Sopenharmony_ciwhich contains the addresses and keys of the RDMA buffers allocated for that
908c2ecf20Sopenharmony_cisession.
918c2ecf20Sopenharmony_ci
928c2ecf20Sopenharmony_ci5. Session becomes connected after all paths to be established are connected
938c2ecf20Sopenharmony_ci(i.e. steps 1-4 finished for all paths requested for a session)
948c2ecf20Sopenharmony_ci
958c2ecf20Sopenharmony_ci6. Server and client exchange periodically heartbeat messages (empty rdma
968c2ecf20Sopenharmony_cimessages with an immediate field) which are used to detect a crash on remote
978c2ecf20Sopenharmony_ciside or network outage in an absence of IO.
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ci7. On any RDMA related error or in the case of a heartbeat timeout, the
1008c2ecf20Sopenharmony_cicorresponding path is disconnected, all the inflight IO are failed over to a
1018c2ecf20Sopenharmony_cihealthy path, if any, and the reconnect mechanism is triggered.
1028c2ecf20Sopenharmony_ci
1038c2ecf20Sopenharmony_ciCLT                                     SRV
1048c2ecf20Sopenharmony_ci*for each connection belonging to a path and for each path:
1058c2ecf20Sopenharmony_ciRTRS_MSG_CON_REQ  ------------------->
1068c2ecf20Sopenharmony_ci                   <------------------- RTRS_MSG_CON_RSP
1078c2ecf20Sopenharmony_ci...
1088c2ecf20Sopenharmony_ci*after all connections are established:
1098c2ecf20Sopenharmony_ciRTRS_MSG_INFO_REQ ------------------->
1108c2ecf20Sopenharmony_ci                   <------------------- RTRS_MSG_INFO_RSP
1118c2ecf20Sopenharmony_ci*heartbeat is started from both sides:
1128c2ecf20Sopenharmony_ci                   -------------------> [RTRS_HB_MSG_IMM]
1138c2ecf20Sopenharmony_ci[RTRS_HB_MSG_ACK] <-------------------
1148c2ecf20Sopenharmony_ci[RTRS_HB_MSG_IMM] <-------------------
1158c2ecf20Sopenharmony_ci                   -------------------> [RTRS_HB_MSG_ACK]
1168c2ecf20Sopenharmony_ci
1178c2ecf20Sopenharmony_ciIO path
1188c2ecf20Sopenharmony_ci-------
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ci* Write (always_invalidate=N) *
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ci1. When processing a write request client selects one of the memory chunks
1238c2ecf20Sopenharmony_cion the server side and rdma writes there the user data, user header and the
1248c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
1258c2ecf20Sopenharmony_cicontains size of the user header. The client tells the server which chunk has
1268c2ecf20Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
1278c2ecf20Sopenharmony_ciusing the IMM field.
1288c2ecf20Sopenharmony_ci
1298c2ecf20Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with
1308c2ecf20Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding
1318c2ecf20Sopenharmony_ciinflight IO and for the error code.
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ciCLT                                                          SRV
1348c2ecf20Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
1358c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ci* Write (always_invalidate=Y) *
1388c2ecf20Sopenharmony_ci
1398c2ecf20Sopenharmony_ci1. When processing a write request client selects one of the memory chunks
1408c2ecf20Sopenharmony_cion the server side and rdma writes there the user data, user header and the
1418c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
1428c2ecf20Sopenharmony_cicontains size of the user header. The client tells the server which chunk has
1438c2ecf20Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
1448c2ecf20Sopenharmony_ciusing the IMM field, Server invalidate rkey associated to the memory chunks
1458c2ecf20Sopenharmony_cifirst, when it finishes, pass the IO to RNBD server module.
1468c2ecf20Sopenharmony_ci
1478c2ecf20Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with
1488c2ecf20Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding
1498c2ecf20Sopenharmony_ciinflight IO and for the error code. The new rkey is sent back using
1508c2ecf20Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates
1518c2ecf20Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post
1528c2ecf20Sopenharmony_ciback the recv buffer for later use.
1538c2ecf20Sopenharmony_ci
1548c2ecf20Sopenharmony_ciCLT                                                          SRV
1558c2ecf20Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
1568c2ecf20Sopenharmony_ci[RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
1578c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ci
1608c2ecf20Sopenharmony_ci* Read (always_invalidate=N)*
1618c2ecf20Sopenharmony_ci
1628c2ecf20Sopenharmony_ci1. When processing a read request client selects one of the memory chunks
1638c2ecf20Sopenharmony_cion the server side and rdma writes there the user header and the
1648c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
1658c2ecf20Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the
1668c2ecf20Sopenharmony_cilist of addresses along with keys for the data to be read into.
1678c2ecf20Sopenharmony_ci
1688c2ecf20Sopenharmony_ci2. When confirming a read request server transfers the requested data first,
1698c2ecf20Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma
1708c2ecf20Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the
1718c2ecf20Sopenharmony_cioutstanding inflight IO and the error code.
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_ciCLT                                           SRV
1748c2ecf20Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
1758c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
1768c2ecf20Sopenharmony_cior in case client requested invalidation:
1778c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
1788c2ecf20Sopenharmony_ci
1798c2ecf20Sopenharmony_ci* Read (always_invalidate=Y)*
1808c2ecf20Sopenharmony_ci
1818c2ecf20Sopenharmony_ci1. When processing a read request client selects one of the memory chunks
1828c2ecf20Sopenharmony_cion the server side and rdma writes there the user header and the
1838c2ecf20Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
1848c2ecf20Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the
1858c2ecf20Sopenharmony_cilist of addresses along with keys for the data to be read into.
1868c2ecf20Sopenharmony_ciServer invalidate rkey associated to the memory chunks first, when it finishes,
1878c2ecf20Sopenharmony_cipasses the IO to RNBD server module.
1888c2ecf20Sopenharmony_ci
1898c2ecf20Sopenharmony_ci2. When confirming a read request server transfers the requested data first,
1908c2ecf20Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma
1918c2ecf20Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the
1928c2ecf20Sopenharmony_cioutstanding inflight IO and the error code. The new rkey is sent back using
1938c2ecf20Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates
1948c2ecf20Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post
1958c2ecf20Sopenharmony_ciback the recv buffer for later use.
1968c2ecf20Sopenharmony_ci
1978c2ecf20Sopenharmony_ciCLT                                           SRV
1988c2ecf20Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
1998c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
2008c2ecf20Sopenharmony_ci[RTRS_MSG_RKEY_RSP]	     <----------------- (RTRS_MSG_RKEY_RSP)
2018c2ecf20Sopenharmony_cior in case client requested invalidation:
2028c2ecf20Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
2038c2ecf20Sopenharmony_ci=========================================
2048c2ecf20Sopenharmony_ciContributors List(in alphabetical order)
2058c2ecf20Sopenharmony_ci=========================================
2068c2ecf20Sopenharmony_ciDanil Kipnis <danil.kipnis@profitbricks.com>
2078c2ecf20Sopenharmony_ciFabian Holler <mail@fholler.de>
2088c2ecf20Sopenharmony_ciGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
2098c2ecf20Sopenharmony_ciJack Wang <jinpu.wang@profitbricks.com>
2108c2ecf20Sopenharmony_ciKleber Souza <kleber.souza@profitbricks.com>
2118c2ecf20Sopenharmony_ciLutz Pogrell <lutz.pogrell@cloud.ionos.com>
2128c2ecf20Sopenharmony_ciMilind Dumbare <Milind.dumbare@gmail.com>
2138c2ecf20Sopenharmony_ciRoman Penyaev <roman.penyaev@profitbricks.com>
214