162306a36Sopenharmony_ci****************************
262306a36Sopenharmony_ciRDMA Transport (RTRS)
362306a36Sopenharmony_ci****************************
462306a36Sopenharmony_ci
562306a36Sopenharmony_ciRTRS (RDMA Transport) is a reliable high speed transport library
662306a36Sopenharmony_ciwhich provides support to establish optimal number of connections
762306a36Sopenharmony_cibetween client and server machines using RDMA (InfiniBand, RoCE, iWarp)
862306a36Sopenharmony_citransport. It is optimized to transfer (read/write) IO blocks.
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciIn its core interface it follows the BIO semantics of providing the
1162306a36Sopenharmony_cipossibility to either write data from an sg list to the remote side
1262306a36Sopenharmony_cior to request ("read") data transfer from the remote side into a given
1362306a36Sopenharmony_cisg list.
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciRTRS provides I/O fail-over and load-balancing capabilities by using
1662306a36Sopenharmony_cimultipath I/O (see "add_path" and "mp_policy" configuration entries in
1762306a36Sopenharmony_ciDocumentation/ABI/testing/sysfs-class-rtrs-client).
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciRTRS is used by the RNBD (RDMA Network Block Device) modules.
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ci==================
2262306a36Sopenharmony_ciTransport protocol
2362306a36Sopenharmony_ci==================
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciOverview
2662306a36Sopenharmony_ci--------
2762306a36Sopenharmony_ciAn established connection between a client and a server is called rtrs
2862306a36Sopenharmony_cisession. A session is associated with a set of memory chunks reserved on the
2962306a36Sopenharmony_ciserver side for a given client for rdma transfer. A session
3062306a36Sopenharmony_ciconsists of multiple paths, each representing a separate physical link
3162306a36Sopenharmony_cibetween client and server. Those are used for load balancing and failover.
3262306a36Sopenharmony_ciEach path consists of as many connections (QPs) as there are cpus on
3362306a36Sopenharmony_cithe client.
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciWhen processing an incoming write or read request, rtrs client uses memory
3662306a36Sopenharmony_cichunks reserved for him on the server side. Their number, size and addresses
3762306a36Sopenharmony_cineed to be exchanged between client and server during the connection
3862306a36Sopenharmony_ciestablishment phase. Apart from the memory related information client needs to
3962306a36Sopenharmony_ciinform the server about the session name and identify each path and connection
4062306a36Sopenharmony_ciindividually.
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciOn an established session client sends to server write or read messages.
4362306a36Sopenharmony_ciServer uses immediate field to tell the client which request is being
4462306a36Sopenharmony_ciacknowledged and for errno. Client uses immediate field to tell the server
4562306a36Sopenharmony_ciwhich of the memory chunks has been accessed and at which offset the message
4662306a36Sopenharmony_cican be found.
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ciModule parameter always_invalidate is introduced for the security problem
4962306a36Sopenharmony_cidiscussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
5062306a36Sopenharmony_ciinvalidate each rdma buffer before we hand it over to RNBD server and
5162306a36Sopenharmony_cithen pass it to the block layer. A new rkey is generated and registered for the
5262306a36Sopenharmony_cibuffer after it returns back from the block layer and RNBD server.
5362306a36Sopenharmony_ciThe new rkey is sent back to the client along with the IO result.
5462306a36Sopenharmony_ciThe procedure is the default behaviour of the driver. This invalidation and
5562306a36Sopenharmony_ciregistration on each IO causes performance drop of up to 20%. A user of the
5662306a36Sopenharmony_cidriver may choose to load the modules with this mechanism switched off
5762306a36Sopenharmony_ci(always_invalidate=N), if he understands and can take the risk of a malicious
5862306a36Sopenharmony_ciclient being able to corrupt memory of a server it is connected to. This might
5962306a36Sopenharmony_cibe a reasonable option in a scenario where all the clients and all the servers
6062306a36Sopenharmony_ciare located within a secure datacenter.
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ciConnection establishment
6462306a36Sopenharmony_ci------------------------
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ci1. Client starts establishing connections belonging to a path of a session one
6762306a36Sopenharmony_ciby one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
6862306a36Sopenharmony_ciThose include uuid of the session and uuid of the path to be
6962306a36Sopenharmony_ciestablished. They are used by the server to find a persisting session/path or
7062306a36Sopenharmony_cito create a new one when necessary. The message also contains the protocol
7162306a36Sopenharmony_civersion and magic for compatibility, total number of connections per session
7262306a36Sopenharmony_ci(as many as cpus on the client), the id of the current connection and
7362306a36Sopenharmony_cithe reconnect counter, which is used to resolve the situations where
7462306a36Sopenharmony_ciclient is trying to reconnect a path, while server is still destroying the old
7562306a36Sopenharmony_cione.
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ci2. Server accepts the connection requests one by one and attaches
7862306a36Sopenharmony_ciRTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
7962306a36Sopenharmony_ciprotocol version, the messages include error code, queue depth supported by
8062306a36Sopenharmony_cithe server (number of memory chunks which are going to be allocated for that
8162306a36Sopenharmony_cisession) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
8262306a36Sopenharmony_ciwhen always_invalidate=Y.
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci3. After all connections of a path are established client sends to server the
8562306a36Sopenharmony_ciRTRS_MSG_INFO_REQ message, containing the name of the session. This message
8662306a36Sopenharmony_cirequests the address information from the server.
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
8962306a36Sopenharmony_ciwhich contains the addresses and keys of the RDMA buffers allocated for that
9062306a36Sopenharmony_cisession.
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ci5. Session becomes connected after all paths to be established are connected
9362306a36Sopenharmony_ci(i.e. steps 1-4 finished for all paths requested for a session)
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ci6. Server and client exchange periodically heartbeat messages (empty rdma
9662306a36Sopenharmony_cimessages with an immediate field) which are used to detect a crash on remote
9762306a36Sopenharmony_ciside or network outage in an absence of IO.
9862306a36Sopenharmony_ci
9962306a36Sopenharmony_ci7. On any RDMA related error or in the case of a heartbeat timeout, the
10062306a36Sopenharmony_cicorresponding path is disconnected, all the inflight IO are failed over to a
10162306a36Sopenharmony_cihealthy path, if any, and the reconnect mechanism is triggered.
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ciCLT                                     SRV
10462306a36Sopenharmony_ci*for each connection belonging to a path and for each path:
10562306a36Sopenharmony_ciRTRS_MSG_CON_REQ  ------------------->
10662306a36Sopenharmony_ci                   <------------------- RTRS_MSG_CON_RSP
10762306a36Sopenharmony_ci...
10862306a36Sopenharmony_ci*after all connections are established:
10962306a36Sopenharmony_ciRTRS_MSG_INFO_REQ ------------------->
11062306a36Sopenharmony_ci                   <------------------- RTRS_MSG_INFO_RSP
11162306a36Sopenharmony_ci*heartbeat is started from both sides:
11262306a36Sopenharmony_ci                   -------------------> [RTRS_HB_MSG_IMM]
11362306a36Sopenharmony_ci[RTRS_HB_MSG_ACK] <-------------------
11462306a36Sopenharmony_ci[RTRS_HB_MSG_IMM] <-------------------
11562306a36Sopenharmony_ci                   -------------------> [RTRS_HB_MSG_ACK]
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciIO path
11862306a36Sopenharmony_ci-------
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ci* Write (always_invalidate=N) *
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ci1. When processing a write request client selects one of the memory chunks
12362306a36Sopenharmony_cion the server side and rdma writes there the user data, user header and the
12462306a36Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
12562306a36Sopenharmony_cicontains size of the user header. The client tells the server which chunk has
12662306a36Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
12762306a36Sopenharmony_ciusing the IMM field.
12862306a36Sopenharmony_ci
12962306a36Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with
13062306a36Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding
13162306a36Sopenharmony_ciinflight IO and for the error code.
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciCLT                                                          SRV
13462306a36Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
13562306a36Sopenharmony_ci[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
13662306a36Sopenharmony_ci
13762306a36Sopenharmony_ci* Write (always_invalidate=Y) *
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ci1. When processing a write request client selects one of the memory chunks
14062306a36Sopenharmony_cion the server side and rdma writes there the user data, user header and the
14162306a36Sopenharmony_ciRTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
14262306a36Sopenharmony_cicontains size of the user header. The client tells the server which chunk has
14362306a36Sopenharmony_cibeen accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
14462306a36Sopenharmony_ciusing the IMM field, Server invalidate rkey associated to the memory chunks
14562306a36Sopenharmony_cifirst, when it finishes, pass the IO to RNBD server module.
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ci2. When confirming a write request server sends an "empty" rdma message with
14862306a36Sopenharmony_cian immediate field. The 32 bit field is used to specify the outstanding
14962306a36Sopenharmony_ciinflight IO and for the error code. The new rkey is sent back using
15062306a36Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates
15162306a36Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post
15262306a36Sopenharmony_ciback the recv buffer for later use.
15362306a36Sopenharmony_ci
15462306a36Sopenharmony_ciCLT                                                          SRV
15562306a36Sopenharmony_ciusr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
15662306a36Sopenharmony_ci[RTRS_MSG_RKEY_RSP]                     <----------------- (RTRS_MSG_RKEY_RSP)
15762306a36Sopenharmony_ci[RTRS_IO_RSP_IMM]                        <----------------- (id + errno)
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_ci* Read (always_invalidate=N)*
16162306a36Sopenharmony_ci
16262306a36Sopenharmony_ci1. When processing a read request client selects one of the memory chunks
16362306a36Sopenharmony_cion the server side and rdma writes there the user header and the
16462306a36Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
16562306a36Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the
16662306a36Sopenharmony_cilist of addresses along with keys for the data to be read into.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci2. When confirming a read request server transfers the requested data first,
16962306a36Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma
17062306a36Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the
17162306a36Sopenharmony_cioutstanding inflight IO and the error code.
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ciCLT                                           SRV
17462306a36Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
17562306a36Sopenharmony_ci[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
17662306a36Sopenharmony_cior in case client requested invalidation:
17762306a36Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ci* Read (always_invalidate=Y)*
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci1. When processing a read request client selects one of the memory chunks
18262306a36Sopenharmony_cion the server side and rdma writes there the user header and the
18362306a36Sopenharmony_ciRTRS_MSG_RDMA_READ message. This message contains the type (read), size of
18462306a36Sopenharmony_cithe user header, flags (specifying if memory invalidation is necessary) and the
18562306a36Sopenharmony_cilist of addresses along with keys for the data to be read into.
18662306a36Sopenharmony_ciServer invalidate rkey associated to the memory chunks first, when it finishes,
18762306a36Sopenharmony_cipasses the IO to RNBD server module.
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci2. When confirming a read request server transfers the requested data first,
19062306a36Sopenharmony_ciattaches an invalidation message if requested and finally an "empty" rdma
19162306a36Sopenharmony_cimessage with an immediate field. The 32 bit field is used to specify the
19262306a36Sopenharmony_cioutstanding inflight IO and the error code. The new rkey is sent back using
19362306a36Sopenharmony_ciSEND_WITH_IMM WR, client When it recived new rkey message, it validates
19462306a36Sopenharmony_cithe message and finished IO after update rkey for the rbuffer, then post
19562306a36Sopenharmony_ciback the recv buffer for later use.
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciCLT                                           SRV
19862306a36Sopenharmony_ciusr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
19962306a36Sopenharmony_ci[RTRS_IO_RSP_IMM]            <-------------- usr_data + (id + errno)
20062306a36Sopenharmony_ci[RTRS_MSG_RKEY_RSP]	     <----------------- (RTRS_MSG_RKEY_RSP)
20162306a36Sopenharmony_cior in case client requested invalidation:
20262306a36Sopenharmony_ci[RTRS_IO_RSP_IMM_W_INV]      <-------------- usr_data + (INV) + (id + errno)
20362306a36Sopenharmony_ci=========================================
20462306a36Sopenharmony_ciContributors List(in alphabetical order)
20562306a36Sopenharmony_ci=========================================
20662306a36Sopenharmony_ciDanil Kipnis <danil.kipnis@profitbricks.com>
20762306a36Sopenharmony_ciFabian Holler <mail@fholler.de>
20862306a36Sopenharmony_ciGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
20962306a36Sopenharmony_ciJack Wang <jinpu.wang@profitbricks.com>
21062306a36Sopenharmony_ciKleber Souza <kleber.souza@profitbricks.com>
21162306a36Sopenharmony_ciLutz Pogrell <lutz.pogrell@cloud.ionos.com>
21262306a36Sopenharmony_ciMilind Dumbare <Milind.dumbare@gmail.com>
21362306a36Sopenharmony_ciRoman Penyaev <roman.penyaev@profitbricks.com>
214