driver-api/md/md-cluster.rst

62306a36Sopenharmony_ci==========
62306a36Sopenharmony_ciMD Cluster
62306a36Sopenharmony_ci==========
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciThe cluster MD is a shared-device RAID for a cluster, it supports
62306a36Sopenharmony_citwo levels: raid1 and raid10 (limited support).
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci1. On-disk format
62306a36Sopenharmony_ci=================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciSeparate write-intent-bitmaps are used for each cluster node.
62306a36Sopenharmony_ciThe bitmaps record all writes that may have been started on that node,
62306a36Sopenharmony_ciand may not yet have finished. The on-disk layout is::
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci  0                    4k                     8k                    12k
62306a36Sopenharmony_ci  -------------------------------------------------------------------
62306a36Sopenharmony_ci  | idle                | md super            | bm super [0] + bits |
62306a36Sopenharmony_ci  | bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
62306a36Sopenharmony_ci  | bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
62306a36Sopenharmony_ci  | bm bits [3, contd]  |                     |                     |
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciDuring "normal" functioning we assume the filesystem ensures that only
62306a36Sopenharmony_cione node writes to any given block at a time, so a write request will
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci - set the appropriate bit (if not already set)
62306a36Sopenharmony_ci - commit the write to all mirrors
62306a36Sopenharmony_ci - schedule the bit to be cleared after a timeout.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciReads are just handled normally. It is up to the filesystem to ensure
62306a36Sopenharmony_cione node doesn't read from a location where another node (or the same
62306a36Sopenharmony_cinode) is writing.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci2. DLM Locks for management
62306a36Sopenharmony_ci===========================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciThere are three groups of locks for managing the device:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci2.1 Bitmap lock resource (bm_lockres)
62306a36Sopenharmony_ci-------------------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci The bm_lockres protects individual node bitmaps. They are named in
62306a36Sopenharmony_ci the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
62306a36Sopenharmony_ci node joins the cluster, it acquires the lock in PW mode and it stays
62306a36Sopenharmony_ci so during the lifetime the node is part of the cluster. The lock
62306a36Sopenharmony_ci resource number is based on the slot number returned by the DLM
62306a36Sopenharmony_ci subsystem. Since DLM starts node count from one and bitmap slots
62306a36Sopenharmony_ci start from zero, one is subtracted from the DLM slot number to arrive
62306a36Sopenharmony_ci at the bitmap slot number.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci The LVB of the bitmap lock for a particular node records the range
62306a36Sopenharmony_ci of sectors that are being re-synced by that node.  No other
62306a36Sopenharmony_ci node may write to those sectors.  This is used when a new nodes
62306a36Sopenharmony_ci joins the cluster.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci2.2 Message passing locks
62306a36Sopenharmony_ci-------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Each node has to communicate with other nodes when starting or ending
62306a36Sopenharmony_ci resync, and for metadata superblock updates.  This communication is
62306a36Sopenharmony_ci managed through three locks: "token", "message", and "ack", together
62306a36Sopenharmony_ci with the Lock Value Block (LVB) of one of the "message" lock.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci2.3 new-device management
62306a36Sopenharmony_ci-------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci A single lock: "no-new-dev" is used to coordinate the addition of
62306a36Sopenharmony_ci new devices - this must be synchronized across the array.
62306a36Sopenharmony_ci Normally all nodes hold a concurrent-read lock on this device.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3. Communication
62306a36Sopenharmony_ci================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Messages can be broadcast to all nodes, and the sender waits for all
62306a36Sopenharmony_ci other nodes to acknowledge the message before proceeding.  Only one
62306a36Sopenharmony_ci message can be processed at a time.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.1 Message Types
62306a36Sopenharmony_ci-----------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci There are six types of messages which are passed:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.1.1 METADATA_UPDATED
62306a36Sopenharmony_ci^^^^^^^^^^^^^^^^^^^^^^
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   informs other nodes that the metadata has
62306a36Sopenharmony_ci   been updated, and the node must re-read the md superblock. This is
62306a36Sopenharmony_ci   performed synchronously. It is primarily used to signal device
62306a36Sopenharmony_ci   failure.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.1.2 RESYNCING
62306a36Sopenharmony_ci^^^^^^^^^^^^^^^
62306a36Sopenharmony_ci   informs other nodes that a resync is initiated or
62306a36Sopenharmony_ci   ended so that each node may suspend or resume the region.  Each
62306a36Sopenharmony_ci   RESYNCING message identifies a range of the devices that the
62306a36Sopenharmony_ci   sending node is about to resync. This overrides any previous
62306a36Sopenharmony_ci   notification from that node: only one ranged can be resynced at a
62306a36Sopenharmony_ci   time per-node.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.1.3 NEWDISK
62306a36Sopenharmony_ci^^^^^^^^^^^^^
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   informs other nodes that a device is being added to
62306a36Sopenharmony_ci   the array. Message contains an identifier for that device.  See
62306a36Sopenharmony_ci   below for further details.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.1.4 REMOVE
62306a36Sopenharmony_ci^^^^^^^^^^^^
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   A failed or spare device is being removed from the
62306a36Sopenharmony_ci   array. The slot-number of the device is included in the message.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 3.1.5 RE_ADD:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   A failed device is being re-activated - the assumption
62306a36Sopenharmony_ci   is that it has been determined to be working again.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 3.1.6 BITMAP_NEEDS_SYNC:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   If a node is stopped locally but the bitmap
62306a36Sopenharmony_ci   isn't clean, then another node is informed to take the ownership of
62306a36Sopenharmony_ci   resync.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.2 Communication mechanism
62306a36Sopenharmony_ci---------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci The DLM LVB is used to communicate within nodes of the cluster. There
62306a36Sopenharmony_ci are three resources used for the purpose:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.2.1 token
62306a36Sopenharmony_ci^^^^^^^^^^^
62306a36Sopenharmony_ci   The resource which protects the entire communication
62306a36Sopenharmony_ci   system. The node having the token resource is allowed to
62306a36Sopenharmony_ci   communicate.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.2.2 message
62306a36Sopenharmony_ci^^^^^^^^^^^^^
62306a36Sopenharmony_ci   The lock resource which carries the data to communicate.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci3.2.3 ack
62306a36Sopenharmony_ci^^^^^^^^^
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   The resource, acquiring which means the message has been
62306a36Sopenharmony_ci   acknowledged by all nodes in the cluster. The BAST of the resource
62306a36Sopenharmony_ci   is used to inform the receiving node that a node wants to
62306a36Sopenharmony_ci   communicate.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciThe algorithm is:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 1. receive status - all nodes have concurrent-reader lock on "ack"::
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci	sender                         receiver                 receiver
62306a36Sopenharmony_ci	"ack":CR                       "ack":CR                 "ack":CR
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 2. sender get EX on "token",
62306a36Sopenharmony_ci    sender get EX on "message"::
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci	sender                        receiver                 receiver
62306a36Sopenharmony_ci	"token":EX                    "ack":CR                 "ack":CR
62306a36Sopenharmony_ci	"message":EX
62306a36Sopenharmony_ci	"ack":CR
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    Sender checks that it still needs to send a message. Messages
62306a36Sopenharmony_ci    received or other events that happened while waiting for the
62306a36Sopenharmony_ci    "token" may have made this message inappropriate or redundant.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 3. sender writes LVB
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    sender down-convert "message" from EX to CW
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    sender try to get EX of "ack"
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    ::
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci      [ wait until all receivers have *processed* the "message" ]
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci                                       [ triggered by bast of "ack" ]
62306a36Sopenharmony_ci                                       receiver get CR on "message"
62306a36Sopenharmony_ci                                       receiver read LVB
62306a36Sopenharmony_ci                                       receiver processes the message
62306a36Sopenharmony_ci                                       [ wait finish ]
62306a36Sopenharmony_ci                                       receiver releases "ack"
62306a36Sopenharmony_ci                                       receiver tries to get PR on "message"
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci     sender                         receiver                  receiver
62306a36Sopenharmony_ci     "token":EX                     "message":CR              "message":CR
62306a36Sopenharmony_ci     "message":CW
62306a36Sopenharmony_ci     "ack":EX
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci 4. triggered by grant of EX on "ack" (indicating all receivers
62306a36Sopenharmony_ci    have processed message)
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    sender down-converts "ack" from EX to CR
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    sender releases "message"
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    sender releases "token"
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci    ::
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci                                 receiver upconvert to PR on "message"
62306a36Sopenharmony_ci                                 receiver get CR of "ack"
62306a36Sopenharmony_ci                                 receiver release "message"
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci     sender                      receiver                   receiver
62306a36Sopenharmony_ci     "ack":CR                    "ack":CR                   "ack":CR
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci4. Handling Failures
62306a36Sopenharmony_ci====================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci4.1 Node Failure
62306a36Sopenharmony_ci----------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci When a node fails, the DLM informs the cluster with the slot
62306a36Sopenharmony_ci number. The node starts a cluster recovery thread. The cluster
62306a36Sopenharmony_ci recovery thread:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci	- acquires the bitmap<number> lock of the failed node
62306a36Sopenharmony_ci	- opens the bitmap
62306a36Sopenharmony_ci	- reads the bitmap of the failed node
62306a36Sopenharmony_ci	- copies the set bitmap to local node
62306a36Sopenharmony_ci	- cleans the bitmap of the failed node
62306a36Sopenharmony_ci	- releases bitmap<number> lock of the failed node
62306a36Sopenharmony_ci	- initiates resync of the bitmap on the current node
62306a36Sopenharmony_ci	  md_check_recovery is invoked within recover_bitmaps,
62306a36Sopenharmony_ci	  then md_check_recovery -> metadata_update_start/finish,
62306a36Sopenharmony_ci	  it will lock the communication by lock_comm.
62306a36Sopenharmony_ci	  Which means when one node is resyncing it blocks all
62306a36Sopenharmony_ci	  other nodes from writing anywhere on the array.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci The resync process is the regular md resync. However, in a clustered
62306a36Sopenharmony_ci environment when a resync is performed, it needs to tell other nodes
62306a36Sopenharmony_ci of the areas which are suspended. Before a resync starts, the node
62306a36Sopenharmony_ci send out RESYNCING with the (lo,hi) range of the area which needs to
62306a36Sopenharmony_ci be suspended. Each node maintains a suspend_list, which contains the
62306a36Sopenharmony_ci list of ranges which are currently suspended. On receiving RESYNCING,
62306a36Sopenharmony_ci the node adds the range to the suspend_list. Similarly, when the node
62306a36Sopenharmony_ci performing resync finishes, it sends RESYNCING with an empty range to
62306a36Sopenharmony_ci other nodes and other nodes remove the corresponding entry from the
62306a36Sopenharmony_ci suspend_list.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci A helper function, ->area_resyncing() can be used to check if a
62306a36Sopenharmony_ci particular I/O range should be suspended or not.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci4.2 Device Failure
62306a36Sopenharmony_ci==================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Device failures are handled and communicated with the metadata update
62306a36Sopenharmony_ci routine.  When a node detects a device failure it does not allow
62306a36Sopenharmony_ci any further writes to that device until the failure has been
62306a36Sopenharmony_ci acknowledged by all other nodes.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci5. Adding a new Device
62306a36Sopenharmony_ci----------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci For adding a new device, it is necessary that all nodes "see" the new
62306a36Sopenharmony_ci device to be added. For this, the following algorithm is used:
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci   1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
62306a36Sopenharmony_ci       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD)
62306a36Sopenharmony_ci   2.  Node 1 sends a NEWDISK message with uuid and slot number
62306a36Sopenharmony_ci   3.  Other nodes issue kobject_uevent_env with uuid and slot number
62306a36Sopenharmony_ci       (Steps 4,5 could be a udev rule)
62306a36Sopenharmony_ci   4.  In userspace, the node searches for the disk, perhaps
62306a36Sopenharmony_ci       using blkid -t SUB_UUID=""
62306a36Sopenharmony_ci   5.  Other nodes issue either of the following depending on whether
62306a36Sopenharmony_ci       the disk was found:
62306a36Sopenharmony_ci       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
62306a36Sopenharmony_ci       disc.number set to slot number)
62306a36Sopenharmony_ci       ioctl(CLUSTERED_DISK_NACK)
62306a36Sopenharmony_ci   6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
62306a36Sopenharmony_ci   7.  Node 1 attempts EX lock on "no-new-dev"
62306a36Sopenharmony_ci   8.  If node 1 gets the lock, it sends METADATA_UPDATED after
62306a36Sopenharmony_ci       unmarking the disk as SpareLocal
62306a36Sopenharmony_ci   9.  If not (get "no-new-dev" lock), it fails the operation and sends
62306a36Sopenharmony_ci       METADATA_UPDATED.
62306a36Sopenharmony_ci   10. Other nodes get the information whether a disk is added or not
62306a36Sopenharmony_ci       by the following METADATA_UPDATED.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6. Module interface
62306a36Sopenharmony_ci===================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci There are 17 call-backs which the md core can make to the cluster
62306a36Sopenharmony_ci module.  Understanding these can give a good overview of the whole
62306a36Sopenharmony_ci process.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.1 join(nodes) and leave()
62306a36Sopenharmony_ci---------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci These are called when an array is started with a clustered bitmap,
62306a36Sopenharmony_ci and when the array is stopped.  join() ensures the cluster is
62306a36Sopenharmony_ci available and initializes the various resources.
62306a36Sopenharmony_ci Only the first 'nodes' nodes in the cluster can use the array.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.2 slot_number()
62306a36Sopenharmony_ci-----------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Reports the slot number advised by the cluster infrastructure.
62306a36Sopenharmony_ci Range is from 0 to nodes-1.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.3 resync_info_update()
62306a36Sopenharmony_ci------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci This updates the resync range that is stored in the bitmap lock.
62306a36Sopenharmony_ci The starting point is updated as the resync progresses.  The
62306a36Sopenharmony_ci end point is always the end of the array.
62306a36Sopenharmony_ci It does *not* send a RESYNCING message.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.4 resync_start(), resync_finish()
62306a36Sopenharmony_ci-----------------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci These are called when resync/recovery/reshape starts or stops.
62306a36Sopenharmony_ci They update the resyncing range in the bitmap lock and also
62306a36Sopenharmony_ci send a RESYNCING message.  resync_start reports the whole
62306a36Sopenharmony_ci array as resyncing, resync_finish reports none of it.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci resync_finish() also sends a BITMAP_NEEDS_SYNC message which
62306a36Sopenharmony_ci allows some other node to take over.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel()
62306a36Sopenharmony_ci-------------------------------------------------------------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci metadata_update_start is used to get exclusive access to
62306a36Sopenharmony_ci the metadata.  If a change is still needed once that access is
62306a36Sopenharmony_ci gained, metadata_update_finish() will send a METADATA_UPDATE
62306a36Sopenharmony_ci message to all other nodes, otherwise metadata_update_cancel()
62306a36Sopenharmony_ci can be used to release the lock.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.6 area_resyncing()
62306a36Sopenharmony_ci--------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci This combines two elements of functionality.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Firstly, it will check if any node is currently resyncing
62306a36Sopenharmony_ci anything in a given range of sectors.  If any resync is found,
62306a36Sopenharmony_ci then the caller will avoid writing or read-balancing in that
62306a36Sopenharmony_ci range.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci Secondly, while node recovery is happening it reports that
62306a36Sopenharmony_ci all areas are resyncing for READ requests.  This avoids races
62306a36Sopenharmony_ci between the cluster-filesystem and the cluster-RAID handling
62306a36Sopenharmony_ci a node failure.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack()
62306a36Sopenharmony_ci---------------------------------------------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci These are used to manage the new-disk protocol described above.
62306a36Sopenharmony_ci When a new device is added, add_new_disk_start() is called before
62306a36Sopenharmony_ci it is bound to the array and, if that succeeds, add_new_disk_finish()
62306a36Sopenharmony_ci is called the device is fully added.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci When a device is added in acknowledgement to a previous
62306a36Sopenharmony_ci request, or when the device is declared "unavailable",
62306a36Sopenharmony_ci new_disk_ack() is called.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.8 remove_disk()
62306a36Sopenharmony_ci-----------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci This is called when a spare or failed device is removed from
62306a36Sopenharmony_ci the array.  It causes a REMOVE message to be send to other nodes.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.9 gather_bitmaps()
62306a36Sopenharmony_ci--------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci This sends a RE_ADD message to all other nodes and then
62306a36Sopenharmony_ci gathers bitmap information from all bitmaps.  This combined
62306a36Sopenharmony_ci bitmap is then used to recovery the re-added device.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci6.10 lock_all_bitmaps() and unlock_all_bitmaps()
62306a36Sopenharmony_ci------------------------------------------------
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci These are called when change bitmap to none. If a node plans
62306a36Sopenharmony_ci to clear the cluster raid's bitmap, it need to make sure no other
62306a36Sopenharmony_ci nodes are using the raid which is achieved by lock all bitmap
62306a36Sopenharmony_ci locks within the cluster, and also those locks are unlocked
62306a36Sopenharmony_ci accordingly.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci7. Unsupported features
62306a36Sopenharmony_ci=======================
62306a36Sopenharmony_ci
62306a36Sopenharmony_ciThere are somethings which are not supported by cluster MD yet.
62306a36Sopenharmony_ci
62306a36Sopenharmony_ci- change array_sectors.