18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci==========================
48c2ecf20Sopenharmony_ciXFS Delayed Logging Design
58c2ecf20Sopenharmony_ci==========================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciIntroduction to Re-logging in XFS
88c2ecf20Sopenharmony_ci=================================
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciXFS logging is a combination of logical and physical logging. Some objects,
118c2ecf20Sopenharmony_cisuch as inodes and dquots, are logged in logical format where the details
128c2ecf20Sopenharmony_cilogged are made up of the changes to in-core structures rather than on-disk
138c2ecf20Sopenharmony_cistructures. Other objects - typically buffers - have their physical changes
148c2ecf20Sopenharmony_cilogged. The reason for these differences is to reduce the amount of log space
158c2ecf20Sopenharmony_cirequired for objects that are frequently logged. Some parts of inodes are more
168c2ecf20Sopenharmony_cifrequently logged than others, and inodes are typically more frequently logged
178c2ecf20Sopenharmony_cithan any other object (except maybe the superblock buffer) so keeping the
188c2ecf20Sopenharmony_ciamount of metadata logged low is of prime importance.
198c2ecf20Sopenharmony_ci
208c2ecf20Sopenharmony_ciThe reason that this is such a concern is that XFS allows multiple separate
218c2ecf20Sopenharmony_cimodifications to a single object to be carried in the log at any given time.
228c2ecf20Sopenharmony_ciThis allows the log to avoid needing to flush each change to disk before
238c2ecf20Sopenharmony_cirecording a new change to the object. XFS does this via a method called
248c2ecf20Sopenharmony_ci"re-logging". Conceptually, this is quite simple - all it requires is that any
258c2ecf20Sopenharmony_cinew change to the object is recorded with a *new copy* of all the existing
268c2ecf20Sopenharmony_cichanges in the new transaction that is written to the log.
278c2ecf20Sopenharmony_ci
288c2ecf20Sopenharmony_ciThat is, if we have a sequence of changes A through to F, and the object was
298c2ecf20Sopenharmony_ciwritten to disk after change D, we would see in the log the following series
308c2ecf20Sopenharmony_ciof transactions, their contents and the log sequence number (LSN) of the
318c2ecf20Sopenharmony_citransaction::
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ci	Transaction		Contents	LSN
348c2ecf20Sopenharmony_ci	   A			   A		   X
358c2ecf20Sopenharmony_ci	   B			  A+B		  X+n
368c2ecf20Sopenharmony_ci	   C			 A+B+C		 X+n+m
378c2ecf20Sopenharmony_ci	   D			A+B+C+D		X+n+m+o
388c2ecf20Sopenharmony_ci	    <object written to disk>
398c2ecf20Sopenharmony_ci	   E			   E		   Y (> X+n+m+o)
408c2ecf20Sopenharmony_ci	   F			  E+F		  Y+p
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_ciIn other words, each time an object is relogged, the new transaction contains
438c2ecf20Sopenharmony_cithe aggregation of all the previous changes currently held only in the log.
448c2ecf20Sopenharmony_ci
458c2ecf20Sopenharmony_ciThis relogging technique also allows objects to be moved forward in the log so
468c2ecf20Sopenharmony_cithat an object being relogged does not prevent the tail of the log from ever
478c2ecf20Sopenharmony_cimoving forward.  This can be seen in the table above by the changing
488c2ecf20Sopenharmony_ci(increasing) LSN of each subsequent transaction - the LSN is effectively a
498c2ecf20Sopenharmony_cidirect encoding of the location in the log of the transaction.
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ciThis relogging is also used to implement long-running, multiple-commit
528c2ecf20Sopenharmony_citransactions.  These transaction are known as rolling transactions, and require
538c2ecf20Sopenharmony_cia special log reservation known as a permanent transaction reservation. A
548c2ecf20Sopenharmony_citypical example of a rolling transaction is the removal of extents from an
558c2ecf20Sopenharmony_ciinode which can only be done at a rate of two extents per transaction because
568c2ecf20Sopenharmony_ciof reservation size limitations. Hence a rolling extent removal transaction
578c2ecf20Sopenharmony_cikeeps relogging the inode and btree buffers as they get modified in each
588c2ecf20Sopenharmony_ciremoval operation. This keeps them moving forward in the log as the operation
598c2ecf20Sopenharmony_ciprogresses, ensuring that current operation never gets blocked by itself if the
608c2ecf20Sopenharmony_cilog wraps around.
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ciHence it can be seen that the relogging operation is fundamental to the correct
638c2ecf20Sopenharmony_ciworking of the XFS journalling subsystem. From the above description, most
648c2ecf20Sopenharmony_cipeople should be able to see why the XFS metadata operations writes so much to
658c2ecf20Sopenharmony_cithe log - repeated operations to the same objects write the same changes to
668c2ecf20Sopenharmony_cithe log over and over again. Worse is the fact that objects tend to get
678c2ecf20Sopenharmony_cidirtier as they get relogged, so each subsequent transaction is writing more
688c2ecf20Sopenharmony_cimetadata into the log.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ciAnother feature of the XFS transaction subsystem is that most transactions are
718c2ecf20Sopenharmony_ciasynchronous. That is, they don't commit to disk until either a log buffer is
728c2ecf20Sopenharmony_cifilled (a log buffer can hold multiple transactions) or a synchronous operation
738c2ecf20Sopenharmony_ciforces the log buffers holding the transactions to disk. This means that XFS is
748c2ecf20Sopenharmony_cidoing aggregation of transactions in memory - batching them, if you like - to
758c2ecf20Sopenharmony_ciminimise the impact of the log IO on transaction throughput.
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ciThe limitation on asynchronous transaction throughput is the number and size of
788c2ecf20Sopenharmony_cilog buffers made available by the log manager. By default there are 8 log
798c2ecf20Sopenharmony_cibuffers available and the size of each is 32kB - the size can be increased up
808c2ecf20Sopenharmony_cito 256kB by use of a mount option.
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ciEffectively, this gives us the maximum bound of outstanding metadata changes
838c2ecf20Sopenharmony_cithat can be made to the filesystem at any point in time - if all the log
848c2ecf20Sopenharmony_cibuffers are full and under IO, then no more transactions can be committed until
858c2ecf20Sopenharmony_cithe current batch completes. It is now common for a single current CPU core to
868c2ecf20Sopenharmony_cibe to able to issue enough transactions to keep the log buffers full and under
878c2ecf20Sopenharmony_ciIO permanently. Hence the XFS journalling subsystem can be considered to be IO
888c2ecf20Sopenharmony_cibound.
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_ciDelayed Logging: Concepts
918c2ecf20Sopenharmony_ci=========================
928c2ecf20Sopenharmony_ci
938c2ecf20Sopenharmony_ciThe key thing to note about the asynchronous logging combined with the
948c2ecf20Sopenharmony_cirelogging technique XFS uses is that we can be relogging changed objects
958c2ecf20Sopenharmony_cimultiple times before they are committed to disk in the log buffers. If we
968c2ecf20Sopenharmony_cireturn to the previous relogging example, it is entirely possible that
978c2ecf20Sopenharmony_citransactions A through D are committed to disk in the same log buffer.
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ciThat is, a single log buffer may contain multiple copies of the same object,
1008c2ecf20Sopenharmony_cibut only one of those copies needs to be there - the last one "D", as it
1018c2ecf20Sopenharmony_cicontains all the changes from the previous changes. In other words, we have one
1028c2ecf20Sopenharmony_cinecessary copy in the log buffer, and three stale copies that are simply
1038c2ecf20Sopenharmony_ciwasting space. When we are doing repeated operations on the same set of
1048c2ecf20Sopenharmony_ciobjects, these "stale objects" can be over 90% of the space used in the log
1058c2ecf20Sopenharmony_cibuffers. It is clear that reducing the number of stale objects written to the
1068c2ecf20Sopenharmony_cilog would greatly reduce the amount of metadata we write to the log, and this
1078c2ecf20Sopenharmony_ciis the fundamental goal of delayed logging.
1088c2ecf20Sopenharmony_ci
1098c2ecf20Sopenharmony_ciFrom a conceptual point of view, XFS is already doing relogging in memory (where
1108c2ecf20Sopenharmony_cimemory == log buffer), only it is doing it extremely inefficiently. It is using
1118c2ecf20Sopenharmony_cilogical to physical formatting to do the relogging because there is no
1128c2ecf20Sopenharmony_ciinfrastructure to keep track of logical changes in memory prior to physically
1138c2ecf20Sopenharmony_ciformatting the changes in a transaction to the log buffer. Hence we cannot avoid
1148c2ecf20Sopenharmony_ciaccumulating stale objects in the log buffers.
1158c2ecf20Sopenharmony_ci
1168c2ecf20Sopenharmony_ciDelayed logging is the name we've given to keeping and tracking transactional
1178c2ecf20Sopenharmony_cichanges to objects in memory outside the log buffer infrastructure. Because of
1188c2ecf20Sopenharmony_cithe relogging concept fundamental to the XFS journalling subsystem, this is
1198c2ecf20Sopenharmony_ciactually relatively easy to do - all the changes to logged items are already
1208c2ecf20Sopenharmony_citracked in the current infrastructure. The big problem is how to accumulate
1218c2ecf20Sopenharmony_cithem and get them to the log in a consistent, recoverable manner.
1228c2ecf20Sopenharmony_ciDescribing the problems and how they have been solved is the focus of this
1238c2ecf20Sopenharmony_cidocument.
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ciOne of the key changes that delayed logging makes to the operation of the
1268c2ecf20Sopenharmony_cijournalling subsystem is that it disassociates the amount of outstanding
1278c2ecf20Sopenharmony_cimetadata changes from the size and number of log buffers available. In other
1288c2ecf20Sopenharmony_ciwords, instead of there only being a maximum of 2MB of transaction changes not
1298c2ecf20Sopenharmony_ciwritten to the log at any point in time, there may be a much greater amount
1308c2ecf20Sopenharmony_cibeing accumulated in memory. Hence the potential for loss of metadata on a
1318c2ecf20Sopenharmony_cicrash is much greater than for the existing logging mechanism.
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ciIt should be noted that this does not change the guarantee that log recovery
1348c2ecf20Sopenharmony_ciwill result in a consistent filesystem. What it does mean is that as far as the
1358c2ecf20Sopenharmony_cirecovered filesystem is concerned, there may be many thousands of transactions
1368c2ecf20Sopenharmony_cithat simply did not occur as a result of the crash. This makes it even more
1378c2ecf20Sopenharmony_ciimportant that applications that care about their data use fsync() where they
1388c2ecf20Sopenharmony_cineed to ensure application level data integrity is maintained.
1398c2ecf20Sopenharmony_ci
1408c2ecf20Sopenharmony_ciIt should be noted that delayed logging is not an innovative new concept that
1418c2ecf20Sopenharmony_ciwarrants rigorous proofs to determine whether it is correct or not. The method
1428c2ecf20Sopenharmony_ciof accumulating changes in memory for some period before writing them to the
1438c2ecf20Sopenharmony_cilog is used effectively in many filesystems including ext3 and ext4. Hence
1448c2ecf20Sopenharmony_cino time is spent in this document trying to convince the reader that the
1458c2ecf20Sopenharmony_ciconcept is sound. Instead it is simply considered a "solved problem" and as
1468c2ecf20Sopenharmony_cisuch implementing it in XFS is purely an exercise in software engineering.
1478c2ecf20Sopenharmony_ci
1488c2ecf20Sopenharmony_ciThe fundamental requirements for delayed logging in XFS are simple:
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_ci	1. Reduce the amount of metadata written to the log by at least
1518c2ecf20Sopenharmony_ci	   an order of magnitude.
1528c2ecf20Sopenharmony_ci	2. Supply sufficient statistics to validate Requirement #1.
1538c2ecf20Sopenharmony_ci	3. Supply sufficient new tracing infrastructure to be able to debug
1548c2ecf20Sopenharmony_ci	   problems with the new code.
1558c2ecf20Sopenharmony_ci	4. No on-disk format change (metadata or log format).
1568c2ecf20Sopenharmony_ci	5. Enable and disable with a mount option.
1578c2ecf20Sopenharmony_ci	6. No performance regressions for synchronous transaction workloads.
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ciDelayed Logging: Design
1608c2ecf20Sopenharmony_ci=======================
1618c2ecf20Sopenharmony_ci
1628c2ecf20Sopenharmony_ciStoring Changes
1638c2ecf20Sopenharmony_ci---------------
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciThe problem with accumulating changes at a logical level (i.e. just using the
1668c2ecf20Sopenharmony_ciexisting log item dirty region tracking) is that when it comes to writing the
1678c2ecf20Sopenharmony_cichanges to the log buffers, we need to ensure that the object we are formatting
1688c2ecf20Sopenharmony_ciis not changing while we do this. This requires locking the object to prevent
1698c2ecf20Sopenharmony_ciconcurrent modification. Hence flushing the logical changes to the log would
1708c2ecf20Sopenharmony_cirequire us to lock every object, format them, and then unlock them again.
1718c2ecf20Sopenharmony_ci
1728c2ecf20Sopenharmony_ciThis introduces lots of scope for deadlocks with transactions that are already
1738c2ecf20Sopenharmony_cirunning. For example, a transaction has object A locked and modified, but needs
1748c2ecf20Sopenharmony_cithe delayed logging tracking lock to commit the transaction. However, the
1758c2ecf20Sopenharmony_ciflushing thread has the delayed logging tracking lock already held, and is
1768c2ecf20Sopenharmony_citrying to get the lock on object A to flush it to the log buffer. This appears
1778c2ecf20Sopenharmony_cito be an unsolvable deadlock condition, and it was solving this problem that
1788c2ecf20Sopenharmony_ciwas the barrier to implementing delayed logging for so long.
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ciThe solution is relatively simple - it just took a long time to recognise it.
1818c2ecf20Sopenharmony_ciPut simply, the current logging code formats the changes to each item into an
1828c2ecf20Sopenharmony_civector array that points to the changed regions in the item. The log write code
1838c2ecf20Sopenharmony_cisimply copies the memory these vectors point to into the log buffer during
1848c2ecf20Sopenharmony_citransaction commit while the item is locked in the transaction. Instead of
1858c2ecf20Sopenharmony_ciusing the log buffer as the destination of the formatting code, we can use an
1868c2ecf20Sopenharmony_ciallocated memory buffer big enough to fit the formatted vector.
1878c2ecf20Sopenharmony_ci
1888c2ecf20Sopenharmony_ciIf we then copy the vector into the memory buffer and rewrite the vector to
1898c2ecf20Sopenharmony_cipoint to the memory buffer rather than the object itself, we now have a copy of
1908c2ecf20Sopenharmony_cithe changes in a format that is compatible with the log buffer writing code.
1918c2ecf20Sopenharmony_cithat does not require us to lock the item to access. This formatting and
1928c2ecf20Sopenharmony_cirewriting can all be done while the object is locked during transaction commit,
1938c2ecf20Sopenharmony_ciresulting in a vector that is transactionally consistent and can be accessed
1948c2ecf20Sopenharmony_ciwithout needing to lock the owning item.
1958c2ecf20Sopenharmony_ci
1968c2ecf20Sopenharmony_ciHence we avoid the need to lock items when we need to flush outstanding
1978c2ecf20Sopenharmony_ciasynchronous transactions to the log. The differences between the existing
1988c2ecf20Sopenharmony_ciformatting method and the delayed logging formatting can be seen in the
1998c2ecf20Sopenharmony_cidiagram below.
2008c2ecf20Sopenharmony_ci
2018c2ecf20Sopenharmony_ciCurrent format log vector::
2028c2ecf20Sopenharmony_ci
2038c2ecf20Sopenharmony_ci    Object    +---------------------------------------------+
2048c2ecf20Sopenharmony_ci    Vector 1      +----+
2058c2ecf20Sopenharmony_ci    Vector 2                    +----+
2068c2ecf20Sopenharmony_ci    Vector 3                                   +----------+
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ciAfter formatting::
2098c2ecf20Sopenharmony_ci
2108c2ecf20Sopenharmony_ci    Log Buffer    +-V1-+-V2-+----V3----+
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciDelayed logging vector::
2138c2ecf20Sopenharmony_ci
2148c2ecf20Sopenharmony_ci    Object    +---------------------------------------------+
2158c2ecf20Sopenharmony_ci    Vector 1      +----+
2168c2ecf20Sopenharmony_ci    Vector 2                    +----+
2178c2ecf20Sopenharmony_ci    Vector 3                                   +----------+
2188c2ecf20Sopenharmony_ci
2198c2ecf20Sopenharmony_ciAfter formatting::
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ci    Memory Buffer +-V1-+-V2-+----V3----+
2228c2ecf20Sopenharmony_ci    Vector 1      +----+
2238c2ecf20Sopenharmony_ci    Vector 2           +----+
2248c2ecf20Sopenharmony_ci    Vector 3                +----------+
2258c2ecf20Sopenharmony_ci
2268c2ecf20Sopenharmony_ciThe memory buffer and associated vector need to be passed as a single object,
2278c2ecf20Sopenharmony_cibut still need to be associated with the parent object so if the object is
2288c2ecf20Sopenharmony_cirelogged we can replace the current memory buffer with a new memory buffer that
2298c2ecf20Sopenharmony_cicontains the latest changes.
2308c2ecf20Sopenharmony_ci
2318c2ecf20Sopenharmony_ciThe reason for keeping the vector around after we've formatted the memory
2328c2ecf20Sopenharmony_cibuffer is to support splitting vectors across log buffer boundaries correctly.
2338c2ecf20Sopenharmony_ciIf we don't keep the vector around, we do not know where the region boundaries
2348c2ecf20Sopenharmony_ciare in the item, so we'd need a new encapsulation method for regions in the log
2358c2ecf20Sopenharmony_cibuffer writing (i.e. double encapsulation). This would be an on-disk format
2368c2ecf20Sopenharmony_cichange and as such is not desirable.  It also means we'd have to write the log
2378c2ecf20Sopenharmony_ciregion headers in the formatting stage, which is problematic as there is per
2388c2ecf20Sopenharmony_ciregion state that needs to be placed into the headers during the log write.
2398c2ecf20Sopenharmony_ci
2408c2ecf20Sopenharmony_ciHence we need to keep the vector, but by attaching the memory buffer to it and
2418c2ecf20Sopenharmony_cirewriting the vector addresses to point at the memory buffer we end up with a
2428c2ecf20Sopenharmony_ciself-describing object that can be passed to the log buffer write code to be
2438c2ecf20Sopenharmony_cihandled in exactly the same manner as the existing log vectors are handled.
2448c2ecf20Sopenharmony_ciHence we avoid needing a new on-disk format to handle items that have been
2458c2ecf20Sopenharmony_cirelogged in memory.
2468c2ecf20Sopenharmony_ci
2478c2ecf20Sopenharmony_ci
2488c2ecf20Sopenharmony_ciTracking Changes
2498c2ecf20Sopenharmony_ci----------------
2508c2ecf20Sopenharmony_ci
2518c2ecf20Sopenharmony_ciNow that we can record transactional changes in memory in a form that allows
2528c2ecf20Sopenharmony_cithem to be used without limitations, we need to be able to track and accumulate
2538c2ecf20Sopenharmony_cithem so that they can be written to the log at some later point in time.  The
2548c2ecf20Sopenharmony_cilog item is the natural place to store this vector and buffer, and also makes sense
2558c2ecf20Sopenharmony_cito be the object that is used to track committed objects as it will always
2568c2ecf20Sopenharmony_ciexist once the object has been included in a transaction.
2578c2ecf20Sopenharmony_ci
2588c2ecf20Sopenharmony_ciThe log item is already used to track the log items that have been written to
2598c2ecf20Sopenharmony_cithe log but not yet written to disk. Such log items are considered "active"
2608c2ecf20Sopenharmony_ciand as such are stored in the Active Item List (AIL) which is a LSN-ordered
2618c2ecf20Sopenharmony_cidouble linked list. Items are inserted into this list during log buffer IO
2628c2ecf20Sopenharmony_cicompletion, after which they are unpinned and can be written to disk. An object
2638c2ecf20Sopenharmony_cithat is in the AIL can be relogged, which causes the object to be pinned again
2648c2ecf20Sopenharmony_ciand then moved forward in the AIL when the log buffer IO completes for that
2658c2ecf20Sopenharmony_citransaction.
2668c2ecf20Sopenharmony_ci
2678c2ecf20Sopenharmony_ciEssentially, this shows that an item that is in the AIL can still be modified
2688c2ecf20Sopenharmony_ciand relogged, so any tracking must be separate to the AIL infrastructure. As
2698c2ecf20Sopenharmony_cisuch, we cannot reuse the AIL list pointers for tracking committed items, nor
2708c2ecf20Sopenharmony_cican we store state in any field that is protected by the AIL lock. Hence the
2718c2ecf20Sopenharmony_cicommitted item tracking needs it's own locks, lists and state fields in the log
2728c2ecf20Sopenharmony_ciitem.
2738c2ecf20Sopenharmony_ci
2748c2ecf20Sopenharmony_ciSimilar to the AIL, tracking of committed items is done through a new list
2758c2ecf20Sopenharmony_cicalled the Committed Item List (CIL).  The list tracks log items that have been
2768c2ecf20Sopenharmony_cicommitted and have formatted memory buffers attached to them. It tracks objects
2778c2ecf20Sopenharmony_ciin transaction commit order, so when an object is relogged it is removed from
2788c2ecf20Sopenharmony_ciit's place in the list and re-inserted at the tail. This is entirely arbitrary
2798c2ecf20Sopenharmony_ciand done to make it easy for debugging - the last items in the list are the
2808c2ecf20Sopenharmony_ciones that are most recently modified. Ordering of the CIL is not necessary for
2818c2ecf20Sopenharmony_citransactional integrity (as discussed in the next section) so the ordering is
2828c2ecf20Sopenharmony_cidone for convenience/sanity of the developers.
2838c2ecf20Sopenharmony_ci
2848c2ecf20Sopenharmony_ci
2858c2ecf20Sopenharmony_ciDelayed Logging: Checkpoints
2868c2ecf20Sopenharmony_ci----------------------------
2878c2ecf20Sopenharmony_ci
2888c2ecf20Sopenharmony_ciWhen we have a log synchronisation event, commonly known as a "log force",
2898c2ecf20Sopenharmony_ciall the items in the CIL must be written into the log via the log buffers.
2908c2ecf20Sopenharmony_ciWe need to write these items in the order that they exist in the CIL, and they
2918c2ecf20Sopenharmony_cineed to be written as an atomic transaction. The need for all the objects to be
2928c2ecf20Sopenharmony_ciwritten as an atomic transaction comes from the requirements of relogging and
2938c2ecf20Sopenharmony_cilog replay - all the changes in all the objects in a given transaction must
2948c2ecf20Sopenharmony_cieither be completely replayed during log recovery, or not replayed at all. If
2958c2ecf20Sopenharmony_cia transaction is not replayed because it is not complete in the log, then
2968c2ecf20Sopenharmony_cino later transactions should be replayed, either.
2978c2ecf20Sopenharmony_ci
2988c2ecf20Sopenharmony_ciTo fulfill this requirement, we need to write the entire CIL in a single log
2998c2ecf20Sopenharmony_citransaction. Fortunately, the XFS log code has no fixed limit on the size of a
3008c2ecf20Sopenharmony_citransaction, nor does the log replay code. The only fundamental limit is that
3018c2ecf20Sopenharmony_cithe transaction cannot be larger than just under half the size of the log.  The
3028c2ecf20Sopenharmony_cireason for this limit is that to find the head and tail of the log, there must
3038c2ecf20Sopenharmony_cibe at least one complete transaction in the log at any given time. If a
3048c2ecf20Sopenharmony_citransaction is larger than half the log, then there is the possibility that a
3058c2ecf20Sopenharmony_cicrash during the write of a such a transaction could partially overwrite the
3068c2ecf20Sopenharmony_cionly complete previous transaction in the log. This will result in a recovery
3078c2ecf20Sopenharmony_cifailure and an inconsistent filesystem and hence we must enforce the maximum
3088c2ecf20Sopenharmony_cisize of a checkpoint to be slightly less than a half the log.
3098c2ecf20Sopenharmony_ci
3108c2ecf20Sopenharmony_ciApart from this size requirement, a checkpoint transaction looks no different
3118c2ecf20Sopenharmony_cito any other transaction - it contains a transaction header, a series of
3128c2ecf20Sopenharmony_ciformatted log items and a commit record at the tail. From a recovery
3138c2ecf20Sopenharmony_ciperspective, the checkpoint transaction is also no different - just a lot
3148c2ecf20Sopenharmony_cibigger with a lot more items in it. The worst case effect of this is that we
3158c2ecf20Sopenharmony_cimight need to tune the recovery transaction object hash size.
3168c2ecf20Sopenharmony_ci
3178c2ecf20Sopenharmony_ciBecause the checkpoint is just another transaction and all the changes to log
3188c2ecf20Sopenharmony_ciitems are stored as log vectors, we can use the existing log buffer writing
3198c2ecf20Sopenharmony_cicode to write the changes into the log. To do this efficiently, we need to
3208c2ecf20Sopenharmony_ciminimise the time we hold the CIL locked while writing the checkpoint
3218c2ecf20Sopenharmony_citransaction. The current log write code enables us to do this easily with the
3228c2ecf20Sopenharmony_ciway it separates the writing of the transaction contents (the log vectors) from
3238c2ecf20Sopenharmony_cithe transaction commit record, but tracking this requires us to have a
3248c2ecf20Sopenharmony_ciper-checkpoint context that travels through the log write process through to
3258c2ecf20Sopenharmony_cicheckpoint completion.
3268c2ecf20Sopenharmony_ci
3278c2ecf20Sopenharmony_ciHence a checkpoint has a context that tracks the state of the current
3288c2ecf20Sopenharmony_cicheckpoint from initiation to checkpoint completion. A new context is initiated
3298c2ecf20Sopenharmony_ciat the same time a checkpoint transaction is started. That is, when we remove
3308c2ecf20Sopenharmony_ciall the current items from the CIL during a checkpoint operation, we move all
3318c2ecf20Sopenharmony_cithose changes into the current checkpoint context. We then initialise a new
3328c2ecf20Sopenharmony_cicontext and attach that to the CIL for aggregation of new transactions.
3338c2ecf20Sopenharmony_ci
3348c2ecf20Sopenharmony_ciThis allows us to unlock the CIL immediately after transfer of all the
3358c2ecf20Sopenharmony_cicommitted items and effectively allow new transactions to be issued while we
3368c2ecf20Sopenharmony_ciare formatting the checkpoint into the log. It also allows concurrent
3378c2ecf20Sopenharmony_cicheckpoints to be written into the log buffers in the case of log force heavy
3388c2ecf20Sopenharmony_ciworkloads, just like the existing transaction commit code does. This, however,
3398c2ecf20Sopenharmony_cirequires that we strictly order the commit records in the log so that
3408c2ecf20Sopenharmony_cicheckpoint sequence order is maintained during log replay.
3418c2ecf20Sopenharmony_ci
3428c2ecf20Sopenharmony_ciTo ensure that we can be writing an item into a checkpoint transaction at
3438c2ecf20Sopenharmony_cithe same time another transaction modifies the item and inserts the log item
3448c2ecf20Sopenharmony_ciinto the new CIL, then checkpoint transaction commit code cannot use log items
3458c2ecf20Sopenharmony_cito store the list of log vectors that need to be written into the transaction.
3468c2ecf20Sopenharmony_ciHence log vectors need to be able to be chained together to allow them to be
3478c2ecf20Sopenharmony_cidetached from the log items. That is, when the CIL is flushed the memory
3488c2ecf20Sopenharmony_cibuffer and log vector attached to each log item needs to be attached to the
3498c2ecf20Sopenharmony_cicheckpoint context so that the log item can be released. In diagrammatic form,
3508c2ecf20Sopenharmony_cithe CIL would look like this before the flush::
3518c2ecf20Sopenharmony_ci
3528c2ecf20Sopenharmony_ci	CIL Head
3538c2ecf20Sopenharmony_ci	   |
3548c2ecf20Sopenharmony_ci	   V
3558c2ecf20Sopenharmony_ci	Log Item <-> log vector 1	-> memory buffer
3568c2ecf20Sopenharmony_ci	   |				-> vector array
3578c2ecf20Sopenharmony_ci	   V
3588c2ecf20Sopenharmony_ci	Log Item <-> log vector 2	-> memory buffer
3598c2ecf20Sopenharmony_ci	   |				-> vector array
3608c2ecf20Sopenharmony_ci	   V
3618c2ecf20Sopenharmony_ci	......
3628c2ecf20Sopenharmony_ci	   |
3638c2ecf20Sopenharmony_ci	   V
3648c2ecf20Sopenharmony_ci	Log Item <-> log vector N-1	-> memory buffer
3658c2ecf20Sopenharmony_ci	   |				-> vector array
3668c2ecf20Sopenharmony_ci	   V
3678c2ecf20Sopenharmony_ci	Log Item <-> log vector N	-> memory buffer
3688c2ecf20Sopenharmony_ci					-> vector array
3698c2ecf20Sopenharmony_ci
3708c2ecf20Sopenharmony_ciAnd after the flush the CIL head is empty, and the checkpoint context log
3718c2ecf20Sopenharmony_civector list would look like::
3728c2ecf20Sopenharmony_ci
3738c2ecf20Sopenharmony_ci	Checkpoint Context
3748c2ecf20Sopenharmony_ci	   |
3758c2ecf20Sopenharmony_ci	   V
3768c2ecf20Sopenharmony_ci	log vector 1	-> memory buffer
3778c2ecf20Sopenharmony_ci	   |		-> vector array
3788c2ecf20Sopenharmony_ci	   |		-> Log Item
3798c2ecf20Sopenharmony_ci	   V
3808c2ecf20Sopenharmony_ci	log vector 2	-> memory buffer
3818c2ecf20Sopenharmony_ci	   |		-> vector array
3828c2ecf20Sopenharmony_ci	   |		-> Log Item
3838c2ecf20Sopenharmony_ci	   V
3848c2ecf20Sopenharmony_ci	......
3858c2ecf20Sopenharmony_ci	   |
3868c2ecf20Sopenharmony_ci	   V
3878c2ecf20Sopenharmony_ci	log vector N-1	-> memory buffer
3888c2ecf20Sopenharmony_ci	   |		-> vector array
3898c2ecf20Sopenharmony_ci	   |		-> Log Item
3908c2ecf20Sopenharmony_ci	   V
3918c2ecf20Sopenharmony_ci	log vector N	-> memory buffer
3928c2ecf20Sopenharmony_ci			-> vector array
3938c2ecf20Sopenharmony_ci			-> Log Item
3948c2ecf20Sopenharmony_ci
3958c2ecf20Sopenharmony_ciOnce this transfer is done, the CIL can be unlocked and new transactions can
3968c2ecf20Sopenharmony_cistart, while the checkpoint flush code works over the log vector chain to
3978c2ecf20Sopenharmony_cicommit the checkpoint.
3988c2ecf20Sopenharmony_ci
3998c2ecf20Sopenharmony_ciOnce the checkpoint is written into the log buffers, the checkpoint context is
4008c2ecf20Sopenharmony_ciattached to the log buffer that the commit record was written to along with a
4018c2ecf20Sopenharmony_cicompletion callback. Log IO completion will call that callback, which can then
4028c2ecf20Sopenharmony_cirun transaction committed processing for the log items (i.e. insert into AIL
4038c2ecf20Sopenharmony_ciand unpin) in the log vector chain and then free the log vector chain and
4048c2ecf20Sopenharmony_cicheckpoint context.
4058c2ecf20Sopenharmony_ci
4068c2ecf20Sopenharmony_ciDiscussion Point: I am uncertain as to whether the log item is the most
4078c2ecf20Sopenharmony_ciefficient way to track vectors, even though it seems like the natural way to do
4088c2ecf20Sopenharmony_ciit. The fact that we walk the log items (in the CIL) just to chain the log
4098c2ecf20Sopenharmony_civectors and break the link between the log item and the log vector means that
4108c2ecf20Sopenharmony_ciwe take a cache line hit for the log item list modification, then another for
4118c2ecf20Sopenharmony_cithe log vector chaining. If we track by the log vectors, then we only need to
4128c2ecf20Sopenharmony_cibreak the link between the log item and the log vector, which means we should
4138c2ecf20Sopenharmony_cidirty only the log item cachelines. Normally I wouldn't be concerned about one
4148c2ecf20Sopenharmony_civs two dirty cachelines except for the fact I've seen upwards of 80,000 log
4158c2ecf20Sopenharmony_civectors in one checkpoint transaction. I'd guess this is a "measure and
4168c2ecf20Sopenharmony_cicompare" situation that can be done after a working and reviewed implementation
4178c2ecf20Sopenharmony_ciis in the dev tree....
4188c2ecf20Sopenharmony_ci
4198c2ecf20Sopenharmony_ciDelayed Logging: Checkpoint Sequencing
4208c2ecf20Sopenharmony_ci--------------------------------------
4218c2ecf20Sopenharmony_ci
4228c2ecf20Sopenharmony_ciOne of the key aspects of the XFS transaction subsystem is that it tags
4238c2ecf20Sopenharmony_cicommitted transactions with the log sequence number of the transaction commit.
4248c2ecf20Sopenharmony_ciThis allows transactions to be issued asynchronously even though there may be
4258c2ecf20Sopenharmony_cifuture operations that cannot be completed until that transaction is fully
4268c2ecf20Sopenharmony_cicommitted to the log. In the rare case that a dependent operation occurs (e.g.
4278c2ecf20Sopenharmony_cire-using a freed metadata extent for a data extent), a special, optimised log
4288c2ecf20Sopenharmony_ciforce can be issued to force the dependent transaction to disk immediately.
4298c2ecf20Sopenharmony_ci
4308c2ecf20Sopenharmony_ciTo do this, transactions need to record the LSN of the commit record of the
4318c2ecf20Sopenharmony_citransaction. This LSN comes directly from the log buffer the transaction is
4328c2ecf20Sopenharmony_ciwritten into. While this works just fine for the existing transaction
4338c2ecf20Sopenharmony_cimechanism, it does not work for delayed logging because transactions are not
4348c2ecf20Sopenharmony_ciwritten directly into the log buffers. Hence some other method of sequencing
4358c2ecf20Sopenharmony_citransactions is required.
4368c2ecf20Sopenharmony_ci
4378c2ecf20Sopenharmony_ciAs discussed in the checkpoint section, delayed logging uses per-checkpoint
4388c2ecf20Sopenharmony_cicontexts, and as such it is simple to assign a sequence number to each
4398c2ecf20Sopenharmony_cicheckpoint. Because the switching of checkpoint contexts must be done
4408c2ecf20Sopenharmony_ciatomically, it is simple to ensure that each new context has a monotonically
4418c2ecf20Sopenharmony_ciincreasing sequence number assigned to it without the need for an external
4428c2ecf20Sopenharmony_ciatomic counter - we can just take the current context sequence number and add
4438c2ecf20Sopenharmony_cione to it for the new context.
4448c2ecf20Sopenharmony_ci
4458c2ecf20Sopenharmony_ciThen, instead of assigning a log buffer LSN to the transaction commit LSN
4468c2ecf20Sopenharmony_ciduring the commit, we can assign the current checkpoint sequence. This allows
4478c2ecf20Sopenharmony_cioperations that track transactions that have not yet completed know what
4488c2ecf20Sopenharmony_cicheckpoint sequence needs to be committed before they can continue. As a
4498c2ecf20Sopenharmony_ciresult, the code that forces the log to a specific LSN now needs to ensure that
4508c2ecf20Sopenharmony_cithe log forces to a specific checkpoint.
4518c2ecf20Sopenharmony_ci
4528c2ecf20Sopenharmony_ciTo ensure that we can do this, we need to track all the checkpoint contexts
4538c2ecf20Sopenharmony_cithat are currently committing to the log. When we flush a checkpoint, the
4548c2ecf20Sopenharmony_cicontext gets added to a "committing" list which can be searched. When a
4558c2ecf20Sopenharmony_cicheckpoint commit completes, it is removed from the committing list. Because
4568c2ecf20Sopenharmony_cithe checkpoint context records the LSN of the commit record for the checkpoint,
4578c2ecf20Sopenharmony_ciwe can also wait on the log buffer that contains the commit record, thereby
4588c2ecf20Sopenharmony_ciusing the existing log force mechanisms to execute synchronous forces.
4598c2ecf20Sopenharmony_ci
4608c2ecf20Sopenharmony_ciIt should be noted that the synchronous forces may need to be extended with
4618c2ecf20Sopenharmony_cimitigation algorithms similar to the current log buffer code to allow
4628c2ecf20Sopenharmony_ciaggregation of multiple synchronous transactions if there are already
4638c2ecf20Sopenharmony_cisynchronous transactions being flushed. Investigation of the performance of the
4648c2ecf20Sopenharmony_cicurrent design is needed before making any decisions here.
4658c2ecf20Sopenharmony_ci
4668c2ecf20Sopenharmony_ciThe main concern with log forces is to ensure that all the previous checkpoints
4678c2ecf20Sopenharmony_ciare also committed to disk before the one we need to wait for. Therefore we
4688c2ecf20Sopenharmony_cineed to check that all the prior contexts in the committing list are also
4698c2ecf20Sopenharmony_cicomplete before waiting on the one we need to complete. We do this
4708c2ecf20Sopenharmony_cisynchronisation in the log force code so that we don't need to wait anywhere
4718c2ecf20Sopenharmony_cielse for such serialisation - it only matters when we do a log force.
4728c2ecf20Sopenharmony_ci
4738c2ecf20Sopenharmony_ciThe only remaining complexity is that a log force now also has to handle the
4748c2ecf20Sopenharmony_cicase where the forcing sequence number is the same as the current context. That
4758c2ecf20Sopenharmony_ciis, we need to flush the CIL and potentially wait for it to complete. This is a
4768c2ecf20Sopenharmony_cisimple addition to the existing log forcing code to check the sequence numbers
4778c2ecf20Sopenharmony_ciand push if required. Indeed, placing the current sequence checkpoint flush in
4788c2ecf20Sopenharmony_cithe log force code enables the current mechanism for issuing synchronous
4798c2ecf20Sopenharmony_citransactions to remain untouched (i.e. commit an asynchronous transaction, then
4808c2ecf20Sopenharmony_ciforce the log at the LSN of that transaction) and so the higher level code
4818c2ecf20Sopenharmony_cibehaves the same regardless of whether delayed logging is being used or not.
4828c2ecf20Sopenharmony_ci
4838c2ecf20Sopenharmony_ciDelayed Logging: Checkpoint Log Space Accounting
4848c2ecf20Sopenharmony_ci------------------------------------------------
4858c2ecf20Sopenharmony_ci
4868c2ecf20Sopenharmony_ciThe big issue for a checkpoint transaction is the log space reservation for the
4878c2ecf20Sopenharmony_citransaction. We don't know how big a checkpoint transaction is going to be
4888c2ecf20Sopenharmony_ciahead of time, nor how many log buffers it will take to write out, nor the
4898c2ecf20Sopenharmony_cinumber of split log vector regions are going to be used. We can track the
4908c2ecf20Sopenharmony_ciamount of log space required as we add items to the commit item list, but we
4918c2ecf20Sopenharmony_cistill need to reserve the space in the log for the checkpoint.
4928c2ecf20Sopenharmony_ci
4938c2ecf20Sopenharmony_ciA typical transaction reserves enough space in the log for the worst case space
4948c2ecf20Sopenharmony_ciusage of the transaction. The reservation accounts for log record headers,
4958c2ecf20Sopenharmony_citransaction and region headers, headers for split regions, buffer tail padding,
4968c2ecf20Sopenharmony_cietc. as well as the actual space for all the changed metadata in the
4978c2ecf20Sopenharmony_citransaction. While some of this is fixed overhead, much of it is dependent on
4988c2ecf20Sopenharmony_cithe size of the transaction and the number of regions being logged (the number
4998c2ecf20Sopenharmony_ciof log vectors in the transaction).
5008c2ecf20Sopenharmony_ci
5018c2ecf20Sopenharmony_ciAn example of the differences would be logging directory changes versus logging
5028c2ecf20Sopenharmony_ciinode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
5038c2ecf20Sopenharmony_cithere are lots of transactions that only contain an inode core and an inode log
5048c2ecf20Sopenharmony_ciformat structure. That is, two vectors totaling roughly 150 bytes. If we modify
5058c2ecf20Sopenharmony_ci10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
5068c2ecf20Sopenharmony_civector is 12 bytes, so the total to be logged is approximately 1.75MB. In
5078c2ecf20Sopenharmony_cicomparison, if we are logging full directory buffers, they are typically 4KB
5088c2ecf20Sopenharmony_cieach, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
5098c2ecf20Sopenharmony_cibuffer format structure for each buffer - roughly 800 vectors or 1.51MB total
5108c2ecf20Sopenharmony_cispace.  From this, it should be obvious that a static log space reservation is
5118c2ecf20Sopenharmony_cinot particularly flexible and is difficult to select the "optimal value" for
5128c2ecf20Sopenharmony_ciall workloads.
5138c2ecf20Sopenharmony_ci
5148c2ecf20Sopenharmony_ciFurther, if we are going to use a static reservation, which bit of the entire
5158c2ecf20Sopenharmony_cireservation does it cover? We account for space used by the transaction
5168c2ecf20Sopenharmony_cireservation by tracking the space currently used by the object in the CIL and
5178c2ecf20Sopenharmony_cithen calculating the increase or decrease in space used as the object is
5188c2ecf20Sopenharmony_cirelogged. This allows for a checkpoint reservation to only have to account for
5198c2ecf20Sopenharmony_cilog buffer metadata used such as log header records.
5208c2ecf20Sopenharmony_ci
5218c2ecf20Sopenharmony_ciHowever, even using a static reservation for just the log metadata is
5228c2ecf20Sopenharmony_ciproblematic. Typically log record headers use at least 16KB of log space per
5238c2ecf20Sopenharmony_ci1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
5248c2ecf20Sopenharmony_cilarge enough to handle arbitrary sized checkpoint transactions. This
5258c2ecf20Sopenharmony_cireservation needs to be made before the checkpoint is started, and we need to
5268c2ecf20Sopenharmony_cibe able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
5278c2ecf20Sopenharmony_cireservation of around 150KB, which is a non-trivial amount of space.
5288c2ecf20Sopenharmony_ci
5298c2ecf20Sopenharmony_ciA static reservation needs to manipulate the log grant counters - we can take a
5308c2ecf20Sopenharmony_cipermanent reservation on the space, but we still need to make sure we refresh
5318c2ecf20Sopenharmony_cithe write reservation (the actual space available to the transaction) after
5328c2ecf20Sopenharmony_cievery checkpoint transaction completion. Unfortunately, if this space is not
5338c2ecf20Sopenharmony_ciavailable when required, then the regrant code will sleep waiting for it.
5348c2ecf20Sopenharmony_ci
5358c2ecf20Sopenharmony_ciThe problem with this is that it can lead to deadlocks as we may need to commit
5368c2ecf20Sopenharmony_cicheckpoints to be able to free up log space (refer back to the description of
5378c2ecf20Sopenharmony_cirolling transactions for an example of this).  Hence we *must* always have
5388c2ecf20Sopenharmony_cispace available in the log if we are to use static reservations, and that is
5398c2ecf20Sopenharmony_civery difficult and complex to arrange. It is possible to do, but there is a
5408c2ecf20Sopenharmony_cisimpler way.
5418c2ecf20Sopenharmony_ci
5428c2ecf20Sopenharmony_ciThe simpler way of doing this is tracking the entire log space used by the
5438c2ecf20Sopenharmony_ciitems in the CIL and using this to dynamically calculate the amount of log
5448c2ecf20Sopenharmony_cispace required by the log metadata. If this log metadata space changes as a
5458c2ecf20Sopenharmony_ciresult of a transaction commit inserting a new memory buffer into the CIL, then
5468c2ecf20Sopenharmony_cithe difference in space required is removed from the transaction that causes
5478c2ecf20Sopenharmony_cithe change. Transactions at this level will *always* have enough space
5488c2ecf20Sopenharmony_ciavailable in their reservation for this as they have already reserved the
5498c2ecf20Sopenharmony_cimaximal amount of log metadata space they require, and such a delta reservation
5508c2ecf20Sopenharmony_ciwill always be less than or equal to the maximal amount in the reservation.
5518c2ecf20Sopenharmony_ci
5528c2ecf20Sopenharmony_ciHence we can grow the checkpoint transaction reservation dynamically as items
5538c2ecf20Sopenharmony_ciare added to the CIL and avoid the need for reserving and regranting log space
5548c2ecf20Sopenharmony_ciup front. This avoids deadlocks and removes a blocking point from the
5558c2ecf20Sopenharmony_cicheckpoint flush code.
5568c2ecf20Sopenharmony_ci
5578c2ecf20Sopenharmony_ciAs mentioned early, transactions can't grow to more than half the size of the
5588c2ecf20Sopenharmony_cilog. Hence as part of the reservation growing, we need to also check the size
5598c2ecf20Sopenharmony_ciof the reservation against the maximum allowed transaction size. If we reach
5608c2ecf20Sopenharmony_cithe maximum threshold, we need to push the CIL to the log. This is effectively
5618c2ecf20Sopenharmony_cia "background flush" and is done on demand. This is identical to
5628c2ecf20Sopenharmony_cia CIL push triggered by a log force, only that there is no waiting for the
5638c2ecf20Sopenharmony_cicheckpoint commit to complete. This background push is checked and executed by
5648c2ecf20Sopenharmony_citransaction commit code.
5658c2ecf20Sopenharmony_ci
5668c2ecf20Sopenharmony_ciIf the transaction subsystem goes idle while we still have items in the CIL,
5678c2ecf20Sopenharmony_cithey will be flushed by the periodic log force issued by the xfssyncd. This log
5688c2ecf20Sopenharmony_ciforce will push the CIL to disk, and if the transaction subsystem stays idle,
5698c2ecf20Sopenharmony_ciallow the idle log to be covered (effectively marked clean) in exactly the same
5708c2ecf20Sopenharmony_cimanner that is done for the existing logging method. A discussion point is
5718c2ecf20Sopenharmony_ciwhether this log force needs to be done more frequently than the current rate
5728c2ecf20Sopenharmony_ciwhich is once every 30s.
5738c2ecf20Sopenharmony_ci
5748c2ecf20Sopenharmony_ci
5758c2ecf20Sopenharmony_ciDelayed Logging: Log Item Pinning
5768c2ecf20Sopenharmony_ci---------------------------------
5778c2ecf20Sopenharmony_ci
5788c2ecf20Sopenharmony_ciCurrently log items are pinned during transaction commit while the items are
5798c2ecf20Sopenharmony_cistill locked. This happens just after the items are formatted, though it could
5808c2ecf20Sopenharmony_cibe done any time before the items are unlocked. The result of this mechanism is
5818c2ecf20Sopenharmony_cithat items get pinned once for every transaction that is committed to the log
5828c2ecf20Sopenharmony_cibuffers. Hence items that are relogged in the log buffers will have a pin count
5838c2ecf20Sopenharmony_cifor every outstanding transaction they were dirtied in. When each of these
5848c2ecf20Sopenharmony_citransactions is completed, they will unpin the item once. As a result, the item
5858c2ecf20Sopenharmony_cionly becomes unpinned when all the transactions complete and there are no
5868c2ecf20Sopenharmony_cipending transactions. Thus the pinning and unpinning of a log item is symmetric
5878c2ecf20Sopenharmony_cias there is a 1:1 relationship with transaction commit and log item completion.
5888c2ecf20Sopenharmony_ci
5898c2ecf20Sopenharmony_ciFor delayed logging, however, we have an asymmetric transaction commit to
5908c2ecf20Sopenharmony_cicompletion relationship. Every time an object is relogged in the CIL it goes
5918c2ecf20Sopenharmony_cithrough the commit process without a corresponding completion being registered.
5928c2ecf20Sopenharmony_ciThat is, we now have a many-to-one relationship between transaction commit and
5938c2ecf20Sopenharmony_cilog item completion. The result of this is that pinning and unpinning of the
5948c2ecf20Sopenharmony_cilog items becomes unbalanced if we retain the "pin on transaction commit, unpin
5958c2ecf20Sopenharmony_cion transaction completion" model.
5968c2ecf20Sopenharmony_ci
5978c2ecf20Sopenharmony_ciTo keep pin/unpin symmetry, the algorithm needs to change to a "pin on
5988c2ecf20Sopenharmony_ciinsertion into the CIL, unpin on checkpoint completion". In other words, the
5998c2ecf20Sopenharmony_cipinning and unpinning becomes symmetric around a checkpoint context. We have to
6008c2ecf20Sopenharmony_cipin the object the first time it is inserted into the CIL - if it is already in
6018c2ecf20Sopenharmony_cithe CIL during a transaction commit, then we do not pin it again. Because there
6028c2ecf20Sopenharmony_cican be multiple outstanding checkpoint contexts, we can still see elevated pin
6038c2ecf20Sopenharmony_cicounts, but as each checkpoint completes the pin count will retain the correct
6048c2ecf20Sopenharmony_civalue according to it's context.
6058c2ecf20Sopenharmony_ci
6068c2ecf20Sopenharmony_ciJust to make matters more slightly more complex, this checkpoint level context
6078c2ecf20Sopenharmony_cifor the pin count means that the pinning of an item must take place under the
6088c2ecf20Sopenharmony_ciCIL commit/flush lock. If we pin the object outside this lock, we cannot
6098c2ecf20Sopenharmony_ciguarantee which context the pin count is associated with. This is because of
6108c2ecf20Sopenharmony_cithe fact pinning the item is dependent on whether the item is present in the
6118c2ecf20Sopenharmony_cicurrent CIL or not. If we don't pin the CIL first before we check and pin the
6128c2ecf20Sopenharmony_ciobject, we have a race with CIL being flushed between the check and the pin
6138c2ecf20Sopenharmony_ci(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
6148c2ecf20Sopenharmony_cilock to guarantee that we pin the items correctly.
6158c2ecf20Sopenharmony_ci
6168c2ecf20Sopenharmony_ciDelayed Logging: Concurrent Scalability
6178c2ecf20Sopenharmony_ci---------------------------------------
6188c2ecf20Sopenharmony_ci
6198c2ecf20Sopenharmony_ciA fundamental requirement for the CIL is that accesses through transaction
6208c2ecf20Sopenharmony_cicommits must scale to many concurrent commits. The current transaction commit
6218c2ecf20Sopenharmony_cicode does not break down even when there are transactions coming from 2048
6228c2ecf20Sopenharmony_ciprocessors at once. The current transaction code does not go any faster than if
6238c2ecf20Sopenharmony_cithere was only one CPU using it, but it does not slow down either.
6248c2ecf20Sopenharmony_ci
6258c2ecf20Sopenharmony_ciAs a result, the delayed logging transaction commit code needs to be designed
6268c2ecf20Sopenharmony_cifor concurrency from the ground up. It is obvious that there are serialisation
6278c2ecf20Sopenharmony_cipoints in the design - the three important ones are:
6288c2ecf20Sopenharmony_ci
6298c2ecf20Sopenharmony_ci	1. Locking out new transaction commits while flushing the CIL
6308c2ecf20Sopenharmony_ci	2. Adding items to the CIL and updating item space accounting
6318c2ecf20Sopenharmony_ci	3. Checkpoint commit ordering
6328c2ecf20Sopenharmony_ci
6338c2ecf20Sopenharmony_ciLooking at the transaction commit and CIL flushing interactions, it is clear
6348c2ecf20Sopenharmony_cithat we have a many-to-one interaction here. That is, the only restriction on
6358c2ecf20Sopenharmony_cithe number of concurrent transactions that can be trying to commit at once is
6368c2ecf20Sopenharmony_cithe amount of space available in the log for their reservations. The practical
6378c2ecf20Sopenharmony_cilimit here is in the order of several hundred concurrent transactions for a
6388c2ecf20Sopenharmony_ci128MB log, which means that it is generally one per CPU in a machine.
6398c2ecf20Sopenharmony_ci
6408c2ecf20Sopenharmony_ciThe amount of time a transaction commit needs to hold out a flush is a
6418c2ecf20Sopenharmony_cirelatively long period of time - the pinning of log items needs to be done
6428c2ecf20Sopenharmony_ciwhile we are holding out a CIL flush, so at the moment that means it is held
6438c2ecf20Sopenharmony_ciacross the formatting of the objects into memory buffers (i.e. while memcpy()s
6448c2ecf20Sopenharmony_ciare in progress). Ultimately a two pass algorithm where the formatting is done
6458c2ecf20Sopenharmony_ciseparately to the pinning of objects could be used to reduce the hold time of
6468c2ecf20Sopenharmony_cithe transaction commit side.
6478c2ecf20Sopenharmony_ci
6488c2ecf20Sopenharmony_ciBecause of the number of potential transaction commit side holders, the lock
6498c2ecf20Sopenharmony_cireally needs to be a sleeping lock - if the CIL flush takes the lock, we do not
6508c2ecf20Sopenharmony_ciwant every other CPU in the machine spinning on the CIL lock. Given that
6518c2ecf20Sopenharmony_ciflushing the CIL could involve walking a list of tens of thousands of log
6528c2ecf20Sopenharmony_ciitems, it will get held for a significant time and so spin contention is a
6538c2ecf20Sopenharmony_cisignificant concern. Preventing lots of CPUs spinning doing nothing is the
6548c2ecf20Sopenharmony_cimain reason for choosing a sleeping lock even though nothing in either the
6558c2ecf20Sopenharmony_citransaction commit or CIL flush side sleeps with the lock held.
6568c2ecf20Sopenharmony_ci
6578c2ecf20Sopenharmony_ciIt should also be noted that CIL flushing is also a relatively rare operation
6588c2ecf20Sopenharmony_cicompared to transaction commit for asynchronous transaction workloads - only
6598c2ecf20Sopenharmony_citime will tell if using a read-write semaphore for exclusion will limit
6608c2ecf20Sopenharmony_citransaction commit concurrency due to cache line bouncing of the lock on the
6618c2ecf20Sopenharmony_ciread side.
6628c2ecf20Sopenharmony_ci
6638c2ecf20Sopenharmony_ciThe second serialisation point is on the transaction commit side where items
6648c2ecf20Sopenharmony_ciare inserted into the CIL. Because transactions can enter this code
6658c2ecf20Sopenharmony_ciconcurrently, the CIL needs to be protected separately from the above
6668c2ecf20Sopenharmony_cicommit/flush exclusion. It also needs to be an exclusive lock but it is only
6678c2ecf20Sopenharmony_ciheld for a very short time and so a spin lock is appropriate here. It is
6688c2ecf20Sopenharmony_cipossible that this lock will become a contention point, but given the short
6698c2ecf20Sopenharmony_cihold time once per transaction I think that contention is unlikely.
6708c2ecf20Sopenharmony_ci
6718c2ecf20Sopenharmony_ciThe final serialisation point is the checkpoint commit record ordering code
6728c2ecf20Sopenharmony_cithat is run as part of the checkpoint commit and log force sequencing. The code
6738c2ecf20Sopenharmony_cipath that triggers a CIL flush (i.e. whatever triggers the log force) will enter
6748c2ecf20Sopenharmony_cian ordering loop after writing all the log vectors into the log buffers but
6758c2ecf20Sopenharmony_cibefore writing the commit record. This loop walks the list of committing
6768c2ecf20Sopenharmony_cicheckpoints and needs to block waiting for checkpoints to complete their commit
6778c2ecf20Sopenharmony_cirecord write. As a result it needs a lock and a wait variable. Log force
6788c2ecf20Sopenharmony_cisequencing also requires the same lock, list walk, and blocking mechanism to
6798c2ecf20Sopenharmony_ciensure completion of checkpoints.
6808c2ecf20Sopenharmony_ci
6818c2ecf20Sopenharmony_ciThese two sequencing operations can use the mechanism even though the
6828c2ecf20Sopenharmony_cievents they are waiting for are different. The checkpoint commit record
6838c2ecf20Sopenharmony_cisequencing needs to wait until checkpoint contexts contain a commit LSN
6848c2ecf20Sopenharmony_ci(obtained through completion of a commit record write) while log force
6858c2ecf20Sopenharmony_cisequencing needs to wait until previous checkpoint contexts are removed from
6868c2ecf20Sopenharmony_cithe committing list (i.e. they've completed). A simple wait variable and
6878c2ecf20Sopenharmony_cibroadcast wakeups (thundering herds) has been used to implement these two
6888c2ecf20Sopenharmony_ciserialisation queues. They use the same lock as the CIL, too. If we see too
6898c2ecf20Sopenharmony_cimuch contention on the CIL lock, or too many context switches as a result of
6908c2ecf20Sopenharmony_cithe broadcast wakeups these operations can be put under a new spinlock and
6918c2ecf20Sopenharmony_cigiven separate wait lists to reduce lock contention and the number of processes
6928c2ecf20Sopenharmony_ciwoken by the wrong event.
6938c2ecf20Sopenharmony_ci
6948c2ecf20Sopenharmony_ci
6958c2ecf20Sopenharmony_ciLifecycle Changes
6968c2ecf20Sopenharmony_ci-----------------
6978c2ecf20Sopenharmony_ci
6988c2ecf20Sopenharmony_ciThe existing log item life cycle is as follows::
6998c2ecf20Sopenharmony_ci
7008c2ecf20Sopenharmony_ci	1. Transaction allocate
7018c2ecf20Sopenharmony_ci	2. Transaction reserve
7028c2ecf20Sopenharmony_ci	3. Lock item
7038c2ecf20Sopenharmony_ci	4. Join item to transaction
7048c2ecf20Sopenharmony_ci		If not already attached,
7058c2ecf20Sopenharmony_ci			Allocate log item
7068c2ecf20Sopenharmony_ci			Attach log item to owner item
7078c2ecf20Sopenharmony_ci		Attach log item to transaction
7088c2ecf20Sopenharmony_ci	5. Modify item
7098c2ecf20Sopenharmony_ci		Record modifications in log item
7108c2ecf20Sopenharmony_ci	6. Transaction commit
7118c2ecf20Sopenharmony_ci		Pin item in memory
7128c2ecf20Sopenharmony_ci		Format item into log buffer
7138c2ecf20Sopenharmony_ci		Write commit LSN into transaction
7148c2ecf20Sopenharmony_ci		Unlock item
7158c2ecf20Sopenharmony_ci		Attach transaction to log buffer
7168c2ecf20Sopenharmony_ci
7178c2ecf20Sopenharmony_ci	<log buffer IO dispatched>
7188c2ecf20Sopenharmony_ci	<log buffer IO completes>
7198c2ecf20Sopenharmony_ci
7208c2ecf20Sopenharmony_ci	7. Transaction completion
7218c2ecf20Sopenharmony_ci		Mark log item committed
7228c2ecf20Sopenharmony_ci		Insert log item into AIL
7238c2ecf20Sopenharmony_ci			Write commit LSN into log item
7248c2ecf20Sopenharmony_ci		Unpin log item
7258c2ecf20Sopenharmony_ci	8. AIL traversal
7268c2ecf20Sopenharmony_ci		Lock item
7278c2ecf20Sopenharmony_ci		Mark log item clean
7288c2ecf20Sopenharmony_ci		Flush item to disk
7298c2ecf20Sopenharmony_ci
7308c2ecf20Sopenharmony_ci	<item IO completion>
7318c2ecf20Sopenharmony_ci
7328c2ecf20Sopenharmony_ci	9. Log item removed from AIL
7338c2ecf20Sopenharmony_ci		Moves log tail
7348c2ecf20Sopenharmony_ci		Item unlocked
7358c2ecf20Sopenharmony_ci
7368c2ecf20Sopenharmony_ciEssentially, steps 1-6 operate independently from step 7, which is also
7378c2ecf20Sopenharmony_ciindependent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
7388c2ecf20Sopenharmony_ciat the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
7398c2ecf20Sopenharmony_ciat the same time. If the log item is in the AIL or between steps 6 and 7
7408c2ecf20Sopenharmony_ciand steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
7418c2ecf20Sopenharmony_ciare entered and completed is the object considered clean.
7428c2ecf20Sopenharmony_ci
7438c2ecf20Sopenharmony_ciWith delayed logging, there are new steps inserted into the life cycle::
7448c2ecf20Sopenharmony_ci
7458c2ecf20Sopenharmony_ci	1. Transaction allocate
7468c2ecf20Sopenharmony_ci	2. Transaction reserve
7478c2ecf20Sopenharmony_ci	3. Lock item
7488c2ecf20Sopenharmony_ci	4. Join item to transaction
7498c2ecf20Sopenharmony_ci		If not already attached,
7508c2ecf20Sopenharmony_ci			Allocate log item
7518c2ecf20Sopenharmony_ci			Attach log item to owner item
7528c2ecf20Sopenharmony_ci		Attach log item to transaction
7538c2ecf20Sopenharmony_ci	5. Modify item
7548c2ecf20Sopenharmony_ci		Record modifications in log item
7558c2ecf20Sopenharmony_ci	6. Transaction commit
7568c2ecf20Sopenharmony_ci		Pin item in memory if not pinned in CIL
7578c2ecf20Sopenharmony_ci		Format item into log vector + buffer
7588c2ecf20Sopenharmony_ci		Attach log vector and buffer to log item
7598c2ecf20Sopenharmony_ci		Insert log item into CIL
7608c2ecf20Sopenharmony_ci		Write CIL context sequence into transaction
7618c2ecf20Sopenharmony_ci		Unlock item
7628c2ecf20Sopenharmony_ci
7638c2ecf20Sopenharmony_ci	<next log force>
7648c2ecf20Sopenharmony_ci
7658c2ecf20Sopenharmony_ci	7. CIL push
7668c2ecf20Sopenharmony_ci		lock CIL flush
7678c2ecf20Sopenharmony_ci		Chain log vectors and buffers together
7688c2ecf20Sopenharmony_ci		Remove items from CIL
7698c2ecf20Sopenharmony_ci		unlock CIL flush
7708c2ecf20Sopenharmony_ci		write log vectors into log
7718c2ecf20Sopenharmony_ci		sequence commit records
7728c2ecf20Sopenharmony_ci		attach checkpoint context to log buffer
7738c2ecf20Sopenharmony_ci
7748c2ecf20Sopenharmony_ci	<log buffer IO dispatched>
7758c2ecf20Sopenharmony_ci	<log buffer IO completes>
7768c2ecf20Sopenharmony_ci
7778c2ecf20Sopenharmony_ci	8. Checkpoint completion
7788c2ecf20Sopenharmony_ci		Mark log item committed
7798c2ecf20Sopenharmony_ci		Insert item into AIL
7808c2ecf20Sopenharmony_ci			Write commit LSN into log item
7818c2ecf20Sopenharmony_ci		Unpin log item
7828c2ecf20Sopenharmony_ci	9. AIL traversal
7838c2ecf20Sopenharmony_ci		Lock item
7848c2ecf20Sopenharmony_ci		Mark log item clean
7858c2ecf20Sopenharmony_ci		Flush item to disk
7868c2ecf20Sopenharmony_ci	<item IO completion>
7878c2ecf20Sopenharmony_ci	10. Log item removed from AIL
7888c2ecf20Sopenharmony_ci		Moves log tail
7898c2ecf20Sopenharmony_ci		Item unlocked
7908c2ecf20Sopenharmony_ci
7918c2ecf20Sopenharmony_ciFrom this, it can be seen that the only life cycle differences between the two
7928c2ecf20Sopenharmony_cilogging methods are in the middle of the life cycle - they still have the same
7938c2ecf20Sopenharmony_cibeginning and end and execution constraints. The only differences are in the
7948c2ecf20Sopenharmony_cicommitting of the log items to the log itself and the completion processing.
7958c2ecf20Sopenharmony_ciHence delayed logging should not introduce any constraints on log item
7968c2ecf20Sopenharmony_cibehaviour, allocation or freeing that don't already exist.
7978c2ecf20Sopenharmony_ci
7988c2ecf20Sopenharmony_ciAs a result of this zero-impact "insertion" of delayed logging infrastructure
7998c2ecf20Sopenharmony_ciand the design of the internal structures to avoid on disk format changes, we
8008c2ecf20Sopenharmony_cican basically switch between delayed logging and the existing mechanism with a
8018c2ecf20Sopenharmony_cimount option. Fundamentally, there is no reason why the log manager would not
8028c2ecf20Sopenharmony_cibe able to swap methods automatically and transparently depending on load
8038c2ecf20Sopenharmony_cicharacteristics, but this should not be necessary if delayed logging works as
8048c2ecf20Sopenharmony_cidesigned.
805