xfs-delayed-logging-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-delayed-logging-design.rst

Lines Matching +full:compare +full:- +full:and +full:- +full:swap
1 .. SPDX-License-Identifier: GPL-2.0
10 This document describes the design and algorithms that the XFS journalling
11 subsystem is based on. This document describes the design and algorithms that
16 transaction reservations are structured and accounted, and then move into how we
27 are atomic and recoverable. For reasons of space and time efficiency, the
28 logging mechanisms are varied and complex, combining intents, logical and
32 Some objects, such as inodes and dquots, are logged in logical format where the
33 details logged are made up of the changes to in-core structures rather than
34 on-disk structures. Other objects - typically buffers - have their physical
36 chained together by intents, ensuring that journal recovery can restart and
40 The reason for these differences is to keep the amount of log space and CPU time
41 required to process objects being modified as small as possible and hence the
43 and some parts of objects are more frequently modified than others, so keeping
49 together are different and are dependent on the object and/or modification being
51 followed to guarantee forwards progress and prevent deadlocks.
58 reservation they take. These are known as "one shot" and "permanent"
63 The type and size of reservation must be matched to the modification taking
64 place.  This means that permanent transactions can be used for one-shot
65 modifications, but one-shot reservations cannot be used for permanent
68 In the code, a one-shot transaction pattern looks somewhat like this::
82 transactions, and the pattern looks like this::
97 While this might look similar to a one-shot transaction, there is an important
107 transactions is running, nothing else can read from or write to the inode and
121 the high level operation must use intents and deferred operations to guarantee
123 the on-disk journal.
142 in the journal and wait for that to complete.
162 modifications to objects and items. As such, the reservation needs to be large
165 transaction, we have to reserve enough space to record a full leaf-to-root split
174 another btree which might require more space. And so on.  Hence the amount of
183 For one-shot transactions, a single unit space reservation is all that is
190 transaction rolling mechanism to re-reserve space on every transaction roll. We
194 For example, an inode allocation is typically two transactions - one to
195 physically allocate a free inode chunk on disk, and another to allocate an inode
204 reservations. That multiple is defined by the reservation log count, and this
205 means we can roll the transaction multiple times before we have to re-reserve
210 re-reserve physical space in the log. This is somewhat complex, and requires
219 of a cycle number - the number of times the log has been overwritten - and the
220 offset into the log.  A LSN carries the cycle in the upper 32 bits and the
228 grant head and the current log tail. That is, how much space can be
229 reserved/consumed before the grant heads would fully wrap the log and overtake
233 reservations currently held by active transactions. It is a purely in-memory
234 accounting of the space reservation and, as such, actually tracks byte offsets
240 reservations amounts and the exact byte count that modifications actually make
241 and need to write into the log. The reserve head is used to prevent new
243 tail. It will block new reservations in a FIFO queue and as the log tail moves
249 head contains an LSN and it tracks the physical space usage in the log. While
251 - and it mostly does track exactly the same location as the reserve grant head -
255 These differences when a permanent transaction is rolled and the internal "log
256 count" reaches zero and the initial set of unit reservations have been
260 available, as we may end up on the end of the FIFO queue and the items we have
262 enough free space in the log to fulfill all of the pending reservations and
269 grant head does not track physical space - it only accounts for the amount of
272 immediately and remain throttled until the log tail is moved forward far enough
273 to remove the overcommit and start taking new reservations. In other words, we
274 can overcommit the reserve head without violating the physical log head and tail
278 xfs_trans_commit() calls, while the physical log space reservation - tracked by
279 the write head - is then reserved separately by a call to xfs_log_reserve()
287 "Re-logging" the locked items on every transaction roll ensures that the items
289 physical head of the log and so do not pin the tail of the log. If a locked item
291 deadlock the log as we cannot take the locks needed to write back that item and
292 move the tail of the log forwards to free up write grant space. Re-logging the
293 locked items avoids this deadlock and guarantees that the log reservation we are
294 making cannot self-deadlock.
298 tail moving forwards and hence ensuring that write grant space is always
303 Re-logging Explained
309 method called "re-logging". Conceptually, this is quite simple - all it requires
313 That is, if we have a sequence of changes A through to F, and the object was
315 of transactions, their contents and the log sequence number (LSN) of the
333 of each subsequent transaction, and it's the technique that allows us to
334 implement long-running, multiple-commit permanent transactions. 
339 keeps relogging the inode and btree buffers as they get modified in each
347 the log - repeated operations to the same objects write the same changes to
348 the log over and over again. Worse is the fact that objects tend to get
352 It should now also be obvious how relogging and asynchronous transactions go
357 in memory - batching them, if you like - to minimise the impact of the log IO on
360 The limitation on asynchronous transaction throughput is the number and size of
362 buffers available and the size of each is 32kB - the size can be increased up
366 that can be made to the filesystem at any point in time - if all the log
367 buffers are full and under IO, then no more transactions can be committed until
369 be to able to issue enough transactions to keep the log buffers full and under
383 but only one of those copies needs to be there - the last one "D", as it
385 necessary copy in the log buffer, and three stale copies that are simply
389 log would greatly reduce the amount of metadata we write to the log, and this
399 Delayed logging is the name we've given to keeping and tracking transactional
402 actually relatively easy to do - all the changes to logged items are already
404 them and get them to the log in a consistent, recoverable manner.
405 Describing the problems and how they have been solved is the focus of this
410 metadata changes from the size and number of log buffers available. In other
426 log is used effectively in many filesystems including ext3 and ext4. Hence
428 concept is sound. Instead it is simply considered a "solved problem" and as
438 	4. No on-disk format change (metadata or log format).
439 	5. Enable and disable with a mount option.
446 ---------------
453 require us to lock every object, format them, and then unlock them again.
456 running. For example, a transaction has object A locked and modified, but needs
458 flushing thread has the delayed logging tracking lock already held, and is
460 to be an unsolvable deadlock condition, and it was solving this problem that
463 The solution is relatively simple - it just took a long time to recognise it.
471 If we then copy the vector into the memory buffer and rewrite the vector to
474 that does not require us to lock the item to access. This formatting and
476 resulting in a vector that is transactionally consistent and can be accessed
481 formatting method and the delayed logging formatting can be seen in the
486     Object    +---------------------------------------------+
487     Vector 1      +----+
488     Vector 2                    +----+
489     Vector 3                                   +----------+
493     Log Buffer    +-V1-+-V2-+----V3----+
497     Object    +---------------------------------------------+
498     Vector 1      +----+
499     Vector 2                    +----+
500     Vector 3                                   +----------+
504     Memory Buffer +-V1-+-V2-+----V3----+
505     Vector 1      +----+
506     Vector 2           +----+
507     Vector 3                +----------+
509 The memory buffer and associated vector need to be passed as a single object,
518 buffer writing (i.e. double encapsulation). This would be an on-disk format
519 change and as such is not desirable.  It also means we'd have to write the log
523 Hence we need to keep the vector, but by attaching the memory buffer to it and
525 self-describing object that can be passed to the log buffer write code to be
527 Hence we avoid needing a new on-disk format to handle items that have been
532 ----------------
535 them to be used without limitations, we need to be able to track and accumulate
537 log item is the natural place to store this vector and buffer, and also makes sense
543 and as such are stored in the Active Item List (AIL) which is a LSN-ordered
545 completion, after which they are unpinned and can be written to disk. An object
547 and then moved forward in the AIL when the log buffer IO completes for that
551 and relogged, so any tracking must be separate to the AIL infrastructure. As
554 committed item tracking needs its own locks, lists and state fields in the log
559 committed and have formatted memory buffers attached to them. It tracks objects
561 its place in the list and re-inserted at the tail. This is entirely arbitrary
562 and done to make it easy for debugging - the last items in the list are the
569 ----------------------------
573 We need to write these items in the order that they exist in the CIL, and they
575 written as an atomic transaction comes from the requirements of relogging and
576 log replay - all the changes in all the objects in a given transaction must
585 reason for this limit is that to find the head and tail of the log, there must
590 failure and an inconsistent filesystem and hence we must enforce the maximum
594 to any other transaction - it contains a transaction header, a series of
595 formatted log items and a commit record at the tail. From a recovery
596 perspective, the checkpoint transaction is also no different - just a lot
600 Because the checkpoint is just another transaction and all the changes to log
607 per-checkpoint context that travels through the log write process through to
615 context and attach that to the CIL for aggregation of new transactions.
618 committed items and effectively allows new transactions to be issued while we
626 the same time another transaction modifies the item and inserts the log item
631 buffer and log vector attached to each log item needs to be attached to the
638 	Log Item <-> log vector 1	-> memory buffer
639 	   |				-> vector array
641 	Log Item <-> log vector 2	-> memory buffer
642 	   |				-> vector array
647 	Log Item <-> log vector N-1	-> memory buffer
648 	   |				-> vector array
650 	Log Item <-> log vector N	-> memory buffer
651 					-> vector array
653 And after the flush the CIL head is empty, and the checkpoint context log
659 	log vector 1	-> memory buffer
660 	   |		-> vector array
661 	   |		-> Log Item
663 	log vector 2	-> memory buffer
664 	   |		-> vector array
665 	   |		-> Log Item
670 	log vector N-1	-> memory buffer
671 	   |		-> vector array
672 	   |		-> Log Item
674 	log vector N	-> memory buffer
675 			-> vector array
676 			-> Log Item
678 Once this transfer is done, the CIL can be unlocked and new transactions can
686 and unpin) in the log vector chain and then free the log vector chain and
692 vectors and break the link between the log item and the log vector means that
695 break the link between the log item and the log vector, which means we should
698 vectors in one checkpoint transaction. I'd guess this is a "measure and
699 compare" situation that can be done after a working and reviewed implementation
703 --------------------------------------
710 re-using a freed metadata extent for a data extent), a special, optimised log
720 As discussed in the checkpoint section, delayed logging uses per-checkpoint
721 contexts, and as such it is simple to assign a sequence number to each
725 atomic counter - we can just take the current context sequence number and add
754 else for such serialisation - it only matters when we do a log force.
758 is, we need to flush the CIL and potentially wait for it to complete. This is a
760 and push if required. Indeed, placing the current sequence checkpoint flush in
763 force the log at the LSN of that transaction) and so the higher level code
767 ------------------------------------------------
778 transaction and region headers, headers for split regions, buffer tail padding,
781 the size of the transaction and the number of regions being logged (the number
785 inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
786 there are lots of transactions that only contain an inode core and an inode log
791 each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
792 buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
794 not particularly flexible and is difficult to select the "optimal value" for
799 reservation by tracking the space currently used by the object in the CIL and
806 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
808 reservation needs to be made before the checkpoint is started, and we need to
810 reservation of around 150KB, which is a non-trivial amount of space.
812 A static reservation needs to manipulate the log grant counters - we can take a
821 space available in the log if we are to use static reservations, and that is
822 very difficult and complex to arrange. It is possible to do, but there is a
826 items in the CIL and using this to dynamically calculate the amount of log
832 maximal amount of log metadata space they require, and such a delta reservation
836 are added to the CIL and avoid the need for reserving and regranting log space
837 up front. This avoids deadlocks and removes a blocking point from the
844 a "background flush" and is done on demand. This is identical to
846 checkpoint commit to complete. This background push is checked and executed by
851 force will push the CIL to disk, and if the transaction subsystem stays idle,
859 ---------------------------------
868 only becomes unpinned when all the transactions complete and there are no
869 pending transactions. Thus the pinning and unpinning of a log item is symmetric
870 as there is a 1:1 relationship with transaction commit and log item completion.
875 That is, we now have a many-to-one relationship between transaction commit and
876 log item completion. The result of this is that pinning and unpinning of the
882 pinning and unpinning becomes symmetric around a checkpoint context. We have to
883 pin the object the first time it is inserted into the CIL - if it is already in
894 current CIL or not. If we don't pin the CIL first before we check and pin the
895 object, we have a race with CIL being flushed between the check and the pin
900 ---------------------------------------
910 points in the design - the three important ones are:
913 	2. Adding items to the CIL and updating item space accounting
916 Looking at the transaction commit and CIL flushing interactions, it is clear
917 that we have a many-to-one interaction here. That is, the only restriction on
924 relatively long period of time - the pinning of log items needs to be done
932 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
935 items, it will get held for a significant time and so spin contention is a
941 compared to transaction commit for asynchronous transaction workloads - only
942 time will tell if using a read-write semaphore for exclusion will limit
950 held for a very short time and so a spin lock is appropriate here. It is
955 that is run as part of the checkpoint commit and log force sequencing. The code
959 checkpoints and needs to block waiting for checkpoints to complete their commit
960 record write. As a result it needs a lock and a wait variable. Log force
961 sequencing also requires the same lock, list walk, and blocking mechanism to
969 the committing list (i.e. they've completed). A simple wait variable and
973 the broadcast wakeups these operations can be put under a new spinlock and
974 given separate wait lists to reduce lock contention and the number of processes
979 -----------------
1019 Essentially, steps 1-6 operate independently from step 7, which is also
1020 independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
1021 at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
1022 at the same time. If the log item is in the AIL or between steps 6 and 7
1023 and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
1024 are entered and completed is the object considered clean.
1041 		Attach log vector and buffer to log item
1050 		Chain log vectors and buffers together
1075 logging methods are in the middle of the life cycle - they still have the same
1076 beginning and end and execution constraints. The only differences are in the
1077 committing of the log items to the log itself and the completion processing.
1081 As a result of this zero-impact "insertion" of delayed logging infrastructure
1082 and the design of the internal structures to avoid on disk format changes, we
1083 can basically switch between delayed logging and the existing mechanism with a
1085 be able to swap methods automatically and transparently depending on load