xfs-delayed-logging-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-delayed-logging-design.rst

Lines Matching +full:wait +full:- +full:on +full:- +full:write
1 .. SPDX-License-Identifier: GPL-2.0
11 subsystem is based on. This document describes the design and algorithms that
12 the XFS journalling subsystem is based on so that readers may familiarize
26 XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
33 details logged are made up of the changes to in-core structures rather than
34 on-disk structures. Other objects - typically buffers - have their physical
49 together are different and are dependent on the object and/or modification being
64 place.  This means that permanent transactions can be used for one-shot
65 modifications, but one-shot reservations cannot be used for permanent
68 In the code, a one-shot transaction pattern looks somewhat like this::
97 While this might look similar to a one-shot transaction, there is an important
107 transactions is running, nothing else can read from or write to the inode and
123 the on-disk journal.
142 in the journal and wait for that to complete.
146 tend to use log forces to ensure modifications are on stable storage only when
161 available to write the modification into the journal before we start making
165 transaction, we have to reserve enough space to record a full leaf-to-root split
174 another btree which might require more space. And so on.  Hence the amount of
180 so that when we come to write the dirty metadata into the log we don't run out
181 of log space half way through the write.
183 For one-shot transactions, a single unit space reservation is all that is
190 transaction rolling mechanism to re-reserve space on every transaction roll. We
194 For example, an inode allocation is typically two transactions - one to
195 physically allocate a free inode chunk on disk, and another to allocate an inode
205 means we can roll the transaction multiple times before we have to re-reserve
210 re-reserve physical space in the log. This is somewhat complex, and requires
219 of a cycle number - the number of times the log has been overwritten - and the
233 reservations currently held by active transactions. It is a purely in-memory
241 and need to write into the log. The reserve head is used to prevent new
248 The other grant head is the "write" head. Unlike the reserve head, this grant
251 - and it mostly does track exactly the same location as the reserve grant head -
260 available, as we may end up on the end of the FIFO queue and the items we have
269 grant head does not track physical space - it only accounts for the amount of
278 xfs_trans_commit() calls, while the physical log space reservation - tracked by
279 the write head - is then reserved separately by a call to xfs_log_reserve()
281 physical log space to be reserved from the write grant head, but only if one
287 "Re-logging" the locked items on every transaction roll ensures that the items
290 pins the tail of the log when we sleep on the write reservation, then we will
291 deadlock the log as we cannot take the locks needed to write back that item and
292 move the tail of the log forwards to free up write grant space. Re-logging the
294 making cannot self-deadlock.
298 tail moving forwards and hence ensuring that write grant space is always
303 Re-logging Explained
309 method called "re-logging". Conceptually, this is quite simple - all it requires
334 implement long-running, multiple-commit permanent transactions. 
347 the log - repeated operations to the same objects write the same changes to
357 in memory - batching them, if you like - to minimise the impact of the log IO on
360 The limitation on asynchronous transaction throughput is the number and size of
362 buffers available and the size of each is 32kB - the size can be increased up
366 that can be made to the filesystem at any point in time - if all the log
383 but only one of those copies needs to be there - the last one "D", as it
386 wasting space. When we are doing repeated operations on the same set of
389 log would greatly reduce the amount of metadata we write to the log, and this
402 actually relatively easy to do - all the changes to logged items are already
413 being accumulated in memory. Hence the potential for loss of metadata on a
438 	4. No on-disk format change (metadata or log format).
446 ---------------
459 trying to get the lock on object A to flush it to the log buffer. This appears
463 The solution is relatively simple - it just took a long time to recognise it.
465 vector array that points to the changed regions in the item. The log write code
486     Object    +---------------------------------------------+
487     Vector 1      +----+
488     Vector 2                    +----+
489     Vector 3                                   +----------+
493     Log Buffer    +-V1-+-V2-+----V3----+
497     Object    +---------------------------------------------+
498     Vector 1      +----+
499     Vector 2                    +----+
500     Vector 3                                   +----------+
504     Memory Buffer +-V1-+-V2-+----V3----+
505     Vector 1      +----+
506     Vector 2           +----+
507     Vector 3                +----------+
518 buffer writing (i.e. double encapsulation). This would be an on-disk format
519 change and as such is not desirable.  It also means we'd have to write the log
521 region state that needs to be placed into the headers during the log write.
525 self-describing object that can be passed to the log buffer write code to be
527 Hence we avoid needing a new on-disk format to handle items that have been
532 ----------------
543 and as such are stored in the Active Item List (AIL) which is a LSN-ordered
561 its place in the list and re-inserted at the tail. This is entirely arbitrary
562 and done to make it easy for debugging - the last items in the list are the
569 ----------------------------
573 We need to write these items in the order that they exist in the CIL, and they
576 log replay - all the changes in all the objects in a given transaction must
581 To fulfill this requirement, we need to write the entire CIL in a single log
582 transaction. Fortunately, the XFS log code has no fixed limit on the size of a
588 crash during the write of a such a transaction could partially overwrite the
594 to any other transaction - it contains a transaction header, a series of
596 perspective, the checkpoint transaction is also no different - just a lot
602 code to write the changes into the log. To do this efficiently, we need to
604 transaction. The current log write code enables us to do this easily with the
607 per-checkpoint context that travels through the log write process through to
638 	Log Item <-> log vector 1	-> memory buffer
639 	   |				-> vector array
641 	Log Item <-> log vector 2	-> memory buffer
642 	   |				-> vector array
647 	Log Item <-> log vector N-1	-> memory buffer
648 	   |				-> vector array
650 	Log Item <-> log vector N	-> memory buffer
651 					-> vector array
659 	log vector 1	-> memory buffer
660 	   |		-> vector array
661 	   |		-> Log Item
663 	log vector 2	-> memory buffer
664 	   |		-> vector array
665 	   |		-> Log Item
670 	log vector N-1	-> memory buffer
671 	   |		-> vector array
672 	   |		-> Log Item
674 	log vector N	-> memory buffer
675 			-> vector array
676 			-> Log Item
703 --------------------------------------
710 re-using a freed metadata extent for a data extent), a special, optimised log
720 As discussed in the checkpoint section, delayed logging uses per-checkpoint
725 atomic counter - we can just take the current context sequence number and add
740 we can also wait on the log buffer that contains the commit record, thereby
750 are also committed to disk before the one we need to wait for. Therefore we
752 complete before waiting on the one we need to complete. We do this
753 synchronisation in the log force code so that we don't need to wait anywhere
754 else for such serialisation - it only matters when we do a log force.
758 is, we need to flush the CIL and potentially wait for it to complete. This is a
767 ------------------------------------------------
771 ahead of time, nor how many log buffers it will take to write out, nor the
780 transaction. While some of this is fixed overhead, much of it is dependent on
785 inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
788 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
792 buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
810 reservation of around 150KB, which is a non-trivial amount of space.
812 A static reservation needs to manipulate the log grant counters - we can take a
813 permanent reservation on the space, but we still need to make sure we refresh
814 the write reservation (the actual space available to the transaction) after
844 a "background flush" and is done on demand. This is identical to
859 ---------------------------------
875 That is, we now have a many-to-one relationship between transaction commit and
877 log items becomes unbalanced if we retain the "pin on transaction commit, unpin
878 on transaction completion" model.
880 To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
881 insertion into the CIL, unpin on checkpoint completion". In other words, the
883 pin the object the first time it is inserted into the CIL - if it is already in
893 the fact pinning the item is dependent on whether the item is present in the
900 ---------------------------------------
910 points in the design - the three important ones are:
917 that we have a many-to-one interaction here. That is, the only restriction on
924 relatively long period of time - the pinning of log items needs to be done
932 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
933 want every other CPU in the machine spinning on the CIL lock. Given that
941 compared to transaction commit for asynchronous transaction workloads - only
942 time will tell if using a read-write semaphore for exclusion will limit
943 transaction commit concurrency due to cache line bouncing of the lock on the
946 The second serialisation point is on the transaction commit side where items
960 record write. As a result it needs a lock and a wait variable. Log force
966 sequencing needs to wait until checkpoint contexts contain a commit LSN
967 (obtained through completion of a commit record write) while log force
968 sequencing needs to wait until previous checkpoint contexts are removed from
969 the committing list (i.e. they've completed). A simple wait variable and
972 much contention on the CIL lock, or too many context switches as a result of
974 given separate wait lists to reduce lock contention and the number of processes
979 -----------------
996 		Write commit LSN into transaction
1006 			Write commit LSN into log item
1019 Essentially, steps 1-6 operate independently from step 7, which is also
1020 independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
1021 at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
1023 and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
1043 		Write CIL context sequence into transaction
1053 		write log vectors into log
1063 			Write commit LSN into log item
1075 logging methods are in the middle of the life cycle - they still have the same
1078 Hence delayed logging should not introduce any constraints on log item
1081 As a result of this zero-impact "insertion" of delayed logging infrastructure
1082 and the design of the internal structures to avoid on disk format changes, we
1085 be able to swap methods automatically and transparently depending on load