xfs-delayed-logging-design.rst - OpenGrok cross reference for /kernel/linux/linux-5.10/Documentation/filesystems/xfs-delayed-logging-design.rst

Lines Matching refs:log
14 logged. The reason for these differences is to reduce the amount of log space
21 modifications to a single object to be carried in the log at any given time.
22 This allows the log to avoid needing to flush each change to disk before
26 changes in the new transaction that is written to the log.
29 written to disk after change D, we would see in the log the following series
30 of transactions, their contents and the log sequence number (LSN) of the
43 the aggregation of all the previous changes currently held only in the log.
45 This relogging technique also allows objects to be moved forward in the log so
46 that an object being relogged does not prevent the tail of the log from ever
49 direct encoding of the location in the log of the transaction.
53 a special log reservation known as a permanent transaction reservation. A
58 removal operation. This keeps them moving forward in the log as the operation
60 log wraps around.
65 the log - repeated operations to the same objects write the same changes to
66 the log over and over again. Worse is the fact that objects tend to get
68 metadata into the log.
71 asynchronous. That is, they don't commit to disk until either a log buffer is
72 filled (a log buffer can hold multiple transactions) or a synchronous operation
73 forces the log buffers holding the transactions to disk. This means that XFS is
75 minimise the impact of the log IO on transaction throughput.
78 log buffers made available by the log manager. By default there are 8 log
83 that can be made to the filesystem at any point in time - if all the log
86 be to able to issue enough transactions to keep the log buffers full and under
95 multiple times before they are committed to disk in the log buffers. If we
97 transactions A through D are committed to disk in the same log buffer.
99 That is, a single log buffer may contain multiple copies of the same object,
102 necessary copy in the log buffer, and three stale copies that are simply
104 objects, these "stale objects" can be over 90% of the space used in the log
106 log would greatly reduce the amount of metadata we write to the log, and this
110 memory == log buffer), only it is doing it extremely inefficiently. It is using
113 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
114 accumulating stale objects in the log buffers.
117 changes to objects in memory outside the log buffer infrastructure. Because of
121 them and get them to the log in a consistent, recoverable manner.
127 metadata changes from the size and number of log buffers available. In other
129 written to the log at any point in time, there may be a much greater amount
133 It should be noted that this does not change the guarantee that log recovery
143 log is used effectively in many filesystems including ext3 and ext4. Hence
150 	1. Reduce the amount of metadata written to the log by at least
155 	4. No on-disk format change (metadata or log format).
166 existing log item dirty region tracking) is that when it comes to writing the
167 changes to the log buffers, we need to ensure that the object we are formatting
169 concurrent modification. Hence flushing the logical changes to the log would
176 trying to get the lock on object A to flush it to the log buffer. This appears
182 vector array that points to the changed regions in the item. The log write code
183 simply copies the memory these vectors point to into the log buffer during
185 using the log buffer as the destination of the formatting code, we can use an
190 the changes in a format that is compatible with the log buffer writing code.
197 asynchronous transactions to the log. The differences between the existing
201 Current format log vector::
232 buffer is to support splitting vectors across log buffer boundaries correctly.
234 are in the item, so we'd need a new encapsulation method for regions in the log
236 change and as such is not desirable.  It also means we'd have to write the log
238 region state that needs to be placed into the headers during the log write.
242 self-describing object that can be passed to the log buffer write code to be
243 handled in exactly the same manner as the existing log vectors are handled.
253 them so that they can be written to the log at some later point in time.  The
254 log item is the natural place to store this vector and buffer, and also makes sense
258 The log item is already used to track the log items that have been written to
259 the log but not yet written to disk. Such log items are considered "active"
261 double linked list. Items are inserted into this list during log buffer IO
264 and then moved forward in the AIL when the log buffer IO completes for that
271 committed item tracking needs it's own locks, lists and state fields in the log
275 called the Committed Item List (CIL).  The list tracks log items that have been
288 When we have a log synchronisation event, commonly known as a "log force",
289 all the items in the CIL must be written into the log via the log buffers.
293 log replay - all the changes in all the objects in a given transaction must
294 either be completely replayed during log recovery, or not replayed at all. If
295 a transaction is not replayed because it is not complete in the log, then
298 To fulfill this requirement, we need to write the entire CIL in a single log
299 transaction. Fortunately, the XFS log code has no fixed limit on the size of a
300 transaction, nor does the log replay code. The only fundamental limit is that
301 the transaction cannot be larger than just under half the size of the log.  The
302 reason for this limit is that to find the head and tail of the log, there must
303 be at least one complete transaction in the log at any given time. If a
304 transaction is larger than half the log, then there is the possibility that a
306 only complete previous transaction in the log. This will result in a recovery
308 size of a checkpoint to be slightly less than a half the log.
312 formatted log items and a commit record at the tail. From a recovery
317 Because the checkpoint is just another transaction and all the changes to log
318 items are stored as log vectors, we can use the existing log buffer writing
319 code to write the changes into the log. To do this efficiently, we need to
321 transaction. The current log write code enables us to do this easily with the
322 way it separates the writing of the transaction contents (the log vectors) from
324 per-checkpoint context that travels through the log write process through to
336 are formatting the checkpoint into the log. It also allows concurrent
337 checkpoints to be written into the log buffers in the case of log force heavy
339 requires that we strictly order the commit records in the log so that
340 checkpoint sequence order is maintained during log replay.
343 the same time another transaction modifies the item and inserts the log item
344 into the new CIL, then checkpoint transaction commit code cannot use log items
345 to store the list of log vectors that need to be written into the transaction.
346 Hence log vectors need to be able to be chained together to allow them to be
347 detached from the log items. That is, when the CIL is flushed the memory
348 buffer and log vector attached to each log item needs to be attached to the
349 checkpoint context so that the log item can be released. In diagrammatic form,
355 	Log Item <-> log vector 1	-> memory buffer
358 	Log Item <-> log vector 2	-> memory buffer
364 	Log Item <-> log vector N-1	-> memory buffer
367 	Log Item <-> log vector N	-> memory buffer
370 And after the flush the CIL head is empty, and the checkpoint context log
376 	log vector 1	-> memory buffer
380 	log vector 2	-> memory buffer
387 	log vector N-1	-> memory buffer
391 	log vector N	-> memory buffer
396 start, while the checkpoint flush code works over the log vector chain to
399 Once the checkpoint is written into the log buffers, the checkpoint context is
400 attached to the log buffer that the commit record was written to along with a
402 run transaction committed processing for the log items (i.e. insert into AIL
403 and unpin) in the log vector chain and then free the log vector chain and
406 Discussion Point: I am uncertain as to whether the log item is the most
408 it. The fact that we walk the log items (in the CIL) just to chain the log
409 vectors and break the link between the log item and the log vector means that
410 we take a cache line hit for the log item list modification, then another for
411 the log vector chaining. If we track by the log vectors, then we only need to
412 break the link between the log item and the log vector, which means we should
413 dirty only the log item cachelines. Normally I wouldn't be concerned about one
414 vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
423 committed transactions with the log sequence number of the transaction commit.
426 committed to the log. In the rare case that a dependent operation occurs (e.g.
427 re-using a freed metadata extent for a data extent), a special, optimised log
431 transaction. This LSN comes directly from the log buffer the transaction is
434 written directly into the log buffers. Hence some other method of sequencing
445 Then, instead of assigning a log buffer LSN to the transaction commit LSN
449 result, the code that forces the log to a specific LSN now needs to ensure that
450 the log forces to a specific checkpoint.
453 that are currently committing to the log. When we flush a checkpoint, the
457 we can also wait on the log buffer that contains the commit record, thereby
458 using the existing log force mechanisms to execute synchronous forces.
461 mitigation algorithms similar to the current log buffer code to allow
466 The main concern with log forces is to ensure that all the previous checkpoints
470 synchronisation in the log force code so that we don't need to wait anywhere
471 else for such serialisation - it only matters when we do a log force.
473 The only remaining complexity is that a log force now also has to handle the
476 simple addition to the existing log forcing code to check the sequence numbers
478 the log force code enables the current mechanism for issuing synchronous
480 force the log at the LSN of that transaction) and so the higher level code
486 The big issue for a checkpoint transaction is the log space reservation for the
488 ahead of time, nor how many log buffers it will take to write out, nor the
489 number of split log vector regions are going to be used. We can track the
490 amount of log space required as we add items to the commit item list, but we
491 still need to reserve the space in the log for the checkpoint.
493 A typical transaction reserves enough space in the log for the worst case space
494 usage of the transaction. The reservation accounts for log record headers,
499 of log vectors in the transaction).
503 there are lots of transactions that only contain an inode core and an inode log
510 space.  From this, it should be obvious that a static log space reservation is
519 log buffer metadata used such as log header records.
521 However, even using a static reservation for just the log metadata is
522 problematic. Typically log record headers use at least 16KB of log space per
523 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
529 A static reservation needs to manipulate the log grant counters - we can take a
536 checkpoints to be able to free up log space (refer back to the description of
538 space available in the log if we are to use static reservations, and that is
542 The simpler way of doing this is tracking the entire log space used by the
543 items in the CIL and using this to dynamically calculate the amount of log
544 space required by the log metadata. If this log metadata space changes as a
549 maximal amount of log metadata space they require, and such a delta reservation
553 are added to the CIL and avoid the need for reserving and regranting log space
558 log. Hence as part of the reservation growing, we need to also check the size
560 the maximum threshold, we need to push the CIL to the log. This is effectively
562 a CIL push triggered by a log force, only that there is no waiting for the
567 they will be flushed by the periodic log force issued by the xfssyncd. This log
569 allow the idle log to be covered (effectively marked clean) in exactly the same
571 whether this log force needs to be done more frequently than the current rate
578 Currently log items are pinned during transaction commit while the items are
581 that items get pinned once for every transaction that is committed to the log
582 buffers. Hence items that are relogged in the log buffers will have a pin count
586 pending transactions. Thus the pinning and unpinning of a log item is symmetric
587 as there is a 1:1 relationship with transaction commit and log item completion.
593 log item completion. The result of this is that pinning and unpinning of the
594 log items becomes unbalanced if we retain the "pin on transaction commit, unpin
636 the amount of space available in the log for their reservations. The practical
638 128MB log, which means that it is generally one per CPU in a machine.
641 relatively long period of time - the pinning of log items needs to be done
651 flushing the CIL could involve walking a list of tens of thousands of log
672 that is run as part of the checkpoint commit and log force sequencing. The code
673 path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
674 an ordering loop after writing all the log vectors into the log buffers but
684 (obtained through completion of a commit record write) while log force
698 The existing log item life cycle is as follows::
705 			Allocate log item
706 			Attach log item to owner item
707 		Attach log item to transaction
709 		Record modifications in log item
712 		Format item into log buffer
715 		Attach transaction to log buffer
717 	<log buffer IO dispatched>
718 	<log buffer IO completes>
721 		Mark log item committed
722 		Insert log item into AIL
723 			Write commit LSN into log item
724 		Unpin log item
727 		Mark log item clean
733 		Moves log tail
739 at the same time. If the log item is in the AIL or between steps 6 and 7
750 			Allocate log item
751 			Attach log item to owner item
752 		Attach log item to transaction
754 		Record modifications in log item
757 		Format item into log vector + buffer
758 		Attach log vector and buffer to log item
759 		Insert log item into CIL
763 	<next log force>
767 		Chain log vectors and buffers together
770 		write log vectors into log
772 		attach checkpoint context to log buffer
774 	<log buffer IO dispatched>
775 	<log buffer IO completes>
778 		Mark log item committed
780 			Write commit LSN into log item
781 		Unpin log item
784 		Mark log item clean
788 		Moves log tail
794 committing of the log items to the log itself and the completion processing.
795 Hence delayed logging should not introduce any constraints on log item
801 mount option. Fundamentally, there is no reason why the log manager would not