xfs-online-fsck-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-online-fsck-design.rst

Lines Matching +full:compare +full:- +full:and +full:- +full:swap
1 .. SPDX-License-Identifier: GPL-2.0
6         Heading 1 uses "====" above and below
8         Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
26   feature is, and issues about which they should be aware.
28 - To help people reading the code to familiarize themselves with the relevant
29   concepts and design points before they start digging into the code.
31 - To help developers maintaining the system by capturing the reasons
41 Part 1 defines what fsck tools are and the motivations for writing a new one.
42 Parts 2 and 3 present a high level overview of how online fsck process works
43 and how it is tested to ensure correct functionality.
44 Part 4 discusses the user interface and the intended usage modes of the new
46 Parts 5 and 6 show off the high level components and how they fit together, and
48 Part 7 sums up what has been discussed so far and speculates about what else
59 - Provide a hierarchy of names through which application programs can associate
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
70 Secondary metadata (e.g. reverse mapping and directory parent pointers) support
72 and reorganization.
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
99 | separate storage systems through the creation of backups; and they avoid |
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 ….kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, a…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
113 name across the kernel, xfsprogs, and fstests git repos.
116 --------------
119 XFS (on Linux) to check and repair filesystems.
123 (``xfs_db``) and can only be used with unmounted filesystems.
126 Due to its high memory requirements and inability to repair things, this
127 program is now deprecated and will not be discussed further.
129 The second program, ``xfs_repair``, was created to be faster and more robust
132 It uses extent-based in-memory data structures to reduce memory consumption,
133 and tries to schedule readahead IO appropriately to reduce I/O waiting time
136 inconsistencies in file metadata and directory tree by erasing things as needed
141 -----------------
147    These occur **unpredictably** and often without warning.
165    health when doing so requires **manual intervention** and downtime.
171 Given this definition of the problems to be solved and the actors who would
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
179 The rest of this document presents the goals and use cases of the new fsck
180 tool, describes its major design points in connection to those goals, and
181 discusses the similarities and differences with existing tools.
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
191 | "online scrub", and portion of the kernel that fixes metadata is called  |
193 +--------------------------------------------------------------------------+
195 The naming hierarchy is broken up into objects known as directories and files
196 and the physical space is split into pieces known as allocation groups.
197 Sharding enables better performance on highly parallel systems and helps to
199 The division of the filesystem into principal objects (allocation groups and
200 inodes) means that there are ample opportunities to perform targeted checks and
208 In summary, online fsck takes advantage of resource sharding and redundant
209 metadata to enable targeted checking and repair operations while the system
212 autonomous self-healing of XFS maximizes service availability.
217 Because it is necessary for online fsck to lock and scan live metadata objects,
221 reacting to the outcomes appropriately, and reporting results to the system
223 The second and third are in the kernel, which implements functions to check
224 and repair each type of online fsck work item.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
235 metadata structure, and handle it well.
238 -----
240 In principle, online fsck should be able to check and to repair everything that
248 sharing and lock acquisition rules as the regular filesystem.
251 In other words, online fsck is not a complete replacement for offline fsck, and
255 and to **increase predictability of operation**.
260 --------------
262 The userspace driver program ``xfs_scrub`` splits the work of checking and
264 Each phase concentrates on checking specific types of scrub items and depends
268 1. Collect geometry information about the mounted filesystem and computer,
269    discover the online fsck capabilities of the kernel, and open the
272 2. Check allocation group metadata, all realtime volume metadata, and all quota
275    If corruption is found in the inode header or inode btree and ``xfs_scrub``
281    Optimizations and all other repairs are deferred to phase 4.
285    If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
286    and there were no problems detected during phase 2, then those scrub items
288    Optimizations, deferred repairs, and unsuccessful repairs are deferred to
291 4. All remaining repairs and scheduled optimizations are performed during this
293    Before starting repairs, the summary counters are checked and any necessary
301 5. By the start of this phase, all primary and secondary filesystem metadata
303    Summary counters such as the free space counts and quota resource counts
304    are checked and corrected.
305    Directory entry names and extended attribute names are checked for
309 6. If the caller asks for a media scan, read all allocated and written data
311    The ability to use hardware-assisted data file integrity checking is new
313    If media errors occur, they will be mapped to the owning files and reported.
315 7. Re-check the summary counters and presents the caller with a summary of
316    space usage and file counts.
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
328    optimization; and for values that are directly controlled by the system
331    released and the positive scan results are returned to userspace.
333    this, resources are released and the negative scan results are returned to
349 --------------------------
351 Each type of metadata object (and therefore each type of scrub item) is
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
376 Scrub obeys the same rules as regular filesystem accesses for resource and lock
383 errors and cross-references healthy records against other metadata to look for
389 Next, it stages the observations in a new ondisk structure and commits it
399 As a result, indexed structures can be rebuilt very quickly, and programs
402 observations and a means to write new structures to disk.
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
429 - Reverse mapping information
431 - Directory parent pointers
444 Instead, repair functions set up an in-memory staging structure to store
449 The next step is to release all locks and start the filesystem scan.
455 Once the scan is done, the owning object is re-locked, the live data is used to
456 write a new ondisk structure, and the repairs are committed atomically.
457 The hooks are disabled and the staging staging area is freed.
466 Finally, the hook, the filesystem scan, and the inode locking model must be
471 primary metadata, but doing so would make it massively more complex and less
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
479 and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
484 method mentioned in Srinivasan and Mohan.
486 build the new structure as quickly as possible; and an auxiliary structure that
491 To avoid conflicts between the index builder and other writer threads, the
494 To avoid duplication of work between the side file and the index builder, side
502 The complexity of such an approach would be very high and perhaps more
505 **Future Work Question**: Can the full scan and live update code used to
509 employed these live scans to build a shadow copy of the metadata and then
513 The live scans and hooks were developed much later.
521 These are often used to speed up resource usage queries, and are many times
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
532 Check and repair require full filesystem scans, but resource and lock
536 implementation of the incore counters, and will be treated separately.
537 Check and repair of the other types of summary counters (quota resource counts
538 and file link counts) employ the same filesystem scanning and hooking
543 Inspiration for quota and file link count repair strategies were drawn from
545 Maintenance") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
546 and Their Indexes"
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
551 track pending changes to the block and inode usage counts in each transaction,
552 and commit those changes to a dquot side file when the transaction commits.
555 Link count checking combines the view deltas and commit step into one because
562 ---------------
565 that may make the feature unsuitable for certain distributors and users.
569 - **Decreased performance**: Adding metadata indices to the filesystem
570   increases the time cost of persisting changes to disk, and the reverse space
571   mapping and directory parent pointers are no exception.
574   reduces the ability of online fsck to find inconsistencies and repair them.
576 - **Incorrect repairs**: As with all software, there might be defects in the
581   and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
583   The xfsprogs build system has a configure option (``--enable-scrub=no``) that
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
592   To reduce the chance that a repair will fail with a dirty transaction and
594   designed to stage and validate all new records before committing the new
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
599   and the ability to perform administrative changes.
604   escaping and reconfiguring the system.
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
608   automated fuzz testing of ondisk artifacts to find mischievous behavior and
609   spraying exploit code onto the public mailing list for instant zero-day
616   Automated testing should front-load some of the risk while the feature is
630 2. Eliminate those inconsistencies; and
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
640 fsck project involves differential analysis against the existing fsck tools and
645 -------------------------------
648 inexpensive and widespread as possible to maximize the scaling advantages of
651 scenarios and hardware setups.
652 This improves code quality by enabling the authors of online fsck to find and
653 fix bugs early, and helps developers of new features to find integration
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
658 functional and regression testing.
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
662 This provides a level of assurance that the kernel and the fsck tools stay in
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
672 This also established a baseline for what can and cannot be repaired offline.
679 ---------------------------------------
689 existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
693 in-kernel validation functions and the ability of the offline fsck tool to
694 detect and eliminate the inconsistent metadata.
707     2. Offline repair (``xfs_repair``) to detect and fix
708     3. Online repair (``xfs_scrub``) to detect and fix
711 -----------------------------------------
717 block in the filesystem to simulate the effects of memory corruption and
745           2. Offline checking (``xfs_repair -n``)
747           4. Online checking (``xfs_scrub -n``)
749 …      6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
756 used to discover incorrect repair code and missing functionality for entire
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
763 allow the online fsck developers to compare online fsck against offline fsck,
764 and they enable XFS developers to find deficiencies in the code base.
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
771 and `improvements in fuzz testing comprehensiveness
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
781 inconsistencies into the filesystem metadata, and regular workloads should
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793   force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
795   freezing and thawing the filesystem.
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797   remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-…
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
815 A foreground CLI process for online fsck on demand, and a background service
816 that performs autonomous checking and repair.
819 ------------------
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
833 program runtime and consume a lot of bandwidth on older storage hardware.
837 The ``xfs_scrub_all`` program walks the list of mounted filesystems and
843 ------------------
846 provides a suite of `systemd <https://systemd.io/>`_ timers and services that
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
852 and throughput requirements of customer workloads.
867 * ``xfs_scrub_all.cron`` on non-systemd systems
879 This was performed via ``systemd-analyze security``, after which privileges
881 extent possible with sandboxing and system call filtering; and access to the
882 filesystem tree was restricted to the minimum needed to start the program and
884 The service definition files restrict CPU usage to 80% of one CPU core, and
885 apply as nice of a priority to IO and CPU scheduling as possible.
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
901 download this information into a human-readable format.
910 notifications and initiate a repair?
913 conversation with early adopters and potential downstream users of XFS.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
918 and
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
922 5. Kernel Algorithms and Data Structures
925 This section discusses the key algorithms and data structures of the kernel
926 code that provide the ability to check and repair metadata while the system
934 ------------------------
939 and a log sequence number.
940 When loading a block buffer from disk, the magic number, UUID, owner, and
942 the current filesystem, and that the information contained in the block is
945 that doesn't belong to the filesystem, and the fourth component enables the
952 The logging code maintains the checksum and the log sequence number of the last
954 Checksums are useful for detecting torn writes and other discrepancies that can
955 be introduced between the computer and its storage devices.
965 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
968 ---------------
972 In those days, storage density was expensive, CPU time was scarce, and
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
985 file metadata (the directory tree, the file block map, and the allocation
987 Like any system that improves redundancy, the reverse-mapping feature increases
990 enabling online fsck and other requested functionality such as free space
991 defragmentation, better media failure reporting, and filesystem shrinking.
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
1003 | copy-writes, which age the filesystem prematurely.                       |
1007 | usage is much less than adding volume management and storage device      |
1009 | Perfection of RAID and volume management are best left to existing       |
1011 +--------------------------------------------------------------------------+
1015 .. code-block:: c
1025 The first two fields capture the location and size of the physical space,
1031 Finally, the flags field provides extra information about the space usage --
1040 Program runtime and ease of resource acquisition are the only real limits to
1061    btree block requires locking the file and searching the entire btree to
1063    Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1077 The details of how these records are staged, written to disk, and committed
1080 Checking and Cross-Referencing
1081 ------------------------------
1084 contained within the structure and its relationship with the rest of the
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1109 itself, and answer these questions:
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1126 and that the storage system is reasonably competent at retrieval.
1128 failed system calls, and in the extreme case, filesystem shutdowns if the
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1136 Reads satisfied by a buffer already in cache (and hence already verified)
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1145 the buffer cache, and the scrub code itself, depending on the amount of higher
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1162 For example, block pointers and inumbers are checked to ensure that they point
1163 within the dynamically allocated parts of an allocation group and within
1165 Names are checked for invalid characters, and flags are checked for invalid
1169 correct order and lack of mergeability (except for file fork mappings).
1174 Validation of Userspace-Controlled Record Attributes
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243    - Does the reverse mapping index list only the appropriate owner as the
1246    - Are none of the blocks claimed as free space?
1248    - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253    - Everything in class 1 above.
1255    - If there's a parent node block, do the keys listed for this block match the
1258    - Do the sibling pointers point to valid blocks?  Of the same level?
1260    - Do the child pointers point to valid blocks?  Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264    - Everything in class 1 and 2 above.
1266    - Does the reverse mapping index list no owners of this space?
1268    - Is this space not claimed by the inode index for inodes?
1270    - Is it not mentioned by the reference count index?
1272    - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276    - Everything in class 1 and 2 above.
1278    - Is there a matching record in free inode btree?
1280    - Do cleared bits in the holemask correspond with inode clusters?
1282    - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287    - Everything in class 1.
1289    - Do all the fields that summarize information about the file forks actually
1292    - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297    - Everything in class 1 and 2 above.
1299    - Is this space not mentioned by the inode btrees?
1301    - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306    - Everything in class 1 and 2 above.
1308    - Within the space subkeyspace of the rmap btree (that is to say, all
1309      records mapped to a particular space extent and ignoring the owner info),
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-…
1317 …//git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, …
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1323 and to
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-…
1331 Extended attributes implement a key-value store that enable fragments of data
1333 Both the kernel and userspace can access the keys and values, subject to
1334 namespace and privilege restrictions.
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1338 Names can be as long as 255 bytes and can exist in several different
1345 Leaf blocks contain attribute key records that point to the name and the value.
1355 lack of separation between attr blocks and index blocks.
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1374 Checking and Cross-Referencing Directories
1378 constituting the nodes, and directory entries (dirents) constituting the edges.
1380 255-byte sequence (name) to an inumber.
1385 Each non-directory file may have multiple directories point to it.
1390 Each data block contains variable-sized records associating a user-provided
1391 name with an inumber and, optionally, a file type.
1393 exists as post-EOF extents) is populated with a block containing free space
1394 information and an index that maps hashes of the dirent names to directory data
1400 If the free space has been separated and the second partition grows again
1431 Checking operations involving :ref:`parents <dirparent>` and
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1445 checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1477 - For each dabtree leaf record, does the record key accurate reflect the
1480 Cross-Referencing Summary Counters
1484 resource usage, and file link counts.
1490 Cross-referencing these values against the filesystem metadata should be a
1491 simple matter of walking the free space and inode metadata in each AG and the
1495 :ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
1498 Post-Repair Reverification
1502 the new structure, and the results of the health assessment are recorded
1503 internally and returned to the calling process.
1505 of the filesystem and the progress of any repairs.
1507 and correction in the online and offline checking tools.
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1520 the metadata are temporarily inconsistent with each other, and rebuilding is
1523 Only online fsck has this requirement of total consistency of AG metadata, and
1530   buffers and finished the work.
1541 :ref:`next section <chain_coordination>`, and details about the solution
1550 uncovered a misinteraction between online fsck and compound transaction chains
1554 the expansion of deferred work items and compound transaction chains when
1555 reverse mapping and reflink were introduced.
1561 extent in AG 7 and then try to free a now superfluous block mapping btree block
1570    It then attaches to the in-memory transaction an action item to schedule
1576    the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1579    an EFI log item from the ``struct xfs_extent_free_item`` object and
1586    of AG 3 to release the former BMBT block and a second physical update to the
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1600 reconstruct the incore state of the intent item and finish it.
1608   In other words, all per-AG metadata updates for an unmapped block must be
1609   completed before the last update to free the extent, and extents should not
1622 and increase parallelism.
1624 During the design phase of the reverse mapping and reflink features, it was
1644 * Freeing any space that was unmapped and not owned by any other file
1654 For copy-on-write updates this is even worse, because this must be done once to
1655 remove the space from a staging area and again to map it into the file!
1658 work items to cover most reverse mapping updates and all refcount updates.
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1668 Once scrub acquires resources and takes locks for a data structure, it must do
1672 For example, if a thread performing a copy-on-write has completed a reverse
1674 will appear inconsistent to scrub and an observation of corruption will be
1679 flaw and rejected:
1681 1. Add a higher level lock to allocation groups and require writer threads to
1684    difficult to determine which locks need to be obtained, and in what order,
1690    targeting the same AG and have it hold the AG header buffers locked across
1700    The checking and repair operations must factor these pending operations into
1710 Online fsck uses an atomic intent item counter and lock cycling to coordinate
1714 transaction, and it is decremented after the associated intent done log item is
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1718 locking that AG header buffer to log the physical updates and the intent done
1733    calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1736    ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
1745    For example, a scan of the refcount btree would lock the AGI and AGF header
1749    chains in progress and the operation may proceed.
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1768 Online fsck for XFS separates the regular filesystem from the checking and
1770 However, there are a few parts of online fsck (such as the intent drains, and
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1784 skip past the sled, which seems to be on the order of less than 1ns and
1790 program that invoked online fsck, and can be amortized if multiple threads
1793 Changing the branch direction requires taking the CPU hotplug lock, and since
1804 - The hooked part of XFS should declare a static-scoped static key that
1809 - When deciding to invoke code that's only used by scrub, the regular
1811   scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1814   ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1828 handle locking AGI and AGF buffers for all scrubber functions.
1829 If it detects a conflict between scrub and the running transactions, it will
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1834 The scrub setup function should detect that flag, enable the static key, and
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1847 shadow copy of an ondisk metadata structure in memory and comparing the two
1860   difficult, especially on 32-bit systems.
1863   and eliminate the possibility of indexed lookups.
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1887 | The second edition solved the half-rebuilt structure problem by storing  |
1892 +--------------------------------------------------------------------------+
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1904 3. Large binary objects (BLOBs) of variable sizes (directory and extended
1905    attribute names and values)
1918 XFS is very record-based, which suggests that the ability to load and store
1920 To support these cases, a pair of ``xfile_load`` and ``xfile_store``
1921 functions are provided to read and persist objects into an xfile that treat any
1932 tmpfs can only push a pagecache folio to the swap cache if the folio is neither
1936 folio and mapping it into kernel address space.  Object load and store uses this
1939 mapping it into kernel address space, and dropping the folio lock.
1943 The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
1944 retrieve the (locked) folio that backs part of an xfile and to release it.
1946 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1954 must never be mapped into process file descriptor tables, and their pages must
1959 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1961 page, and release the pages.
1964 In other words, xfiles ignore the VFS read and write code paths to avoid
1965 having to create a dummy ``struct kiocb`` and to avoid taking inode and
1967 tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
1971 For example, if a scrub function stores scan results in an xfile and needs
1977 Arrays of Fixed-Sized Records
1981 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1983 Directories have a set of fixed-size dirent records that point to the names,
1984 and extended attributes have a set of fixed-size attribute keys that point to
1985 names and values.
1986 Quota counters and file link counters index records with numbers.
1987 During a repair, scrub needs to stage new records during the gathering step and
1990 Although this requirement can be satisfied by calling the read and write
1993 iterator functions, and to deal with sparse records and sorting.
1994 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1995 the byte-accessible xfile.
2003 Iteration of records is assumed to be necessary for all cases and will be
2007 Gaps may exist between records, and a record may be updated multiple times
2011 Access to array elements is performed programmatically via ``xfarray_load`` and
2012 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2013 provide loading and storing of array elements at arbitrary array indices.
2014 Gaps are defined to be null records, and null records are defined to be a
2021 and do not require multiple updates to a record.
2022 The typical use case here is rebuilding space btrees and key/value btrees.
2034 at any time, and uniqueness of records is left to callers.
2036 null record slot in the bag; and the ``xfarray_unset`` function removes a
2040 `big in-memory array
2041 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2050 .. code-block:: c
2068 .. code-block:: c
2092 quicksort and a heapsort subalgorithm in the spirit of
2093 `Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
2110   of the xfarray into a memory buffer, and sorting the buffer.
2116 A good pivot splits the set to sort in half, leading to the divide and conquer
2121 records into a memory buffer and using the kernel heapsort to identify the
2127 of the triads, and then sort the middle value of each triad to determine the
2131 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2134 low-effort robust (resistant) location in large samples`, in *Contributions to
2135 Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2139 subset around the pivot, then set up the current and next stack frames to
2140 sort with the larger and the smaller halves of the pivot, respectively.
2143 As a final performance optimization, the hi and lo scanning phase of quicksort
2154 Extended attributes and directories add an additional requirement for staging
2157 and each extended attribute needs to store both the attribute name and value.
2158 The names, keys, and values can consume a large amount of memory, so the
2162 Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
2163 and persist objects.
2166 The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
2169 The details of repairing directories and extended attributes will be discussed
2177 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2181 In-Memory B+Trees
2185 checking and repairing of secondary metadata commonly requires coordination
2186 between a live metadata scan of the filesystem and writer threads that are
2190 This *can* be done by appending concurrent updates into a separate log file and
2193 Another option is to skip the side-log and commit live updates from the
2202 Because xfarrays are not indexed and do not enforce record ordering, they
2204 Conveniently, however, XFS has a library to create and maintain ordered reverse
2211 The XFS buffer cache specializes in abstracting IO to block-oriented  address
2218 `in-memory btree
2219 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2228 per-AG structure.
2230 pages from the xfile and "write" cached pages back to the xfile.
2233 With this adaptation in place, users of the xfile-backed buffer cache use
2234 exactly the same APIs as users of the disk-backed buffer cache.
2235 The separation between xfile and buffer cache implies higher memory usage since
2237 updates to an in-memory btree.
2243 Space management for an xfile is very simple -- each btree block is one memory
2245 These blocks use the same header format as an on-disk btree, but the in-memory
2247 corruption-prone than regular DRAM.
2251 The header describes the owner, height, and the block number of the root
2256 Preallocate space for the block with ``xfile_prealloc``, and hand back the
2272 3. Pass the buffer cache target, buffer ops, and other information to
2273    ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
2286    and to update the in-memory btree.
2293    buffer target, and the destroy the xfile to release all resources.
2301 structure, the ephemeral nature of the in-memory btree block storage presents
2307 log transactions back into the filesystem, and certainly won't exist during
2310 remove the buffer log items from the transaction and write the updates into the
2313 The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
2334 ------------------------------
2337 structures by creating a new btree and adding observations individually.
2339 the incore records to be sorted prior to commit, but was very slow and leaked
2345 rebuilding a btree index from a collection of records -- bulk btree loading.
2346 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2347 had separate copy-pasted implementations for each btree type.
2350 were taken, and the four were refactored into a single generic btree bulk
2352 Those notes in turn have been refreshed and are presented below.
2358 be stored in the new btree, and sort the records.
2360 btree from the record set, the type of btree, and any load factor preferences.
2363 First, the geometry computation computes the minimum and maximum records that
2364 will fit in a leaf block from the size of a btree block and the size of the
2368         maxrecs = (block_size - header_size) / record_size
2376 This must be at least minrecs and no more than maxrecs.
2392 btree key and pointer as the record size::
2394         maxrecs = (block_size - header_size) / (key_size + ptr_size)
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2414   tree is ``level + 1`` and the space needed is the summation of the number of
2417 - For inode-rooted btrees where the records in the top level do not fit in the
2419   summation of the number of blocks on each level, and the inode fork points to
2422 - For inode-rooted btrees where the records in the top level can be stored in
2424   height is ``level + 1``, and the space needed is one less than the summation
2426   This only becomes relevant when non-bmap btrees gain the ability to root in
2427   an inode, which is a future patchset and only included here for completeness.
2438 Intent (EFI) item in the same transaction as each space allocation and attaches
2439 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2444 extent, it updates the in-memory reservation to reflect the claimed space.
2450 It's possible that other parts of the system will remain busy and push the head
2456 EFD for the old EFI and new EFI at the head.
2459 EFIs have a role to play during the commit and reaping phases; please see the
2460 next section and the section about :ref:`reaping<reaping>` for more details.
2464 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2465 and the
2467 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2475 rest of the block with records, and adds the new leaf block to a list of
2493 to compute the relevant keys and write them into the parent node::
2528 Blocks are queued for IO using a delwri list and written in one large batch
2534 clean up the space reservations that were made for the new btree, and reap the
2549    c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
2552       call ``xrep_defer_finish`` to clear out the deferred work and obtain a
2555 3. Clear out the deferred work a second time to finish the commit and clean
2558 The transaction rolling in steps 2c and 3 represent a weakness in the repair
2559 algorithm, because a log flush and a crash before the end of the reap step can
2573    records from the inode chunk information and a bitmap of the old inode btree
2585 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2610 ondisk inodes and to decide if the file is allocated
2619 This xfarray is walked twice during the btree creation step -- once to populate
2620 the inode btree with all inode chunk records, and a second time to populate the
2621 free inode btree with records for chunks that have free non-sparse inodes.
2628 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2638 physical blocks, and that the rectangles can be laid down to allow them to
2642 In other words, the record emission stimulus is level-triggered::
2653 Extents being used to stage copy-on-write operations should be the only records
2655 Single-owner file blocks aren't recorded in either the free space or the
2661    records for any space having more than one reverse mapping and add them to
2664    because these are extents allocated to stage a copy on write operation and
2679 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2689 - Until the reverse mapping btree runs out of records:
2691   - Retrieve the next record from the btree and put it in a bag.
2693   - Collect all records with the same starting block from the btree and put
2696   - While the bag isn't empty:
2698     - Among the mappings in the bag, compute the lowest block number where the
2704     - Remove all mappings from the bag that end at this position.
2706     - Collect all reverse mappings that start at this position from the btree
2707       and put them in the bag.
2709     - If the size of the bag changed and is greater than one, create a new
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2715 Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
2721 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2730    records from the reverse mapping records for that inode and fork.
2741    records to that immediate area and skip to step 8.
2745 6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2754 immediate areas if the data and attr forks are not both in BMBT format.
2762 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2768 ---------------------------
2771 suspect, there is a question of how to find and dispose of the blocks that
2778 the files and directories that it decides not to clear, hence it can build new
2779 structures in the discovered free space and avoid the question of reaping.
2803 Repairs for file-based metadata such as extended attributes, directories,
2804 symbolic links, quota files and realtime bitmaps are performed by building a
2805 new structure attached to a temporary file and exchanging all mappings in the
2816    - If zero, the block has a single owner and can be freed.
2818    - If not, the block is part of a crosslinked structure and must not be
2825    structure being repaired and move on to the next region.
2830 8. Free the region and move on.
2848 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2854 Old reference count and inode btrees are the easiest to reap because they have
2856 btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
2860 1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
2869    old data structures and hence is a candidate for reaping.
2893 5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2919 information changes the number of free space records, repair must re-estimate
2923 are created for the reserved blocks and that unused reserved blocks are
2925 Deferrred rmap and freeing operations are used to ensure that this transition
2930 Blocks for the free space btrees and the reverse mapping btrees are supplied by
2940 blocks and the AGFL blocks (``rmap_agfl_bitmap``).
2948 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2958 btree blocks, and the reverse mapping btree blocks all have reverse mapping
2960 The full process of gathering reverse mapping records and building a new btree
2966 corresponding to the gaps in the new rmap btree records, and then clearing the
2967 bits corresponding to extents in the free space btrees and the current AGFL
2977 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2988 2. Subtract the space used by the two free space btrees and the rmap btree.
2991    other owner, to avoid re-adding crosslinked blocks to the AGFL.
2995 5. The next operation to fix the freelist will right-size the list.
3000 --------------------
3003 ("dinodes") and an in-memory ("cached") representation.
3006 badly damaged that the filesystem cannot load the in-memory representation.
3008 specialized resource acquisition functions that return either the in-memory
3013 is necessary to get the in-core structure loaded.
3014 This means fixing whatever is caught by the inode cluster buffer and inode fork
3015 verifiers, and retrying the ``iget`` operation.
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3019 subject it to comprehensive checks, repairs, and optimizations.
3020 Most inode attributes are easy to check and constrain, or are user-controlled
3022 Dealing with the data and attr fork extent counts and the file block counts is
3024 forks, or if that fails, leaving the fields invalid and waiting for the fork
3029 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3033 --------------------
3035 Similar to inodes, quota records ("dquots") also have both ondisk records and
3036 an in-memory representation, and hence are subject to the same cache coherency
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3043 checking are obviously bad limits and timer values.
3045 Quota usage counters are checked, repaired, and discussed separately in the
3050 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3056 --------------------------------
3059 as free blocks, free inodes, and allocated inodes.
3060 This information could be compiled by walking the free space and inode indexes,
3066 Writer threads reserve the worst-case quantities of resources from the
3067 incore counter and give back whatever they don't use at commit time.
3077 global incore counter and can satisfy small allocations from the local batch.
3079 The high-performance nature of the summary counters makes it difficult for
3088 For repairs, the in-memory counters must be stabilized while walking the
3089 filesystem metadata to get an accurate reading and install it in the percpu
3094 garbage collection threads, and it must wait for existing writer programs to
3097 inode btrees, and the realtime bitmap to compute the correct value of all
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3106 - It does not quiesce the log.
3109 long enough to check and correct the summary counters.
3111 +--------------------------------------------------------------------------+
3113 +--------------------------------------------------------------------------+
3120 | - Other programs can unfreeze the filesystem without our knowledge.      |
3121 |   This leads to incorrect scan results and incorrect repairs.            |
3123 | - Adding an extra lock to prevent others from thawing the filesystem     |
3124 |   required the addition of a ``->freeze_super`` function to wrap         |
3127 |   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     |
3128 |   last reference to the VFS superblock, and any subsequent access        |
3136 | - The log need not be quiesced to check the summary counters, but a VFS  |
3140 | - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3143 | - A bug in the VFS meant that freeze could complete even when            |
3144 |   sync_filesystem fails to flush the filesystem and returns an error.    |
3146 +--------------------------------------------------------------------------+
3150 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3154 ---------------------
3157 entire filesystem to record observations and comparing the observations against
3160 observations to disk in a replacement structure and committing it atomically.
3167 - How does scrub manage the scan while it is collecting data?
3169 - How does the scan keep abreast of changes being made to the system by other
3179 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3184 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3186 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3187 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3188 1913-4.
3192 They form a continuous keyspace that can be expressed as a 64-bit integer,
3195 ``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
3206 Advancing the scan cursor is a multi-step process encapsulated in
3214 2. Use the per-AG inode btree to look up the next inumber after the one that
3228    c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
3249    and that it has stabilized this next inode so that it cannot disappear from
3252 6. Drop the AGI lock and return the incore inode to the caller.
3268    d. Unlock and release the inode.
3277 coordinator must release the AGI and push the main filesystem to get the inode
3282 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3286 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3299 and the initialization of the actual ondisk inode.
3305 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3308 - Speculative preallocations need to be unreserved.
3310 - An unlinked file may have lost its last reference, in which case the entire
3312   the ondisk metadata and freeing the inode.
3315 Inactivation has two parts -- the VFS part, which initiates writeback on all
3316 dirty file pages, and the XFS part, which cleans up XFS-specific information
3317 and frees the inode if it was unlinked.
3337 7. Space on the data and realtime devices for the transaction.
3352 an object that normally is acquired in a later stage of the locking order, and
3353 then decide to cross-reference the object with an object that is acquired
3358 iget and irele During a Scrub
3362 context, and possibly with resources already locked and bound to it.
3369 save time if another process re-opens the file before the system runs out
3370 of memory and frees it.
3371 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3377 transaction, and XFS does not support nesting transactions.
3386 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3388 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3395 In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
3396 in a well-known order: parent → child when updating the directory tree, and
3402 Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
3408 scanner, the scrub process holds the IOLOCK of the file being scanned and it
3411 cannot use the regular inode locking functions and avoid becoming trapped in an
3414 Solving both of these problems is straightforward -- any time online fsck
3417 If the trylock fails, scrub drops all inode locks and use trylock loops to
3422 resource being scrubbed before and after the lock cycle to detect changes and
3432 parent directory, and that the parent directory contains exactly one dirent
3434 Fully validating this relationship (and repairing it if possible) requires a
3435 walk of every directory on the filesystem while holding the child locked, and
3440 if the scanner fails to lock a parent, it can drop and relock both the child
3441 and the prospective parent.
3443 rename operation must have changed the child's parentage, and the scan can
3448 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3461 filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
3476 Regular blocking notifier chains use a rwsem and seem to have a much lower
3477 overhead for single-threaded applications.
3478 However, it may turn out that the combination of blocking chains and static
3483 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3484   a well-known incore filesystem object.
3486 - Each hook must define an action code and a structure containing more context
3489 - Hook providers should provide appropriate wrapper functions and structs
3490   around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
3493 - A callsite in the regular filesystem code must be chosen to call
3494   ``xfs_hooks_call`` with the action code and data structure.
3495   This place should be adjacent to (and not earlier than) the place where
3498   handle sleeping and should not be vulnerable to memory reclaim or locking
3501   caller and the callee.
3503 - The online fsck function should define a structure to hold scan data, a lock
3504   to coordinate access to the scan data, and a ``struct xfs_hook`` object.
3505   The scanner function and the regular filesystem code must acquire resources
3508 - The online fsck code must contain a C function to catch the hook action code
3509   and data structure.
3513 - Prior to unlocking inodes to start the scan, online fsck must call
3514   ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3529 The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
3559 checking code and the code making an update to the filesystem:
3561 - Prior to invoking the notifier call chain, the filesystem function being
3565 - The scanning function and the scrub hook function must coordinate access to
3568 - Scrub hook function must not add the live update information to the scan
3573 - Scrub hook functions must not change the caller's state, including the
3578 - The hook function can abort the inode scan to avoid breaking the other rules.
3582 - ``xchk_iscan_start`` starts a scan
3584 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3587 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3590   in-memory scan information.
3592 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3595 - ``xchk_iscan_teardown`` to finish the scan
3599 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3607 It is useful to compare the mount time quotacheck code to the online repair
3613    dquots will actually load, and zero the resource usage counters in the
3629 index implemented with a sparse ``xfarray``, and only writes to the real dquots
3634 1. The inodes involved are joined and locked to a transaction.
3651    b. Quota usage changes are logged and unused reservation is given back to
3656 For online quotacheck, hooks are placed in steps 2 and 4.
3669    a. Grab and lock the inode.
3672       realtime blocks) and add that to the shadow dquots for the user, group,
3673       and project ids associated with the inode.
3675    c. Unlock and release the inode.
3679    a. Grab and lock the dquot.
3681    b. Check the dquot against the shadow dquots created by the scan and updated
3686 If repairs are desired, the real and shadow dquots are locked and their
3691 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3701 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3728 bumplink and droplink.
3732 Non-directories never have children of any kind.
3734 links pointing to child subdirectories and the number of dotdot entries
3738 both the inode and the shadow data, and comparing the link counts.
3749 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3759 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3760 The primary advantage of this approach is the simplicity and modularity of the
3761 repair code -- code and data are entirely contained within the scrub module,
3762 do not require hooks in the main filesystem, and are usually the most efficient
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3766 the kernel finishes repairing and revalidating the metadata.
3773 every file in the filesystem, and the filesystem cannot stop.
3774 Therefore, rmap repair foregoes atomicity between scrub and repair.
3776 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3781 2. While holding the locks on the AGI and AGF buffers acquired during the
3783    staging extents, and the internal log.
3795    a. Create a btree cursor for the in-memory btree.
3797    b. Use the rmap code to add the record to the in-memory btree.
3806    a. Create a btree cursor for the in-memory btree.
3808    b. Replay the operation into the in-memory btree.
3815 7. When the inode scan finishes, create a new scrub transaction and relock the
3823 10. Perform the usual btree bulk loading and commit to install the new rmap
3833 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3837 --------------------------------------------
3840 extended attributes, symbolic link targets, free space bitmaps and summary
3841 information for the realtime volume, and quota records.
3842 File forks map 64-bit logical file fork space extents to physical storage space
3843 extents, similar to how a memory management unit maps 64-bit virtual addresses
3845 Therefore, file-based tree structures (such as directories and extended
3847 to other blocks mapped within that same address space, and file-based linear
3848 structures (such as bitmaps and quota records) compute array element offsets in
3853 Therefore, online repair of file-based metadata createas a temporary file in
3855 temporary file, and atomically exchanges all file fork mappings (and hence the
3861 **Note**: All space usage and inode indices in the filesystem *must* be
3867 field of the block headers to match the file being repaired and not the
3869 The directory, extended attribute, and symbolic link functions were all
3872 There is a downside to the reaping process -- if the system crashes during the
3873 reap phase and the fork extents are crosslinked, the iunlink processing will
3874 fail because freeing space will find the extra reverse mappings and abort.
3878 They are not linked into a directory and the entire file will be reaped when
3882 opened by handle, and they must never be linked into the directory tree.
3884 +--------------------------------------------------------------------------+
3886 +--------------------------------------------------------------------------+
3889 | fork would be reaped; and then a new structure would be built in its     |
3895 | offset in the fork from the salvage data, reaping the old extents, and   |
3901 | - Array structures are linearly addressed, and the regular filesystem    |
3905 | - Extended attributes are allowed to use the entire attr fork offset     |
3908 | - Even if repair could build an alternate copy of a data structure in a  |
3914 | - A crash after construction of the secondary tree but before the range  |
3918 | - Reaping blocks after a repair is not a simple operation, and           |
3922 | - Directory entry blocks and quota records record the file fork offset   |
3929 | - Each block in a directory or extended attributes btree index contains  |
3930 |   sibling and child block pointers.                                      |
3938 +--------------------------------------------------------------------------+
3945 This allocates an inode, marks the in-core inode private, and attaches it to
3948 and must be kept private.
3950 Temporary files only use two inode locks: the IOLOCK and the ILOCK.
3953 The usage patterns of these two locks are the same as for any other XFS file --
3954 access to file data are controlled via the IOLOCK, and access to file metadata
3956 Locking helpers are provided so that the temporary file and its lock state can
3967 2. The regular directory, symbolic link, and extended attribute functions can
3976 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3980 -----------------------------
3984 It is not possible to swap the inumbers of two files, so instead the new
3986 This suggests the need for the ability to swap extents, but the existing extent
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3992    Therefore, it can only exchange one mapping per transaction, and each
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
4001    For this use case, an incomplete exchange will not result in a user-visible
4004 d. Online repair needs to swap the contents of two files that are by definition
4006    For directory and xattr repairs, the user-visible contents might be the
4009 e. Old blocks in the file may be cross-linked with another structure and must
4010    not reappear if the system goes down mid-repair.
4012 These problems are overcome by creating a new deferred operation and a new type
4016 the reverse-mapping extent swap code, but records intermedia progress in the
4030 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4033 +--------------------------------------------------------------------------+
4034 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
4035 +--------------------------------------------------------------------------+
4057 | functions to obtain the log feature and call                             |
4064 | and the MMAPLOCK, but before allocating the transaction.                 |
4070 | Log-assisted extended attribute updates and file content exchanges bothe |
4071 | use log incompat features and provide convenience wrappers around the    |
4073 +--------------------------------------------------------------------------+
4081 There are likely to be many extent mappings in each fork, and the edges of
4086 This is roughly the format of the new deferred exchange-mapping work item:
4088 .. code-block:: c
4109 offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
4114 incremented and the blockcount field is decremented to reflect the progress
4117 mappings instead of the data fork and other work to be done after the exchange.
4129    This will log an extent swap intent item to the transaction for the deferred
4134    a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
4135       ``xmi_startoff2``, respectively, and compute the longest extent that can
4140       Mutual holes, unwritten extents, and extent mappings to the same physical
4144       from file 1 as "map1", and the mapping that came from file 2 as "map2".
4154    f. Log the block, quota, and extent count updates for both files.
4162       This quantity is ``(map1.br_startoff + map1.br_blockcount -
4165    j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
4166       by the number of blocks computed in the previous step, and decrease
4175       The operation manager completes the deferred work in steps 3b-3e before
4178 4. Perform any post-processing.
4182 find the most recent unfinished maping exchange log intent item and restart
4185 will either see the old broken structure or the new one, and never a mismash of
4194 operation begins, and directio writes to be quiesced.
4196 maximum amount of disk space and quota that can be consumed on behalf of both
4197 files in the operation, and reserve that quantity of resources to avoid an
4201 - Data device blocks needed to handle the repeated updates to the fork
4203 - Change in data and realtime block counts for both files.
4204 - Increase in quota usage for both files, if the two files do not share the
4206 - The number of extent mappings that will be added to each file.
4207 - Whether or not there are partially written realtime extents.
4222 Extended attributes, symbolic links, and directories can set the fork format to
4223 "local" and treat the fork as a literal area for data storage.
4226 - If both forks are in local format and the fork areas are large enough, the
4228   forks, and committing.
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4247 Extended attributes and directories stamp the owning inode into every block,
4256 extent reaping <reaping>` mechanism that is done post-repair.
4258 iunlink processing at the end of recovery will free both the temporary file and
4260 However, this iunlink processing omits the cross-link detection of online
4261 repair, and is not completely foolproof.
4278    the appropriate resource reservations, locks, and fill out a ``struct
4293 the filesystem block size between 4KiB and 1GiB in size.
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4305 and can satisfy a power-of-two allocation request.
4309 1. Take the ILOCK of both the realtime bitmap and summary files.
4318    c. Increment it, and write it back to the xfile.
4320 3. Compare the contents of the xfile against the ondisk file.
4323 and use atomic mapping exchange to commit the new contents.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4339 index blocks, and remote value blocks are intermixed.
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4342 Values larger than a block are allocated separate extents and written there.
4356       1. Check the name for problems, and ignore the name if there are.
4359          If that succeeds, add the name and value to the staging xfarray and
4362 2. If the memory usage of the xfarray and xfblob exceed a certain amount of
4363    memory or there are no more attr fork blocks to examine, unlock the file and
4366 3. Use atomic file mapping exchange to exchange the new and old extended
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4383 and then it scans all directories to establish parentage of those linked files.
4384 Damaged files and directories are zapped, and files with no parent are
4389 blocks and salvage any dirents that look plausible, correct link counts, and
4393 and moving orphans to the ``/lost+found`` directory.
4413       i. Check the name for problems, and ignore the name if there are.
4415       ii. Retrieve the inumber and grab the inode.
4416           If that succeeds, add the name, inode number, and file type to the
4417           staging xfarray and xblob.
4419 3. If the memory usage of the xfarray and xfblob exceed a certain amount of
4421    directory and add the staged dirents into the temporary directory.
4424 4. Use atomic file mapping exchange to exchange the new and old directory
4441    directory and the dentry can be purged from the cache.
4453 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4475 each parent pointer is a directory and that it contains a dirent matching
4477 Both online and offline repair can use this strategy.
4479 +--------------------------------------------------------------------------+
4481 +--------------------------------------------------------------------------+
4487 | Unfortunately, this early implementation had major shortcomings and was  |
4496 |    Checking and repairs were performed on mounted filesystems without    |
4510 | Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
4515 | commit a dirent update and a parent pointer update in the same           |
4517 | Chandan increased the maximum extent counts of both data and attribute   |
4540 |    and amending the xattr code to support updating an xattr key and      |
4544 | 3. Same as above, but remove the old parent pointer entry and add a new  |
4567 | In the end, it was decided that solution #6 was the most compact and the |
4569 +--------------------------------------------------------------------------+
4575 Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
4579    an xfblob for storing entry names, and an xfarray for stashing the fixed
4583 2. Set up an inode scanner and hook into the directory entry code to receive
4590    a. Stash the parent pointer name and an addname entry for this dirent in the
4591       xfblob and xfarray, respectively.
4601       dirent update in the xfblob and xfarray for later.
4604       Instead, we stash updates in the xfarray and rely on the scanner thread
4610    directory and the directory being repaired.
4617 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4627    an xfblob for storing parent pointer names, and an xfarray for stashing the
4631 2. Set up an inode scanner and hook into the directory entry code to receive
4638    a. Stash the dirent name and an addpptr entry for this parent pointer in the
4639       xfblob and xfarray, respectively.
4648    a. Stash the dirent name and an addpptr or removepptr entry for this dirent
4649       update in the xfblob and xfarray for later.
4652       Instead, we stash updates in the xfarray and rely on the scanner thread
4657 6. Copy all non-parent pointer extended attributes to the temporary file.
4660    forks of the temporary file and the file being repaired.
4667 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4685       and skip the next step.
4687    b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
4690       1. Deduplicating names to reduce memory usage, and
4696       name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
4702    a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
4703       ``name_hash``, and ``name_cookie``.
4718             cookie and skip the next step.
4720          c. Record the name in a per-file xfblob, and remember the xfblob
4724             name_cookie)`` tuples in a per-file slab.
4726       2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
4727          and ``name_cookie``.
4730          per-AG tuple slab.
4731          This should be trivial since the per-AG tuples are in child inumber
4734       4. Position a second slab cursor at the start of the per-file tuple slab.
4737          ``name_hash``, and ``name_cookie`` fields of the records under each
4740          a. If the per-AG cursor is at a lower point in the keyspace than the
4741             per-file cursor, then the per-AG cursor points to a missing parent
4743             Add the parent pointer to the inode and advance the per-AG
4746          b. If the per-file cursor is at a lower point in the keyspace than
4747             the per-AG cursor, then the per-file cursor points to a dangling
4749             Remove the parent pointer from the inode and advance the per-file
4760 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
4764 challenging because xfs_repair currently uses two single-pass scans of the
4765 filesystem during phases 3 and 4 to decide which files are corrupt enough to be
4767 This scan would have to be converted into a multi-pass scan:
4769 1. The first pass of the scan zaps corrupt inodes, forks, and attributes
4784    the dirents and add them to the now-empty directories.
4797 Fortunately, non-directories are allowed to have multiple parents and cannot
4799 Directories typically constitute 5-10% of the files in a filesystem, which
4802 If the directory tree could be frozen, it would be easy to discover cycles and
4804 from the root directory and marking a bitmap for each directory found.
4818 For this to work, all directory entries and parent pointers must be internally
4819 consistent, each directory entry must have a parent pointer, and the link
4826 the parent -> child relationship by taking the ILOCKs and installing a dirent
4837       1. Create a path object for that parent pointer, and mark the
4840       2. Record the parent pointer name and inode number in a path structure.
4844          Mark the path for deletion and repeat step 1a with the next
4851          Mark the path as a cycle and repeat step 1a with the next subdirectory
4860          a. Record the parent pointer name and inode number in the path object
4866          c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
4877    If the entry matches part of a path, mark that path and the scan stale.
4879    all scan data and starts over.
4885    a. Corrupt paths and cycle paths are counted as suspect.
4896    parent and exit.
4901 5. If the subdirectory has no good paths and more than one suspect path, delete
4904 6. If the subdirectory has zero paths, attach it to the lost and found.
4908 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
4915 -------------
4917 Filesystems present files as a directed, and hopefully acyclic, graph.
4919 The root of the filesystem is a directory, and each entry in a directory points
4920 downwards either to more subdirectories or to non-directory files.
4927 back to the child directory and the file link count checker can detect a file
4932 and parent pointers can be rebuilt by scanning directories.
4937 serve as an orphanage, and linking orphan files into the orphanage by using the
4943 The directory and file link count repair setup functions must use the regular
4945 security attributes and dentry cache entries, just like a regular directory
4951    to try to ensure that the lost and found directory actually exists.
4955    orphanage and the file being reattached.
4966    reparent the orphaned file into the lost and found and invalidate the dentry
4970    orphanage ILOCK, and clean the scrub transaction.  Call
4971    ``xrep_adoption_commit`` to commit the updates and the scrub transaction.
4978 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4981 6. Userspace Algorithms and Data Structures
4984 This section discusses the key algorithms and data structures of the userspace
4985 program, ``xfs_scrub``, that provide the ability to drive metadata checks and
4986 repairs in the kernel, verify file data, and look for other potential problems.
4991 -----------------
4999    the allocation group space btrees, and the realtime volume space
5003    forks, inode indices, inode records, and the forks of every file on the
5006 c. The naming hierarchy depends on consistency within the directory and
5010 d. Directories, extended attributes, and file data depend on consistency within
5011    the file forks that map directory and extended attribute data to physical
5014 e. The file forks depends on consistency within inode records and the space
5015    metadata indices of the allocation groups and the realtime volume.
5016    This includes quota and realtime metadata files.
5020 g. Realtime space metadata depend on the inode records and data forks of the
5024    and reverse mapping btrees) depend on consistency within the AG headers and
5027 i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
5033 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
5036 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
5038 - Phase 3 scans inodes in parallel.
5039   For each inode, groups (f), (e), and (d) are checked, in that order.
5041 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
5044 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
5047 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
5048   to read them, and to report which blocks of which files are affected.
5050 - Phase 7 checks group (a), having validated everything else.
5056 --------------------
5059 Given that XFS targets installations with large high-performance storage,
5066 workqueue and scheduled a single workqueue item per AG.
5068 inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
5073 filesystem contains one AG with a few large sparse files and the rest of the
5082 and it uses INUMBERS to find inode btree chunks.
5086 second workqueue, and it is this second workqueue that queries BULKSTAT,
5087 creates a file handle, and passes it to a function to generate scrub items for
5096 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
5097 and the
5099 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
5105 ------------------
5107 During phase 2, corruptions and inconsistencies reported in any AGI header or
5115 During phase 3, corruptions and inconsistencies reported in any part of a
5125 filesystem object, it became much more memory-efficient to track all eligible
5127 Each repair item represents a single lockable object -- AGs, metadata files,
5137 1. Start a round of repair with a workqueue and enough workers to keep the CPUs
5177 Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
5184 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5187 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5188 and
5190 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5191 and the
5193 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5197 -----------------------------------------------
5202 These names consist of the filesystem label, names in directory entries, and
5207 - Slashes and null bytes are not allowed in directory entries.
5209 - Null bytes are not allowed in userspace-visible extended attributes.
5211 - Null bytes are not allowed in the filesystem label.
5213 Directory entries and attribute keys store the length of the name explicitly
5216 presented together -- all the names in a directory, or all the attributes of a
5220 modern-day Linux systems is that programs work with Unicode character code
5222 These programs typically encode those code points in UTF-8 when interfacing
5223 with the C library because the kernel expects null-terminated names.
5225 UTF-8 encoded Unicode data.
5233 The standard also permits characters to be constructed in multiple ways --
5243 For example, the character "Right-to-Left Override" U+202E can trick some
5257 sections 4 and 5 of the
5260 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5263 `libicu <https://github.com/unicode-org/icu>`_
5266 Names are also checked for control characters, non-rendering characters, and
5272 ---------------------------------------
5286 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5287 to narrow down the failure to the specific region of the media and recorded.
5290 and report what has been lost.
5292 construct file paths from inode numbers for user-friendly reporting.
5294 7. Conclusion and Future Work
5298 in this document and now has some familiarity with how XFS performs online
5299 rebuilding of its metadata indices, and how filesystem users can interact with
5303 has been built, and why.
5307 ----------------------
5313 necessary refinements to online repair and lack of customer demand mean that
5319 As mentioned earlier, XFS has long had the ability to swap extents between
5321 The earliest form of this was the fork swap mechanism, where the entire
5324 When XFS v5 came along with self-describing metadata, this old mechanism grew
5329 develop an iterative mechanism that used deferred bmap and rmap operations to
5330 swap mappings one at a time.
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5333 an iteration of an existing mechanism and not something totally novel.
5339 old and new contents even after a crash, and it can operate on two arbitrary
5343 - **Atomic commit of file writes**: A userspace process opens a file that it
5345   Next, it opens a temporary file and calls the file clone operation to reflink
5354 - **Transactional file updates**: The same mechanism as above, but the caller
5357   To make this happen, the calling process snapshots the file modification and
5366 - **Emulation of atomic block device writes**: Export a block device with a
5369   Stage all writes to a temporary file, and when that is complete, call the
5372   This emulates an atomic device write in software, and can support arbitrary
5376 ----------------
5386 filesystem object, a list of scrub types to run against that object, and a
5390 dependency that cannot be satisfied due to a corruption, and tells userspace
5398 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5399 and
5401 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5405 ------------------------------------
5414 operation and abort the operation if it exceeds budget.
5420 ------------------------
5431 (``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
5437 uses the new fallocate map-freespace call to map any free space in that region
5440 ``GETFSMAP`` and issues forced repair requests on the data structure.
5451 contents; any changes will be written somewhere else via copy-on-write.
5453 cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
5467 most shared data extents in the filesystem, and target them first.
5472 that creates a new file with the old contents and then locklessly runs around
5476 hidden behind a jump label, and a log item that tracks the kernel walking the
5490 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5491 and
5493 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5497 ---------------------
5500 the data and metadata at the end of the filesystem, and handing the freed space