xfs-online-fsck-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-online-fsck-design.rst

Lines Matching +full:loss +full:- +full:of +full:- +full:signal
1 .. SPDX-License-Identifier: GPL-2.0
5         Mapping of heading styles within this document:
8         Heading 3 uses "----"
21 This document captures the design of the online filesystem check feature for
23 The purpose of this document is threefold:
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
37 This document is licensed under the terms of the GNU Public License, v2.
42 Parts 2 and 3 present a high level overview of how online fsck process works
44 Part 4 discusses the user interface and the intended usage modes of the new
47 then present case studies of how each repair function actually works.
51 .. contents:: Table of Contents
59 - Provide a hierarchy of names through which application programs can associate
60   arbitrary blobs of data for any length of time,
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
88 Filesystems of the 20th century generally lacked any redundancy in the ondisk
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
98 | System administrators avoid data loss by increasing the number of        |
99 | separate storage systems through the creation of backups; and they avoid |
100 | downtime by increasing the redundancy of each storage system through the |
101 | creation of RAID arrays.                                                 |
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
118 The online fsck tool described here will be the third tool in the history of
122 The first program, ``xfs_check``, was created as part of the XFS debugger
132 It uses extent-based in-memory data structures to reduce memory consumption,
134 while it scans the metadata of the entire filesystem.
135 The most important feature of this tool is its ability to respond to
141 -----------------
146    shutdowns occur as a result of silent corruptions in the metadata.
149 2. **Users** experience a **total loss of service** during the recovery period
152 3. **Users** experience a **total loss of service** if the filesystem is taken
155 4. **Data owners** cannot **check the integrity** of their stored data without
156    reading all of it.
164 6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
168    malicious actors **exploit quirks of Unicode** to place misleading names
171 Given this definition of the problems to be solved and the actors who would
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
178 ``xfs_scrub`` is the name of the driver program.
179 The rest of this document presents the goals and use cases of the new fsck
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
190 | The kernel portion of online fsck that validates metadata is called      |
191 | "online scrub", and portion of the kernel that fixes metadata is called  |
193 +--------------------------------------------------------------------------+
199 The division of the filesystem into principal objects (allocation groups and
201 repairs on a subset of the filesystem.
204 Even if a piece of filesystem metadata can only be regenerated by scanning the
208 In summary, online fsck takes advantage of resource sharding and redundant
212 autonomous self-healing of XFS maximizes service availability.
214 2. Theory of Operation
218 online fsck consists of three separate code components.
224 and repair each type of online fsck work item.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
234 philosophy, which is to say that each item should handle one aspect of a
238 -----
242 However, online fsck cannot be running 100% of the time, which means that
246 This limitation means that maintenance of the offline fsck tool will continue.
247 A second limitation of online fsck is that it must follow the same resource
252 a complete run of online fsck may take longer than online fsck.
253 However, both of these limitations are acceptable tradeoffs to satisfy the
254 different motivations of online fsck, which are to **minimize system downtime**
255 and to **increase predictability of operation**.
259 Phases of Work
260 --------------
262 The userspace driver program ``xfs_scrub`` splits the work of checking and
264 Each phase concentrates on checking specific types of scrub items and depends
265 on the success of all previous phases.
269    discover the online fsck capabilities of the kernel, and open the
283 3. Check all metadata of every file in the filesystem.
298    Free space in the filesystem is trimmed at the end of phase 4 if the
301 5. By the start of this phase, all primary and secondary filesystem metadata
311    The ability to use hardware-assisted data file integrity checking is new
312    to online fsck; neither of the previous tools have this capability.
315 7. Re-check the summary counters and presents the caller with a summary of
318 This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
325 the one aspect of a metadata object represented by a scrub item:
327 1. The scrub item of interest is checked for corruptions; opportunities for
345    item to assess the efficacy of the repairs.
346    The results of the reassessment are returned to userspace.
348 Classification of Metadata
349 --------------------------
351 Each type of metadata object (and therefore each type of scrub item) is
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
385 Repairs for this class of scrub item are simple, since the repair function
393 Because ``xfs_scrub`` locks a primary object for the duration of the repair,
394 this is effectively an offline repair operation performed on a subset of the
396 This minimizes the complexity of the repair code because it is not necessary to
398 any other part of the filesystem.
404 targeted work on individual shards of the filesystem avoids total loss of
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
415 Algorithms") of Srinivasan.
417 duration of the repair is *always* an offline algorithm.
425 but are only needed for online fsck or for reorganization of the filesystem.
429 - Reverse mapping information
431 - Directory parent pointers
433 This class of metadata is difficult for scrub to process because scrub attaches
435 to the usual order of resource acquisition.
439 Repairs, however, require a full scan of primary metadata, which can take a
442 duration of the repair.
444 Instead, repair functions set up an in-memory staging structure to store
446 Depending on the requirements of the specific repair function, the staging
455 Once the scan is done, the owning object is re-locked, the live data is used to
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
485 Their method consists of an index builder that extracts relevant record data to
492 builder maintains a publicly visible cursor that tracks the progress of the
494 To avoid duplication of work between the side file and the index builder, side
498 To minimize changes to the rest of the codebase, XFS online repair keeps the
500 In other words, there is no attempt to expose the keyspace of the new index
502 The complexity of such an approach would be very high and perhaps more
509 employed these live scans to build a shadow copy of the metadata and then
514 That in turn increases the runtime of those scrub functions.
519 Metadata structures in this last category summarize the contents of primary
524 Examples of summary information include:
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
536 implementation of the incore counters, and will be treated separately.
537 Check and repair of the other types of summary counters (quota resource counts
539 techniques as outlined above, but because the underlying data are sets of
540 integer counters, the staging data need not be a fully functional mirror of the
545 Maintenance") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
554 whereas the data structure being rebuilt is an index of dquots.
556 it sets attributes of the objects being scanned instead of writing them to a
562 ---------------
564 During the development of online fsck, several risk factors were identified
569 - **Decreased performance**: Adding metadata indices to the filesystem
570   increases the time cost of persisting changes to disk, and the reverse space
574   reduces the ability of online fsck to find inconsistencies and repair them.
576 - **Incorrect repairs**: As with all software, there might be defects in the
583   The xfsprogs build system has a configure option (``--enable-scrub=no``) that
584   disables building of the ``xfs_scrub`` binary, though this is not a risk
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
589   If the keyspaces of several metadata indices overlap in some manner but a
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
608   automated fuzz testing of ondisk artifacts to find mischievous behavior and
609   spraying exploit code onto the public mailing list for instant zero-day
610   disclosure is somehow of some social benefit.
611   In the view of this author, the benefit is realized only when the fuzz
615   ongoing risk to the stability of the development process.
616   Automated testing should front-load some of the risk while the feature is
619 Many of these risks are inherent to software programming.
632 3. Minimize further loss of data.
634 Demonstrations of correct operation are necessary to build users' confidence
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
641 systematic testing of every attribute of every type of metadata object.
645 -------------------------------
647 The primary goal of any free software QA effort is to make testing as
648 inexpensive and widespread as possible to maximize the scaling advantages of
650 In other words, testing should maximize the breadth of filesystem configuration
652 This improves code quality by enabling the authors of online fsck to find and
653 fix bugs early, and helps developers of new features to find integration
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
662 This provides a level of assurance that the kernel and the fsck tools stay in
664 During development of the online checking code, fstests was modified to run
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
668 To start development of online repair, fstests was modified to run
673 To complete the first phase of development of online repair, fstests was
675 This enables a comparison of the effectiveness of online repair as compared to
678 General Fuzz Testing of Metadata Blocks
679 ---------------------------------------
683 Before development of online fsck even began, a set of fstests were created
685 This required the creation of fstests library code that can create a filesystem
686 containing every possible type of metadata object.
688 a single block of a specific type of metadata object, trash it with the
689 existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
692 This earlier test suite enabled XFS developers to test the ability of the
693 in-kernel validation functions and the ability of the offline fsck tool to
695 This part of the test suite was extended to cover online fsck in exactly the
704   * Test the reactions of:
710 Targeted Fuzz Testing of Metadata Records
711 -----------------------------------------
715 of every metadata field of every metadata object in the filesystem.
716 ``xfs_db`` can modify every field of every metadata structure in every
717 block in the filesystem to simulate the effects of memory corruption and
731       * For each conceivable type of transformation that can be applied to a bit field...
742         * ...test the reactions of:
745           2. Offline checking (``xfs_repair -n``)
747           4. Online checking (``xfs_scrub -n``)
754 check the responses of XFS' fsck tools.
755 Since the introduction of the fuzz testing framework, these tests have been
757 classes of metadata objects in ``xfs_repair``.
758 The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
779 Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793   force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797   remounting the filesystem read-only and read-write.
798 * The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
800 Success is defined by the ability to run all of these tests without observing
802 check warnings, or any other sort of mischief.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-…
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
812 The primary user of online fsck is the system administrator, just like offline
814 Online fsck presents two modes of operation to administrators:
819 ------------------
824 The program checks every piece of metadata in the filesystem while the
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
828 option to increase the verbosity of the information reported.
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
831 correction capabilities of the hardware to check data file contents.
833 program runtime and consume a lot of bandwidth on older storage hardware.
835 The output of a foreground invocation is captured in the system log.
837 The ``xfs_scrub_all`` program walks the list of mounted filesystems and
838 initiates ``xfs_scrub`` for each of them in parallel.
843 ------------------
845 To reduce the workload of system administrators, the ``xfs_scrub`` package
846 provides a suite of `systemd <https://systemd.io/>`_ timers and services that
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
852 and throughput requirements of customer workloads.
854 The output of the background service is also captured in the system log.
855 If desired, reports of failures (either due to inconsistencies or mere runtime
864 This can be done by enabling either of the following services:
867 * ``xfs_scrub_all.cron`` on non-systemd systems
869 This automatic weekly scan is configured out of the box to perform an
870 additional media scan of all file data once per month.
877 (as of systemd 249) to ensure that the xfs_scrub processes have as little
878 access to the rest of the system as possible.
879 This was performed via ``systemd-analyze security``, after which privileges
884 The service definition files restrict CPU usage to 80% of one CPU core, and
885 apply as nice of a priority to IO and CPU scheduling as possible.
886 This measure was taken to minimize delays in the rest of the filesystem.
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
896 XFS caches a summary of each filesystem's health status in memory.
900 System administrators should use the ``health`` command of ``xfs_spaceman`` to
901 download this information into a human-readable format.
912 *Answer*: These questions remain unanswered, but should be a part of the
913 conversation with early adopters and potential downstream users of XFS.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
919 `preservation of sickness info during memory reclaim
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
925 This section discusses the key algorithms and data structures of the kernel
930 The remainder of this section presents the mechanisms through which XFS
934 ------------------------
936 Starting with XFS version 5 in 2012, XFS updated the format of nearly every
938 "unique" identifier (UUID), an owner code, the ondisk address of the block,
941 ondisk address confirm that the retrieved block matches the specific owner of
949 to the log as part of a transaction.
952 The logging code maintains the checksum and the log sequence number of the last
956 Sequence number tracking enables log recovery to avoid applying out of date
965 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
968 ---------------
970 The original design of XFS (circa 1993) is an improvement upon 1980s Unix
975 the filesystem, even at the cost of data integrity.
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
983 By adding a new index, the filesystem retains most of its ability to scale
987 Like any system that improves redundancy, the reverse-mapping feature increases
992 Second, the different ondisk storage format of the reverse mapping btree
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
999 | A criticism of adding the secondary index is that it does nothing to     |
1000 | improve the robustness of user data storage itself.                      |
1003 | copy-writes, which age the filesystem prematurely.                       |
1004 | In keeping with thirty years of precedent, users who want file data      |
1006 | As for metadata, the complexity of adding a new secondary index of space |
1009 | Perfection of RAID and volume management are best left to existing       |
1011 +--------------------------------------------------------------------------+
1015 .. code-block:: c
1025 The first two fields capture the location and size of the physical space,
1026 in units of filesystem blocks.
1031 Finally, the flags field provides extra information about the space usage --
1035 Online filesystem checking judges the consistency of each primary metadata
1038 because it contains a centralized alternate copy of all space allocation
1040 Program runtime and ease of resource acquisition are the only real limits to
1044 * The absence of an entry in the free space information.
1045 * The absence of an entry in the inode index.
1046 * The absence of an entry in the reference count data if the file is not
1048 * The correspondence of an entry in the reverse mapping information.
1052 1. Reverse mappings can provide a positive affirmation of correctness if any of
1057 2. Proving the consistency of secondary metadata with the primary metadata is
1058    difficult because that requires a full scan of all primary space metadata,
1063    Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1072    This means that forward progress during this part of a scan of the reverse
1075 In summary, reverse mappings play a key role in reconstruction of primary
1077 The details of how these records are staged, written to disk, and committed
1080 Checking and Cross-Referencing
1081 ------------------------------
1083 The first step of checking a metadata structure is to examine every record
1084 contained within the structure and its relationship with the rest of the
1086 XFS contains multiple layers of checking to try to prevent inconsistent
1088 Each of these layers contributes information that helps the kernel to make
1089 three decisions about the health of a metadata structure:
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1106 The lowest layer of metadata protection in XFS are the metadata verifiers built
1108 These functions perform inexpensive internal consistency checking of the block
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1125 establish that the filesystem code is reasonably free of gross corruption bugs
1127 Corruption problems observed at runtime cause the generation of health reports,
1129 corrupt metadata force the cancellation of a dirty transaction.
1132 block of a structure in the course of checking the structure.
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1142 After the buffer cache, the next level of metadata protection is the internal
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1145 the buffer cache, and the scrub code itself, depending on the amount of higher
1147 The scope of checking is still internal to the block.
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1163 within the dynamically allocated parts of an allocation group and within
1168 Btree records spanning an interval of the btree keyspace are checked for
1169 correct order and lack of mergeability (except for file fork mappings).
1170 For performance reasons, regular code may skip some of these checks unless
1172 Scrub functions, of course, must check all possible problems.
1174 Validation of Userspace-Controlled Record Attributes
1177 Various pieces of filesystem metadata are directly controlled by userspace.
1178 Because of this nature, validation work cannot be more precise than checking
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1199 After internal block checks, the next higher level of checking is
1200 cross-referencing records between metadata structures.
1201 For regular runtime code, the cost of these checks is considered to be
1203 inconsistencies, it must pursue all avenues of inquiry.
1204 The exact set of cross-referencing is highly dependent on the context of the
1209 Specifically, scrub can scan the key space of an index to determine if that
1211 For the reverse mapping btree, it is possible to mask parts of the key for the
1212 purposes of performing a keyspace scan so that scrub can decide if the rmap
1213 btree contains records mapping a certain extent of physical space without the
1214 sparsenses of the rest of the rmap keyspace getting in the way.
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1229   of btree?
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1236   of the child block?
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243    - Does the reverse mapping index list only the appropriate owner as the
1244      owner of each block?
1246    - Are none of the blocks claimed as free space?
1248    - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253    - Everything in class 1 above.
1255    - If there's a parent node block, do the keys listed for this block match the
1256      keyspace of this block?
1258    - Do the sibling pointers point to valid blocks?  Of the same level?
1260    - Do the child pointers point to valid blocks?  Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264    - Everything in class 1 and 2 above.
1266    - Does the reverse mapping index list no owners of this space?
1268    - Is this space not claimed by the inode index for inodes?
1270    - Is it not mentioned by the reference count index?
1272    - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276    - Everything in class 1 and 2 above.
1278    - Is there a matching record in free inode btree?
1280    - Do cleared bits in the holemask correspond with inode clusters?
1282    - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287    - Everything in class 1.
1289    - Do all the fields that summarize information about the file forks actually
1292    - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297    - Everything in class 1 and 2 above.
1299    - Is this space not mentioned by the inode btrees?
1301    - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306    - Everything in class 1 and 2 above.
1308    - Within the space subkeyspace of the rmap btree (that is to say, all
1310      are there the same number of reverse mapping records for each block as the
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-…
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-…
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1343 Block 0 in the attribute fork is always the top of the structure, but otherwise
1344 each of the three types of blocks can be found at any offset in the attr fork.
1347 Values that are less than 3/4 the size of a filesystem block are also stored
1351 rooted at block 0) is created to map hashes of the attribute names to leaf
1355 lack of separation between attr blocks and index blocks.
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1363 2. Walk the blocks of the attr fork looking for leaf blocks.
1369       This performs a named lookup of the attr name to ensure the correctness
1370       of the dabtree.
1372       integrity of the remote value block.
1374 Checking and Cross-Referencing Directories
1379 Directories are a special type of file containing a set of mappings from a
1380 255-byte sequence (name) to an inumber.
1384 Directory entries point to files of any type.
1385 Each non-directory file may have multiple directories point to it.
1390 Each data block contains variable-sized records associating a user-provided
1393 exists as post-EOF extents) is populated with a block containing free space
1394 information and an index that maps hashes of the dirent names to directory data
1398 populated with a linear array of free space information for faster
1401 beyond one block, then a dabtree is used to map hashes of dirent names to
1410 2. Walk the blocks of the first partition looking for directory entries.
1419    d. If a file type is included in the dirent, does it match the type of the
1425    f. If the directory has a second partition, perform a named lookup of the
1426       dirent name to ensure the correctness of the dabtree.
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1443 The internal structure of a dabtree closely resembles the btrees that record
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1475   contents of the child dabtree block?
1477 - For each dabtree leaf record, does the record key accurate reflect the
1478   contents of the directory or attr block?
1480 Cross-Referencing Summary Counters
1483 XFS maintains three classes of summary counters: available resources, quota
1486 In theory, the amount of available resources (data blocks, inodes, realtime
1489 maintain summaries of this information in the superblock.
1490 Cross-referencing these values against the filesystem metadata should be a
1491 simple matter of walking the free space and inode metadata in each AG and the
1498 Post-Repair Reverification
1502 the new structure, and the results of the health assessment are recorded
1505 of the filesystem and the progress of any repairs.
1506 For developers, it is a useful means to judge the efficacy of error detection
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1513 with a chain of transactions.
1523 Only online fsck has this requirement of total consistency of AG metadata, and
1527 * For each AG, maintain a count of intent items targeting that AG.
1540 Details about the discovery of this situation are presented in the
1546 Discovery of the Problem
1549 Midway through the development of online scrubbing, the fsstress tests
1551 created by other writer threads that resulted in false reports of metadata
1553 The root cause of these reports is the eventual consistency model introduced by
1554 the expansion of deferred work items and compound transaction chains when
1563 To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
1570    It then attaches to the in-memory transaction an action item to schedule
1571    deferred freeing of space.
1572    Concretely, each transaction maintains a list of ``struct
1573    xfs_defer_pending`` objects, each of which maintains a list of ``struct
1575    Returning to the example above, the action item tracks the freeing of both
1586    of AG 3 to release the former BMBT block and a second physical update to the
1587    free space btrees of AG 7 to release the unmapped file space.
1595 but before #2 is committed, a scan of the filesystem metadata would show
1597 of the unmapped space.
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1600 reconstruct the incore state of the intent item and finish it.
1608   In other words, all per-AG metadata updates for an unmapped block must be
1615   correctness of filesystem operations.
1621 In this manner, XFS employs a form of eventual consistency to avoid deadlocks
1624 During the design phase of the reverse mapping and reflink features, it was
1654 For copy-on-write updates this is even worse, because this must be done once to
1657 To deal with this explosion in a calm manner, XFS expands its use of deferred
1659 This reduces the worst case size of transaction reservations by breaking the
1660 work into a long chain of small updates, which increases the degree of eventual
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1672 For example, if a thread performing a copy-on-write has completed a reverse
1674 will appear inconsistent to scrub and an observation of corruption will be
1678 Several other solutions to this problem were evaluated upon discovery of this
1686    Performing a dry run of a file operation to discover necessary locks would
1689 2. Make the deferred work coordinator code aware of consecutive intent items
1692    This would introduce a lot of complexity into the coordinator since it is
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1721 is an explicit deprioritization of online fsck to benefit file operations.
1722 The second property of the drain is key to the correct coordination of scrub,
1733    calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1745    For example, a scan of the refcount btree would lock the AGI and AGF header
1754    back to step 1 unless a signal has been caught.
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1770 However, there are a few parts of online fsck (such as the intent drains, and
1772 what's going on in the rest of the filesystem.
1776 running on behalf of userspace.
1777 Taking locks in the hot path of a writer thread to access a data structure only
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1783 This sled has an overhead of however long it takes the instruction decoder to
1784 skip past the sled, which seems to be on the order of less than 1ns and
1785 does not access memory outside of instruction fetching.
1804 - The hooked part of XFS should declare a static-scoped static key that
1806   The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
1809 - When deciding to invoke code that's only used by scrub, the regular
1811   scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1831 If the caller of the helper has not enabled the static key, the helper will
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1838 For more information, please see the kernel documentation of
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1847 shadow copy of an ondisk metadata structure in memory and comparing the two
1854 To meet these goals, the kernel needs to collect a large amount of information
1855 in a place that doesn't require the correct operation of the filesystem.
1859 * Allocating a contiguous region of memory to create a C array is very
1860   difficult, especially on 32-bit systems.
1862 * Linked lists of records introduce double pointer overhead which is very high
1863   and eliminate the possibility of indexed lookups.
1871 Continued development of online fsck demonstrated that the ability to perform
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1883 | The first edition of online repair inserted records into a new btree as  |
1887 | The second edition solved the half-rebuilt structure problem by storing  |
1888 | everything in memory, but frequently ran the system out of memory.       |
1891 | memory overhead of the list pointers was extreme.                        |
1892 +--------------------------------------------------------------------------+
1897 A survey of the intended uses of xfiles suggested these use cases:
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1904 3. Large binary objects (BLOBs) of variable sizes (directory and extended
1913 The rest of this section discusses the interfaces that the xfile presents to
1914 four of those five higher level data structures.
1918 XFS is very record-based, which suggests that the ability to load and store
1920 To support these cases, a pair of ``xfile_load`` and ``xfile_store``
1922 error as an out of memory error.  For online repair, squashing error conditions
1926 However, no discussion of file access idioms is complete without answering the
1937 mechanism.  Folio locks are not supposed to be held for long periods of time, so
1944 retrieve the (locked) folio that backs part of an xfile and to release it.
1946 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1959 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1977 Arrays of Fixed-Sized Records
1980 In XFS, each type of indexed space metadata (free space, inodes, reference
1981 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1983 Directories have a set of fixed-size dirent records that point to the names,
1984 and extended attributes have a set of fixed-size attribute keys that point to
1991 methods of the xfile directly, it is simpler for callers for there to be a
1992 higher level abstraction to take care of computing array offsets, to provide
1994 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1995 the byte-accessible xfile.
2003 Iteration of records is assumed to be necessary for all cases and will be
2006 The first type of caller handles records that are indexed by position.
2012 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2013 provide loading and storing of array elements at arbitrary array indices.
2015 sequence of all zero bytes.
2020 The second type of caller handles records that are not indexed by position
2024 via the ``xfarray_append`` function, which stores a record at the end of the
2030 The third type of caller is a bag, which is useful for counting records.
2034 at any time, and uniqueness of records is left to callers.
2040 `big in-memory array
2041 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2046 Most users of the xfarray require the ability to iterate the records stored in
2050 .. code-block:: c
2059 All users of this idiom must be prepared to handle null records or must already
2065 of the array that are not populated with memory pages.
2066 Once it finds a page, it will skip the zeroed areas of the page.
2068 .. code-block:: c
2080 During the fourth demonstration of online repair, a community reviewer remarked
2081 that for performance reasons, online repair ought to load batches of records
2082 into btree record blocks instead of inserting records into a new btree one at a
2085 of the records, so naturally the xfarray must also support sorting the record
2091 The sorting algorithm used in the xfarray is actually a combination of adaptive
2092 quicksort and a heapsort subalgorithm in the spirit of
2096 To sort records in a reasonably short amount of time, ``xfarray`` takes
2097 advantage of the binary subpartitioning offered by quicksort, but it also uses
2103 The Linux kernel already contains a reasonably fast implementation of heapsort.
2104 It only operates on regular C arrays, which limits the scope of its usefulness.
2109 * Loading a small number of xfarray records from potentially disparate parts
2110   of the xfarray into a memory buffer, and sorting the buffer.
2112 In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
2122 median of the nine.
2126 Typical ninther implementations pick three unique triads of records, sort each
2127 of the triads, and then sort the middle value of each triad to determine the
2131 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2132 the 4th element of that buffer as the pivot.
2134 low-effort robust (resistant) location in large samples`, in *Contributions to
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2140 sort with the larger and the smaller halves of the pivot, respectively.
2143 As a final performance optimization, the hi and lo scanning phase of quicksort
2147 accounting for the application of heapsort directly onto xfile pages.
2155 records: arbitrary byte sequences of finite length.
2158 The names, keys, and values can consume a large amount of memory, so the
2159 ``xfblob`` abstraction was created to simplify management of these blobs
2169 The details of repairing directories and extended attributes will be discussed
2172 to cache a small number of entries before adding them to a temporary ondisk
2175 The proposed patchset is at the start of the
2177 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2181 In-Memory B+Trees
2185 checking and repairing of secondary metadata commonly requires coordination
2186 between a live metadata scan of the filesystem and writer threads that are
2192 unbounded memory consumption if the rest of the system is very busy.
2193 Another option is to skip the side-log and commit live updates from the
2199 Given that indexed lookups of scan data is required for both strategies, online
2200 fsck employs the second strategy of committing live updates directly into
2211 The XFS buffer cache specializes in abstracting IO to block-oriented  address
2212 spaces, which means that adaptation of the buffer cache to interface with
2213 xfiles enables reuse of the entire btree library.
2218 `in-memory btree
2219 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2228 per-AG structure.
2233 With this adaptation in place, users of the xfile-backed buffer cache use
2234 exactly the same APIs as users of the disk-backed buffer cache.
2237 updates to an in-memory btree.
2243 Space management for an xfile is very simple -- each btree block is one memory
2245 These blocks use the same header format as an on-disk btree, but the in-memory
2247 corruption-prone than regular DRAM.
2250 The very first block of an xfile backing an xfbtree contains a header block.
2251 The header describes the owner, height, and the block number of the root
2255 If there are no gaps, create one by extending the length of the xfile.
2277    For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
2282    Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
2286    and to update the in-memory btree.
2301 structure, the ephemeral nature of the in-memory btree block storage presents
2302 some challenges of its own.
2318 2. Record the dirty/ordered status of the log item.
2333 Bulk Loading of Ondisk B+Trees
2334 ------------------------------
2336 As mentioned previously, early iterations of online repair built new btree
2338 Loading a btree one record at a time had a slight advantage of not requiring
2342 loading factor of the blocks in the new btree.
2345 rebuilding a btree index from a collection of records -- bulk btree loading.
2346 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2347 had separate copy-pasted implementations for each btree type.
2349 To prepare for online fsck, each of the four bulk loaders were studied, notes
2357 The zeroth step of bulk loading is to assemble the entire record set that will
2359 Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
2360 btree from the record set, the type of btree, and any load factor preferences.
2364 will fit in a leaf block from the size of a btree block and the size of the
2366 Roughly speaking, the maximum number of records is::
2368         maxrecs = (block_size - header_size) / record_size
2371 which means the minimum number of records is half of maxrecs::
2381 The default loading factor was chosen to be 75% of maxrecs, which provides a
2387 running out of space::
2391 Load factor is computed for btree node blocks using the combined size of the
2394         maxrecs = (block_size - header_size) / (key_size + ptr_size)
2398 Once that's done, the number of leaf blocks required to store the record set
2403 The number of node blocks needed to point to the next level down in the tree
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2414   tree is ``level + 1`` and the space needed is the summation of the number of
2417 - For inode-rooted btrees where the records in the top level do not fit in the
2419   summation of the number of blocks on each level, and the inode fork points to
2422 - For inode-rooted btrees where the records in the top level can be stored in
2425   of the number of blocks on each level.
2426   This only becomes relevant when non-bmap btrees gain the ability to root in
2434 Once repair knows the number of blocks needed for the new btree, it allocates
2439 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2444 extent, it updates the in-memory reservation to reflect the claimed space.
2446 reduce the number of EFIs in play.
2449 reservations pin the tail of the ondisk log.
2450 It's possible that other parts of the system will remain busy and push the head
2451 of the log towards the pinned tail.
2452 To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
2454 To alleviate this problem, the dynamic relogging capability of the deferred ops
2464 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2467 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2475 rest of the block with records, and adds the new leaf block to a list of
2533 The repair function must log the location of the new root in a transaction,
2537 1. Commit the location of the new btree root.
2545    b. For unclaimed portions of incore reservations, create a regular deferred
2550       reservation of the committing transaction.
2559 algorithm, because a log flush and a crash before the end of the reap step can
2561 Online repair functions minimize the chances of this occurring by using very
2562 large transactions, which each can accommodate many thousands of block freeing
2565 subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
2573    records from the inode chunk information and a bitmap of the old inode btree
2579    of blocks needed for the inode btree.
2581    geometry of the finobt.
2583 4. Allocate the number of blocks computed in the previous step.
2589 6. Commit the location of the new btree root block(s) to the AGI.
2595 The inode btree maps inumbers to the ondisk location of the associated
2598 Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
2599 location of the old inode btree blocks.
2600 Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
2601 location of at least one inode cluster buffer.
2602 A cluster is the smallest number of ondisk inodes that can be allocated or
2612 Accumulate the results of successive inode cluster buffer reads until there is
2617 Once the repair function accumulates one chunk's worth of data, it calls
2619 This xfarray is walked twice during the btree creation step -- once to populate
2621 free inode btree with records for chunks that have free non-sparse inodes.
2622 The number of records for the inode btree is the number of xfarray records,
2628 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2635 Reference counts are required for correct operation of copy on write for shared
2637 Imagine the reverse mapping entries as rectangles representing extents of
2641 or end wherever the height of the stack changes.
2642 In other words, the record emission stimulus is level-triggered::
2653 Extents being used to stage copy-on-write operations should be the only records
2655 Single-owner file blocks aren't recorded in either the free space or the
2667    Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
2671    at the end of the xfarray.
2672    This matches the sorting order of records in the refcount btree.
2675    of blocks needed for the new tree.
2677 4. Allocate the number of blocks computed in the previous step.
2682 6. Commit the location of new btree root block to the AGF.
2689 - Until the reverse mapping btree runs out of records:
2691   - Retrieve the next record from the btree and put it in a bag.
2693   - Collect all records with the same starting block from the btree and put
2696   - While the bag isn't empty:
2698     - Among the mappings in the bag, compute the lowest block number where the
2700       This position will be either the starting block number of the next
2704     - Remove all mappings from the bag that end at this position.
2706     - Collect all reverse mappings that start at this position from the btree
2709     - If the size of the bag changed and is greater than one, create a new
2711       the size of the bag.
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2721 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2732    Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
2736    of blocks needed for the new tree.
2743 5. Allocate the number of blocks computed in the previous step.
2753 First, it's possible to move the fork offset to adjust the sizes of the
2756 EXTENTS format instead of BMBT, which may require a conversion.
2762 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2768 ---------------------------
2771 suspect, there is a question of how to find and dispose of the blocks that
2773 The laziest method of course is not to deal with them at all, but this slowly
2774 leads to service degradations as space leaks out of the filesystem.
2775 Hopefully, someone will schedule a rebuild of the free space information to
2777 Offline repair rebuilds all space metadata after recording the usage of
2779 structures in the discovered free space and avoid the question of reaping.
2781 As part of a repair, online fsck relies heavily on the reverse mapping records
2784 there may be other data structures that also think they own some of those
2789 For space metadata, the process of finding extents to dispose of generally
2792 1. Create a bitmap of space used by data structures that must be preserved.
2794    the same rmap owner code is used to denote all of the objects being rebuilt.
2796 2. Survey the reverse mapping data to create a bitmap of space owned by the
2803 Repairs for file-based metadata such as extended attributes, directories,
2810 The process for disposing of old extents is as follows:
2812 4. For each candidate extent, count the number of reverse mapping records for
2816    - If zero, the block has a single owner and can be freed.
2818    - If not, the block is part of a crosslinked structure and must not be
2833 Transactions are of finite size, so the reaping process must be careful to roll
2837 a. EFIs logged on behalf of space that is no longer occupied
2844 minimize the chances of this occurring.
2848 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2857 Creating a list of extents to reap the old btree blocks is quite simple,
2888    of blocks needed for each new tree.
2890 4. Allocate the number of blocks computed in the previous step from the free
2897 6. Commit the locations of the new btree root blocks to the AGF.
2907 space component of the keyspace of the reverse mapping btree.
2910 new blocks are reserved out of the free space btrees.
2912 However, repair holds the AGF buffer lock for the duration of the free space
2919 information changes the number of free space records, repair must re-estimate
2922 As part of committing the new btrees, repair must ensure that reverse mappings
2937 creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
2948 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2960 The full process of gathering reverse mapping records and building a new btree
2961 are described in the case study of
2962 :ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
2965 The list of candidate reaping blocks is computed by setting the bits
2972 The rest of the process of rebuildng the reverse mapping btree is discussed
2977 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2991    other owner, to avoid re-adding crosslinked blocks to the AGFL.
2995 5. The next operation to fix the freelist will right-size the list.
3000 --------------------
3003 ("dinodes") and an in-memory ("cached") representation.
3006 badly damaged that the filesystem cannot load the in-memory representation.
3008 specialized resource acquisition functions that return either the in-memory
3013 is necessary to get the in-core structure loaded.
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3020 Most inode attributes are easy to check and constrain, or are user-controlled
3029 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3033 --------------------
3036 an in-memory representation, and hence are subject to the same cache coherency
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3050 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3056 --------------------------------
3058 Filesystem summary counters track availability of filesystem resources such
3064 For performance reasons, XFS also maintains incore copies of those counters,
3066 Writer threads reserve the worst-case quantities of resources from the
3076 percpu counter, which means that each CPU is allocated a batch of blocks from a
3079 The high-performance nature of the summary counters makes it difficult for
3083 values of the summary counters, there's no way to hold the value of a percpu
3084 counter stable, so it's quite possible that the counter will be out of date by
3086 Earlier versions of online scrub would return to userspace with an incomplete
3088 For repairs, the in-memory counters must be stabilized while walking the
3097 inode btrees, and the realtime bitmap to compute the correct value of all
3099 This is very similar to a filesystem freeze, though not all of the pieces are
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3106 - It does not quiesce the log.
3111 +--------------------------------------------------------------------------+
3113 +--------------------------------------------------------------------------+
3120 | - Other programs can unfreeze the filesystem without our knowledge.      |
3123 | - Adding an extra lock to prevent others from thawing the filesystem     |
3124 |   required the addition of a ``->freeze_super`` function to wrap         |
3133 |   superblock, but it felt suboptimal given the other inadequacies of     |
3136 | - The log need not be quiesced to check the summary counters, but a VFS  |
3140 | - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3141 |   counters to disk as part of cleaning the log.                          |
3143 | - A bug in the VFS meant that freeze could complete even when            |
3146 +--------------------------------------------------------------------------+
3150 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3154 ---------------------
3156 Certain types of metadata can only be checked by walking every file in the
3159 Like every other type of online repair, repairs are made by writing those
3162 hundreds of billions of files because the downtime would be excessive.
3163 Therefore, online fsck must build the infrastructure to manage a live scan of
3167 - How does scrub manage the scan while it is collecting data?
3169 - How does the scan keep abreast of changes being made to the system by other
3177 In the original Unix filesystems of the 1970s, each directory entry contained
3179 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3183 UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
3184 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3185 `"Implementation of the File System"
3186 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3187 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3188 1913-4.
3190 XFS retains most of this design, except now inumbers are search keys over all
3192 They form a continuous keyspace that can be expressed as a 64-bit integer,
3199 The first part of this scan cursor object tracks the inode that will be
3201 Somewhat less obviously, the scan cursor object must also track which parts of
3206 Advancing the scan cursor is a multi-step process encapsulated in
3209 1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
3214 2. Use the per-AG inode btree to look up the next inumber after the one that
3219    a. Move the examination cursor to the point of the inumber keyspace that
3220       corresponds to the start of the next AG.
3225       visited the entire keyspace up to just before the start of the next AG's
3231    d. If there are no more AGs to examine, set both cursors to the end of the
3243       created in the part of the inode keyspace that the visited inode cursor
3246 5. Get the incore inode for the inumber of the examination cursor.
3282 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3284 The first user of the new functionality is the
3286 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3293 always obtained (``xfs_iget``) outside of transaction context because the
3294 creation of the incore context for an existing file does not require metadata
3297 part of file creation must be performed in transaction context because the
3298 filesystem must ensure the atomicity of the ondisk inode btree index updates
3299 and the initialization of the actual ondisk inode.
3301 References to incore inodes are always released (``xfs_irele``) outside of
3302 transaction context because there are a handful of activities that might
3305 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3308 - Speculative preallocations need to be unreserved.
3310 - An unlinked file may have lost its last reference, in which case the entire
3311   file must be inactivated, which involves releasing all of its resources in
3315 Inactivation has two parts -- the VFS part, which initiates writeback on all
3316 dirty file pages, and the XFS part, which cleans up XFS-specific information
3352 an object that normally is acquired in a later stage of the locking order, and
3353 then decide to cross-reference the object with an object that is acquired
3361 An inode scan performed on behalf of a scrub operation runs in transaction
3363 This isn't much of a problem for ``iget`` since it can operate in the context
3364 of an existing transaction, as long as all of the bound resources are acquired
3369 save time if another process re-opens the file before the system runs out
3370 of memory and frees it.
3371 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3386 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3388 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3396 in a well-known order: parent → child when updating the directory tree, and
3397 in numerical order of the addresses of their ``struct inode`` object otherwise.
3400 If two MMAPLOCKs must be acquired, they are acquired in numerical order of
3401 the addresses of their ``struct address_space`` objects.
3402 Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
3408 scanner, the scrub process holds the IOLOCK of the file being scanned and it
3409 needs to take the IOLOCK of the file at the other end of the directory link.
3414 Solving both of these problems is straightforward -- any time online fsck
3415 needs to take a second lock of the same class, it uses trylock to avoid an ABBA
3431 Online fsck must verify that the dotdot dirent of a directory points up to a
3435 walk of every directory on the filesystem while holding the child locked, and
3438 possibility of missing an inode.
3448 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3456 The second piece of support that online fsck functions need during a full
3460 Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
3467 notifier call chain facility to dispatch updates to any number of interested
3474 The current implementation of XFS hooks uses SRCU notifier chains to reduce the
3477 overhead for single-threaded applications.
3478 However, it may turn out that the combination of blocking chains and static
3483 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3484   a well-known incore filesystem object.
3486 - Each hook must define an action code and a structure containing more context
3489 - Hook providers should provide appropriate wrapper functions and structs
3490   around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
3493 - A callsite in the regular filesystem code must be chosen to call
3500   However, the exact requirements are very dependent on the context of the hook
3503 - The online fsck function should define a structure to hold scan data, a lock
3508 - The online fsck code must contain a C function to catch the hook action code
3513 - Prior to unlocking inodes to start the scan, online fsck must call
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3520 The number of hooks should be kept to a minimum to reduce complexity.
3521 Static keys are used to reduce the overhead of filesystem hooks to nearly
3529 The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
3561 - Prior to invoking the notifier call chain, the filesystem function being
3565 - The scanning function and the scrub hook function must coordinate access to
3568 - Scrub hook function must not add the live update information to the scan
3573 - Scrub hook functions must not change the caller's state, including the
3578 - The hook function can abort the inode scan to avoid breaking the other rules.
3582 - ``xchk_iscan_start`` starts a scan
3584 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3587 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3590   in-memory scan information.
3592 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3595 - ``xchk_iscan_teardown`` to finish the scan
3597 This functionality is also a part of the
3599 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3657 The step 2 hook creates a shadow version of the transaction dquot context
3691 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3701 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3706 1. If the entry is a dotdot (``'..'``) entry of the root directory, the
3710 2. If the entry is a dotdot entry of a subdirectory, the parent's backref
3726 Live update hooks are carefully placed in all parts of the filesystem that
3730 For any file, the correct link count is the number of parents plus the number
3731 of child subdirectories.
3732 Non-directories never have children of any kind.
3733 The backref information is used to detect inconsistencies in the number of
3734 links pointing to child subdirectories and the number of dotdot entries
3737 After the scan completes, the link count of each file can be checked by locking
3749 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3759 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3760 The primary advantage of this approach is the simplicity and modularity of the
3761 repair code -- code and data are entirely contained within the scrub module,
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3768 For repairs going on within a shard of the filesystem, these advantages
3769 outweigh the delays inherent in locking the shard while repairing parts of the
3772 btree repair strategy because it must scan every space mapping of every fork of
3776 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3788    can receive updates to the rmap btree from the rest of the filesystem during
3791 5. For each space mapping found in either fork of each file scanned,
3792    decide if the mapping matches the AG of interest.
3795    a. Create a btree cursor for the in-memory btree.
3797    b. Use the rmap code to add the record to the in-memory btree.
3806    a. Create a btree cursor for the in-memory btree.
3808    b. Replay the operation into the in-memory btree.
3818 8. Compute the new btree geometry using the number of rmap records in the
3821 9. Allocate the number of blocks computed in the previous step.
3833 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3837 --------------------------------------------
3839 XFS stores a substantial amount of metadata in file forks: directories,
3842 File forks map 64-bit logical file fork space extents to physical storage space
3843 extents, similar to how a memory management unit maps 64-bit virtual addresses
3845 Therefore, file-based tree structures (such as directories and extended
3847 to other blocks mapped within that same address space, and file-based linear
3853 Therefore, online repair of file-based metadata createas a temporary file in
3867 field of the block headers to match the file being repaired and not the
3872 There is a downside to the reaping process -- if the system crashes during the
3884 +--------------------------------------------------------------------------+
3886 +--------------------------------------------------------------------------+
3887 | In the initial iteration of file metadata repair, the damaged metadata   |
3891 | This strategy did not survive the introduction of the atomic repair      |
3901 | - Array structures are linearly addressed, and the regular filesystem    |
3902 |   codebase does not have the concept of a linear offset that could be    |
3905 | - Extended attributes are allowed to use the entire attr fork offset     |
3908 | - Even if repair could build an alternate copy of a data structure in a  |
3909 |   different part of the fork address space, the atomic repair commit     |
3914 | - A crash after construction of the secondary tree but before the range  |
3918 | - Reaping blocks after a repair is not a simple operation, and           |
3922 | - Directory entry blocks and quota records record the file fork offset   |
3923 |   in the header area of each block.                                      |
3924 |   An atomic range collapse operation would have to rewrite this part of  |
3927 |   it's something to be aware of.                                         |
3929 | - Each block in a directory or extended attributes btree index contains  |
3934 |   Doing this as part of a range collapse means rewriting a large number  |
3935 |   of blocks repeatedly, which is not conducive to quick repairs.         |
3937 | This lead to the introduction of temporary file staging.                 |
3938 +--------------------------------------------------------------------------+
3945 This allocates an inode, marks the in-core inode private, and attaches it to
3953 The usage patterns of these two locks are the same as for any other XFS file --
3964 1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
3970 Once a good copy of a data file has been constructed in a temporary file, it
3971 must be conveyed to the file being repaired, which is the topic of the next
3976 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3980 -----------------------------
3984 It is not possible to swap the inumbers of two files, so instead the new
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3991    reverse mapping information up to date with every exchange of mappings.
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
4001    For this use case, an incomplete exchange will not result in a user-visible
4004 d. Online repair needs to swap the contents of two files that are by definition
4006    For directory and xattr repairs, the user-visible contents might be the
4007    same, but the contents of individual blocks may be very different.
4009 e. Old blocks in the file may be cross-linked with another structure and must
4010    not reappear if the system goes down mid-repair.
4013 of log intent item to track the progress of an operation to exchange two file
4016 the reverse-mapping extent swap code, but records intermedia progress in the
4021 The new log item records the progress of the exchange to ensure that once an
4030 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4033 +--------------------------------------------------------------------------+
4034 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
4035 +--------------------------------------------------------------------------+
4045 | filesystem, either as part of an unmount or because the system is        |
4052 | The log coordinates access to incompatible features through the use of   |
4061 | obtain log assistance must be called just prior to the creation of the   |
4070 | Log-assisted extended attribute updates and file content exchanges bothe |
4073 +--------------------------------------------------------------------------+
4075 Mechanics of a Logged File Content Exchange
4081 There are likely to be many extent mappings in each fork, and the edges of
4084 such as exchanging file sizes, inode flags, or conversion of fork data to local
4086 This is roughly the format of the new deferred exchange-mapping work item:
4088 .. code-block:: c
4111 Each step of an exchange operation exchanges the largest file range mapping
4117 mappings instead of the data fork and other work to be done after the exchange.
4118 The two isize fields are used to exchange the file sizes at the end of the
4119 operation if the file data fork is the target of the operation.
4121 When the exchange is initiated, the sequence of operations is as follows:
4124    At the start, it should contain the entirety of the file block ranges to be
4132 3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
4134    a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
4137       This is the minimum of the two ``br_blockcount`` s in the mappings.
4138       Keep advancing through the file forks until at least one of the mappings
4156    g. Extend the ondisk size of either file if necessary.
4159       item that was read at the start of step 3.
4161    i. Compute the amount of file range that has just been covered.
4162       This quantity is ``(map1.br_startoff + map1.br_blockcount -
4165    j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
4166       by the number of blocks computed in the previous step, and decrease
4171       of the work item.
4175       The operation manager completes the deferred work in steps 3b-3e before
4176       moving back to the start of step 3.
4178 4. Perform any post-processing.
4181 If the filesystem goes down in the middle of an operation, log recovery will
4185 will either see the old broken structure or the new one, and never a mismash of
4191 There are a few things that need to be taken care of before initiating an
4196 maximum amount of disk space and quota that can be consumed on behalf of both
4197 files in the operation, and reserve that quantity of resources to avoid an
4198 unrecoverable out of space failure once it starts dirtying metadata.
4199 The preparation step scans the ranges of both files to estimate:
4201 - Data device blocks needed to handle the repeated updates to the fork
4203 - Change in data and realtime block counts for both files.
4204 - Increase in quota usage for both files, if the two files do not share the
4205   same set of quota ids.
4206 - The number of extent mappings that will be added to each file.
4207 - Whether or not there are partially written realtime extents.
4212 The need for precise estimation increases the run time of the exchange
4214 The filesystem must not run completely out of free space, nor can the mapping
4226 - If both forks are in local format and the fork areas are large enough, the
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4236   The contents of the local format fork are converted to a block to perform the
4251 repair builds every block in the new data structure with the owner field of the
4256 extent reaping <reaping>` mechanism that is done post-repair.
4257 If the filesystem should go down during the reap part of the repair, the
4258 iunlink processing at the end of recovery will free both the temporary file and
4260 However, this iunlink processing omits the cross-link detection of online
4279    xfs_exchmaps_req`` with the details of the exchange operation.
4290 In the "realtime" section of an XFS filesystem, free space is tracked via a
4292 Each bit in the bitmap represents one realtime extent, which is a multiple of
4294 The realtime summary file indexes the number of free extents of a given size to
4295 the offset of the block within the realtime free space bitmap where those free
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4303 counters to match the number of blocks in the rt bitmap.
4304 Each counter records the number of free extents that start in that bitmap block
4305 and can satisfy a power-of-two allocation request.
4309 1. Take the ILOCK of both the realtime bitmap and summary files.
4320 3. Compare the contents of the xfile against the ondisk file.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4335 Values are limited in size to 64KiB, but there is no limit in the number of
4337 The attribute fork is unpartitioned, which means that the root of the attribute
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4344 btree (``dabtree``) is created to map hashes of attribute names to entries
4349 1. Walk the attr fork mappings of the file being repaired to find the attribute
4362 2. If the memory usage of the xfarray and xfblob exceed a certain amount of
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4383 and then it scans all directories to establish parentage of those linked files.
4391 The salvage process is discussed in the case study at the end of this section.
4392 The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
4401 1. Find the parent of the directory.
4406 2. Walk the first partition of data fork of the directory to find the directory
4419 3. If the memory usage of the xfarray and xfblob exceed a certain amount of
4436 ensure that one of the following apply:
4448 every child dentry of a specific directory, which makes this a hard problem.
4453 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4459 A parent pointer is a piece of file metadata that enables a user to locate the
4462 Without them, reconstruction of directory trees is hindered in much the same
4463 way that the historic lack of reverse space mapping information once hindered
4464 reconstruction of filesystem space metadata.
4472 The directory checking process can be strengthened to ensure that the target of
4474 Likewise, each parent pointer can be checked by ensuring that the target of
4479 +--------------------------------------------------------------------------+
4481 +--------------------------------------------------------------------------+
4490 | 1. The XFS codebase of the late 2000s did not have the infrastructure to |
4500 | 3. The extended attribute did not record the name of the directory entry |
4511 | second implementation that solves all shortcomings of the first.         |
4513 | manipulations of the extended attribute structures.                      |
4517 | Chandan increased the maximum extent counts of both data and attribute   |
4519 | to handle the maximum hardlink count of any file.                        |
4524 | of repair tools needing to to ensure that the ``dirent_pos`` field       |
4561 |    parent_gen)``.  This format doesn't require any of the complicated    |
4562 |    nested name hashing of the previous suggestions.  However, it was     |
4569 +--------------------------------------------------------------------------+
4587    pointer references the directory of interest.
4609 6. When the scan is complete, atomically exchange the contents of the temporary
4617 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4623 Online reconstruction of a file's parent pointer information works similarly to
4635    dirent references the file of interest.
4657 6. Copy all non-parent pointer extended attributes to the temporary file.
4659 7. When the scan is complete, atomically exchange the mappings of the attribute
4660    forks of the temporary file and the file being repaired.
4667 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4670 Digression: Offline Checking of Parent Pointers
4678 1. After the set of surviving files has been established (phase 6),
4679    walk the surviving directories of each AG in the filesystem.
4680    This is already performed as part of the connectivity checks.
4696       name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
4702    a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
4705       handling the uncommon case of a directory containing multiple hardlinks
4720          c. Record the name in a per-file xfblob, and remember the xfblob
4724             name_cookie)`` tuples in a per-file slab.
4726       2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
4729       3. Position one slab cursor at the start of the inode's records in the
4730          per-AG tuple slab.
4731          This should be trivial since the per-AG tuples are in child inumber
4734       4. Position a second slab cursor at the start of the per-file tuple slab.
4737          ``name_hash``, and ``name_cookie`` fields of the records under each
4740          a. If the per-AG cursor is at a lower point in the keyspace than the
4741             per-file cursor, then the per-AG cursor points to a missing parent
4743             Add the parent pointer to the inode and advance the per-AG
4746          b. If the per-file cursor is at a lower point in the keyspace than
4747             the per-AG cursor, then the per-file cursor points to a dangling
4749             Remove the parent pointer from the inode and advance the per-file
4760 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
4764 challenging because xfs_repair currently uses two single-pass scans of the
4767 This scan would have to be converted into a multi-pass scan:
4769 1. The first pass of the scan zaps corrupt inodes, forks, and attributes
4776    blocks, if phase 4 is also capable of zapping directories.
4782 4. At the start of phase 6, space metadata have been rebuilt.
4784    the dirents and add them to the now-empty directories.
4797 Fortunately, non-directories are allowed to have multiple parents and cannot
4799 Directories typically constitute 5-10% of the files in a filesystem, which
4800 reduces the amount of work dramatically.
4809 However, one of online repair's design goals is to avoid locking the entire
4814 Directory parent pointers enable an incremental approach to validation of the
4816 Instead of using one thread to scan the entire filesystem, multiple threads can
4820 counts of all directories must be correct.
4821 Each scanner thread must be able to take the IOLOCK of an alleged parent
4822 directory while holding the IOLOCK of the child directory to prevent either
4824 This is not possible since the VFS does not take the IOLOCK of a child
4826 the parent -> child relationship by taking the ILOCKs and installing a dirent
4835    a. For each parent pointer of that subdirectory,
4858       6. For each parent pointer of this alleged ancestor directory,
4866          c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
4877    If the entry matches part of a path, mark that path and the scan stale.
4883 1. Walk each path of the target subdirectory.
4908 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
4915 -------------
4919 The root of the filesystem is a directory, and each entry in a directory points
4920 downwards either to more subdirectories or to non-directory files.
4933 This should reduce the incidence of files ending up in ``/lost+found``.
4939 Reparenting a file to the orphanage does not reset any of its permissions or
4950 1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
4954 2. If the decision is made to reconnect a file, take the IOLOCK of both the
4978 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4984 This section discusses the key algorithms and data structures of the userspace
4991 -----------------
4993 Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
4996 In XFS, there are several groups of metadata dependencies:
5003    forks, inode indices, inode records, and the forks of every file on the
5015    metadata indices of the allocation groups and the realtime volume.
5020 g. Realtime space metadata depend on the inode records and data forks of the
5033 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
5036 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
5038 - Phase 3 scans inodes in parallel.
5041 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
5044 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
5047 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
5048   to read them, and to report which blocks of which files are affected.
5050 - Phase 7 checks group (a), having validated everything else.
5053 of the program flow.
5056 --------------------
5058 An XFS filesystem can easily contain hundreds of millions of inodes.
5059 Given that XFS targets installations with large high-performance storage,
5065 Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
5071 metadata object of each inode.
5073 filesystem contains one AG with a few large sparse files and the rest of the
5076 been dispatching at the level of individual inodes, or, to constrain memory
5084 of items that can be waiting to be run.
5088 each metadata object of each inode.
5096 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
5099 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
5105 ------------------
5109 functioning of the inode indices to find inodes to scan.
5115 During phase 3, corruptions and inconsistencies reported in any part of a
5120 In the original design of ``xfs_scrub``, it was thought that repairs would be
5124 With recent increases in the number of optimizations possible for a given
5125 filesystem object, it became much more memory-efficient to track all eligible
5127 Each repair item represents a single lockable object -- AGs, metadata files,
5128 individual inodes, or a class of summary information.
5130 Phase 4 is responsible for scheduling a lot of repair work in as quick a
5137 1. Start a round of repair with a workqueue and enough workers to keep the CPUs
5146            of repairs needed for this object.
5161            of repairs needed for this object.
5170 2. If step 1 made any repair progress of any kind, jump back to step 1 to start
5171    another round of repair.
5184 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5185 refactoring of the
5187 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5190 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5193 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5197 -----------------------------------------------
5199 If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
5202 These names consist of the filesystem label, names in directory entries, and
5203 the names of extended attributes.
5204 Like most Unix filesystems, XFS imposes the sparest of constraints on the
5205 contents of a name:
5207 - Slashes and null bytes are not allowed in directory entries.
5209 - Null bytes are not allowed in userspace-visible extended attributes.
5211 - Null bytes are not allowed in the filesystem label.
5213 Directory entries and attribute keys store the length of the name explicitly
5216 presented together -- all the names in a directory, or all the attributes of a
5219 Although the Unix naming constraints are very permissive, the reality of most
5220 modern-day Linux systems is that programs work with Unicode character code
5222 These programs typically encode those code points in UTF-8 when interfacing
5223 with the C library because the kernel expects null-terminated names.
5225 UTF-8 encoded Unicode data.
5233 The standard also permits characters to be constructed in multiple ways --
5242 characters to alter the presentation of text.
5243 For example, the character "Right-to-Left Override" U+202E can trick some
5245 A second category of rendering problems involves whitespace characters.
5257 sections 4 and 5 of the
5260 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5262 detection component of
5263 `libicu <https://github.com/unicode-org/icu>`_
5266 Names are also checked for control characters, non-rendering characters, and
5267 mixing of bidirectional characters.
5268 All of these potential issues are reported to the system administrator during
5271 Media Verification of File Data Extents
5272 ---------------------------------------
5274 The system administrator can elect to initiate a media scan of all file data
5276 This scan after validation of all filesystem metadata (except for the summary
5283 verification request is sent to the disk as a directio read of the raw block
5286 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5287 to narrow down the failure to the specific region of the media and recorded.
5292 construct file paths from inode numbers for user-friendly reporting.
5297 It is hoped that the reader of this document has followed the designs laid out
5299 rebuilding of its metadata indices, and how filesystem users can interact with
5301 Although the scope of this work is daunting, it is hoped that this guide will
5307 ----------------------
5313 necessary refinements to online repair and lack of customer demand mean that
5321 The earliest form of this was the fork swap mechanism, where the entire
5322 contents of data forks could be exchanged between two files by exchanging the
5324 When XFS v5 came along with self-describing metadata, this old mechanism grew
5325 some log support to continue rewriting the owner fields of BMBT blocks during
5328 the consistency of the fork mappings with the reverse mapping index was to
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5333 an iteration of an existing mechanism and not something totally novel.
5334 For the narrow case of file defragmentation, the file contents must be
5335 identical, so the recovery guarantees are not much of a gain.
5338 implementations because it can guarantee that the caller never sees a mix of
5343 - **Atomic commit of file writes**: A userspace process opens a file that it
5350   committing all of the updates to the original file, or none of them.
5354 - **Transactional file updates**: The same mechanism as above, but the caller
5358   change timestamps of the original file before reflinking its data to the
5366 - **Emulation of atomic block device writes**: Export a block device with a
5376 ----------------
5378 As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
5380 Since 2018, the cost of making a kernel call has increased considerably on some
5381 systems to mitigate the effects of speculative execution attacks.
5383 reduce the number of times an execution path crosses a security boundary.
5385 With vectorized scrub, userspace pushes to the kernel the identity of a
5386 filesystem object, a list of scrub types to run against that object, and a
5387 simple representation of the data dependencies between the selected scrub
5389 The kernel executes as much of the caller's plan as it can until it hits a
5392 It is hoped that ``io_uring`` will pick up enough of this functionality that
5393 online fsck can use that instead of adding a separate vectored scrub system
5398 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5401 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5404 Quality of Service Targets for Scrub
5405 ------------------------------------
5407 One serious shortcoming of the online fsck code is that the amount of time that
5409 Userspace is allowed to send a fatal signal to the process which will cause
5420 ------------------------
5422 Over the years, many XFS users have requested the creation of a program to
5423 clear a portion of the physical storage underlying a filesystem so that it
5424 becomes a contiguous chunk of free space.
5429 This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
5436 To clear all the metadata out of a portion of physical storage, clearspace
5437 uses the new fallocate map-freespace call to map any free space in that region
5439 Next, clearspace finds all metadata blocks in that region by way of
5446 To clear all the file data out of a portion of the physical storage, clearspace
5449 of the file to try to share the physical space with a dummy file.
5451 contents; any changes will be written somewhere else via copy-on-write.
5452 Clearspace makes its own copy of the frozen extent in an area that is not being
5460 To clear a piece of physical storage that has a high sharing factor, it is
5481 **Future Work Question**: Can static keys be used to minimize the cost of
5490 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5493 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5497 ---------------------
5499 Removing the end of the filesystem ought to be a simple matter of evacuating
5500 the data and metadata at the end of the filesystem, and handing the freed space
5502 That requires an evacuation of the space at end of the filesystem, which is a
5503 use of free space defragmentation!