xfs-online-fsck-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-online-fsck-design.rst

Lines Matching +full:keep +full:- +full:a +full:- +full:live
1 .. SPDX-License-Identifier: GPL-2.0
8         Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
41 Part 1 defines what fsck tools are and the motivations for writing a new one.
42 Parts 2 and 3 present a high level overview of how online fsck process works
54 1. What is a Filesystem Check?
57 A Unix filesystem has four main responsibilities:
59 - Provide a hierarchy of names through which application programs can associate
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
76 The filesystem check (fsck) tool examines all the metadata in a filesystem
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
84 the filesystem metadata to a consistent state, not to maximize the data
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
132 It uses extent-based in-memory data structures to reduce memory consumption,
141 -----------------
146    shutdowns occur as a result of silent corruptions in the metadata.
149 2. **Users** experience a **total loss of service** during the recovery period
152 3. **Users** experience a **total loss of service** if the filesystem is taken
157    This may expose them to substantial billing costs when a linear media scan
160 5. **System administrators** cannot **schedule** a maintenance window to deal
172 benefit, the proposed solution is a third fsck tool that acts on a running
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
177 program to drive fsck activity on a live filesystem.
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
193 +--------------------------------------------------------------------------+
201 repairs on a subset of the filesystem.
204 Even if a piece of filesystem metadata can only be regenerated by scanning the
212 autonomous self-healing of XFS maximizes service availability.
217 Because it is necessary for online fsck to lock and scan live metadata objects,
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
233 Scrub item types are delineated in a manner consistent with the Unix design
234 philosophy, which is to say that each item should handle one aspect of a
238 -----
243 latent errors may creep in after a scrub completes.
247 A second limitation of online fsck is that it must follow the same resource
251 In other words, online fsck is not a complete replacement for offline fsck, and
252 a complete run of online fsck may take longer than online fsck.
260 --------------
274    Each metadata structure is scheduled as a separate scrub item.
284    Each metadata structure is also scheduled as a separate scrub item.
309 6. If the caller asks for a media scan, read all allocated and written data
311    The ability to use hardware-assisted data file integrity checking is new
315 7. Re-check the summary counters and presents the caller with a summary of
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
325 the one aspect of a metadata object represented by a scrub item:
338    Repair functions generally choose rebuild a structure from other metadata
349 --------------------------
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
389 Next, it stages the observations in a new ondisk structure and commits it
393 Because ``xfs_scrub`` locks a primary object for the duration of the repair,
394 this is effectively an offline repair operation performed on a subset of the
399 As a result, indexed structures can be rebuilt very quickly, and programs
402 observations and a means to write new structures to disk.
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
416 However, any data structure builder that maintains a resource lock for the
429 - Reverse mapping information
431 - Directory parent pointers
439 Repairs, however, require a full scan of primary metadata, which can take a
444 Instead, repair functions set up an in-memory staging structure to store
447 index will either have the same format as the ondisk structure or a design
455 Once the scan is done, the owning object is re-locked, the live data is used to
456 write a new ondisk structure, and the repairs are committed atomically.
461 comes at a high cost to code complexity.
462 Live filesystem code has to be hooked so that the repair function can observe
464 The staging area has to become a fully functional parallel structure so that
467 sufficiently well integrated that a hook event can decide if a given update
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
492 builder maintains a publicly visible cursor that tracks the progress of the
505 **Future Work Question**: Can the full scan and live update code used to
506 facilitate a repair also be used to implement a comprehensive check?
509 employed these live scans to build a shadow copy of the metadata and then
511 However, doing that is a fair amount more work than what the checking functions
513 The live scans and hooks were developed much later.
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
540 integer counters, the staging data need not be a fully functional mirror of the
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
552 and commit those changes to a dquot side file when the transaction commits.
556 it sets attributes of the objects being scanned instead of writing them to a
562 ---------------
566 Steps can be taken to mitigate or eliminate those risks, though at a cost to
569 - **Decreased performance**: Adding metadata indices to the filesystem
576 - **Incorrect repairs**: As with all software, there might be defects in the
583   The xfsprogs build system has a configure option (``--enable-scrub=no``) that
584   disables building of the ``xfs_scrub`` binary, though this is not a risk
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
589   If the keyspaces of several metadata indices overlap in some manner but a
592   To reduce the chance that a repair will fail with a dirty transaction and
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
609   spraying exploit code onto the public mailing list for instant zero-day
616   Automated testing should front-load some of the risk while the feature is
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
645 -------------------------------
656 The Linux filesystem community shares a common QA testing suite,
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
662 This provides a level of assurance that the kernel and the fsck tools stay in
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
670 This ensures that offline repair does not crash, leave a corrupt filesystem
672 This also established a baseline for what can and cannot be repaired offline.
674 modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
675 This enables a comparison of the effectiveness of online repair as compared to
679 ---------------------------------------
681 XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
683 Before development of online fsck even began, a set of fstests were created
685 This required the creation of fstests library code that can create a filesystem
687 Next, individual test cases were created to create a test filesystem, identify
688 a single block of a specific type of metadata object, trash it with the
689 existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
693 in-kernel validation functions and the ability of the offline fsck tool to
698 In other words, for a given fstests filesystem configuration:
711 -----------------------------------------
714 infrastructure to provide a much more powerful facility: targeted fuzz testing
719 Given that fstests already contains the ability to create a filesystem
723 For a given fstests filesystem configuration:
731       * For each conceivable type of transformation that can be applied to a bit field...
738         6. Add a small quantity
739         7. Subtract a small quantity
745           2. Offline checking (``xfs_repair -n``)
747           4. Online checking (``xfs_scrub -n``)
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
777 A unique requirement to online fsck is the ability to operate on a filesystem
786 * For each scrub item type, create a test to exercise checking that item type
788 * For each scrub item type, create a test to exercise repairing that item type
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793   force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797   remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-…
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
815 A foreground CLI process for online fsck on demand, and a background service
819 ------------------
822 metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
823 a command line.
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
833 program runtime and consume a lot of bandwidth on older storage hardware.
835 The output of a foreground invocation is captured in the system log.
843 ------------------
846 provides a suite of `systemd <https://systemd.io/>`_ timers and services that
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
867 * ``xfs_scrub_all.cron`` on non-systemd systems
876 The systemd unit file definitions have been subjected to a security audit
879 This was performed via ``systemd-analyze security``, after which privileges
885 apply as nice of a priority to IO and CPU scheduling as possible.
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
896 XFS caches a summary of each filesystem's health status in memory.
901 download this information into a human-readable format.
902 If problems have been observed, the administrator can schedule a reduced
904 Failing that, the administrator can decide to schedule a maintenance window to
909 Would it be helpful for sysadmins to have a daemon to listen for corruption
910 notifications and initiate a repair?
912 *Answer*: These questions remain unanswered, but should be a part of the
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
934 ------------------------
937 ondisk block header to record a magic number, a checksum, a universally
939 and a log sequence number.
940 When loading a block buffer from disk, the magic number, UUID, owner, and
948 Whenever a file system operation modifies a block, the change is submitted
949 to the log as part of a transaction.
959 These two features improve overall runtime resiliency by providing a means for
965 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
968 ---------------
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
980 For XFS, a different redundancy strategy was chosen to modernize the design:
981 a secondary space usage index that maps allocated disk extents back to their
983 By adding a new index, the filesystem retains most of its ability to scale
987 Like any system that improves redundancy, the reverse-mapping feature increases
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
999 | A criticism of adding the secondary index is that it does nothing to     |
1001 | This is a valid point, but adding a new index for file data block        |
1003 | copy-writes, which age the filesystem prematurely.                       |
1005 | integrity can supply as powerful a solution as they require.             |
1006 | As for metadata, the complexity of adding a new secondary index of space |
1011 +--------------------------------------------------------------------------+
1013 The information captured in a reverse space mapping record is as follows:
1015 .. code-block:: c
1031 Finally, the flags field provides extra information about the space usage --
1032 is this an attribute fork extent?  A file mapping btree extent?  Or an
1037 The reverse mapping index plays a key role in the consistency checking process
1038 because it contains a centralized alternate copy of all space allocation
1042 For example, a file data extent mapping can be checked against:
1052 1. Reverse mappings can provide a positive affirmation of correctness if any of
1054    The checking code for most primary metadata follows a path similar to the
1058    difficult because that requires a full scan of all primary space metadata,
1060    For example, checking a reverse mapping record for a file extent mapping
1063    Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1069    For example, if the filesystem normally takes a file ILOCK before taking
1070    the AGF buffer lock but scrub wants to take a file ILOCK while holding
1072    This means that forward progress during this part of a scan of the reverse
1075 In summary, reverse mappings play a key role in reconstruction of primary
1080 Checking and Cross-Referencing
1081 ------------------------------
1083 The first step of checking a metadata structure is to examine every record
1089 three decisions about the health of a metadata structure:
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1129 corrupt metadata force the cancellation of a dirty transaction.
1132 block of a structure in the course of checking the structure.
1133 Corruption problems observed during a check are immediately reported to
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1136 Reads satisfied by a buffer already in cache (and hence already verified)
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1171 debugging is enabled or a write is about to occur.
1174 Validation of Userspace-Controlled Record Attributes
1179 that a value is within the possible range.
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1212 purposes of performing a keyspace scan so that scrub can decide if the rmap
1213 btree contains records mapping a certain extent of physical space without the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243    - Does the reverse mapping index list only the appropriate owner as the
1246    - Are none of the blocks claimed as free space?
1248    - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253    - Everything in class 1 above.
1255    - If there's a parent node block, do the keys listed for this block match the
1258    - Do the sibling pointers point to valid blocks?  Of the same level?
1260    - Do the child pointers point to valid blocks?  Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264    - Everything in class 1 and 2 above.
1266    - Does the reverse mapping index list no owners of this space?
1268    - Is this space not claimed by the inode index for inodes?
1270    - Is it not mentioned by the reference count index?
1272    - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276    - Everything in class 1 and 2 above.
1278    - Is there a matching record in free inode btree?
1280    - Do cleared bits in the holemask correspond with inode clusters?
1282    - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287    - Everything in class 1.
1289    - Do all the fields that summarize information about the file forks actually
1292    - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297    - Everything in class 1 and 2 above.
1299    - Is this space not mentioned by the inode btrees?
1301    - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306    - Everything in class 1 and 2 above.
1308    - Within the space subkeyspace of the rmap btree (that is to say, all
1309      records mapped to a particular space extent and ignoring the owner info),
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-…
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-…
1326 before starting a repair.
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1341 A file's extended attributes are stored in blocks mapped by the attr fork.
1347 Values that are less than 3/4 the size of a filesystem block are also stored
1349 Remote value blocks contain values that are too large to fit inside a leaf.
1350 If the leaf information exceeds a single filesystem block, a dabtree (also
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1364    For each entry inside a leaf:
1366    a. Validate that the name does not contain invalid characters.
1369       This performs a named lookup of the attr name to ensure the correctness
1371       If the value is stored in a remote block, this also validates the
1374 Checking and Cross-Referencing Directories
1377 The filesystem directory tree is a directed acylic graph structure, with files
1379 Directories are a special type of file containing a set of mappings from a
1380 255-byte sequence (name) to an inumber.
1383 A root directory points to itself.
1385 Each non-directory file may have multiple directories point to it.
1387 In XFS, directories are implemented as a file containing up to three 32GB
1390 Each data block contains variable-sized records associating a user-provided
1391 name with an inumber and, optionally, a file type.
1393 exists as post-EOF extents) is populated with a block containing free space
1398 populated with a linear array of free space information for faster
1401 beyond one block, then a dabtree is used to map hashes of dirent names to
1404 Checking a directory is pretty straightforward:
1413    a. Does the name contain no invalid characters?
1417    c. Does the child inode have a nonzero link count?
1419    d. If a file type is included in the dirent, does it match the type of the
1422    e. If the child is a subdirectory, does the child's dotdot pointer point
1425    f. If the directory has a second partition, perform a named lookup of the
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1443 The internal structure of a dabtree closely resembles the btrees that record
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1445 checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1477 - For each dabtree leaf record, does the record key accurate reflect the
1480 Cross-Referencing Summary Counters
1488 This would make for very slow reporting, so a transactional filesystem can
1490 Cross-referencing these values against the filesystem metadata should be a
1498 Post-Repair Reverification
1501 After performing a repair, the checking code is run a second time to validate
1506 For developers, it is a useful means to judge the efficacy of error detection
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1513 with a chain of transactions.
1516 Because the AG header buffers are unlocked between transactions within a chain,
1527 * For each AG, maintain a count of intent items targeting that AG.
1528   The count should be bumped whenever a new item is added to the chain.
1538 This may lead to online fsck taking a long time to complete, but regular
1550 uncovered a misinteraction between online fsck and compound transaction chains
1560 which makes it impossible (say) to use a single transaction to free a space
1561 extent in AG 7 and then try to free a now superfluous block mapping btree block
1565 actual metadata updates to a fresh transaction.
1568 1. The first transaction contains a physical update to the file's block mapping
1570    It then attaches to the in-memory transaction an action item to schedule
1572    Concretely, each transaction maintains a list of ``struct
1573    xfs_defer_pending`` objects, each of which maintains a list of ``struct
1585 2. The second transaction contains a physical update to the free space btrees
1586    of AG 3 to release the former BMBT block and a second physical update to the
1590    Attached to the transaction is a an extent free done (EFD) log item.
1591    The EFD contains a pointer to the EFI logged in transaction #1 so that log
1595 but before #2 is committed, a scan of the filesystem metadata would show
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1599 an intent log item but does not find a corresponding intent done item, it will
1606 * Log items must be added to a transaction in the correct order to prevent
1608   In other words, all per-AG metadata updates for an unmapped block must be
1612 * AG header buffers are released between each transaction in a chain.
1621 In this manner, XFS employs a form of eventual consistency to avoid deadlocks
1625 decided that it was impractical to cram all the reverse mapping updates for a
1626 single filesystem change into a single transaction because a single file
1630 * A reverse mapping update for the block mapping update
1632 * A reverse mapping update for the freelist fix
1634 * A shape change to the block mapping btree
1635 * A reverse mapping update for the btree update
1637 * A reverse mapping update for the freelist fix
1640 * A reverse mapping update for the refcount update
1641 * Fixing the freelist (a third time)
1642 * A reverse mapping update for the freelist fix
1645 * Fixing the freelist (a fourth time)
1646 * A reverse mapping update for the freelist fix
1649 * Fixing the freelist (a fifth time)
1650 * A reverse mapping update for the freelist fix
1654 For copy-on-write updates this is even worse, because this must be done once to
1655 remove the space from a staging area and again to map it into the file!
1657 To deal with this explosion in a calm manner, XFS expands its use of deferred
1660 work into a long chain of small updates, which increases the degree of eventual
1662 Again, this generally isn't a problem because XFS orders its deferred work
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1668 Once scrub acquires resources and takes locks for a data structure, it must do
1670 If the main lock for a space btree is an AG header buffer lock, scrub may have
1671 interrupted another thread that is midway through finishing a chain.
1672 For example, if a thread performing a copy-on-write has completed a reverse
1676 If a repair is attempted in this state, the results will be catastrophic!
1681 1. Add a higher level lock to allocation groups and require writer threads to
1686    Performing a dry run of a file operation to discover necessary locks would
1692    This would introduce a lot of complexity into the coordinator since it is
1696    work can start on a new sibling task.
1702    This solution is a nonstarter because it is *extremely* invasive to the main
1713 First, the counter is incremented when a deferred work item is *queued* to a
1716 The second property is that deferred work can be added to a transaction without
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1723 since scrub will always be able to decide if a conflict is possible.
1727 1. Call the appropriate subsystem function to add a deferred work item to a
1733    calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1745    For example, a scan of the refcount btree would lock the AGI and AGF header
1754    back to step 1 unless a signal has been caught.
1756 To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1770 However, there are a few parts of online fsck (such as the intent drains, and
1771 later, live update hooks) where it is useful for the online fsck code to know
1777 Taking locks in the hot path of a writer thread to access a data structure only
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1781 replace a static branch to hook code with ``nop`` sleds when online fsck isn't
1795 to change a static key while holding any locks or resources that could be
1804 - The hooked part of XFS should declare a static-scoped static key that
1807   The static key itself should be declared as a ``static`` variable.
1809 - When deciding to invoke code that's only used by scrub, the regular
1811   scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1820   the ``xchk_fsgates_enable`` from the setup function to enable a specific
1829 If it detects a conflict between scrub and the running transactions, it will
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1846 Some online checking functions work by scanning the filesystem to build a
1849 For online repair to rebuild a metadata structure, it must compute the record
1852 Ideally, repairs complete with a single atomic commit that introduces
1853 a new data structure.
1854 To meet these goals, the kernel needs to collect a large amount of information
1855 in a place that doesn't require the correct operation of the filesystem.
1859 * Allocating a contiguous region of memory to create a C array is very
1860   difficult, especially on 32-bit systems.
1869 At any given time, online fsck does not need to keep the entire record set in
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1883 | The first edition of online repair inserted records into a new btree as  |
1884 | it found them, which failed because filesystem could shut down with a    |
1885 | built data structure, which would be live after recovery finished.       |
1887 | The second edition solved the half-rebuilt structure problem by storing  |
1892 +--------------------------------------------------------------------------+
1897 A survey of the intended uses of xfiles suggested these use cases:
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1918 XFS is very record-based, which suggests that the ability to load and store
1920 To support these cases, a pair of ``xfile_load`` and ``xfile_store``
1932 tmpfs can only push a pagecache folio to the swap cache if the folio is neither
1946 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1959 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1965 having to create a dummy ``struct kiocb`` and to avoid taking inode and
1971 For example, if a scrub function stores scan results in an xfile and needs
1973 provide a lock for all threads to share.
1977 Arrays of Fixed-Sized Records
1981 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1982 records indexed with a classic B+ tree.
1983 Directories have a set of fixed-size dirent records that point to the names,
1984 and extended attributes have a set of fixed-size attribute keys that point to
1987 During a repair, scrub needs to stage new records during the gathering step and
1991 methods of the xfile directly, it is simpler for callers for there to be a
1994 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1995 the byte-accessible xfile.
2007 Gaps may exist between records, and a record may be updated multiple times
2009 In other words, these callers want a sparse linearly addressed table file.
2012 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2014 Gaps are defined to be null records, and null records are defined to be a
2021 and do not require multiple updates to a record.
2024 via the ``xfarray_append`` function, which stores a record at the end of the
2026 For callers that require records to be presentable in a specific order (e.g.
2030 The third type of caller is a bag, which is useful for counting records.
2035 The ``xfarray_store_anywhere`` function is used to insert a record in any
2036 null record slot in the bag; and the ``xfarray_unset`` function removes a
2040 `big in-memory array
2041 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2050 .. code-block:: c
2062 For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
2066 Once it finds a page, it will skip the zeroed areas of the page.
2068 .. code-block:: c
2080 During the fourth demonstration of online repair, a community reviewer remarked
2082 into btree record blocks instead of inserting records into a new btree one at a
2091 The sorting algorithm used in the xfarray is actually a combination of adaptive
2092 quicksort and a heapsort subalgorithm in the spirit of
2096 To sort records in a reasonably short amount of time, ``xfarray`` takes
2100 Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
2103 The Linux kernel already contains a reasonably fast implementation of heapsort.
2107 * Sorting any record subset backed by a single xfile page.
2109 * Loading a small number of xfarray records from potentially disparate parts
2110   of the xfarray into a memory buffer, and sorting the buffer.
2115 Choosing a quicksort pivot is a tricky business.
2116 A good pivot splits the set to sort in half, leading to the divide and conquer
2118 A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
2120 The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
2121 records into a memory buffer and using the kernel heapsort to identify the
2124 Most modern quicksort implementations employ Tukey's "ninther" to select a
2125 pivot from a classic C array.
2130 It turned out to be much more performant to read the nine elements into a
2131 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2133 Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
2134 low-effort robust (resistant) location in large samples`, in *Contributions to
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2143 As a final performance optimization, the hi and lo scanning phase of quicksort
2158 The names, keys, and values can consume a large amount of memory, so the
2164 The store function returns a magic cookie for every object that it persists.
2166 The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
2170 in a subsequent section about atomic file content exchanges.
2172 to cache a small number of entries before adding them to a temporary ondisk
2177 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2181 In-Memory B+Trees
2186 between a live metadata scan of the filesystem and writer threads that are
2190 This *can* be done by appending concurrent updates into a separate log file and
2193 Another option is to skip the side-log and commit live updates from the
2194 filesystem directly into the scan data, which trades more overhead for a lower
2200 fsck employs the second strategy of committing live updates directly into
2204 Conveniently, however, XFS has a library to create and maintain ordered reverse
2206 If only there was a means to create one in memory.
2208 Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
2211 The XFS buffer cache specializes in abstracting IO to block-oriented  address
2218 `in-memory btree
2219 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2222 Using xfiles as a Buffer Cache Target
2225 Two modifications are necessary to support xfiles as a buffer cache target.
2227 host the ``struct xfs_buf`` rhashtable, because normally those are held by a
2228 per-AG structure.
2233 With this adaptation in place, users of the xfile-backed buffer cache use
2234 exactly the same APIs as users of the disk-backed buffer cache.
2237 updates to an in-memory btree.
2243 Space management for an xfile is very simple -- each btree block is one memory
2245 These blocks use the same header format as an on-disk btree, but the in-memory
2247 corruption-prone than regular DRAM.
2250 The very first block of an xfile backing an xfbtree contains a header block.
2254 To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
2269 2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
2275    Each btree type should define a wrapper that passes necessary arguments to
2286    and to update the in-memory btree.
2287    For example, a btree cursor for an rmap xfbtree can be passed to the
2290    xfbtree updates that are logged to a transaction.
2300 Although it is a clever hack to reuse the rmap btree code to handle the staging
2301 structure, the ephemeral nature of the in-memory btree block storage presents
2322 4. Queue the buffer to a special delwri list.
2334 ------------------------------
2337 structures by creating a new btree and adding observations individually.
2338 Loading a btree one record at a time had a slight advantage of not requiring
2340 blocks if the system went down during a repair.
2341 Loading records one at a time also meant that repair could not control the
2344 Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
2345 rebuilding a btree index from a collection of records -- bulk btree loading.
2346 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2347 had separate copy-pasted implementations for each btree type.
2350 were taken, and the four were refactored into a single generic btree bulk
2364 will fit in a leaf block from the size of a btree block and the size of the
2368         maxrecs = (block_size - header_size) / record_size
2378 Choosing maxrecs is also undesirable because adding a single record to each
2379 newly rebuilt leaf block will cause a tree split, which causes a noticeable
2381 The default loading factor was chosen to be 75% of maxrecs, which provides a
2394         maxrecs = (block_size - header_size) / (key_size + ptr_size)
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2417 - For inode-rooted btrees where the records in the top level do not fit in the
2422 - For inode-rooted btrees where the records in the top level can be stored in
2426   This only becomes relevant when non-bmap btrees gain the ability to root in
2427   an inode, which is a future patchset and only included here for completeness.
2439 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2443 Each time the btree builder claims a block for the btree from a reserved
2444 extent, it updates the in-memory reservation to reflect the claimed space.
2455 mechanism is reused here to commit a transaction at the log head containing an
2457 This enables the log to release the old EFI to keep the log moving forwards.
2459 EFIs have a role to play during the commit and reaping phases; please see the
2464 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2467 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2474 a block from the reserved list, writes the new btree block header, fills the
2475 rest of the block with records, and adds the new leaf block to a list of
2483 Sibling pointers are set every time a new block is added to the level::
2492 To fill a node block, it walks each block in the next level down in the tree
2524 This is a little complicated because a new btree block could have been freed
2528 Blocks are queued for IO using a delwri list and written in one large batch
2533 The repair function must log the location of the new root in a transaction,
2541    a. Log Extent Freeing Done (EFD) items for all the space that was consumed
2545    b. For unclaimed portions of incore reservations, create a regular deferred
2549    c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
2552       call ``xrep_defer_finish`` to clear out the deferred work and obtain a
2555 3. Clear out the deferred work a second time to finish the commit and clean
2558 The transaction rolling in steps 2c and 3 represent a weakness in the repair
2559 algorithm, because a log flush and a crash before the end of the reap step can
2564 Repair moves on to reaping the old blocks, which will be presented in a
2565 subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
2573    records from the inode chunk information and a bitmap of the old inode btree
2602 A cluster is the smallest number of ondisk inodes that can be allocated or
2603 freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
2613 enough information to fill a single inode chunk record, which is 64 consecutive
2619 This xfarray is walked twice during the btree creation step -- once to populate
2620 the inode btree with all inode chunk records, and a second time to populate the
2621 free inode btree with records for chunks that have free non-sparse inodes.
2628 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2640 From the diagram below, it is apparent that a reference count record must start
2642 In other words, the record emission stimulus is level-triggered::
2653 Extents being used to stage copy-on-write operations should be the only records
2655 Single-owner file blocks aren't recorded in either the free space or the
2664    because these are extents allocated to stage a copy on write operation and
2667    Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
2689 - Until the reverse mapping btree runs out of records:
2691   - Retrieve the next record from the btree and put it in a bag.
2693   - Collect all records with the same starting block from the btree and put
2696   - While the bag isn't empty:
2698     - Among the mappings in the bag, compute the lowest block number where the
2704     - Remove all mappings from the bag that end at this position.
2706     - Collect all reverse mappings that start at this position from the btree
2709     - If the size of the bag changed and is greater than one, create a new
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2721 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2727 The high level process to rebuild a data/attr fork mapping btree is:
2756 EXTENTS format instead of BMBT, which may require a conversion.
2762 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2768 ---------------------------
2770 Whenever online fsck builds a new data structure to replace one that is
2771 suspect, there is a question of how to find and dispose of the blocks that
2775 Hopefully, someone will schedule a rebuild of the free space information to
2781 As part of a repair, online fsck relies heavily on the reverse mapping records
2792 1. Create a bitmap of space used by data structures that must be preserved.
2796 2. Survey the reverse mapping data to create a bitmap of space owned by the
2803 Repairs for file-based metadata such as extended attributes, directories,
2804 symbolic links, quota files and realtime bitmaps are performed by building a
2805 new structure attached to a temporary file and exchanging all mappings in the
2816    - If zero, the block has a single owner and can be freed.
2818    - If not, the block is part of a crosslinked structure and must not be
2837 a. EFIs logged on behalf of space that is no longer occupied
2841 This is also a window in which a crash during the reaping process can leak
2848 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2851 Case Study: Reaping After a Regular Btree Repair
2857 Creating a list of extents to reap the old btree blocks is quite simple,
2863    metadata structure being rebuilt, set the corresponding range in a bitmap.
2868 4. Each set bit in the bitmap represents a block that could be a block from the
2869    old data structures and hence is a candidate for reaping.
2902 Repairing the free space btrees has three key complications over a regular
2919 information changes the number of free space records, repair must re-estimate
2937 creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
2939 The repair context maintains a second bitmap corresponding to the rmap btree
2948 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2956 Old reverse mapping btrees are less difficult to reap after a repair.
2960 The full process of gathering reverse mapping records and building a new btree
2962 :ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
2973 in a separate :ref:`case study<rmap_repair>`.
2977 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2985 1. Create a bitmap for all the space that the reverse mapping data claims is
2991    other owner, to avoid re-adding crosslinked blocks to the AGFL.
2995 5. The next operation to fix the freelist will right-size the list.
3000 --------------------
3003 ("dinodes") and an in-memory ("cached") representation.
3004 There is a very high potential for cache coherency issues if online fsck is not
3006 badly damaged that the filesystem cannot load the in-memory representation.
3007 When online fsck wants to open a damaged file for scrubbing, it must use
3008 specialized resource acquisition functions that return either the in-memory
3009 representation *or* a lock on whichever object is necessary to prevent any
3013 is necessary to get the in-core structure loaded.
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3020 Most inode attributes are easy to check and constrain, or are user-controlled
3029 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3033 --------------------
3036 an in-memory representation, and hence are subject to the same cache coherency
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3046 section about :ref:`live quotacheck <quotacheck>`.
3050 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3056 --------------------------------
3061 but this is a slow process, so XFS maintains a copy in the ondisk superblock
3066 Writer threads reserve the worst-case quantities of resources from the
3075 To reduce contention even further, the incore counter is implemented as a
3076 percpu counter, which means that each CPU is allocated a batch of blocks from a
3079 The high-performance nature of the summary counters makes it difficult for
3080 online fsck to check them, since there is no way to quiesce a percpu counter
3083 values of the summary counters, there's no way to hold the value of a percpu
3087 scan flag, but this is not a satisfying outcome for a system administrator.
3088 For repairs, the in-memory counters must be stabilized while walking the
3099 This is very similar to a filesystem freeze, though not all of the pieces are
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3106 - It does not quiesce the log.
3111 +--------------------------------------------------------------------------+
3113 +--------------------------------------------------------------------------+
3120 | - Other programs can unfreeze the filesystem without our knowledge.      |
3123 | - Adding an extra lock to prevent others from thawing the filesystem     |
3124 |   required the addition of a ``->freeze_super`` function to wrap         |
3129 |   becomes a UAF bug!                                                     |
3136 | - The log need not be quiesced to check the summary counters, but a VFS  |
3138 |   This adds unnecessary runtime to live fscounter fsck operations.       |
3140 | - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3143 | - A bug in the VFS meant that freeze could complete even when            |
3146 +--------------------------------------------------------------------------+
3150 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3154 ---------------------
3160 observations to disk in a replacement structure and committing it atomically.
3163 Therefore, online fsck must build the infrastructure to manage a live scan of
3165 There are two questions that need to be solved to perform a live walk:
3167 - How does scrub manage the scan while it is collecting data?
3169 - How does the scan keep abreast of changes being made to the system by other
3179 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3184 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3186 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3187 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3188 1913-4.
3192 They form a continuous keyspace that can be expressed as a 64-bit integer,
3194 Scans proceed in a linear fashion across the inumber keyspace, starting from
3196 Naturally, a scan through a keyspace requires a scan cursor object to track the
3202 the keyspace have already been visited, which is critical for deciding if a
3206 Advancing the scan cursor is a multi-step process encapsulated in
3214 2. Use the per-AG inode btree to look up the next inumber after the one that
3219    a. Move the examination cursor to the point of the inumber keyspace that
3237    a. Move the examination cursor ahead to the next inode marked as allocated
3256 1. Start a scan by calling ``xchk_iscan_start``.
3261    a. Lock the inode to prevent updates during the scan.
3278 back into a loadable state.
3282 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3286 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3302 transaction context because there are a handful of activities that might
3305 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3308 - Speculative preallocations need to be unreserved.
3310 - An unlinked file may have lost its last reference, in which case the entire
3315 Inactivation has two parts -- the VFS part, which initiates writeback on all
3316 dirty file pages, and the XFS part, which cleans up XFS-specific information
3318 If the inode is unlinked (or unconnected after a file handle operation), the
3339 8. Incore dquot references, if a file is being repaired.
3352 an object that normally is acquired in a later stage of the locking order, and
3353 then decide to cross-reference the object with an object that is acquired
3358 iget and irele During a Scrub
3361 An inode scan performed on behalf of a scrub operation runs in transaction
3363 This isn't much of a problem for ``iget`` since it can operate in the context
3367 When the VFS ``iput`` function is given a linked inode with no other
3369 save time if another process re-opens the file before the system runs out
3371 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3376 inode, which was a problem for scrub because scrub may already hold a
3380 To capture these nuances, the online fsck code has a separate ``xchk_irele``
3386 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3388 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3396 in a well-known order: parent → child when updating the directory tree, and
3406 Inode lock acquisition must be done carefully during a coordinated inode scan.
3407 Online fsck cannot abide these conventions, because for a directory tree
3410 If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
3414 Solving both of these problems is straightforward -- any time online fsck
3415 needs to take a second lock of the same class, it uses trylock to avoid an ABBA
3427 Case Study: Finding a Directory Parent
3431 Online fsck must verify that the dotdot dirent of a directory points up to a
3434 Fully validating this relationship (and repairing it if possible) requires a
3437 The coordinated inode scan provides a way to walk the filesystem without the
3440 if the scanner fails to lock a parent, it can drop and relock both the child
3442 If the dotdot entry changes while the directory is unlocked, then a move or
3448 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3456 The second piece of support that online fsck functions need during a full
3459 in a dynamic environment.
3464 a downstream consumer.
3469 Call chains are a dynamic list, which means that they can be configured at
3476 Regular blocking notifier chains use a rwsem and seem to have a much lower
3477 overhead for single-threaded applications.
3479 keys are a more performant combination; more study is needed here.
3481 The following pieces are necessary to hook a certain point in the filesystem:
3483 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3484   a well-known incore filesystem object.
3486 - Each hook must define an action code and a structure containing more context
3489 - Hook providers should provide appropriate wrapper functions and structs
3493 - A callsite in the regular filesystem code must be chosen to call
3497   In general, when the filesystem calls a hook chain, it should be able to
3503 - The online fsck function should define a structure to hold scan data, a lock
3504   to coordinate access to the scan data, and a ``struct xfs_hook`` object.
3508 - The online fsck code must contain a C function to catch the hook action code
3513 - Prior to unlocking inodes to start the scan, online fsck must call
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3520 The number of hooks should be kept to a minimum to reduce complexity.
3526 Live Updates During a Scan
3561 - Prior to invoking the notifier call chain, the filesystem function being
3565 - The scanning function and the scrub hook function must coordinate access to
3566   the scan data by acquiring a lock on the scan data.
3568 - Scrub hook function must not add the live update information to the scan
3570   The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
3573 - Scrub hook functions must not change the caller's state, including the
3578 - The hook function can abort the inode scan to avoid breaking the other rules.
3582 - ``xchk_iscan_start`` starts a scan
3584 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3587 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3590   in-memory scan information.
3592 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3595 - ``xchk_iscan_teardown`` to finish the scan
3597 This functionality is also a part of the
3599 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3621    incore dquot to a delayed write (delwri) list.
3628 Therefore, online quotacheck records file resource usage to a shadow dquot
3629 index implemented with a sparse ``xfarray``, and only writes to the real dquots
3634 1. The inodes involved are joined and locked to a transaction.
3638    a. The dquot is locked.
3640    b. A quota reservation is added to the dquot's resource usage.
3649    a. The dquot is locked again.
3657 The step 2 hook creates a shadow version of the transaction dquot context
3658 (``dqtrx``) that operates in a similar manner to the regular code.
3661 live update coordinates with the inode scanner.
3665 1. Set up a coordinated inode scan.
3669    a. Grab and lock the inode.
3679    a. Grab and lock the dquot.
3682       by the live hooks.
3684 Live updates are key to being able to walk every quota record without
3685 needing to hold any locks for a long duration.
3691 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3699 File link count checking also uses live update hooks.
3701 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3703 During the scanning phase, each entry in a directory generates observation
3706 1. If the entry is a dotdot (``'..'``) entry of the root directory, the
3710 2. If the entry is a dotdot entry of a subdirectory, the parent's backref
3713 3. If the entry is neither a dot nor a dotdot entry, the target file's parent
3716 4. If the target is a subdirectory, the parent's child link count is bumped.
3718 A crucial point to understand about how the link count inode scanner interacts
3719 with the live update hooks is that the scan cursor tracks which *parent*
3721 In other words, the live updates ignore any update about ``A → B`` when A has
3723 Furthermore, a subdirectory A with a dotdot entry pointing back to B is
3724 accounted as a backref counter in the shadow data for A, since child dotdot
3726 Live update hooks are carefully placed in all parts of the filesystem that
3732 Non-directories never have children of any kind.
3739 A second coordinated inode scan cursor is used for comparisons.
3740 Live updates are key to being able to walk every inode without needing to hold
3749 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3759 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3761 repair code -- code and data are entirely contained within the scrub module,
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3765 decides a structure is corrupt, no other threads can access the metadata until
3768 For repairs going on within a shard of the filesystem, these advantages
3775 It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
3776 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3787 4. Hook into rmap updates for the AG being repaired so that the live scan data
3795    a. Create a btree cursor for the in-memory btree.
3797    b. Use the rmap code to add the record to the in-memory btree.
3802 6. For each live update received via the hook, decide if the owner has already
3804    If so, apply the live update into the scan data:
3806    a. Create a btree cursor for the in-memory btree.
3808    b. Replay the operation into the in-memory btree.
3815 7. When the inode scan finishes, create a new scrub transaction and relock the
3833 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3837 --------------------------------------------
3839 XFS stores a substantial amount of metadata in file forks: directories,
3842 File forks map 64-bit logical file fork space extents to physical storage space
3843 extents, similar to how a memory management unit maps 64-bit virtual addresses
3845 Therefore, file-based tree structures (such as directories and extended
3847 to other blocks mapped within that same address space, and file-based linear
3852 cannot be staged in memory, even when a paging scheme is available.
3853 Therefore, online repair of file-based metadata createas a temporary file in
3854 the XFS filesystem, writes a new structure at the correct offsets into the
3862 consistent to use a temporary file safely!
3866 Exchanging metadata file mappings with a temporary file requires the owner
3872 There is a downside to the reaping process -- if the system crashes during the
3878 They are not linked into a directory and the entire file will be reaped when
3884 +--------------------------------------------------------------------------+
3886 +--------------------------------------------------------------------------+
3889 | fork would be reaped; and then a new structure would be built in its     |
3894 | The second iteration explored building a second structure at a high      |
3896 | using a ``COLLAPSE_RANGE`` operation to slide the new extents into       |
3901 | - Array structures are linearly addressed, and the regular filesystem    |
3902 |   codebase does not have the concept of a linear offset that could be    |
3905 | - Extended attributes are allowed to use the entire attr fork offset     |
3908 | - Even if repair could build an alternate copy of a data structure in a  |
3911 |   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     |
3914 | - A crash after construction of the secondary tree but before the range  |
3918 | - Reaping blocks after a repair is not a simple operation, and           |
3919 |   initiating a reap operation from a restarted range collapse operation  |
3922 | - Directory entry blocks and quota records record the file fork offset   |
3926 |   Rewriting a single field in block headers is not a huge problem, but   |
3929 | - Each block in a directory or extended attributes btree index contains  |
3931 |   Were the atomic commit to use a range collapse operation, each block   |
3934 |   Doing this as part of a range collapse means rewriting a large number  |
3938 +--------------------------------------------------------------------------+
3940 Using a Temporary File
3943 Online repair code should use the ``xrep_tempfile_create`` function to create a
3945 This allocates an inode, marks the in-core inode private, and attaches it to
3953 The usage patterns of these two locks are the same as for any other XFS file --
3962 Data can be written to a temporary file by two means:
3964 1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
3970 Once a good copy of a data file has been constructed in a temporary file, it
3976 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3980 -----------------------------
3982 Once repair builds a temporary file with a new data structure written into
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
3996    defragmentation code (which swapped entire extent forks in a single
4001    For this use case, an incomplete exchange will not result in a user-visible
4006    For directory and xattr repairs, the user-visible contents might be the
4009 e. Old blocks in the file may be cross-linked with another structure and must
4010    not reappear if the system goes down mid-repair.
4012 These problems are overcome by creating a new deferred operation and a new type
4016 the reverse-mapping extent swap code, but records intermedia progress in the
4017 log so that operations can be restarted after a crash.
4030 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4033 +--------------------------------------------------------------------------+
4034 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
4035 +--------------------------------------------------------------------------+
4036 | Starting with XFS v5, the superblock contains a                          |
4047 | Because upper level code can be working on a transaction at the same     |
4049 | communicate to the log when it is going to use a log incompatible        |
4056 | The code supporting a log incompat feature should create wrapper         |
4063 | For a file operation, this step must happen after taking the IOLOCK      |
4070 | Log-assisted extended attribute updates and file content exchanges bothe |
4073 +--------------------------------------------------------------------------+
4075 Mechanics of a Logged File Content Exchange
4078 Exchanging contents between file forks is a complex task.
4086 This is roughly the format of the new deferred exchange-mapping work item:
4088 .. code-block:: c
4123 1. Create a deferred work item for the file mapping exchange.
4134    a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
4136       be exchanged in a single step.
4138       Keep advancing through the file forks until at least one of the mappings
4146    b. Create a deferred block mapping update to unmap map1 from file 1.
4148    c. Create a deferred block mapping update to unmap map2 from file 2.
4150    d. Create a deferred block mapping update to map map1 into file 2.
4152    e. Create a deferred block mapping update to map map2 into file 1.
4158    h. Log a mapping exchange done log item for th mapping exchange intent log
4162       This quantity is ``(map1.br_startoff + map1.br_blockcount -
4163       xmi_startoff1)``, because step 3a could have skipped holes.
4170    k. Log a new mapping exchange intent log item reflecting the advanced state
4175       The operation manager completes the deferred work in steps 3b-3e before
4178 4. Perform any post-processing.
4185 will either see the old broken structure or the new one, and never a mismash of
4191 There are a few things that need to be taken care of before initiating an
4201 - Data device blocks needed to handle the repeated updates to the fork
4203 - Change in data and realtime block counts for both files.
4204 - Increase in quota usage for both files, if the two files do not share the
4206 - The number of extent mappings that will be added to each file.
4207 - Whether or not there are partially written realtime extents.
4208   User programs must never be able to access a realtime file extent that maps
4215 exchange ever add more extent mappings to a fork than it can support.
4223 "local" and treat the fork as a literal area for data storage.
4226 - If both forks are in local format and the fork areas are large enough, the
4230   be done with a single transaction.
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4236   The contents of the local format fork are converted to a block to perform the
4254 After a successful exchange operation, the repair operation must reap the old
4256 extent reaping <reaping>` mechanism that is done post-repair.
4260 However, this iunlink processing omits the cross-link detection of online
4266 To repair a metadata file, online repair proceeds as follows:
4268 1. Create a temporary repair file.
4277 4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
4278    the appropriate resource reservations, locks, and fill out a ``struct
4290 In the "realtime" section of an XFS filesystem, free space is tracked via a
4292 Each bit in the bitmap represents one realtime extent, which is a multiple of
4294 The realtime summary file indexes the number of free extents of a given size to
4301 The summary file itself is a flat file (with no block headers or checksums!)
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4305 and can satisfy a power-of-two allocation request.
4313    a. Compute the position in the summary file that contains a counter that
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4342 Values larger than a block are allocated separate extents and written there.
4343 If the leaf information expands beyond a single block, a directory/attribute
4353    a. Walk the attr leaf block to find candidate keys.
4362 2. If the memory usage of the xfarray and xfblob exceed a certain amount of
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4403    parent has a child entry pointing back to the directory being repaired.
4410    a. Walk the directory data block to find candidate entries.
4419 3. If the memory usage of the xfarray and xfblob exceed a certain amount of
4431 rebuilding a directory?
4435 In theory it is necessary to scan all dentry cache entries for a directory to
4440 2. The cached dentry no longer has a corresponding ondisk dirent in the new
4447 Unfortunately, the current dentry cache design doesn't provide a means to walk
4448 every child dentry of a specific directory, which makes this a hard problem.
4453 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4459 A parent pointer is a piece of file metadata that enables a user to locate the
4473 each dirent also contains a parent pointer pointing back to the dirent.
4475 each parent pointer is a directory and that it contains a dirent matching
4479 +--------------------------------------------------------------------------+
4481 +--------------------------------------------------------------------------+
4483 | than a decade ago by SGI.                                                |
4484 | Each link from a parent directory to a child file is mirrored with an    |
4492 |    It did not guarantee that a change in a forward link would always be  |
4509 | a file system repair to depend on.                                       |
4510 | Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
4515 | commit a dirent update and a parent pointer update in the same           |
4525 | always matched when reconstructing a directory.                          |
4527 | There were a few other ways to have solved that problem:                 |
4544 | 3. Same as above, but remove the old parent pointer entry and add a new  |
4568 | most performant.  A new hash function was designed for parent pointers.  |
4569 +--------------------------------------------------------------------------+
4575 Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
4576 a :ref:`directory entry live update hook <liveupdate>` as follows:
4578 1. Set up a temporary directory for generating the new directory structure,
4580    size fields involved in a directory update: ``(child inumber, add vs.
4590    a. Stash the parent pointer name and an addname entry for this dirent in the
4594       a threshold, flush the stashed updates to the temporary directory.
4596 4. For each live directory update received via the hook, decide if the child
4600    a. Stash the parent pointer name an addname or removename entry for this
4617 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4623 Online reconstruction of a file's parent pointer information works similarly to
4626 1. Set up a temporary file for generating a new extended attribute structure,
4628    fixed size fields involved in a parent pointer update: ``(parent inumber,
4638    a. Stash the dirent name and an addpptr entry for this parent pointer in the
4642       exceeds a threshold, flush the stashed updates to the temporary file.
4644 4. For each live directory update received via the hook, decide if the parent
4648    a. Stash the dirent name and an addpptr or removepptr entry for this dirent
4657 6. Copy all non-parent pointer extended attributes to the temporary file.
4667 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4675 Parent pointer checks are therefore a second pass to be added to the existing
4684    a. If the name has already been stored in the xfblob, then use that cookie
4692       2. Creating a stable sort key for the parent pointer indexes so that the
4696       name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
4702    a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
4704       Having a single ``name_cookie`` for each ``name`` is critical for
4705       handling the uncommon case of a directory containing multiple hardlinks
4713          a. Validate the ondisk parent pointer.
4720          c. Record the name in a per-file xfblob, and remember the xfblob
4724             name_cookie)`` tuples in a per-file slab.
4726       2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
4730          per-AG tuple slab.
4731          This should be trivial since the per-AG tuples are in child inumber
4734       4. Position a second slab cursor at the start of the per-file tuple slab.
4740          a. If the per-AG cursor is at a lower point in the keyspace than the
4741             per-file cursor, then the per-AG cursor points to a missing parent
4743             Add the parent pointer to the inode and advance the per-AG
4746          b. If the per-file cursor is at a lower point in the keyspace than
4747             the per-AG cursor, then the per-file cursor points to a dangling
4749             Remove the parent pointer from the inode and advance the per-file
4760 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
4764 challenging because xfs_repair currently uses two single-pass scans of the
4767 This scan would have to be converted into a multi-pass scan:
4784    the dirents and add them to the now-empty directories.
4793 As mentioned earlier, the filesystem directory tree is supposed to be a
4795 However, each node in this graph is a separate ``xfs_inode`` object with its
4797 Fortunately, non-directories are allowed to have multiple parents and cannot
4799 Directories typically constitute 5-10% of the files in a filesystem, which
4803 disconnected regions by running a depth (or breadth) first search downwards
4804 from the root directory and marking a bitmap for each directory found.
4805 At any point in the walk, trying to set an already set bit means there is a
4811 Directory tree updates can move subtrees across the scanner wavefront on a live
4819 consistent, each directory entry must have a parent pointer, and the link
4824 This is not possible since the VFS does not take the IOLOCK of a child
4826 the parent -> child relationship by taking the ILOCKs and installing a dirent
4829 The scanning process uses a dirent hook to detect changes to the directories
4835    a. For each parent pointer of that subdirectory,
4837       1. Create a path object for that parent pointer, and mark the
4840       2. Record the parent pointer name and inode number in a path structure.
4843          a cycle.
4844          Mark the path for deletion and repeat step 1a with the next
4847       4. Try to mark the alleged parent inode number in a bitmap in the path
4849          If the bit is already set, then there is a cycle in the directory
4851          Mark the path as a cycle and repeat step 1a with the next subdirectory
4855          If the alleged parent is not a linked directory, abort the scan
4860          a. Record the parent pointer name and inode number in the path object
4864             Repeat step 1a with the next subdirectory parent pointer.
4866          c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
4877    If the entry matches part of a path, mark that path and the scan stale.
4885    a. Corrupt paths and cycle paths are counted as suspect.
4908 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
4915 -------------
4917 Filesystems present files as a directed, and hopefully acyclic, graph.
4918 In other words, a tree.
4919 The root of the filesystem is a directory, and each entry in a directory points
4920 downwards either to more subdirectories or to non-directory files.
4921 Unfortunately, a disruption in the directory graph pointers result in a
4926 detect a dotdot entry pointing to a parent directory that doesn't have a link
4927 back to the child directory and the file link count checker can detect a file
4929 If such a file has a positive link count, the file is an orphan.
4936 Offline fsck solves the problem by creating a directory ``/lost+found`` to
4939 Reparenting a file to the orphanage does not reset any of its permissions or
4945 security attributes and dentry cache entries, just like a regular directory
4954 2. If the decision is made to reconnect a file, take the IOLOCK of both the
4973 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
4978 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4991 -----------------
4998 a. Filesystem summary counts depend on consistency within the inode indices,
5030 Therefore, a metadata dependency graph is a convenient way to schedule checking
5033 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
5036 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
5038 - Phase 3 scans inodes in parallel.
5041 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
5044 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
5047 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
5050 - Phase 7 checks group (a), having validated everything else.
5056 --------------------
5059 Given that XFS targets installations with large high-performance storage,
5061 if the program has been invoked manually from a command line.
5062 This requires careful scheduling to keep the threads as evenly loaded as
5065 Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
5066 workqueue and scheduled a single workqueue item per AG.
5070 The file handle was then passed to a function to generate scrub items for each
5073 filesystem contains one AG with a few large sparse files and the rest of the
5080 avoid this problem with ease by adding a second workqueue.
5087 creates a file handle, and passes it to a function to generate scrub items for
5096 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
5099 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
5105 ------------------
5115 During phase 3, corruptions and inconsistencies reported in any part of a
5124 With recent increases in the number of optimizations possible for a given
5125 filesystem object, it became much more memory-efficient to track all eligible
5126 repairs for a given filesystem object with a single repair item.
5127 Each repair item represents a single lockable object -- AGs, metadata files,
5128 individual inodes, or a class of summary information.
5130 Phase 4 is responsible for scheduling a lot of repair work in as quick a
5137 1. Start a round of repair with a workqueue and enough workers to keep the CPUs
5140    a. For each repair item queued by phase 2,
5142       i.   Ask the kernel to repair everything listed in the repair item for a
5145       ii.  Make a note if the kernel made any progress in reducing the number
5153    b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
5157       i.   Ask the kernel to repair everything listed in the repair item for a
5160       ii.  Make a note if the kernel made any progress in reducing the number
5184 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5187 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5190 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5193 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5197 -----------------------------------------------
5205 contents of a name:
5207 - Slashes and null bytes are not allowed in directory entries.
5209 - Null bytes are not allowed in userspace-visible extended attributes.
5211 - Null bytes are not allowed in the filesystem label.
5216 presented together -- all the names in a directory, or all the attributes of a
5220 modern-day Linux systems is that programs work with Unicode character code
5222 These programs typically encode those code points in UTF-8 when interfacing
5223 with the C library because the kernel expects null-terminated names.
5225 UTF-8 encoded Unicode data.
5230 For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
5231 identically to "Latin Small Letter A" U+0061 "a".
5233 The standard also permits characters to be constructed in multiple ways --
5234 either by using a defined code point, or by combining one code point with
5237 as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
5238 U+030A "◌̊".
5243 For example, the character "Right-to-Left Override" U+202E can trick some
5245 A second category of rendering problems involves whitespace characters.
5246 If the character "Zero Width Space" U+200B is encountered in a file name, the
5247 name will render identically to a name that does not have the zero width
5250 If two names within a naming domain have different byte sequences but render
5251 identically, a user may be confused by it.
5260 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5263 `libicu <https://github.com/unicode-org/icu>`_
5264 to identify names with a directory or within a file's extended attributes that
5266 Names are also checked for control characters, non-rendering characters, and
5272 ---------------------------------------
5274 The system administrator can elect to initiate a media scan of all file data
5282 When the space map scan accumulates a region larger than 32MB, a media
5283 verification request is sent to the disk as a directio read of the raw block
5286 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5292 construct file paths from inode numbers for user-friendly reporting.
5307 ----------------------
5309 As discussed earlier, a second frontend to the atomic file mapping exchange
5310 mechanism is a new ioctl call that userspace programs can use to commit updates
5324 When XFS v5 came along with self-describing metadata, this old mechanism grew
5330 swap mappings one at a time.
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5335 identical, so the recovery guarantees are not much of a gain.
5338 implementations because it can guarantee that the caller never sees a mix of
5339 old and new contents even after a crash, and it can operate on two arbitrary
5343 - **Atomic commit of file writes**: A userspace process opens a file that it
5345   Next, it opens a temporary file and calls the file clone operation to reflink
5354 - **Transactional file updates**: The same mechanism as above, but the caller
5364   A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
5366 - **Emulation of atomic block device writes**: Export a block device with a
5369   Stage all writes to a temporary file, and when that is complete, call the
5370   atomic file mapping exchange system call with a flag to indicate that holes
5376 ----------------
5379 earlier was a catalyst for enabling a vectorized scrub system call.
5380 Since 2018, the cost of making a kernel call has increased considerably on some
5383 reduce the number of times an execution path crosses a security boundary.
5385 With vectorized scrub, userspace pushes to the kernel the identity of a
5386 filesystem object, a list of scrub types to run against that object, and a
5389 The kernel executes as much of the caller's plan as it can until it hits a
5390 dependency that cannot be satisfied due to a corruption, and tells userspace
5393 online fsck can use that instead of adding a separate vectored scrub system
5398 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5401 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5405 ------------------------------------
5409 Userspace is allowed to send a fatal signal to the process which will cause
5410 ``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
5411 for userspace to provide a time budget to the kernel.
5413 be too much work to allow userspace to specify a timeout for a scrub/repair
5416 ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
5420 ------------------------
5422 Over the years, many XFS users have requested the creation of a program to
5423 clear a portion of the physical storage underlying a filesystem so that it
5424 becomes a contiguous chunk of free space.
5430 The second piece it needs is a new fallocate mode
5431 (``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
5432 maps it to a file.
5436 To clear all the metadata out of a portion of physical storage, clearspace
5437 uses the new fallocate map-freespace call to map any free space in that region
5446 To clear all the file data out of a portion of the physical storage, clearspace
5448 Having identified a good target, it uses the ``FICLONERANGE`` call on that part
5449 of the file to try to share the physical space with a dummy file.
5451 contents; any changes will be written somewhere else via copy-on-write.
5460 To clear a piece of physical storage that has a high sharing factor, it is
5464 To make this work smoothly, clearspace needs a new ioctl
5471 *Answer*: To move inode chunks, Dave Chinner constructed a prototype program
5472 that creates a new file with the old contents and then locklessly runs around
5476 hidden behind a jump label, and a log item that tracks the kernel walking the
5490 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5493 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5497 ---------------------
5499 Removing the end of the filesystem ought to be a simple matter of evacuating
5502 That requires an evacuation of the space at end of the filesystem, which is a