xfs-online-fsck-design.rst - OpenGrok cross reference for /Documentation/filesystems/xfs/xfs-online-fsck-design.rst

Lines Matching +full:d +full:- +full:cache +full:- +full:block +full:- +full:size
1 .. SPDX-License-Identifier: GPL-2.0
8         Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
59 - Provide a hierarchy of names through which application programs can associate
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
79 cross-references different types of metadata records with each other to look
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
132 It uses extent-based in-memory data structures to reduce memory consumption,
141 -----------------
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
193 +--------------------------------------------------------------------------+
212 autonomous self-healing of XFS maximizes service availability.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
238 -----
260 --------------
311    The ability to use hardware-assisted data file integrity checking is new
315 7. Re-check the summary counters and presents the caller with a summary of
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
349 --------------------------
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
429 - Reverse mapping information
431 - Directory parent pointers
444 Instead, repair functions set up an in-memory staging structure to store
455 Once the scan is done, the owning object is re-locked, the live data is used to
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
551 track pending changes to the block and inode usage counts in each transaction,
562 ---------------
569 - **Decreased performance**: Adding metadata indices to the filesystem
576 - **Incorrect repairs**: As with all software, there might be defects in the
583   The xfsprogs build system has a configure option (``--enable-scrub=no``) that
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
609   spraying exploit code onto the public mailing list for instant zero-day
616   Automated testing should front-load some of the risk while the feature is
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
645 -------------------------------
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
679 ---------------------------------------
688 a single block of a specific type of metadata object, trash it with the
693 in-kernel validation functions and the ability of the offline fsck tool to
711 -----------------------------------------
717 block in the filesystem to simulate the effects of memory corruption and
745           2. Offline checking (``xfs_repair -n``)
747           4. Online checking (``xfs_scrub -n``)
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793   force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797   remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-…
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
819 ------------------
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
840 kernel block device to prevent resource overconsumption.
843 ------------------
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
867 * ``xfs_scrub_all.cron`` on non-systemd systems
871 This is less foolproof than, say, storing file data block checksums, but much
879 This was performed via ``systemd-analyze security``, after which privileges
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
901 download this information into a human-readable format.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
934 ------------------------
937 ondisk block header to record a magic number, a checksum, a universally
938 "unique" identifier (UUID), an owner code, the ondisk address of the block,
940 When loading a block buffer from disk, the magic number, UUID, owner, and
941 ondisk address confirm that the retrieved block matches the specific owner of
942 the current filesystem, and that the information contained in the block is
948 Whenever a file system operation modifies a block, the change is submitted
965 Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
968 ---------------
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
985 file metadata (the directory tree, the file block map, and the allocation
987 Like any system that improves redundancy, the reverse-mapping feature increases
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
1001 | This is a valid point, but adding a new index for file data block        |
1003 | copy-writes, which age the filesystem prematurely.                       |
1011 +--------------------------------------------------------------------------+
1015 .. code-block:: c
1018 	    xfs_agblock_t    rm_startblock;   /* extent start block */
1025 The first two fields capture the location and size of the physical space,
1031 Finally, the flags field provides extra information about the space usage --
1061    btree block requires locking the file and searching the entire btree to
1062    confirm the block.
1063    Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1071    an AGF buffer lock, scrub cannot block on that second acquisition.
1080 Checking and Cross-Referencing
1081 ------------------------------
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1107 into the buffer cache.
1108 These functions perform inexpensive internal consistency checking of the block
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1132 block of a structure in the course of checking the structure.
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1136 Reads satisfied by a buffer already in cache (and hence already verified)
1142 After the buffer cache, the next level of metadata protection is the internal
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1145 the buffer cache, and the scrub code itself, depending on the amount of higher
1147 The scope of checking is still internal to the block.
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1162 For example, block pointers and inumbers are checked to ensure that they point
1174 Validation of Userspace-Controlled Record Attributes
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1199 After internal block checks, the next higher level of checking is
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1236   of the child block?
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243    - Does the reverse mapping index list only the appropriate owner as the
1244      owner of each block?
1246    - Are none of the blocks claimed as free space?
1248    - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253    - Everything in class 1 above.
1255    - If there's a parent node block, do the keys listed for this block match the
1256      keyspace of this block?
1258    - Do the sibling pointers point to valid blocks?  Of the same level?
1260    - Do the child pointers point to valid blocks?  Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264    - Everything in class 1 and 2 above.
1266    - Does the reverse mapping index list no owners of this space?
1268    - Is this space not claimed by the inode index for inodes?
1270    - Is it not mentioned by the reference count index?
1272    - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276    - Everything in class 1 and 2 above.
1278    - Is there a matching record in free inode btree?
1280    - Do cleared bits in the holemask correspond with inode clusters?
1282    - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287    - Everything in class 1.
1289    - Do all the fields that summarize information about the file forks actually
1292    - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297    - Everything in class 1 and 2 above.
1299    - Is this space not mentioned by the inode btrees?
1301    - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306    - Everything in class 1 and 2 above.
1308    - Within the space subkeyspace of the rmap btree (that is to say, all
1310      are there the same number of reverse mapping records for each block as the
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-…
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-…
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1343 Block 0 in the attribute fork is always the top of the structure, but otherwise
1346 Names are always stored elsewhere in the same leaf block.
1347 Values that are less than 3/4 the size of a filesystem block are also stored
1348 elsewhere in the same leaf block.
1350 If the leaf information exceeds a single filesystem block, a dabtree (also
1351 rooted at block 0) is created to map hashes of the attribute names to leaf
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1371       If the value is stored in a remote block, this also validates the
1372       integrity of the remote value block.
1374 Checking and Cross-Referencing Directories
1380 255-byte sequence (name) to an inumber.
1385 Each non-directory file may have multiple directories point to it.
1390 Each data block contains variable-sized records associating a user-provided
1392 If the directory entry data grows beyond one block, the second partition (which
1393 exists as post-EOF extents) is populated with a block containing free space
1397 If this second partition grows beyond one block, the third partition is
1401 beyond one block, then a dabtree is used to map hashes of dirent names to
1419    d. If a file type is included in the dirent, does it match the type of the
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1475   contents of the child dabtree block?
1477 - For each dabtree leaf record, does the record key accurate reflect the
1478   contents of the directory or attr block?
1480 Cross-Referencing Summary Counters
1490 Cross-referencing these values against the filesystem metadata should be a
1498 Post-Repair Reverification
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1561 extent in AG 7 and then try to free a now superfluous block mapping btree block
1568 1. The first transaction contains a physical update to the file's block mapping
1570    It then attaches to the in-memory transaction an action item to schedule
1576    the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1586    of AG 3 to release the former BMBT block and a second physical update to the
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1608   In other words, all per-AG metadata updates for an unmapped block must be
1629 * The block mapping update itself
1630 * A reverse mapping update for the block mapping update
1634 * A shape change to the block mapping btree
1648 * Freeing the space used by the block mapping btree
1654 For copy-on-write updates this is even worse, because this must be done once to
1659 This reduces the worst case size of transaction reservations by breaking the
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1672 For example, if a thread performing a copy-on-write has completed a reverse
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1733    calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1804 - The hooked part of XFS should declare a static-scoped static key that
1809 - When deciding to invoke code that's only used by scrub, the regular
1811   scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1860   difficult, especially on 32-bit systems.
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1887 | The second edition solved the half-rebuilt structure problem by storing  |
1892 +--------------------------------------------------------------------------+
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1918 XFS is very record-based, which suggests that the ability to load and store
1932 tmpfs can only push a pagecache folio to the swap cache if the folio is neither
1946 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1958 are performed by manipulating the page cache directly.
1959 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1977 Arrays of Fixed-Sized Records
1981 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1983 Directories have a set of fixed-size dirent records that point to the names,
1984 and extended attributes have a set of fixed-size attribute keys that point to
1994 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1995 the byte-accessible xfile.
2012 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2040 `big in-memory array
2041 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2050 .. code-block:: c
2068 .. code-block:: c
2131 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2134 low-effort robust (resistant) location in large samples`, in *Contributions to
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2172 to cache a small number of entries before adding them to a temporary ondisk
2177 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2181 In-Memory B+Trees
2193 Another option is to skip the side-log and commit live updates from the
2209 regular file, which means that the kernel can create byte or block addressable
2211 The XFS buffer cache specializes in abstracting IO to block-oriented  address
2212 spaces, which means that adaptation of the buffer cache to interface with
2218 `in-memory btree
2219 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2222 Using xfiles as a Buffer Cache Target
2225 Two modifications are necessary to support xfiles as a buffer cache target.
2228 per-AG structure.
2233 With this adaptation in place, users of the xfile-backed buffer cache use
2234 exactly the same APIs as users of the disk-backed buffer cache.
2235 The separation between xfile and buffer cache implies higher memory usage since
2237 updates to an in-memory btree.
2243 Space management for an xfile is very simple -- each btree block is one memory
2244 page in size.
2245 These blocks use the same header format as an on-disk btree, but the in-memory
2246 block verifiers ignore the checksums, assuming that xfile memory is no more
2247 corruption-prone than regular DRAM.
2250 The very first block of an xfile backing an xfbtree contains a header block.
2251 The header describes the owner, height, and the block number of the root
2252 xfbtree block.
2254 To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
2256 Preallocate space for the block with ``xfile_prealloc``, and hand back the
2258 To free an xfbtree block, use ``xfile_discard`` (which internally uses
2269 2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
2272 3. Pass the buffer cache target, buffer ops, and other information to
2274    initial root block to the xfile.
2286    and to update the in-memory btree.
2301 structure, the ephemeral nature of the in-memory btree block storage presents
2334 ------------------------------
2345 rebuilding a btree index from a collection of records -- bulk btree loading.
2346 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2347 had separate copy-pasted implementations for each btree type.
2364 will fit in a leaf block from the size of a btree block and the size of the
2365 block header.
2368         maxrecs = (block_size - header_size) / record_size
2377 Choosing minrecs is undesirable because it wastes half the block.
2379 newly rebuilt leaf block will cause a tree split, which causes a noticeable
2391 Load factor is computed for btree node blocks using the combined size of the
2392 btree key and pointer as the record size::
2394         maxrecs = (block_size - header_size) / (key_size + ptr_size)
2410 needs one block.
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2417 - For inode-rooted btrees where the records in the top level do not fit in the
2420   the root block.
2422 - For inode-rooted btrees where the records in the top level can be stored in
2423   the inode fork area, then the root block can be stored in the inode, the
2426   This only becomes relevant when non-bmap btrees gain the ability to root in
2439 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2443 Each time the btree builder claims a block for the btree from a reserved
2444 extent, it updates the in-memory reservation to reflect the claimed space.
2445 Block reservation tries to allocate as much contiguous space as possible to
2464 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2467 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2474 a block from the reserved list, writes the new btree block header, fills the
2475 rest of the block with records, and adds the new leaf block to a list of
2483 Sibling pointers are set every time a new block is added to the level::
2492 To fill a node block, it walks each block in the next level down in the tree
2524 This is a little complicated because a new btree block could have been freed
2562 large transactions, which each can accommodate many thousands of block freeing
2589 6. Commit the location of the new btree root block(s) to the AGI.
2603 freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
2619 This xfarray is walked twice during the btree creation step -- once to populate
2621 free inode btree with records for chunks that have free non-sparse inodes.
2628 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2642 In other words, the record emission stimulus is level-triggered::
2653 Extents being used to stage copy-on-write operations should be the only records
2655 Single-owner file blocks aren't recorded in either the free space or the
2682 6. Commit the location of new btree root block to the AGF.
2689 - Until the reverse mapping btree runs out of records:
2691   - Retrieve the next record from the btree and put it in a bag.
2693   - Collect all records with the same starting block from the btree and put
2696   - While the bag isn't empty:
2698     - Among the mappings in the bag, compute the lowest block number where the
2700       This position will be either the starting block number of the next
2701       unprocessed reverse mapping or the next block after the shortest mapping
2704     - Remove all mappings from the bag that end at this position.
2706     - Collect all reverse mappings that start at this position from the btree
2709     - If the size of the bag changed and is greater than one, create a new
2710       refcount record associating the block number range that we just walked to
2711       the size of the bag.
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2721 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2748 7. Commit the new btree root block to the inode fork immediate area.
2762 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2768 ---------------------------
2786 Permitting the block allocator to hand them out again will not push the system
2803 Repairs for file-based metadata such as extended attributes, directories,
2813    the first block in that extent that do not have the same rmap owner for the
2816    - If zero, the block has a single owner and can be freed.
2818    - If not, the block is part of a crosslinked structure and must not be
2821 5. Starting with the next block in the extent, figure out how many more blocks
2822    have the same zero/nonzero other owner status as that first block.
2828    cache as stale to prevent log writeback.
2833 Transactions are of finite size, so the reaping process must be careful to roll
2848 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2866    For each block visited, clear that range in the above bitmap.
2868 4. Each set bit in the bitmap represents a block that could be a block from the
2895    Call it again for the free space by block number index.
2919 information changes the number of free space records, repair must re-estimate
2948 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2977 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2983 The allocation group free block list (AGFL) is repaired as follows:
2991    other owner, to avoid re-adding crosslinked blocks to the AGFL.
2995 5. The next operation to fix the freelist will right-size the list.
3000 --------------------
3003 ("dinodes") and an in-memory ("cached") representation.
3004 There is a very high potential for cache coherency issues if online fsck is not
3006 badly damaged that the filesystem cannot load the in-memory representation.
3008 specialized resource acquisition functions that return either the in-memory
3013 is necessary to get the in-core structure loaded.
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3020 Most inode attributes are easy to check and constrain, or are user-controlled
3022 Dealing with the data and attr fork extent counts and the file block counts is
3029 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3033 --------------------
3036 an in-memory representation, and hence are subject to the same cache coherency
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3050 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3056 --------------------------------
3066 Writer threads reserve the worst-case quantities of resources from the
3079 The high-performance nature of the summary counters makes it difficult for
3088 For repairs, the in-memory counters must be stabilized while walking the
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3106 - It does not quiesce the log.
3111 +--------------------------------------------------------------------------+
3113 +--------------------------------------------------------------------------+
3120 | - Other programs can unfreeze the filesystem without our knowledge.      |
3123 | - Adding an extra lock to prevent others from thawing the filesystem     |
3124 |   required the addition of a ``->freeze_super`` function to wrap         |
3131 |   block device has frozen the filesystem.                                |
3136 | - The log need not be quiesced to check the summary counters, but a VFS  |
3140 | - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3143 | - A bug in the VFS meant that freeze could complete even when            |
3146 +--------------------------------------------------------------------------+
3150 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3154 ---------------------
3167 - How does scrub manage the scan while it is collecting data?
3169 - How does the scan keep abreast of changes being made to the system by other
3179 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3180 its data block mapping.
3184 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3186 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3187 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3188 1913-4.
3192 They form a continuous keyspace that can be expressed as a 64-bit integer,
3206 Advancing the scan cursor is a multi-step process encapsulated in
3214 2. Use the per-AG inode btree to look up the next inumber after the one that
3231    d. If there are no more AGs to examine, set both cursors to the end of the
3268    d. Unlock and release the inode.
3272 There are subtleties with the inode cache that complicate grabbing the incore
3275 enough to load it into the inode cache.
3282 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3286 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3305 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3308 - Speculative preallocations need to be unreserved.
3310 - An unlinked file may have lost its last reference, in which case the entire
3315 Inactivation has two parts -- the VFS part, which initiates writeback on all
3316 dirty file pages, and the XFS part, which cleans up XFS-specific information
3330 4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
3331    can update page cache mappings.
3353 then decide to cross-reference the object with an object that is acquired
3369 save time if another process re-opens the file before the system runs out
3371 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3386 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3388 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3396 in a well-known order: parent → child when updating the directory tree, and
3414 Solving both of these problems is straightforward -- any time online fsck
3448 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3477 overhead for single-threaded applications.
3483 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3484   a well-known incore filesystem object.
3486 - Each hook must define an action code and a structure containing more context
3489 - Hook providers should provide appropriate wrapper functions and structs
3493 - A callsite in the regular filesystem code must be chosen to call
3503 - The online fsck function should define a structure to hold scan data, a lock
3508 - The online fsck code must contain a C function to catch the hook action code
3513 - Prior to unlocking inodes to start the scan, online fsck must call
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3561 - Prior to invoking the notifier call chain, the filesystem function being
3565 - The scanning function and the scrub hook function must coordinate access to
3568 - Scrub hook function must not add the live update information to the scan
3573 - Scrub hook functions must not change the caller's state, including the
3578 - The hook function can abort the inode scan to avoid breaking the other rules.
3582 - ``xchk_iscan_start`` starts a scan
3584 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3587 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3590   in-memory scan information.
3592 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3595 - ``xchk_iscan_teardown`` to finish the scan
3599 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3691 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3701 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3732 Non-directories never have children of any kind.
3749 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3759 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3761 repair code -- code and data are entirely contained within the scrub module,
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3776 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3795    a. Create a btree cursor for the in-memory btree.
3797    b. Use the rmap code to add the record to the in-memory btree.
3806    a. Create a btree cursor for the in-memory btree.
3808    b. Replay the operation into the in-memory btree.
3833 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3837 --------------------------------------------
3842 File forks map 64-bit logical file fork space extents to physical storage space
3843 extents, similar to how a memory management unit maps 64-bit virtual addresses
3845 Therefore, file-based tree structures (such as directories and extended
3847 to other blocks mapped within that same address space, and file-based linear
3853 Therefore, online repair of file-based metadata createas a temporary file in
3867 field of the block headers to match the file being repaired and not the
3872 There is a downside to the reaping process -- if the system crashes during the
3884 +--------------------------------------------------------------------------+
3886 +--------------------------------------------------------------------------+
3901 | - Array structures are linearly addressed, and the regular filesystem    |
3905 | - Extended attributes are allowed to use the entire attr fork offset     |
3908 | - Even if repair could build an alternate copy of a data structure in a  |
3914 | - A crash after construction of the secondary tree but before the range  |
3918 | - Reaping blocks after a repair is not a simple operation, and           |
3922 | - Directory entry blocks and quota records record the file fork offset   |
3923 |   in the header area of each block.                                      |
3925 |   each block header.                                                     |
3926 |   Rewriting a single field in block headers is not a huge problem, but   |
3929 | - Each block in a directory or extended attributes btree index contains  |
3930 |   sibling and child block pointers.                                      |
3931 |   Were the atomic commit to use a range collapse operation, each block   |
3938 +--------------------------------------------------------------------------+
3945 This allocates an inode, marks the in-core inode private, and attaches it to
3953 The usage patterns of these two locks are the same as for any other XFS file --
3976 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3980 -----------------------------
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
4001    For this use case, an incomplete exchange will not result in a user-visible
4004 d. Online repair needs to swap the contents of two files that are by definition
4006    For directory and xattr repairs, the user-visible contents might be the
4009 e. Old blocks in the file may be cross-linked with another structure and must
4010    not reappear if the system goes down mid-repair.
4016 the reverse-mapping extent swap code, but records intermedia progress in the
4030 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4033 +--------------------------------------------------------------------------+
4034 | **Sidebar: Using Log-Incompatible Feature Flags**                        |
4035 +--------------------------------------------------------------------------+
4070 | Log-assisted extended attribute updates and file content exchanges bothe |
4073 +--------------------------------------------------------------------------+
4086 This is roughly the format of the new deferred exchange-mapping work item:
4088 .. code-block:: c
4124    At the start, it should contain the entirety of the file block ranges to be
4134    a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
4146    b. Create a deferred block mapping update to unmap map1 from file 1.
4148    c. Create a deferred block mapping update to unmap map2 from file 2.
4150    d. Create a deferred block mapping update to map map1 into file 2.
4152    e. Create a deferred block mapping update to map map2 into file 1.
4154    f. Log the block, quota, and extent count updates for both files.
4156    g. Extend the ondisk size of either file if necessary.
4162       This quantity is ``(map1.br_startoff + map1.br_blockcount -
4175       The operation manager completes the deferred work in steps 3b-3e before
4178 4. Perform any post-processing.
4193 First, regular files require the page cache to be flushed to disk before the
4201 - Data device blocks needed to handle the repeated updates to the fork
4203 - Change in data and realtime block counts for both files.
4204 - Increase in quota usage for both files, if the two files do not share the
4206 - The number of extent mappings that will be added to each file.
4207 - Whether or not there are partially written realtime extents.
4226 - If both forks are in local format and the fork areas are large enough, the
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4236   The contents of the local format fork are converted to a block to perform the
4238   The conversion to block format must be done in the same transaction that
4247 Extended attributes and directories stamp the owning inode into every block,
4251 repair builds every block in the new data structure with the owner field of the
4256 extent reaping <reaping>` mechanism that is done post-repair.
4260 However, this iunlink processing omits the cross-link detection of online
4293 the filesystem block size between 4KiB and 1GiB in size.
4294 The realtime summary file indexes the number of free extents of a given size to
4295 the offset of the block within the realtime free space bitmap where those free
4301 The summary file itself is a flat file (with no block headers or checksums!)
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4304 Each counter records the number of free extents that start in that bitmap block
4305 and can satisfy a power-of-two allocation request.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4335 Values are limited in size to 64KiB, but there is no limit in the number of
4338 structure is always in logical block zero, but attribute leaf blocks, dabtree
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4342 Values larger than a block are allocated separate extents and written there.
4343 If the leaf information expands beyond a single block, a directory/attribute
4353    a. Walk the attr leaf block to find candidate keys.
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4398 Unlike extended attributes, directory blocks are all the same size, so
4410    a. Walk the directory data block to find candidate entries.
4430 **Future Work Question**: Should repair revalidate the dentry cache when
4435 In theory it is necessary to scan all dentry cache entries for a directory to
4441    directory and the dentry can be purged from the cache.
4447 Unfortunately, the current dentry cache design doesn't provide a means to walk
4453 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4479 +--------------------------------------------------------------------------+
4481 +--------------------------------------------------------------------------+
4565 |    the parent inumber is now xor'd into the hash index.                  |
4569 +--------------------------------------------------------------------------+
4580    size fields involved in a directory update: ``(child inumber, add vs.
4617 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4628    fixed size fields involved in a parent pointer update: ``(parent inumber,
4657 6. Copy all non-parent pointer extended attributes to the temporary file.
4667 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
4696       name_cookie)`` tuples in a per-AG in-memory slab.  The ``name_hash``
4702    a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
4720          c. Record the name in a per-file xfblob, and remember the xfblob
4723          d. Store ``(parent_inum, parent_gen, name_hash, name_len,
4724             name_cookie)`` tuples in a per-file slab.
4726       2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
4730          per-AG tuple slab.
4731          This should be trivial since the per-AG tuples are in child inumber
4734       4. Position a second slab cursor at the start of the per-file tuple slab.
4740          a. If the per-AG cursor is at a lower point in the keyspace than the
4741             per-file cursor, then the per-AG cursor points to a missing parent
4743             Add the parent pointer to the inode and advance the per-AG
4746          b. If the per-file cursor is at a lower point in the keyspace than
4747             the per-AG cursor, then the per-file cursor points to a dangling
4749             Remove the parent pointer from the inode and advance the per-file
4760 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
4764 challenging because xfs_repair currently uses two single-pass scans of the
4767 This scan would have to be converted into a multi-pass scan:
4784    the dirents and add them to the now-empty directories.
4797 Fortunately, non-directories are allowed to have multiple parents and cannot
4799 Directories typically constitute 5-10% of the files in a filesystem, which
4826 the parent -> child relationship by taking the ILOCKs and installing a dirent
4866          c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
4908 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
4915 -------------
4920 downwards either to more subdirectories or to non-directory files.
4945 security attributes and dentry cache entries, just like a regular directory
4967    cache.
4978 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4991 -----------------
5010 d. Directories, extended attributes, and file data depend on consistency within
5033 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
5036 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
5038 - Phase 3 scans inodes in parallel.
5039   For each inode, groups (f), (e), and (d) are checked, in that order.
5041 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
5044 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
5047 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
5050 - Phase 7 checks group (a), having validated everything else.
5056 --------------------
5059 Given that XFS targets installations with large high-performance storage,
5096 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
5099 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
5105 ------------------
5125 filesystem object, it became much more memory-efficient to track all eligible
5127 Each repair item represents a single lockable object -- AGs, metadata files,
5168    d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
5184 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
5187 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5190 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5193 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5197 -----------------------------------------------
5207 - Slashes and null bytes are not allowed in directory entries.
5209 - Null bytes are not allowed in userspace-visible extended attributes.
5211 - Null bytes are not allowed in the filesystem label.
5216 presented together -- all the names in a directory, or all the attributes of a
5220 modern-day Linux systems is that programs work with Unicode character code
5222 These programs typically encode those code points in UTF-8 when interfacing
5223 with the C library because the kernel expects null-terminated names.
5225 UTF-8 encoded Unicode data.
5233 The standard also permits characters to be constructed in multiple ways --
5243 For example, the character "Right-to-Left Override" U+202E can trick some
5260 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5263 `libicu <https://github.com/unicode-org/icu>`_
5266 Names are also checked for control characters, non-rendering characters, and
5272 ---------------------------------------
5283 verification request is sent to the disk as a directio read of the raw block
5286 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5292 construct file paths from inode numbers for user-friendly reporting.
5307 ----------------------
5324 When XFS v5 came along with self-describing metadata, this old mechanism grew
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5343 - **Atomic commit of file writes**: A userspace process opens a file that it
5354 - **Transactional file updates**: The same mechanism as above, but the caller
5366 - **Emulation of atomic block device writes**: Export a block device with a
5367   logical sector size matching the filesystem block size to force all writes
5368   to be aligned to the filesystem block size.
5376 ----------------
5398 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5401 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5405 ------------------------------------
5420 ------------------------
5437 uses the new fallocate map-freespace call to map any free space in that region
5451 contents; any changes will be written somewhere else via copy-on-write.
5490 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5493 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5497 ---------------------