• Home
  • Raw
  • Download

Lines Matching +full:reverse +full:- +full:data

1 .. SPDX-License-Identifier: GPL-2.0
8 Heading 3 uses "----"
25 - To help kernel distributors understand exactly what the XFS online fsck
28 - To help people reading the code to familiarize themselves with the relevant
31 - To help developers maintaining the system by capturing the reasons
59 - Provide a hierarchy of names through which application programs can associate
60 arbitrary blobs of data for any length of time,
62 - Virtualize physical storage media across those names, and
64 - Retrieve the named data blobs at any time.
66 - Examine resource usage.
70 Secondary metadata (e.g. reverse mapping and directory parent pointers) support
79 cross-references different types of metadata records with each other to look
81 People do not like losing data, so most fsck tools also contains some ability
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
84 the filesystem metadata to a consistent state, not to maximize the data
92 it is now possible to regenerate data structures when non-catastrophic errors
95 +--------------------------------------------------------------------------+
97 +--------------------------------------------------------------------------+
98 | System administrators avoid data loss by increasing the number of |
103 +--------------------------------------------------------------------------+
106 -----------------------
109 …nges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`…
110 …ps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-servi…
111 …ges <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
116 --------------
132 It uses extent-based in-memory data structures to reduce memory consumption,
141 -----------------
155 4. **Data owners** cannot **check the integrity** of their stored data without
175 This new third program has three components: an in-kernel facility to check
176 metadata, an in-kernel facility to repair metadata, and a userspace driver
183 +--------------------------------------------------------------------------+
185 +--------------------------------------------------------------------------+
193 +--------------------------------------------------------------------------+
212 autonomous self-healing of XFS maximizes service availability.
226 +------------------------------------------------------------------+
228 +------------------------------------------------------------------+
231 +------------------------------------------------------------------+
238 -----
260 --------------
309 6. If the caller asks for a media scan, read all allocated and written data
311 The ability to use hardware-assisted data file integrity checking is new
315 7. Re-check the summary counters and presents the caller with a summary of
322 -------------------------
324 The kernel scrub code uses a three-step strategy for checking and repairing
337 2. The repair function is called to rebuild the data structure.
349 --------------------------
362 - Free space and reference count information
364 - Inode records and indexes
366 - Storage mapping information for file data
368 - Directories
370 - Extended attributes
372 - Symbolic links
374 - Quota limits
383 errors and cross-references healthy records against other metadata to look for
391 Finally, the storage from the old data structure are carefully reaped.
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
408 V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
410 *Extending Database Technology*, pp. 293-309, 1992.
413 in-memory array prior to formatting the new ondisk structure, which is very
414 similar to the list-based algorithm discussed in section 2.3 ("List-Based
416 However, any data structure builder that maintains a resource lock for the
429 - Reverse mapping information
431 - Directory parent pointers
444 Instead, repair functions set up an in-memory staging structure to store
450 When the repair scanner needs to record an observation, the staging data are
455 Once the scan is done, the owning object is re-locked, the live data is used to
458 Finally, the storage from the old data structure are carefully reaped.
478 2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
485 Their method consists of an index builder that extracts relevant record data to
526 - Summary counts of free space and inodes
528 - File link counts from directories
530 - Quota resource usage counts
539 techniques as outlined above, but because the underlying data are sets of
540 integer counters, the staging data need not be a fully functional mirror of the
547 <http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
549 Since quotas are non-negative integer counts of resource usage, online
554 whereas the data structure being rebuilt is an index of dquots.
557 separate data structure.
562 ---------------
569 - **Decreased performance**: Adding metadata indices to the filesystem
570 increases the time cost of persisting changes to disk, and the reverse space
573 reverse mapping features at format time, though this choice dramatically
576 - **Incorrect repairs**: As with all software, there might be defects in the
583 The xfsprogs build system has a configure option (``--enable-scrub=no``) that
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
597 - **Misbehavior**: Online fsck requires many privileges -- raw IO to block
607 - **Fuzz Kiddiez**: There are many people now who seem to think that running
609 spraying exploit code onto the public mailing list for instant zero-day
616 Automated testing should front-load some of the risk while the feature is
632 3. Minimize further loss of data.
637 of every aspect of a fsck tool until the introduction of low-cost virtual
638 machines with high-IOPS storage.
645 -------------------------------
657 `fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
660 would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
665 ``xfs_scrub -n`` between each test to ensure that the new checking code
679 ---------------------------------------
693 in-kernel validation functions and the ability of the offline fsck tool to
711 -----------------------------------------
745 2. Offline checking (``xfs_repair -n``)
747 4. Online checking (``xfs_scrub -n``)
762 These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
768 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements…
770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
772 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`…
775 --------------
790 * Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
792 * Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
793 force-repairing the whole filesystem doesn't cause problems.
794 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
796 * Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
797 remounting the filesystem read-only and read-write.
805 …git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-
806 and the `evolution of existing per-function stress testing
807 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stre…
819 ------------------
827 Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
831 correction capabilities of the hardware to check data file contents.
843 ------------------
849 possible, the lowest CPU and IO priority, and in a CPU-constrained single
867 * ``xfs_scrub_all.cron`` on non-systemd systems
870 additional media scan of all file data once per month.
871 This is less foolproof than, say, storing file data block checksums, but much
879 This was performed via ``systemd-analyze security``, after which privileges
891 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-se…
894 ----------------
901 download this information into a human-readable format.
917 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-repo…
920 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-report…
922 5. Kernel Algorithms and Data Structures
925 This section discusses the key algorithms and data structures of the kernel
934 ------------------------
965 Documentation/filesystems/xfs-self-describing-metadata.rst
967 Reverse Mapping
968 ---------------
975 the filesystem, even at the cost of data integrity.
977 increase internal redundancy -- either storing nearly identical copies of
978 metadata, or more space-efficient encoding techniques.
987 Like any system that improves redundancy, the reverse-mapping feature increases
989 However, it has two critical advantages: first, the reverse index is key to
992 Second, the different ondisk storage format of the reverse mapping btree
993 defeats device-level deduplication because the filesystem requires real
996 +--------------------------------------------------------------------------+
998 +--------------------------------------------------------------------------+
1000 | improve the robustness of user data storage itself. |
1001 | This is a valid point, but adding a new index for file data block |
1002 | checksums increases write amplification by turning data overwrites into |
1003 | copy-writes, which age the filesystem prematurely. |
1004 | In keeping with thirty years of precedent, users who want file data |
1011 +--------------------------------------------------------------------------+
1013 The information captured in a reverse space mapping record is as follows:
1015 .. code-block:: c
1031 Finally, the flags field provides extra information about the space usage --
1033 unwritten data extent?
1037 The reverse mapping index plays a key role in the consistency checking process
1042 For example, a file data extent mapping can be checked against:
1046 * The absence of an entry in the reference count data if the file is not
1048 * The correspondence of an entry in the reverse mapping information.
1050 There are several observations to make about reverse mapping indices:
1052 1. Reverse mappings can provide a positive affirmation of correctness if any of
1060 For example, checking a reverse mapping record for a file extent mapping
1063 Instead, scrub relies on rigorous cross-referencing during the primary space
1066 3. Consistency scans must use non-blocking lock acquisition primitives if the
1072 This means that forward progress during this part of a scan of the reverse
1073 mapping data cannot be guaranteed if system load is heavy.
1075 In summary, reverse mappings play a key role in reconstruction of primary
1080 Checking and Cross-Referencing
1081 ------------------------------
1091 - Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092 - Is this structure inconsistent with the rest of the system
1094 - Is there so much damage around the filesystem that cross-referencing is not
1096 - Can the structure be optimized to improve performance or reduce the size of
1098 - Does the structure contain data that is not inconsistent but deserves review
1111 - Does the block belong to this filesystem?
1113 - Does the block belong to the structure that asked for the read?
1117 - Is the type of data stored in the block within a reasonable range of what
1120 - Does the physical location of the block match the location it was read from?
1122 - Does the block checksum match the data?
1124 The scope of the protections here are very limited -- verifiers can only
1134 userspace as corruption; during a cross-reference, they are reported as a
1135 failure to cross-reference once the full examination is complete.
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1150 - Does the type of data stored in the block match what scrub is expecting?
1152 - Does the block belong to the owning structure that asked for the read?
1154 - If the block contains records, do the records fit within the block?
1156 - If the block tracks internal free space information, is it consistent with
1159 - Are the records contained inside the block free of obvious corruptions?
1161 Record checks in this category are more rigorous and more time-intensive.
1174 Validation of Userspace-Controlled Record Attributes
1182 - Superblock fields controlled by mount options
1183 - Filesystem labels
1184 - File timestamps
1185 - File permissions
1186 - File size
1187 - File flags
1188 - Names present in directory entries, extended attribute keys, and filesystem
1190 - Extended attribute key namespaces
1191 - Extended attribute values
1192 - File data block contents
1193 - Quota limits
1194 - Quota timer expiration (if resource usage exceeds the soft limit)
1196 Cross-Referencing Space Metadata
1200 cross-referencing records between metadata structures.
1204 The exact set of cross-referencing is highly dependent on the context of the
1205 data structure being checked.
1211 For the reverse mapping btree, it is possible to mask parts of the key for the
1216 Btree blocks undergo the following checks before cross-referencing:
1218 - Does the type of data stored in the block match what scrub is expecting?
1220 - Does the block belong to the owning structure that asked for the read?
1222 - Do the records fit within the block?
1224 - Are the records contained inside the block free of obvious corruptions?
1226 - Are the name hashes in the correct order?
1228 - Do node pointers within the btree point to valid block addresses for the type
1231 - Do child pointers point towards the leaves?
1233 - Do sibling pointers point across the same level?
1235 - For each node block record, does the record key accurate reflect the contents
1238 Space allocation records are cross-referenced as follows:
1240 1. Any space mentioned by any metadata structure are cross-referenced as
1243 - Does the reverse mapping index list only the appropriate owner as the
1246 - Are none of the blocks claimed as free space?
1248 - If these aren't file data blocks, are none of the blocks claimed as space
1251 2. Btree blocks are cross-referenced as follows:
1253 - Everything in class 1 above.
1255 - If there's a parent node block, do the keys listed for this block match the
1258 - Do the sibling pointers point to valid blocks? Of the same level?
1260 - Do the child pointers point to valid blocks? Of the next level down?
1262 3. Free space btree records are cross-referenced as follows:
1264 - Everything in class 1 and 2 above.
1266 - Does the reverse mapping index list no owners of this space?
1268 - Is this space not claimed by the inode index for inodes?
1270 - Is it not mentioned by the reference count index?
1272 - Is there a matching record in the other free space btree?
1274 4. Inode btree records are cross-referenced as follows:
1276 - Everything in class 1 and 2 above.
1278 - Is there a matching record in free inode btree?
1280 - Do cleared bits in the holemask correspond with inode clusters?
1282 - Do set bits in the freemask correspond with inode records with zero link
1285 5. Inode records are cross-referenced as follows:
1287 - Everything in class 1.
1289 - Do all the fields that summarize information about the file forks actually
1292 - Does each inode with zero link count correspond to a record in the free
1295 6. File fork space mapping records are cross-referenced as follows:
1297 - Everything in class 1 and 2 above.
1299 - Is this space not mentioned by the inode btrees?
1301 - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1304 7. Reference count records are cross-referenced as follows:
1306 - Everything in class 1 and 2 above.
1308 - Within the space subkeyspace of the rmap btree (that is to say, all
1310 are there the same number of reverse mapping records for each block as the
1315 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-
1317 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gap…
1319 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-ga…
1322 …tps://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-r…
1325 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-
1331 Extended attributes implement a key-value store that enable fragments of data
1335 Most typically these fragments are metadata about the file -- origins, security
1336 contexts, user-supplied labels, indexing information, etc.
1356 Scrub must read each block mapped by the attr fork and ignore the non-leaf
1374 Checking and Cross-Referencing Directories
1380 255-byte sequence (name) to an inumber.
1385 Each non-directory file may have multiple directories point to it.
1389 The first partition contains directory entry data blocks.
1390 Each data block contains variable-sized records associating a user-provided
1392 If the directory entry data grows beyond one block, the second partition (which
1393 exists as post-EOF extents) is populated with a block containing free space
1394 information and an index that maps hashes of the dirent names to directory data
1402 directory data blocks.
1439 maps user-provided names to improve lookup times by avoiding linear scans.
1440 Internally, it maps a 32-bit hash of the name to a block offset within the
1444 fixed-size metadata records -- each dabtree block contains a magic number, a
1446 The format of leaf and node records are the same -- each entry points to the
1448 leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1456 - Does the block belong to the owning structure that asked for the read?
1458 - Do the records fit within the block?
1460 - Are the records contained inside the block free of obvious corruptions?
1462 - Are the name hashes in the correct order?
1464 - Do node pointers within the dabtree point to valid fork offsets for dabtree
1467 - Do leaf pointers within the dabtree point to valid fork offsets for directory
1470 - Do child pointers point towards the leaves?
1472 - Do sibling pointers point across the same level?
1474 - For each dabtree node record, does the record key accurate reflect the
1477 - For each dabtree leaf record, does the record key accurate reflect the
1480 Cross-Referencing Summary Counters
1486 In theory, the amount of available resources (data blocks, inodes, realtime
1490 Cross-referencing these values against the filesystem metadata should be a
1498 Post-Repair Reverification
1510 ------------------------------------
1512 Complex operations can make modifications to multiple per-AG data structures
1555 reverse mapping and reflink were introduced.
1570 It then attaches to the in-memory transaction an action item to schedule
1598 Happily, log recovery corrects this inconsistency for us -- when recovery finds
1608 In other words, all per-AG metadata updates for an unmapped block must be
1624 During the design phase of the reverse mapping and reflink features, it was
1625 decided that it was impractical to cram all the reverse mapping updates for a
1630 * A reverse mapping update for the block mapping update
1632 * A reverse mapping update for the freelist fix
1635 * A reverse mapping update for the btree update
1637 * A reverse mapping update for the freelist fix
1640 * A reverse mapping update for the refcount update
1642 * A reverse mapping update for the freelist fix
1646 * A reverse mapping update for the freelist fix
1650 * A reverse mapping update for the freelist fix
1654 For copy-on-write updates this is even worse, because this must be done once to
1658 work items to cover most reverse mapping updates and all refcount updates.
1665 However, online fsck changes the rules -- remember that although physical
1666 updates to per-AG structures are coordinated by locking the buffers for AG
1668 Once scrub acquires resources and takes locks for a data structure, it must do
1672 For example, if a thread performing a copy-on-write has completed a reverse
1699 protect the data structure being scrubbed to look for pending operations.
1717 holding an AG header lock, but per-AG work items cannot be marked done without
1733 calls ``->finish_item`` to complete it.
1735 4. The ``->finish_item`` implementation logs some changes and calls
1761 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1777 Taking locks in the hot path of a writer thread to access a data structure only
1778 to find that no further action is necessary is expensive -- on the author's
1779 computer, this have an overhead of 40-50ns per access.
1804 - The hooked part of XFS should declare a static-scoped static key that
1809 - When deciding to invoke code that's only used by scrub, the regular
1811 scrub-only hook code if the static key is not enabled.
1813 - The regular filesystem should export helper functions that call
1819 - Scrub functions wanting to turn on scrub-only XFS functionality should call
1832 return -EDEADLOCK, which should result in the scrub being restarted with the
1839 Documentation/staging/static-keys.rst.
1844 ----------------------
1853 a new data structure.
1860 difficult, especially on 32-bit systems.
1872 indexed data storage would also be very useful.
1873 Fortunately, the Linux kernel already has a facility for byte-addressable and
1875 In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1876 to store intermediate data that doesn't need to be in memory at all times, so
1880 +--------------------------------------------------------------------------+
1882 +--------------------------------------------------------------------------+
1885 | built data structure, which would be live after recovery finished. |
1887 | The second edition solved the half-rebuilt structure problem by storing |
1892 +--------------------------------------------------------------------------+
1899 1. Arrays of fixed-sized records (space management btrees, directory and
1902 2. Sparse arrays of fixed-sized records (quotas and link counts)
1907 4. Staging btrees in memory (reverse mapping btrees)
1911 To support the first four use cases, high level data structures wrap the xfile
1914 four of those five higher level data structures.
1919 and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
1922 XFS is very record-based, which suggests that the ability to load and store
1953 :ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1966 xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1979 other threads to provide updates to the scanned data, the scrub function must
1984 Arrays of Fixed-Sized Records
1988 counts, file fork space, and reverse mappings) consists of a set of fixed-size
1990 Directories have a set of fixed-size dirent records that point to the names,
1991 and extended attributes have a set of fixed-size attribute keys that point to
2001 The ``xfarray`` abstraction presents a linear array for fixed-size records atop
2002 the byte-accessible xfile.
2019 ``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2034 rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
2039 reverse mapping information.
2047 `big in-memory array
2048 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2057 .. code-block:: c
2075 .. code-block:: c
2138 memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2141 low-effort robust (resistant) location in large samples`, in *Contributions to
2145 The partitioning of quicksort is fairly textbook -- rearrange the record
2184 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ serie…
2188 In-Memory B+Trees
2195 Keeping the scan data up to date requires requires the ability to propagate
2196 metadata updates from the filesystem into the data being collected by the scan.
2200 Another option is to skip the side-log and commit live updates from the
2201 filesystem directly into the scan data, which trades more overhead for a lower
2203 In both cases, the data structure holding the scan results must support indexed
2206 Given that indexed lookups of scan data is required for both strategies, online
2208 scan data.
2211 Conveniently, however, XFS has a library to create and maintain ordered reverse
2218 The XFS buffer cache specializes in abstracting IO to block-oriented address
2225 `in-memory btree
2226 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2235 per-AG structure.
2240 With this adaptation in place, users of the xfile-backed buffer cache use
2241 exactly the same APIs as users of the disk-backed buffer cache.
2244 updates to an in-memory btree.
2250 Space management for an xfile is very simple -- each btree block is one memory
2252 These blocks use the same header format as an on-disk btree, but the in-memory
2254 corruption-prone than regular DRAM.
2294 and to update the in-memory btree.
2309 structure, the ephemeral nature of the in-memory btree block storage presents
2313 other than the data device.
2342 ------------------------------
2353 rebuilding a btree index from a collection of records -- bulk btree loading.
2354 This was implemented rather inefficiently code-wise, since ``xfs_repair``
2355 had separate copy-pasted implementations for each btree type.
2376 maxrecs = (block_size - header_size) / record_size
2402 maxrecs = (block_size - header_size) / (key_size + ptr_size)
2421 - For AG-rooted btrees, this level is the root level, so the height of the new
2425 - For inode-rooted btrees where the records in the top level do not fit in the
2430 - For inode-rooted btrees where the records in the top level can be stored in
2434 This only becomes relevant when non-bmap btrees gain the ability to root in
2444 Each reserved extent is tracked separately by the btree builder state data.
2447 its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2452 extent, it updates the in-memory reservation to reflect the claimed space.
2472 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2475 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2481 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2580 1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
2605 reverse mapping information.
2606 Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
2608 Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
2625 Once the repair function accumulates one chunk's worth of data, it calls
2627 This xfarray is walked twice during the btree creation step -- once to populate
2629 free inode btree with records for chunks that have free non-sparse inodes.
2636 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2642 Reverse mapping records are used to rebuild the reference count information.
2644 file data.
2645 Imagine the reverse mapping entries as rectangles representing extents of
2650 In other words, the record emission stimulus is level-triggered::
2661 Extents being used to stage copy-on-write operations should be the only records
2663 Single-owner file blocks aren't recorded in either the free space or the
2668 1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
2669 records for any space having more than one reverse mapping and add them to
2695 generate refcount information from reverse mapping records.
2697 - Until the reverse mapping btree runs out of records:
2699 - Retrieve the next record from the btree and put it in a bag.
2701 - Collect all records with the same starting block from the btree and put
2704 - While the bag isn't empty:
2706 - Among the mappings in the bag, compute the lowest block number where the
2709 unprocessed reverse mapping or the next block after the shortest mapping
2712 - Remove all mappings from the bag that end at this position.
2714 - Collect all reverse mappings that start at this position from the btree
2717 - If the size of the bag changed and is greater than one, create a new
2721 The bag-like structure in this case is a type 2 xfarray as discussed in the
2723 Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
2729 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2735 The high level process to rebuild a data/attr fork mapping btree is:
2737 1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
2738 records from the reverse mapping records for that inode and fork.
2762 immediate areas if the data and attr forks are not both in BMBT format.
2770 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2776 ---------------------------
2778 Whenever online fsck builds a new data structure to replace one that is
2789 As part of a repair, online fsck relies heavily on the reverse mapping records
2792 there may be other data structures that also think they own some of those
2800 1. Create a bitmap of space used by data structures that must be preserved.
2804 2. Survey the reverse mapping data to create a bitmap of space owned by the
2811 Repairs for file-based metadata such as extended attributes, directories,
2819 4. For each candidate extent, count the number of reverse mapping records for
2821 data structure being repaired.
2823 - If zero, the block has a single owner and can be freed.
2825 - If not, the block is part of a crosslinked structure and must not be
2831 6. If the region is crosslinked, delete the reverse mapping entry for the
2855 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-l…
2869 2. For each reverse mapping record with an rmap owner corresponding to the
2872 3. Walk the current data structures that have the same rmap owner.
2876 old data structures and hence is a candidate for reaping.
2881 common case), then step 2 can be performed at the same time as the reverse
2889 1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
2890 records from the gaps in the reverse mapping btree.
2907 reverse mapping btree, the new free space btrees, or the AGFL.
2912 First, free space is not explicitly tracked in the reverse mapping records.
2914 space component of the keyspace of the reverse mapping btree.
2926 information changes the number of free space records, repair must re-estimate
2929 As part of committing the new btrees, repair must ensure that reverse mappings
2937 Blocks for the free space btrees and the reverse mapping btrees are supplied by
2939 Blocks put onto the AGFL have reverse mapping records with the owner
2942 btrees or the reverse mapping btrees.
2943 When repair walks reverse mapping records to synthesize free space records, it
2955 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2960 Case Study: Reaping After Repairing Reverse Mapping Btrees
2963 Old reverse mapping btrees are less difficult to reap after a repair.
2965 btree blocks, and the reverse mapping btree blocks all have reverse mapping
2967 The full process of gathering reverse mapping records and building a new btree
2969 :ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
2979 The rest of the process of rebuildng the reverse mapping btree is discussed
2984 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2992 1. Create a bitmap for all the space that the reverse mapping data claims is
2997 3. Subtract any space that the reverse mapping data claims is owned by any
2998 other owner, to avoid re-adding crosslinked blocks to the AGFL.
3002 5. The next operation to fix the freelist will right-size the list.
3007 --------------------
3010 ("dinodes") and an in-memory ("cached") representation.
3013 badly damaged that the filesystem cannot load the in-memory representation.
3015 specialized resource acquisition functions that return either the in-memory
3020 is necessary to get the in-core structure loaded.
3025 Once the in-memory representation is loaded, repair can lock the inode and can
3027 Most inode attributes are easy to check and constrain, or are user-controlled
3029 Dealing with the data and attr fork extent counts and the file block counts is
3036 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3040 --------------------
3043 an in-memory representation, and hence are subject to the same cache coherency
3048 whatever is necessary to get the in-core structure loaded.
3049 Once the in-memory representation is loaded, the only attributes needing
3057 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3063 --------------------------------
3073 Writer threads reserve the worst-case quantities of resources from the
3086 The high-performance nature of the summary counters makes it difficult for
3095 For repairs, the in-memory counters must be stabilized while walking the
3109 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3113 - It does not quiesce the log.
3118 +--------------------------------------------------------------------------+
3120 +--------------------------------------------------------------------------+
3127 | - Other programs can unfreeze the filesystem without our knowledge. |
3130 | - Adding an extra lock to prevent others from thawing the filesystem |
3131 | required the addition of a ``->freeze_super`` function to wrap |
3143 | - The log need not be quiesced to check the summary counters, but a VFS |
3147 | - Quiescing the log means that XFS flushes the (possibly incorrect) |
3150 | - A bug in the VFS meant that freeze could complete even when |
3153 +--------------------------------------------------------------------------+
3157 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3161 ---------------------
3174 - How does scrub manage the scan while it is collecting data?
3176 - How does the scan keep abreast of changes being made to the system by other
3186 (*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3187 its data block mapping.
3191 Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3193 <https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3194 Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3195 1913-4.
3198 the space in the data section filesystem.
3199 They form a continuous keyspace that can be expressed as a 64-bit integer,
3210 concurrent filesystem update needs to be incorporated into the scan data.
3213 Advancing the scan cursor is a multi-step process encapsulated in
3221 2. Use the per-AG inode btree to look up the next inumber after the one that
3289 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3293 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3312 - The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3315 - Speculative preallocations need to be unreserved.
3317 - An unlinked file may have lost its last reference, in which case the entire
3322 Inactivation has two parts -- the VFS part, which initiates writeback on all
3323 dirty file pages, and the XFS part, which cleans up XFS-specific information
3344 7. Space on the data and realtime devices for the transaction.
3357 Resources are often released in the reverse order, though this is not required.
3360 then decide to cross-reference the object with an object that is acquired
3376 save time if another process re-opens the file before the system runs out
3378 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3393 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3395 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`…
3403 in a well-known order: parent → child when updating the directory tree, and
3421 Solving both of these problems is straightforward -- any time online fsck
3455 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3484 overhead for single-threaded applications.
3490 - A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3491 a well-known incore filesystem object.
3493 - Each hook must define an action code and a structure containing more context
3496 - Hook providers should provide appropriate wrapper functions and structs
3500 - A callsite in the regular filesystem code must be chosen to call
3501 ``xfs_hooks_call`` with the action code and data structure.
3510 - The online fsck function should define a structure to hold scan data, a lock
3511 to coordinate access to the scan data, and a ``struct xfs_hook`` object.
3515 - The online fsck code must contain a C function to catch the hook action code
3516 and data structure.
3518 hook information must be applied to the scan data.
3520 - Prior to unlocking inodes to start the scan, online fsck must call
3524 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3551 scan data mutex ←──┐ same │
3553 update scan data │ lock │
3555 scan data mutex ←──┘ │
3568 - Prior to invoking the notifier call chain, the filesystem function being
3572 - The scanning function and the scrub hook function must coordinate access to
3573 the scan data by acquiring a lock on the scan data.
3575 - Scrub hook function must not add the live update information to the scan
3580 - Scrub hook functions must not change the caller's state, including the
3585 - The hook function can abort the inode scan to avoid breaking the other rules.
3589 - ``xchk_iscan_start`` starts a scan
3591 - ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3594 - ``xchk_iscan_want_live_update`` to decide if an inode has already been
3597 in-memory scan information.
3599 - ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3602 - ``xchk_iscan_teardown`` to finish the scan
3606 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3678 b. Determine that inode's resource usage (data blocks, inode counts,
3698 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3708 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3711 data as follows:
3731 accounted as a backref counter in the shadow data for A, since child dotdot
3739 Non-directories never have children of any kind.
3745 both the inode and the shadow data, and comparing the link counts.
3756 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3761 Case Study: Rebuilding Reverse Mapping Records
3766 and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3768 repair code -- code and data are entirely contained within the scrub module,
3771 A secondary advantage of this repair approach is atomicity -- once the kernel
3778 Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
3783 <liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3784 scan for reverse mapping records.
3789 scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
3794 4. Hook into rmap updates for the AG being repaired so that the live scan data
3802 a. Create a btree cursor for the in-memory btree.
3804 b. Use the rmap code to add the record to the in-memory btree.
3811 If so, apply the live update into the scan data:
3813 a. Create a btree cursor for the in-memory btree.
3815 b. Replay the operation into the in-memory btree.
3840 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3844 --------------------------------------------
3849 File forks map 64-bit logical file fork space extents to physical storage space
3850 extents, similar to how a memory management unit maps 64-bit virtual addresses
3852 Therefore, file-based tree structures (such as directories and extended
3854 to other blocks mapped within that same address space, and file-based linear
3860 Therefore, online repair of file-based metadata createas a temporary file in
3878 There is a downside to the reaping process -- if the system crashes during the
3880 fail because freeing space will find the extra reverse mappings and abort.
3890 +--------------------------------------------------------------------------+
3892 +--------------------------------------------------------------------------+
3894 | blocks would be scanned for salvageable data; the extents in the file |
3901 | offset in the fork from the salvage data, reaping the old extents, and |
3907 | - Array structures are linearly addressed, and the regular filesystem |
3911 | - Extended attributes are allowed to use the entire attr fork offset |
3914 | - Even if repair could build an alternate copy of a data structure in a |
3920 | - A crash after construction of the secondary tree but before the range |
3924 | - Reaping blocks after a repair is not a simple operation, and |
3928 | - Directory entry blocks and quota records record the file fork offset |
3935 | - Each block in a directory or extended attributes btree index contains |
3944 +--------------------------------------------------------------------------+
3951 This allocates an inode, marks the in-core inode private, and attaches it to
3958 userspace for data fork blocks.
3959 The usage patterns of these two locks are the same as for any other XFS file --
3960 access to file data are controlled via the IOLOCK, and access to file metadata
3968 Data can be written to a temporary file by two means:
3976 Once a good copy of a data file has been constructed in a temporary file, it
3982 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
3986 ----------------------
3988 Once repair builds a temporary file with a new data structure written into
3996 a. When the reverse-mapping btree is enabled, the swap code must keep the
3997 reverse mapping information up to date with every exchange of mappings.
4001 b. Reverse-mapping is critical for the operation of online fsck, so the old
4007 For this use case, an incomplete exchange will not result in a user-visible
4012 For directory and xattr repairs, the user-visible contents might be the
4015 e. Old blocks in the file may be cross-linked with another structure and must
4016 not reappear if the system goes down mid-repair.
4022 the reverse-mapping extent swap code.
4026 The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
4032 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
4035 +--------------------------------------------------------------------------+
4036 | **Sidebar: Using Log-Incompatible Feature Flags** |
4037 +--------------------------------------------------------------------------+
4075 | Log-assisted extended attribute updates and atomic extent swaps both use |
4078 +--------------------------------------------------------------------------+
4089 such as exchanging file sizes, inode flags, or conversion of fork data to local
4093 .. code-block:: c
4121 instead of the data fork and other work to be done after the extent swap.
4123 if the file data fork is the target of the swap operation.
4166 This quantity is ``(map1.br_startoff + map1.br_blockcount -
4179 The operation manager completes the deferred work in steps 3b-3e before
4182 4. Perform any post-processing.
4204 - Data device blocks needed to handle the repeated updates to the fork
4206 - Change in data and realtime block counts for both files.
4207 - Increase in quota usage for both files, if the two files do not share the
4209 - The number of extent mappings that will be added to each file.
4210 - Whether or not there are partially written realtime extents.
4226 "local" and treat the fork as a literal area for data storage.
4229 - If both forks are in local format and the fork areas are large enough, the
4235 - If both forks map blocks, then the regular atomic extent swap is used.
4237 - Otherwise, only one fork is in local format.
4251 builds every block in the new data structure with the owner field of the file
4256 reaping <reaping>` mechanism that is done post-repair.
4260 However, this iunlink processing omits the cross-link detection of online
4270 2. Use the staging data to write out new contents into the temporary repair
4298 length, similar to what the free space by count (cntbt) btree does for the data
4302 partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
4305 and can satisfy a power-of-two allocation request.
4328 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
4334 In XFS, extended attributes are implemented as a namespaced name-value store.
4340 Attribute leaf blocks contain variable-sized records that associate
4341 user-provided names with the user-provided values.
4374 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
4378 ------------------
4388 The best that online repair can do at this time is to read directory data
4406 2. Walk the first partition of data fork of the directory to find the directory
4407 entry data blocks.
4410 a. Walk the directory data block to find candidate entries.
4420 memory or there are no more directory data blocks to examine, unlock the
4452 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4462 way that the historic lack of reverse space mapping information once hindered
4480 +--------------------------------------------------------------------------+
4482 +--------------------------------------------------------------------------+
4494 | followed up with the corresponding change to the reverse links. |
4518 | Chandan increased the maximum extent counts of both data and attribute |
4521 +--------------------------------------------------------------------------+
4568 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repai…
4605 <https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
4641 5. Copy all non-parent pointer extended attributes to the temporary file.
4651 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-re…
4668 per-AG in-memory slab.
4672 a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
4678 Record the names in a per-file xfblob, and store ``(parent_inum,
4679 parent_gen, dirent_pos)`` tuples in a per-file slab.
4681 2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
4684 per-AG tuple slab.
4685 This should be trivial since the per-AG tuples are in child inumber
4688 4. Position a second slab cursor at the start of the per-file tuple slab.
4693 a. Tuples in the per-AG list but not the per-file list are missing and
4696 b. Tuples in the per-file list but not the per-AG list are dangling
4706 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
4710 challenging because it currently uses a single-pass scan of the filesystem
4712 This scan would have to be converted into a multi-pass scan:
4729 the dirents and add them to the now-empty directories.
4736 -------------
4741 downwards either to more subdirectories or to non-directory files.
4791 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4794 6. Userspace Algorithms and Data Structures
4797 This section discusses the key algorithms and data structures of the userspace
4799 repairs in the kernel, verify file data, and look for other potential problems.
4804 -----------------
4807 That structure follows naturally from the data dependencies designed into the
4815 b. Quota resource counts depend on consistency within the quota file data
4823 d. Directories, extended attributes, and file data depend on consistency within
4824 the file forks that map directory and extended attribute data to physical
4833 g. Realtime space metadata depend on the inode records and data forks of the
4837 and reverse mapping btrees) depend on consistency within the AG headers and
4846 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
4849 - Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
4851 - Phase 3 scans inodes in parallel.
4854 - Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
4857 - Phase 5 starts by checking groups (b) and (c) in parallel before moving on
4860 - Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
4863 - Phase 7 checks group (a), having validated everything else.
4865 Notice that the data dependencies between groups are enforced by the structure
4869 --------------------
4872 Given that XFS targets installations with large high-performance storage,
4909 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-t…
4912 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalan…
4918 ------------------
4938 filesystem object, it became much more memory-efficient to track all eligible
4940 Each repair item represents a single lockable object -- AGs, metadata files,
4945 The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
4992 Corrupt file data blocks reported by phase 6 cannot be recovered by the
4997 …://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warn…
4999 `repair data dependency
5000 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-d…
5003 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracki…
5006 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-schedu…
5010 -----------------------------------------------
5020 - Slashes and null bytes are not allowed in directory entries.
5022 - Null bytes are not allowed in userspace-visible extended attributes.
5024 - Null bytes are not allowed in the filesystem label.
5029 presented together -- all the names in a directory, or all the attributes of a
5033 modern-day Linux systems is that programs work with Unicode character code
5035 These programs typically encode those code points in UTF-8 when interfacing
5036 with the C library because the kernel expects null-terminated names.
5038 UTF-8 encoded Unicode data.
5046 The standard also permits characters to be constructed in multiple ways --
5056 For example, the character "Right-to-Left Override" U+202E can trick some
5073 When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5076 `libicu <https://github.com/unicode-org/icu>`_
5079 Names are also checked for control characters, non-rendering characters, and
5084 Media Verification of File Data Extents
5085 ---------------------------------------
5087 The system administrator can elect to initiate a media scan of all file data
5092 to find areas that are allocated to file data fork extents.
5093 Gaps between data fork extents that are smaller than 64k are treated as if
5094 they were data fork extents to reduce the command setup overhead.
5099 If the verification read fails, ``xfs_scrub`` retries with single-block reads
5105 construct file paths from inode numbers for user-friendly reporting.
5120 ----------------
5135 contents of data forks could be exchanged between two files by exchanging the
5137 When XFS v5 came along with self-describing metadata, this old mechanism grew
5140 When the reverse mapping btree was later added to XFS, the only way to maintain
5141 the consistency of the fork mappings with the reverse mapping index was to
5144 This mechanism is identical to steps 2-3 from the procedure above except for
5156 - **Atomic commit of file writes**: A userspace process opens a file that it
5167 - **Transactional file updates**: The same mechanism as above, but the caller
5171 change timestamps of the original file before reflinking its data to the
5178 - **Emulation of atomic block device writes**: Export a block device with a
5188 ----------------
5199 simple representation of the data dependencies between the selected scrub
5210 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
5213 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
5217 ------------------------------------
5232 ------------------------
5240 reverse mapping index from userspace.
5249 uses the new fallocate map-freespace call to map any free space in that region
5252 ``GETFSMAP`` and issues forced repair requests on the data structure.
5258 To clear all the file data out of a portion of the physical storage, clearspace
5259 uses the FSMAP information to find relevant file data blocks.
5263 contents; any changes will be written somewhere else via copy-on-write.
5266 <swapext_if_unchanged>` feature) to change the target file's data extent
5279 most shared data extents in the filesystem, and target them first.
5302 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
5305 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
5309 ---------------------
5312 the data and metadata at the end of the filesystem, and handing the freed space