1XFS Self Describing Metadata 2---------------------------- 3 4Introduction 5------------ 6 7The largest scalability problem facing XFS is not one of algorithmic 8scalability, but of verification of the filesystem structure. Scalabilty of the 9structures and indexes on disk and the algorithms for iterating them are 10adequate for supporting PB scale filesystems with billions of inodes, however it 11is this very scalability that causes the verification problem. 12 13Almost all metadata on XFS is dynamically allocated. The only fixed location 14metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all 15other metadata structures need to be discovered by walking the filesystem 16structure in different ways. While this is already done by userspace tools for 17validating and repairing the structure, there are limits to what they can 18verify, and this in turn limits the supportable size of an XFS filesystem. 19 20For example, it is entirely possible to manually use xfs_db and a bit of 21scripting to analyse the structure of a 100TB filesystem when trying to 22determine the root cause of a corruption problem, but it is still mainly a 23manual task of verifying that things like single bit errors or misplaced writes 24weren't the ultimate cause of a corruption event. It may take a few hours to a 25few days to perform such forensic analysis, so for at this scale root cause 26analysis is entirely possible. 27 28However, if we scale the filesystem up to 1PB, we now have 10x as much metadata 29to analyse and so that analysis blows out towards weeks/months of forensic work. 30Most of the analysis work is slow and tedious, so as the amount of analysis goes 31up, the more likely that the cause will be lost in the noise. Hence the primary 32concern for supporting PB scale filesystems is minimising the time and effort 33required for basic forensic analysis of the filesystem structure. 34 35 36Self Describing Metadata 37------------------------ 38 39One of the problems with the current metadata format is that apart from the 40magic number in the metadata block, we have no other way of identifying what it 41is supposed to be. We can't even identify if it is the right place. Put simply, 42you can't look at a single metadata block in isolation and say "yes, it is 43supposed to be there and the contents are valid". 44 45Hence most of the time spent on forensic analysis is spent doing basic 46verification of metadata values, looking for values that are in range (and hence 47not detected by automated verification checks) but are not correct. Finding and 48understanding how things like cross linked block lists (e.g. sibling 49pointers in a btree end up with loops in them) are the key to understanding what 50went wrong, but it is impossible to tell what order the blocks were linked into 51each other or written to disk after the fact. 52 53Hence we need to record more information into the metadata to allow us to 54quickly determine if the metadata is intact and can be ignored for the purpose 55of analysis. We can't protect against every possible type of error, but we can 56ensure that common types of errors are easily detectable. Hence the concept of 57self describing metadata. 58 59The first, fundamental requirement of self describing metadata is that the 60metadata object contains some form of unique identifier in a well known 61location. This allows us to identify the expected contents of the block and 62hence parse and verify the metadata object. IF we can't independently identify 63the type of metadata in the object, then the metadata doesn't describe itself 64very well at all! 65 66Luckily, almost all XFS metadata has magic numbers embedded already - only the 67AGFL, remote symlinks and remote attribute blocks do not contain identifying 68magic numbers. Hence we can change the on-disk format of all these objects to 69add more identifying information and detect this simply by changing the magic 70numbers in the metadata objects. That is, if it has the current magic number, 71the metadata isn't self identifying. If it contains a new magic number, it is 72self identifying and we can do much more expansive automated verification of the 73metadata object at runtime, during forensic analysis or repair. 74 75As a primary concern, self describing metadata needs some form of overall 76integrity checking. We cannot trust the metadata if we cannot verify that it has 77not been changed as a result of external influences. Hence we need some form of 78integrity check, and this is done by adding CRC32c validation to the metadata 79block. If we can verify the block contains the metadata it was intended to 80contain, a large amount of the manual verification work can be skipped. 81 82CRC32c was selected as metadata cannot be more than 64k in length in XFS and 83hence a 32 bit CRC is more than sufficient to detect multi-bit errors in 84metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is 85fast. So while CRC32c is not the strongest of possible integrity checks that 86could be used, it is more than sufficient for our needs and has relatively 87little overhead. Adding support for larger integrity fields and/or algorithms 88does really provide any extra value over CRC32c, but it does add a lot of 89complexity and so there is no provision for changing the integrity checking 90mechanism. 91 92Self describing metadata needs to contain enough information so that the 93metadata block can be verified as being in the correct place without needing to 94look at any other metadata. This means it needs to contain location information. 95Just adding a block number to the metadata is not sufficient to protect against 96mis-directed writes - a write might be misdirected to the wrong LUN and so be 97written to the "correct block" of the wrong filesystem. Hence location 98information must contain a filesystem identifier as well as a block number. 99 100Another key information point in forensic analysis is knowing who the metadata 101block belongs to. We already know the type, the location, that it is valid 102and/or corrupted, and how long ago that it was last modified. Knowing the owner 103of the block is important as it allows us to find other related metadata to 104determine the scope of the corruption. For example, if we have a extent btree 105object, we don't know what inode it belongs to and hence have to walk the entire 106filesystem to find the owner of the block. Worse, the corruption could mean that 107no owner can be found (i.e. it's an orphan block), and so without an owner field 108in the metadata we have no idea of the scope of the corruption. If we have an 109owner field in the metadata object, we can immediately do top down validation to 110determine the scope of the problem. 111 112Different types of metadata have different owner identifiers. For example, 113directory, attribute and extent tree blocks are all owned by an inode, whilst 114freespace btree blocks are owned by an allocation group. Hence the size and 115contents of the owner field are determined by the type of metadata object we are 116looking at. The owner information can also identify misplaced writes (e.g. 117freespace btree block written to the wrong AG). 118 119Self describing metadata also needs to contain some indication of when it was 120written to the filesystem. One of the key information points when doing forensic 121analysis is how recently the block was modified. Correlation of set of corrupted 122metadata blocks based on modification times is important as it can indicate 123whether the corruptions are related, whether there's been multiple corruption 124events that lead to the eventual failure, and even whether there are corruptions 125present that the run-time verification is not detecting. 126 127For example, we can determine whether a metadata object is supposed to be free 128space or still allocated if it is still referenced by its owner by looking at 129when the free space btree block that contains the block was last written 130compared to when the metadata object itself was last written. If the free space 131block is more recent than the object and the object's owner, then there is a 132very good chance that the block should have been removed from the owner. 133 134To provide this "written timestamp", each metadata block gets the Log Sequence 135Number (LSN) of the most recent transaction it was modified on written into it. 136This number will always increase over the life of the filesystem, and the only 137thing that resets it is running xfs_repair on the filesystem. Further, by use of 138the LSN we can tell if the corrupted metadata all belonged to the same log 139checkpoint and hence have some idea of how much modification occurred between 140the first and last instance of corrupt metadata on disk and, further, how much 141modification occurred between the corruption being written and when it was 142detected. 143 144Runtime Validation 145------------------ 146 147Validation of self-describing metadata takes place at runtime in two places: 148 149 - immediately after a successful read from disk 150 - immediately prior to write IO submission 151 152The verification is completely stateless - it is done independently of the 153modification process, and seeks only to check that the metadata is what it says 154it is and that the metadata fields are within bounds and internally consistent. 155As such, we cannot catch all types of corruption that can occur within a block 156as there may be certain limitations that operational state enforces of the 157metadata, or there may be corruption of interblock relationships (e.g. corrupted 158sibling pointer lists). Hence we still need stateful checking in the main code 159body, but in general most of the per-field validation is handled by the 160verifiers. 161 162For read verification, the caller needs to specify the expected type of metadata 163that it should see, and the IO completion process verifies that the metadata 164object matches what was expected. If the verification process fails, then it 165marks the object being read as EFSCORRUPTED. The caller needs to catch this 166error (same as for IO errors), and if it needs to take special action due to a 167verification error it can do so by catching the EFSCORRUPTED error value. If we 168need more discrimination of error type at higher levels, we can define new 169error numbers for different errors as necessary. 170 171The first step in read verification is checking the magic number and determining 172whether CRC validating is necessary. If it is, the CRC32c is calculated and 173compared against the value stored in the object itself. Once this is validated, 174further checks are made against the location information, followed by extensive 175object specific metadata validation. If any of these checks fail, then the 176buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. 177 178Write verification is the opposite of the read verification - first the object 179is extensively verified and if it is OK we then update the LSN from the last 180modification made to the object, After this, we calculate the CRC and insert it 181into the object. Once this is done the write IO is allowed to continue. If any 182error occurs during this process, the buffer is again marked with a EFSCORRUPTED 183error for the higher layers to catch. 184 185Structures 186---------- 187 188A typical on-disk structure needs to contain the following information: 189 190struct xfs_ondisk_hdr { 191 __be32 magic; /* magic number */ 192 __be32 crc; /* CRC, not logged */ 193 uuid_t uuid; /* filesystem identifier */ 194 __be64 owner; /* parent object */ 195 __be64 blkno; /* location on disk */ 196 __be64 lsn; /* last modification in log, not logged */ 197}; 198 199Depending on the metadata, this information may be part of a header structure 200separate to the metadata contents, or may be distributed through an existing 201structure. The latter occurs with metadata that already contains some of this 202information, such as the superblock and AG headers. 203 204Other metadata may have different formats for the information, but the same 205level of information is generally provided. For example: 206 207 - short btree blocks have a 32 bit owner (ag number) and a 32 bit block 208 number for location. The two of these combined provide the same 209 information as @owner and @blkno in eh above structure, but using 8 210 bytes less space on disk. 211 212 - directory/attribute node blocks have a 16 bit magic number, and the 213 header that contains the magic number has other information in it as 214 well. hence the additional metadata headers change the overall format 215 of the metadata. 216 217A typical buffer read verifier is structured as follows: 218 219#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) 220 221static void 222xfs_foo_read_verify( 223 struct xfs_buf *bp) 224{ 225 struct xfs_mount *mp = bp->b_target->bt_mount; 226 227 if ((xfs_sb_version_hascrc(&mp->m_sb) && 228 !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), 229 XFS_FOO_CRC_OFF)) || 230 !xfs_foo_verify(bp)) { 231 XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 232 xfs_buf_ioerror(bp, EFSCORRUPTED); 233 } 234} 235 236The code ensures that the CRC is only checked if the filesystem has CRCs enabled 237by checking the superblock of the feature bit, and then if the CRC verifies OK 238(or is not needed) it verifies the actual contents of the block. 239 240The verifier function will take a couple of different forms, depending on 241whether the magic number can be used to determine the format of the block. In 242the case it can't, the code is structured as follows: 243 244static bool 245xfs_foo_verify( 246 struct xfs_buf *bp) 247{ 248 struct xfs_mount *mp = bp->b_target->bt_mount; 249 struct xfs_ondisk_hdr *hdr = bp->b_addr; 250 251 if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 252 return false; 253 254 if (!xfs_sb_version_hascrc(&mp->m_sb)) { 255 if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 256 return false; 257 if (bp->b_bn != be64_to_cpu(hdr->blkno)) 258 return false; 259 if (hdr->owner == 0) 260 return false; 261 } 262 263 /* object specific verification checks here */ 264 265 return true; 266} 267 268If there are different magic numbers for the different formats, the verifier 269will look like: 270 271static bool 272xfs_foo_verify( 273 struct xfs_buf *bp) 274{ 275 struct xfs_mount *mp = bp->b_target->bt_mount; 276 struct xfs_ondisk_hdr *hdr = bp->b_addr; 277 278 if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { 279 if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 280 return false; 281 if (bp->b_bn != be64_to_cpu(hdr->blkno)) 282 return false; 283 if (hdr->owner == 0) 284 return false; 285 } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 286 return false; 287 288 /* object specific verification checks here */ 289 290 return true; 291} 292 293Write verifiers are very similar to the read verifiers, they just do things in 294the opposite order to the read verifiers. A typical write verifier: 295 296static void 297xfs_foo_write_verify( 298 struct xfs_buf *bp) 299{ 300 struct xfs_mount *mp = bp->b_target->bt_mount; 301 struct xfs_buf_log_item *bip = bp->b_fspriv; 302 303 if (!xfs_foo_verify(bp)) { 304 XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 305 xfs_buf_ioerror(bp, EFSCORRUPTED); 306 return; 307 } 308 309 if (!xfs_sb_version_hascrc(&mp->m_sb)) 310 return; 311 312 313 if (bip) { 314 struct xfs_ondisk_hdr *hdr = bp->b_addr; 315 hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); 316 } 317 xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); 318} 319 320This will verify the internal structure of the metadata before we go any 321further, detecting corruptions that have occurred as the metadata has been 322modified in memory. If the metadata verifies OK, and CRCs are enabled, we then 323update the LSN field (when it was last modified) and calculate the CRC on the 324metadata. Once this is done, we can issue the IO. 325 326Inodes and Dquots 327----------------- 328 329Inodes and dquots are special snowflakes. They have per-object CRC and 330self-identifiers, but they are packed so that there are multiple objects per 331buffer. Hence we do not use per-buffer verifiers to do the work of per-object 332verification and CRC calculations. The per-buffer verifiers simply perform basic 333identification of the buffer - that they contain inodes or dquots, and that 334there are magic numbers in all the expected spots. All further CRC and 335verification checks are done when each inode is read from or written back to the 336buffer. 337 338The structure of the verifiers and the identifiers checks is very similar to the 339buffer code described above. The only difference is where they are called. For 340example, inode read verification is done in xfs_iread() when the inode is first 341read out of the buffer and the struct xfs_inode is instantiated. The inode is 342already extensively verified during writeback in xfs_iflush_int, so the only 343addition here is to add the LSN and CRC to the inode as it is copied back into 344the buffer. 345 346XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of 347the unlinked list modifications check or update CRCs, neither during unlink nor 348log recovery. So, it's gone unnoticed until now. This won't matter immediately - 349repair will probably complain about it - but it needs to be fixed. 350 351