1Written by: Neil Brown <neilb@suse.de> 2 3Overlay Filesystem 4================== 5 6This document describes a prototype for a new approach to providing 7overlay-filesystem functionality in Linux (sometimes referred to as 8union-filesystems). An overlay-filesystem tries to present a 9filesystem which is the result over overlaying one filesystem on top 10of the other. 11 12The result will inevitably fail to look exactly like a normal 13filesystem for various technical reasons. The expectation is that 14many use cases will be able to ignore these differences. 15 16This approach is 'hybrid' because the objects that appear in the 17filesystem do not all appear to belong to that filesystem. In many 18cases an object accessed in the union will be indistinguishable 19from accessing the corresponding object from the original filesystem. 20This is most obvious from the 'st_dev' field returned by stat(2). 21 22While directories will report an st_dev from the overlay-filesystem, 23all non-directory objects will report an st_dev from the lower or 24upper filesystem that is providing the object. Similarly st_ino will 25only be unique when combined with st_dev, and both of these can change 26over the lifetime of a non-directory object. Many applications and 27tools ignore these values and will not be affected. 28 29Upper and Lower 30--------------- 31 32An overlay filesystem combines two filesystems - an 'upper' filesystem 33and a 'lower' filesystem. When a name exists in both filesystems, the 34object in the 'upper' filesystem is visible while the object in the 35'lower' filesystem is either hidden or, in the case of directories, 36merged with the 'upper' object. 37 38It would be more correct to refer to an upper and lower 'directory 39tree' rather than 'filesystem' as it is quite possible for both 40directory trees to be in the same filesystem and there is no 41requirement that the root of a filesystem be given for either upper or 42lower. 43 44The lower filesystem can be any filesystem supported by Linux and does 45not need to be writable. The lower filesystem can even be another 46overlayfs. The upper filesystem will normally be writable and if it 47is it must support the creation of trusted.* extended attributes, and 48must provide valid d_type in readdir responses, so NFS is not suitable. 49 50A read-only overlay of two read-only filesystems may use any 51filesystem type. 52 53Directories 54----------- 55 56Overlaying mainly involves directories. If a given name appears in both 57upper and lower filesystems and refers to a non-directory in either, 58then the lower object is hidden - the name refers only to the upper 59object. 60 61Where both upper and lower objects are directories, a merged directory 62is formed. 63 64At mount time, the two directories given as mount options "lowerdir" and 65"upperdir" are combined into a merged directory: 66 67 mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ 68workdir=/work /merged 69 70The "workdir" needs to be an empty directory on the same filesystem 71as upperdir. 72 73Then whenever a lookup is requested in such a merged directory, the 74lookup is performed in each actual directory and the combined result 75is cached in the dentry belonging to the overlay filesystem. If both 76actual lookups find directories, both are stored and a merged 77directory is created, otherwise only one is stored: the upper if it 78exists, else the lower. 79 80Only the lists of names from directories are merged. Other content 81such as metadata and extended attributes are reported for the upper 82directory only. These attributes of the lower directory are hidden. 83 84whiteouts and opaque directories 85-------------------------------- 86 87In order to support rm and rmdir without changing the lower 88filesystem, an overlay filesystem needs to record in the upper filesystem 89that files have been removed. This is done using whiteouts and opaque 90directories (non-directories are always opaque). 91 92A whiteout is created as a character device with 0/0 device number. 93When a whiteout is found in the upper level of a merged directory, any 94matching name in the lower level is ignored, and the whiteout itself 95is also hidden. 96 97A directory is made opaque by setting the xattr "trusted.overlay.opaque" 98to "y". Where the upper filesystem contains an opaque directory, any 99directory in the lower filesystem with the same name is ignored. 100 101readdir 102------- 103 104When a 'readdir' request is made on a merged directory, the upper and 105lower directories are each read and the name lists merged in the 106obvious way (upper is read first, then lower - entries that already 107exist are not re-added). This merged name list is cached in the 108'struct file' and so remains as long as the file is kept open. If the 109directory is opened and read by two processes at the same time, they 110will each have separate caches. A seekdir to the start of the 111directory (offset 0) followed by a readdir will cause the cache to be 112discarded and rebuilt. 113 114This means that changes to the merged directory do not appear while a 115directory is being read. This is unlikely to be noticed by many 116programs. 117 118seek offsets are assigned sequentially when the directories are read. 119Thus if 120 - read part of a directory 121 - remember an offset, and close the directory 122 - re-open the directory some time later 123 - seek to the remembered offset 124 125there may be little correlation between the old and new locations in 126the list of filenames, particularly if anything has changed in the 127directory. 128 129Readdir on directories that are not merged is simply handled by the 130underlying directory (upper or lower). 131 132 133Non-directories 134--------------- 135 136Objects that are not directories (files, symlinks, device-special 137files etc.) are presented either from the upper or lower filesystem as 138appropriate. When a file in the lower filesystem is accessed in a way 139the requires write-access, such as opening for write access, changing 140some metadata etc., the file is first copied from the lower filesystem 141to the upper filesystem (copy_up). Note that creating a hard-link 142also requires copy_up, though of course creation of a symlink does 143not. 144 145The copy_up may turn out to be unnecessary, for example if the file is 146opened for read-write but the data is not modified. 147 148The copy_up process first makes sure that the containing directory 149exists in the upper filesystem - creating it and any parents as 150necessary. It then creates the object with the same metadata (owner, 151mode, mtime, symlink-target etc.) and then if the object is a file, the 152data is copied from the lower to the upper filesystem. Finally any 153extended attributes are copied up. 154 155Once the copy_up is complete, the overlay filesystem simply 156provides direct access to the newly created file in the upper 157filesystem - future operations on the file are barely noticed by the 158overlay filesystem (though an operation on the name of the file such as 159rename or unlink will of course be noticed and handled). 160 161 162Non-standard behavior 163--------------------- 164 165The copy_up operation essentially creates a new, identical file and 166moves it over to the old name. The new file may be on a different 167filesystem, so both st_dev and st_ino of the file may change. 168 169Any open files referring to this inode will access the old data and 170metadata. Similarly any file locks obtained before copy_up will not 171apply to the copied up file. 172 173On a file opened with O_RDONLY fchmod(2), fchown(2), futimesat(2) and 174fsetxattr(2) will fail with EROFS. 175 176If a file with multiple hard links is copied up, then this will 177"break" the link. Changes will not be propagated to other names 178referring to the same inode. 179 180Symlinks in /proc/PID/ and /proc/PID/fd which point to a non-directory 181object in overlayfs will not contain valid absolute paths, only 182relative paths leading up to the filesystem's root. This will be 183fixed in the future. 184 185Some operations are not atomic, for example a crash during copy_up or 186rename will leave the filesystem in an inconsistent state. This will 187be addressed in the future. 188 189Changes to underlying filesystems 190--------------------------------- 191 192Offline changes, when the overlay is not mounted, are allowed to either 193the upper or the lower trees. 194 195Changes to the underlying filesystems while part of a mounted overlay 196filesystem are not allowed. If the underlying filesystem is changed, 197the behavior of the overlay is undefined, though it will not result in 198a crash or deadlock. 199