1Definitions 2~~~~~~~~~~~ 3 4Userspace filesystem: 5 6 A filesystem in which data and metadata are provided by an ordinary 7 userspace process. The filesystem can be accessed normally through 8 the kernel interface. 9 10Filesystem daemon: 11 12 The process(es) providing the data and metadata of the filesystem. 13 14Non-privileged mount (or user mount): 15 16 A userspace filesystem mounted by a non-privileged (non-root) user. 17 The filesystem daemon is running with the privileges of the mounting 18 user. NOTE: this is not the same as mounts allowed with the "user" 19 option in /etc/fstab, which is not discussed here. 20 21Filesystem connection: 22 23 A connection between the filesystem daemon and the kernel. The 24 connection exists until either the daemon dies, or the filesystem is 25 umounted. Note that detaching (or lazy umounting) the filesystem 26 does _not_ break the connection, in this case it will exist until 27 the last reference to the filesystem is released. 28 29Mount owner: 30 31 The user who does the mounting. 32 33User: 34 35 The user who is performing filesystem operations. 36 37What is FUSE? 38~~~~~~~~~~~~~ 39 40FUSE is a userspace filesystem framework. It consists of a kernel 41module (fuse.ko), a userspace library (libfuse.*) and a mount utility 42(fusermount3). 43 44One of the most important features of FUSE is allowing secure, 45non-privileged mounts. This opens up new possibilities for the use of 46filesystems. A good example is sshfs: a secure network filesystem 47using the sftp protocol. 48 49The userspace library and utilities are available from the FUSE 50homepage: 51 52 https://github.com/libfuse/libfuse/ 53 54Filesystem type 55~~~~~~~~~~~~~~~ 56 57The filesystem type given to mount(2) can be one of the following: 58 59'fuse' 60 61 This is the usual way to mount a FUSE filesystem. The first 62 argument of the mount system call may contain an arbitrary string, 63 which is not interpreted by the kernel. 64 65'fuseblk' 66 67 The filesystem is block device based. The first argument of the 68 mount system call is interpreted as the name of the device. 69 70Mount options 71~~~~~~~~~~~~~ 72 73See mount.fuse3(8). 74 75Control filesystem 76~~~~~~~~~~~~~~~~~~ 77 78There's a control filesystem for FUSE, which can be mounted by: 79 80 mount -t fusectl none /sys/fs/fuse/connections 81 82Mounting it under the '/sys/fs/fuse/connections' directory makes it 83backwards compatible with versions before 2.6.0. 84 85Under the fuse control filesystem each connection has a directory 86named by a unique number. 87 88For each connection the following files exist within this directory: 89 90 'waiting' 91 92 The number of requests which are waiting to be transferred to 93 userspace or being processed by the filesystem daemon. If there is 94 no filesystem activity and 'waiting' is non-zero, then the 95 filesystem is hung or deadlocked. 96 97 'abort' 98 99 Writing anything into this file will abort the filesystem 100 connection. This means that all waiting requests will be aborted an 101 error returned for all aborted and new requests. 102 103Only the owner of the mount may read or write these files. 104 105Interrupting filesystem operations 106~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 107 108If a process issuing a FUSE filesystem request is interrupted, the 109following will happen: 110 111 1) If the request is not yet sent to userspace AND the signal is 112 fatal (SIGKILL or unhandled fatal signal), then the request is 113 dequeued and returns immediately. 114 115 2) If the request is not yet sent to userspace AND the signal is not 116 fatal, then an 'interrupted' flag is set for the request. When 117 the request has been successfully transferred to userspace and 118 this flag is set, an INTERRUPT request is queued. 119 120 3) If the request is already sent to userspace, then an INTERRUPT 121 request is queued. 122 123INTERRUPT requests take precedence over other requests, so the 124userspace filesystem will receive queued INTERRUPTs before any others. 125 126The userspace filesystem may ignore the INTERRUPT requests entirely, 127or may honor them by sending a reply to the _original_ request, with 128the error set to EINTR. 129 130It is also possible that there's a race between processing the 131original request and it's INTERRUPT request. There are two possibilities: 132 133 1) The INTERRUPT request is processed before the original request is 134 processed 135 136 2) The INTERRUPT request is processed after the original request has 137 been answered 138 139If the filesystem cannot find the original request, it should wait for 140some timeout and/or a number of new requests to arrive, after which it 141should reply to the INTERRUPT request with an EAGAIN error. In case 1421) the INTERRUPT request will be requeued. In case 2) the INTERRUPT 143reply will be ignored. 144 145Aborting a filesystem connection 146~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 147 148It is possible to get into certain situations where the filesystem is 149not responding. Reasons for this may be: 150 151 a) Broken userspace filesystem implementation 152 153 b) Network connection down 154 155 c) Accidental deadlock 156 157 d) Malicious deadlock 158 159(For more on c) and d) see later sections) 160 161In either of these cases it may be useful to abort the connection to 162the filesystem. There are several ways to do this: 163 164 - Kill the filesystem daemon. Works in case of a) and b) 165 166 - Kill the filesystem daemon and all users of the filesystem. Works 167 in all cases except some malicious deadlocks 168 169 - Use forced umount (umount -f). Works in all cases but only if 170 filesystem is still attached (it hasn't been lazy unmounted) 171 172 - Abort filesystem through the FUSE control filesystem. Most 173 powerful method, always works. 174 175How do non-privileged mounts work? 176~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 177 178Since the mount() system call is a privileged operation, a helper 179program (fusermount3) is needed, which is installed setuid root. 180 181The implication of providing non-privileged mounts is that the mount 182owner must not be able to use this capability to compromise the 183system. Obvious requirements arising from this are: 184 185 A) mount owner should not be able to get elevated privileges with the 186 help of the mounted filesystem 187 188 B) mount owner should not get illegitimate access to information from 189 other users' and the super user's processes 190 191 C) mount owner should not be able to induce undesired behavior in 192 other users' or the super user's processes 193 194How are requirements fulfilled? 195~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 196 197 A) The mount owner could gain elevated privileges by either: 198 199 1) creating a filesystem containing a device file, then opening 200 this device 201 202 2) creating a filesystem containing a suid or sgid application, 203 then executing this application 204 205 The solution is not to allow opening device files and ignore 206 setuid and setgid bits when executing programs. To ensure this 207 fusermount3 always adds "nosuid" and "nodev" to the mount options 208 for non-privileged mounts. 209 210 B) If another user is accessing files or directories in the 211 filesystem, the filesystem daemon serving requests can record the 212 exact sequence and timing of operations performed. This 213 information is otherwise inaccessible to the mount owner, so this 214 counts as an information leak. 215 216 The solution to this problem will be presented in point 2) of C). 217 218 C) There are several ways in which the mount owner can induce 219 undesired behavior in other users' processes, such as: 220 221 1) mounting a filesystem over a file or directory which the mount 222 owner could otherwise not be able to modify (or could only 223 make limited modifications). 224 225 This is solved in fusermount3, by checking the access 226 permissions on the mountpoint and only allowing the mount if 227 the mount owner can do unlimited modification (has write 228 access to the mountpoint, and mountpoint is not a "sticky" 229 directory) 230 231 2) Even if 1) is solved the mount owner can change the behavior 232 of other users' processes. 233 234 i) It can slow down or indefinitely delay the execution of a 235 filesystem operation creating a DoS against the user or the 236 whole system. For example a suid application locking a 237 system file, and then accessing a file on the mount owner's 238 filesystem could be stopped, and thus causing the system 239 file to be locked forever. 240 241 ii) It can present files or directories of unlimited length, or 242 directory structures of unlimited depth, possibly causing a 243 system process to eat up diskspace, memory or other 244 resources, again causing DoS. 245 246 The solution to this as well as B) is not to allow processes 247 to access the filesystem, which could otherwise not be 248 monitored or manipulated by the mount owner. Since if the 249 mount owner can ptrace a process, it can do all of the above 250 without using a FUSE mount, the same criteria as used in 251 ptrace can be used to check if a process is allowed to access 252 the filesystem or not. 253 254 Note that the ptrace check is not strictly necessary to 255 prevent B/2/i, it is enough to check if mount owner has enough 256 privilege to send signal to the process accessing the 257 filesystem, since SIGSTOP can be used to get a similar effect. 258 259I think these limitations are unacceptable? 260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 261 262If a sysadmin trusts the users enough, or can ensure through other 263measures, that system processes will never enter non-privileged 264mounts, it can relax the last limitation with a "user_allow_other" 265config option. If this config option is set, the mounting user can 266add the "allow_other" mount option which disables the check for other 267users' processes. 268 269Kernel - userspace interface 270~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 271 272The following diagram shows how a filesystem operation (in this 273example unlink) is performed in FUSE. 274 275NOTE: everything in this description is greatly simplified 276 277 | "rm /mnt/fuse/file" | FUSE filesystem daemon 278 | | 279 | | >sys_read() 280 | | >fuse_dev_read() 281 | | >request_wait() 282 | | [sleep on fc->waitq] 283 | | 284 | >sys_unlink() | 285 | >fuse_unlink() | 286 | [get request from | 287 | fc->unused_list] | 288 | >request_send() | 289 | [queue req on fc->pending] | 290 | [wake up fc->waitq] | [woken up] 291 | >request_wait_answer() | 292 | [sleep on req->waitq] | 293 | | <request_wait() 294 | | [remove req from fc->pending] 295 | | [copy req to read buffer] 296 | | [add req to fc->processing] 297 | | <fuse_dev_read() 298 | | <sys_read() 299 | | 300 | | [perform unlink] 301 | | 302 | | >sys_write() 303 | | >fuse_dev_write() 304 | | [look up req in fc->processing] 305 | | [remove from fc->processing] 306 | | [copy write buffer to req] 307 | [woken up] | [wake up req->waitq] 308 | | <fuse_dev_write() 309 | | <sys_write() 310 | <request_wait_answer() | 311 | <request_send() | 312 | [add request to | 313 | fc->unused_list] | 314 | <fuse_unlink() | 315 | <sys_unlink() | 316 317There are a couple of ways in which to deadlock a FUSE filesystem. 318Since we are talking about unprivileged userspace programs, 319something must be done about these. 320 321Scenario 1 - Simple deadlock 322----------------------------- 323 324 | "rm /mnt/fuse/file" | FUSE filesystem daemon 325 | | 326 | >sys_unlink("/mnt/fuse/file") | 327 | [acquire inode semaphore | 328 | for "file"] | 329 | >fuse_unlink() | 330 | [sleep on req->waitq] | 331 | | <sys_read() 332 | | >sys_unlink("/mnt/fuse/file") 333 | | [acquire inode semaphore 334 | | for "file"] 335 | | *DEADLOCK* 336 337The solution for this is to allow the filesystem to be aborted. 338 339Scenario 2 - Tricky deadlock 340---------------------------- 341 342This one needs a carefully crafted filesystem. It's a variation on 343the above, only the call back to the filesystem is not explicit, 344but is caused by a pagefault. 345 346 | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 347 | | 348 | [fd = open("/mnt/fuse/file")] | [request served normally] 349 | [mmap fd to 'addr'] | 350 | [close fd] | [FLUSH triggers 'magic' flag] 351 | [read a byte from addr] | 352 | >do_page_fault() | 353 | [find or create page] | 354 | [lock page] | 355 | >fuse_readpage() | 356 | [queue READ request] | 357 | [sleep on req->waitq] | 358 | | [read request to buffer] 359 | | [create reply header before addr] 360 | | >sys_write(addr - headerlength) 361 | | >fuse_dev_write() 362 | | [look up req in fc->processing] 363 | | [remove from fc->processing] 364 | | [copy write buffer to req] 365 | | >do_page_fault() 366 | | [find or create page] 367 | | [lock page] 368 | | * DEADLOCK * 369 370Solution is basically the same as above. 371 372An additional problem is that while the write buffer is being copied 373to the request, the request must not be interrupted/aborted. This is 374because the destination address of the copy may not be valid after the 375request has returned. 376 377This is solved with doing the copy atomically, and allowing abort 378while the page(s) belonging to the write buffer are faulted with 379get_user_pages(). The 'req->locked' flag indicates when the copy is 380taking place, and abort is delayed until this flag is unset. 381