Lines Matching +full:len +full:- +full:or +full:- +full:limit
5 This write-up is based on three articles published at lwn.net:
7 - <https://lwn.net/Articles/649115/> Pathname lookup in Linux
8 - <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
9 - <https://lwn.net/Articles/650786/> A walk among the symlinks
15 - per-directory parallel name lookup.
16 - ``openat2()`` resolution restriction flags.
27 the early parts of the analysis we will divide off symlinks - leaving
30 will allow us to review "REF-walk" and "RCU-walk" separately. But we
35 --------------------------
37 .. _openat: http://man7.org/linux/man-pages/man2/openat.2.html
41 of elements: "slashes" that are sequences of one or more "``/``"
42 characters, and "components" that are sequences of one or more
43 non-"``/``" characters. These form two kinds of paths. Those that
45 The others are "relative" and start from the current directory, or
49 .. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
63 such as ``ENOENT`` or ``ENOTDIR`` will be reported.
69 exist, it could be "``.``" or "``..``" which are handled quite differently
77 particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
81 A pathname that contains at least one non-<slash> character and
82 that ends with one or more trailing <slash> characters shall not
84 the trailing <slash> characters names an existing directory or a
106 ----------------------
129 the VFS to determine if a particular file does or doesn't exist
136 whether remote filesystems like NFS and 9P, or cluster filesystems
137 like ocfs2 or cephfs. These filesystems allow the VFS to revalidate
142 REF-walk: simple concurrency management with refcounts and spinlocks
143 --------------------------------------------------------------------
148 pathname, and focus on the "REF-walk" approach to concurrency
151 (indicating the use of RCU-walk) is set.
155 REF-walk is fairly heavy-handed with locks and reference counts. Not
156 as heavy-handed as in the old "big kernel lock" days, but certainly not
159 various primitives is assumed, or can be gleaned from elsewhere such
162 The locking mechanisms used by REF-walk include:
164 dentry->d_lockref
168 reference count. The special-sauce of this primitive is that the
174 will behave as expected. It also protects the ``->d_inode`` reference
184 setting ``d_inode`` to ``NULL``, or by removing it from the hash table
193 So as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode``
196 dentry->d_lock
201 renamed or unlinked. In particular, its parent (``d_parent``), and its
205 When looking for a name in a directory, REF-walk takes ``d_lock`` on
211 REF-walk can take ``d_lock`` to get a stable reference to ``d_parent``,
232 The name-lookup process (``d_lookup()``) does *not* try to prevent this
244 ``-EAGAIN``.
246 inode->i_rwsem
253 currently in the dcache or, optionally, when the list of entries in a
260 ``d_lock`` on one or more the dentries while the change happens. One
276 to take, or not take, ``i_rwsem`` is one of the
279 If two threads attempt to look up the same name at the same time - a
280 name that is not yet in the dcache - the shared lock on ``i_rwsem`` will
284 per-dentry flag bit (``DCACHE_PAR_LOOKUP``).
289 is already a matching dentry in the primary or secondary hash
302 dentry from the secondary hash table - it will normally have been
321 mnt->mnt_count
324 ``mnt_count`` is a per-CPU reference counter on "``mount``" structures.
325 Per-CPU here means that incrementing the count is cheap as it only
326 uses CPU-local memory, but checking if the count is zero is expensive as
331 in particular, doesn't stabilize the link to the mounted-on dentry. It
334 filesystem. So a reference through ``->mnt_count`` provides a stable
335 reference to the mounted dentry, but not the mounted-on dentry.
356 needed to stabilize the link to the mounted-on dentry, which the
364 ``-EAGAIN``.
377 ----------------------------------------------
379 .. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
382 in a ``struct nameidata``, "namei" being the traditional name - dating
383 all the way back to `First Edition Unix`_ - of the function that
393 starting point (the current working directory, the root directory, or some other
407 This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT`` or ``LAST_DOTDOT``.
415 only assigned the first time it is used, or when a non-standard root
420 It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
425 pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
429 ``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call
454 directly from walk_component() or from handle_dots(). It calls
463 This "hand-over-hand" sequencing of getting a reference to the new
466 analogue in the "RCU-walk" version.
469 ----------------------------
471 ``link_path_walk()`` only walks as far as setting ``nd->last`` and
472 ``nd->last_type`` to refer to the final component of the path. It does
479 ``path_parentat()`` is clearly the simplest - it just wraps a little bit
482 aiming to create a name (via ``filename_create()``) or remove or rename
487 ``path_lookupat()`` is nearly as simple - it is used when an existing
488 object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
500 or without O_EXCL), final "``/``" characters, and trailing symbolic
505 Each of these, or the functions which call them, need to be alert to
511 by testing ``last.name[last.len]``. If there is any character beyond
515 ---------------------------
517 Apart from symbolic links, there are only two parts of the "REF-walk"
522 ``->d_revalidate()`` dentry method to ensure that the cached information
523 is current. This will often confirm validity or update a few details
540 to three different flags that might be set in ``dentry->d_flags``:
559 ``d_manage()`` by returning ``-EISDIR``.
569 If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``,
583 report that there was an error, that there was nothing to mount, or
592 This will become more important next time when we examine RCU-walk
595 RCU-walk - faster pathname lookup in Linux
598 RCU-walk is another algorithm for performing pathname lookup in Linux.
599 It is in many ways similar to REF-walk and the two share quite a bit
600 of code. The significant difference in RCU-walk is how it allows for
603 We noted that REF-walk is complex because there are numerous details
604 and special cases. RCU-walk reduces this complexity by simply
605 refusing to handle a number of cases -- it instead falls back to
606 REF-walk. The difficulty with RCU-walk comes from a different
612 --------------------------
625 The REF-walk mechanism already described certainly doesn't follow this
627 be other threads modifying the data. RCU-walk, in contrast, is
631 other parts it is important that RCU-walk can quickly fall back to
632 using REF-walk.
634 Pathname lookup always starts in RCU-walk mode but only remains there
638 notices that something has changed or is changing, or if something
640 REF-walk.
643 ``vfsmount`` and ``dentry``, and ensuring that these are still valid -
644 that a path walk with REF-walk would have found the same entries.
645 This is an invariant that RCU-walk must guarantee. It can only make
647 REF-walk could also have made if it were walking down the tree at the
649 processed with the reliable, if slightly sluggish, REF-walk. If
650 RCU-walk finds it cannot stop gracefully, it simply gives up and
651 restarts from the top with REF-walk.
653 This pattern of "try RCU-walk, if that fails try REF-walk" can be
660 They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
662 special flag to request "REF-walk". If either of those report the
665 revalidated - normally entries are only revalidated if the filesystem
669 REF-walk, but will never then try to switch back to RCU-walk. Places
670 that trip up RCU-walk are much more likely to be near the leaves and
675 --------------------------------
677 RCU is, unsurprisingly, critical to RCU-walk mode. The
678 ``rcu_read_lock()`` is held for the entire time that RCU-walk is walking
680 data structures - dentries, inodes, super_blocks, and mounts - will
681 not be freed while the lock is held. They might be unlinked or
682 invalidated in one way or another, but the memory will not be
687 As we saw above, REF-walk holds a counted reference to the current
689 before taking references to the "next" dentry or vfsmount. It also
691 taken to prevent certain changes from happening. RCU-walk must not
692 take those references or locks and so cannot prevent such changes.
693 Instead, it checks to see if a change has been made, and aborts or
696 To preserve the invariant mentioned above (that RCU-walk may only make
697 decisions that REF-walk could have made), it must make the checks at
698 or near the same places that REF-walk holds the references. So, when
699 REF-walk increments a reference count or takes a spinlock, RCU-walk
700 samples the status of a seqlock using ``read_seqcount_begin()`` or a
701 similar function. When REF-walk decrements the count or drops the
702 lock, RCU-walk checks if the sampled status is still valid using
703 ``read_seqcount_retry()`` or similar.
706 RCU-walk accesses two different fields in a seqlock-protected
707 structure, or accesses the same field twice, there is no a priori
709 is needed - which it usually is - RCU-walk must take a copy and then
713 imposes a memory barrier so that no memory-read instruction from
715 CPU or by the compiler. A simple example of this can be seen in
717 byte-wise name equality, calls into the filesystem to compare a name
720 are consistent, and only then is ``->d_compare()`` called. When
728 the bigger picture of how RCU-walk uses seqlocks.
730 ``mount_lock`` and ``nd->m_seq``
733 We already met the ``mount_lock`` seqlock when REF-walk used it to
734 ensure that crossing a mount point is performed safely. RCU-walk uses
738 descends the tree, RCU-walk samples the state of ``mount_lock`` at the
743 relatively rare, it is reasonable to fall back on REF-walk any time
744 that any "mount" or "unmount" happens.
746 ``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk
747 sequence, whether switching to REF-walk for the rest of the path or
749 down over a mount point (in ``__follow_mount_rcu()``) or up (in
751 whole RCU-walk sequence is aborted and the path is processed again by
752 REF-walk.
754 If RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure
755 that, had REF-walk taken counted references on each vfsmount, the
759 ``dentry->d_seq`` and ``nd->seq``
762 In place of taking a count or lock on ``d_reflock``, RCU-walk samples
763 the per-dentry ``d_seq`` seqlock, and stores the sequence number in the
764 ``seq`` field of the nameidata structure, so ``nd->seq`` should always be
765 the current sequence number of ``nd->dentry``. This number needs to be
766 revalidated after copying, and before using, the name, parent, or
775 ``mnt->mnt_mountpoint`` link to get a new dentry and collect its
778 mount point and follow the ``mnt->mnt_root`` link. This would imply a
783 The inode pointer, stored in ``->d_inode``, is a little more
791 ``lookup_fast()`` is the only lookup routine that is used in RCU-mode,
804 the old dentry which we saw in REF-walk.
806 No ``inode->i_rwsem`` or even ``rename_lock``
811 ``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does
812 take ``i_rwsem`` and modifies the directory in a way that RCU-walk needs
813 to notice, the result will be either that RCU-walk fails to find the
814 dentry that it is looking for, or it will find a dentry which
816 REF-walk mode which can take whatever locks are needed.
818 Though ``rename_lock`` could be used by RCU-walk as it doesn't require
819 any sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to
822 something that actually is there. When RCU-walk fails to find
823 something in the dentry cache, whether it is really there or not, it
824 already drops down to REF-walk and tries again with appropriate
829 -----------------------------------------
831 That "dropping down to REF-walk" typically involves a call to
832 ``unlazy_walk()``, so named because "RCU-walk" is also sometimes
837 permission checking or name revalidation couldn't be achieved while
839 automount point is found, or in a couple of cases involving symlinks.
841 the final component, or the very end of the path, depending on which
844 Other reasons for dropping out of RCU-walk that do not trigger a call
846 handled immediately, such as ``mount_lock`` or one of the ``d_seq``
848 will return ``-ECHILD`` which will percolate up until it triggers a new
849 attempt from the top using REF-walk.
855 it, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk
862 all. For ``dentry->d_lockref``, it is safe to increment the reference
864 "dead" which involves setting the counter to ``-128``.
867 For ``mnt->mnt_count`` it is safe to take a reference as long as
870 the standard way of calling ``mnt_put()`` - an unmount may have
874 correct, or if it should just decrement the count and pretend none of
878 --------------------------
880 RCU-walk depends almost entirely on cached information and often will
882 besides the already-mentioned component-name comparison, where the
883 file system might be included in RCU-walk, and it must know to be
886 If the filesystem has non-standard permission-checking requirements -
888 - the ``i_op->permission`` interface might be called during RCU-walk.
890 knows not to sleep, but to return ``-ECHILD`` if it cannot complete
891 promptly. ``i_op->permission`` is given the inode pointer, not the
895 held. This typically means they must be freed using ``kfree_rcu()`` or
901 ``d_op->d_revalidate`` may be called in RCU-walk too. This interface
902 *is* passed the dentry but does not have access to the ``inode`` or the
910 ------------------
912 In various places in the details of REF-walk and RCU-walk, and also in
917 can see that in the high-level approach of first trying RCU-walk and
918 then trying REF-walk, and in places where ``unlazy_walk()`` is used to
919 switch to REF-walk for the rest of the path. We also saw it earlier
924 again - repeatedly". This is seen with the use of ``rename_lock`` and
925 ``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern -
944 Then a consideration of access-time updates and summary of the various
948 -----------------
961 "``readlink -f``" command does, though it also edits out "``.``" and
976 If a symlink referred to itself either directly or through
978 successfully - the error ``ELOOP`` must be returned. Loops can be
987 true loops, but also to "very deep" non-loops. It's not about memory
990 Linux imposes a limit on the length of any pathname: ``PATH_MAX``, which
991 is 4096. There are a number of reasons for this limit; not letting the
994 sort of limit is needed for the same reason. Linux imposes a limit of
996 a further limit of eight on the maximum depth of recursion, but that was
998 just the one limit.
1012 ---------------------------------------
1016 to external storage. It is particularly important for RCU-walk to be
1018 it doesn't need to drop down into REF-walk.
1020 .. _object-oriented design pattern: https://lwn.net/Articles/446317/
1026 common `object-oriented design pattern`_ in the kernel). This will
1033 that the filesystem will allocate some temporary memory and copy or
1037 the inode which, itself, is protected by RCU or by a counted reference
1044 When the symlink is stored in the page cache or elsewhere, the
1045 situation is not so straightforward. A reference on a dentry or even
1053 Taking a reference to a cache page is often possible even in RCU-walk
1056 out of RCU-walk mode completely. Even filesystems that allocate
1058 allocate memory without the need to drop out of RCU-walk. If a
1059 filesystem cannot successfully get a reference in RCU-walk mode, it
1060 must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
1061 REF-walk mode in which the filesystem is allowed to sleep.
1063 The place for all this to happen is the ``i_op->get_link()`` inode
1064 method. This is called both in RCU-walk and REF-walk. In RCU-walk the
1065 ``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
1066 RCU-walk. Much like the ``i_op->permission()`` method we
1067 looked at previously, ``->get_link()`` would need to be careful that
1070 ``struct delayed_called`` will be passed to ``->get_link()``:
1076 whether in RCU-walk or REF-walk, the symlink stack needs to contain,
1079 - the ``struct path`` to provide a reference to the previous path
1080 - the ``const char *`` to provide a reference to the to previous name
1081 - the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
1082 - the ``struct delayed_call`` for later invocation.
1086 remnant). On a 64-bit system, this is about 40 bytes per entry;
1096 ---------------------
1099 components in the path and all of the non-final symlinks. As symlinks
1101 symlink, or is restored from the stack, so that much of the loop
1116 want to pop the symlink-just-completed off the stack before pushing
1117 the symlink-just-found to avoid leaving empty path remnants that would
1137 A pair of special-case symlinks deserve a little further explanation.
1149 aren't really (and are therefore commonly referred to as "magic-links")::
1151 $ ls -l /proc/self/fd/1
1152 lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
1157 objects you get a name that might refer to the same file - unless it
1158 has been unlinked or mounted over. When ``walk_component()`` follows
1159 one of these, the ``->get_link()`` method in "procfs" doesn't return
1161 ``nameidata`` in place to point to that target. ``->get_link()`` then
1166 --------------------------------------------
1172 ``last`` name if it doesn't exist or give an error if it does. Other
1182 and then handles the final component by calling open_last_lookups() or
1184 open_last_lookups() or lookup_last() will set things up properly and
1202 the filesystem provides it) to combine the final lookup with the open, or
1203 will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
1204 directly. In the later case the actual "open" of this newly found or
1208 2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
1209 wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
1210 otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
1216 ln -s bar /tmp/foo
1221 like for a non-creating open: lookup_last() or open_last_lookup()
1226 ------------------------
1228 We previously said of RCU-walk that it would "take no locks, increment
1231 reference (or even a memory allocation) may be needed. But these
1237 time", or "``atime``". Passing through a directory to access a file
1251 care about access-time updates during pathname lookup.
1260 quite complex. Trying to stay in RCU-walk while doing it is best
1266 ``relatime``, many filesystems record ``atime`` with a one-second
1269 It is easy to test if an ``atime`` update is needed while in RCU-walk
1270 mode and, if it isn't, the update can be skipped and RCU-walk mode
1272 path walk drop down to REF-walk. All of this is handled in the
1276 -----------
1294 to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
1307 or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
1310 to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
1311 ``ND_JUMPED`` is set when the look completes - which may be at the
1312 final component or, when creating, unlinking, or renaming, at the penultimate component.
1314 Resolution-restriction flags
1322 ``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
1326 ``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
1328 ``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
1331 bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
1332 lookup is determined by the first mountpoint the path lookup reaches --
1334 with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
1341 resolution of "..". Magic-links are also blocked.
1345 the starting point, and ".." at the starting point will act as a no-op. As with
1347 attacks against ".." resolution. Magic-links are also blocked.
1349 Final-component flags
1361 "``mount --bind``".
1364 symlinks. Some system calls set or clear it implicitly, while
1375 available to the filesystem and particularly the ``->d_revalidate()``
1377 if it knows that it will be asked to open or create the file soon.
1378 These flags were previously useful for ``->lookup()`` too but with the
1379 introduction of ``->atomic_open()`` they are less relevant there.
1382 ---------------
1385 in good shape - various parts are certainly easier to understand now
1387 "finished". As already mentioned, RCU-walk currently only follows
1389 symlinks, it doesn't help with NFS, XFS, or Btrfs. That support