18c2ecf20Sopenharmony_ci=============== 28c2ecf20Sopenharmony_ciPathname lookup 38c2ecf20Sopenharmony_ci=============== 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ciThis write-up is based on three articles published at lwn.net: 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ci- <https://lwn.net/Articles/649115/> Pathname lookup in Linux 88c2ecf20Sopenharmony_ci- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux 98c2ecf20Sopenharmony_ci- <https://lwn.net/Articles/650786/> A walk among the symlinks 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ciWritten by Neil Brown with help from Al Viro and Jon Corbet. 128c2ecf20Sopenharmony_ciIt has subsequently been updated to reflect changes in the kernel 138c2ecf20Sopenharmony_ciincluding: 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_ci- per-directory parallel name lookup. 168c2ecf20Sopenharmony_ci- ``openat2()`` resolution restriction flags. 178c2ecf20Sopenharmony_ci 188c2ecf20Sopenharmony_ciIntroduction to pathname lookup 198c2ecf20Sopenharmony_ci=============================== 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciThe most obvious aspect of pathname lookup, which very little 228c2ecf20Sopenharmony_ciexploration is needed to discover, is that it is complex. There are 238c2ecf20Sopenharmony_cimany rules, special cases, and implementation alternatives that all 248c2ecf20Sopenharmony_cicombine to confuse the unwary reader. Computer science has long been 258c2ecf20Sopenharmony_ciacquainted with such complexity and has tools to help manage it. One 268c2ecf20Sopenharmony_citool that we will make extensive use of is "divide and conquer". For 278c2ecf20Sopenharmony_cithe early parts of the analysis we will divide off symlinks - leaving 288c2ecf20Sopenharmony_cithem until the final part. Well before we get to symlinks we have 298c2ecf20Sopenharmony_cianother major division based on the VFS's approach to locking which 308c2ecf20Sopenharmony_ciwill allow us to review "REF-walk" and "RCU-walk" separately. But we 318c2ecf20Sopenharmony_ciare getting ahead of ourselves. There are some important low level 328c2ecf20Sopenharmony_cidistinctions we need to clarify first. 338c2ecf20Sopenharmony_ci 348c2ecf20Sopenharmony_ciThere are two sorts of ... 358c2ecf20Sopenharmony_ci-------------------------- 368c2ecf20Sopenharmony_ci 378c2ecf20Sopenharmony_ci.. _openat: http://man7.org/linux/man-pages/man2/openat.2.html 388c2ecf20Sopenharmony_ci 398c2ecf20Sopenharmony_ciPathnames (sometimes "file names"), used to identify objects in the 408c2ecf20Sopenharmony_cifilesystem, will be familiar to most readers. They contain two sorts 418c2ecf20Sopenharmony_ciof elements: "slashes" that are sequences of one or more "``/``" 428c2ecf20Sopenharmony_cicharacters, and "components" that are sequences of one or more 438c2ecf20Sopenharmony_cinon-"``/``" characters. These form two kinds of paths. Those that 448c2ecf20Sopenharmony_cistart with slashes are "absolute" and start from the filesystem root. 458c2ecf20Sopenharmony_ciThe others are "relative" and start from the current directory, or 468c2ecf20Sopenharmony_cifrom some other location specified by a file descriptor given to 478c2ecf20Sopenharmony_ci"``*at()``" system calls such as `openat() <openat_>`_. 488c2ecf20Sopenharmony_ci 498c2ecf20Sopenharmony_ci.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html 508c2ecf20Sopenharmony_ci 518c2ecf20Sopenharmony_ciIt is tempting to describe the second kind as starting with a 528c2ecf20Sopenharmony_cicomponent, but that isn't always accurate: a pathname can lack both 538c2ecf20Sopenharmony_cislashes and components, it can be empty, in other words. This is 548c2ecf20Sopenharmony_cigenerally forbidden in POSIX, but some of those "``*at()``" system calls 558c2ecf20Sopenharmony_ciin Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For 568c2ecf20Sopenharmony_ciexample, if you have an open file descriptor on an executable file you 578c2ecf20Sopenharmony_cican execute it by calling `execveat() <execveat_>`_ passing 588c2ecf20Sopenharmony_cithe file descriptor, an empty path, and the ``AT_EMPTY_PATH`` flag. 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ciThese paths can be divided into two sections: the final component and 618c2ecf20Sopenharmony_cieverything else. The "everything else" is the easy bit. In all cases 628c2ecf20Sopenharmony_ciit must identify a directory that already exists, otherwise an error 638c2ecf20Sopenharmony_cisuch as ``ENOENT`` or ``ENOTDIR`` will be reported. 648c2ecf20Sopenharmony_ci 658c2ecf20Sopenharmony_ciThe final component is not so simple. Not only do different system 668c2ecf20Sopenharmony_cicalls interpret it quite differently (e.g. some create it, some do 678c2ecf20Sopenharmony_cinot), but it might not even exist: neither the empty pathname nor the 688c2ecf20Sopenharmony_cipathname that is just slashes have a final component. If it does 698c2ecf20Sopenharmony_ciexist, it could be "``.``" or "``..``" which are handled quite differently 708c2ecf20Sopenharmony_cifrom other components. 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ci.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12 738c2ecf20Sopenharmony_ci 748c2ecf20Sopenharmony_ciIf a pathname ends with a slash, such as "``/tmp/foo/``" it might be 758c2ecf20Sopenharmony_citempting to consider that to have an empty final component. In many 768c2ecf20Sopenharmony_ciways that would lead to correct results, but not always. In 778c2ecf20Sopenharmony_ciparticular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named 788c2ecf20Sopenharmony_ciby the final component, and they are required to work with pathnames 798c2ecf20Sopenharmony_ciending in "``/``". According to POSIX_: 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ci A pathname that contains at least one non-<slash> character and 828c2ecf20Sopenharmony_ci that ends with one or more trailing <slash> characters shall not 838c2ecf20Sopenharmony_ci be resolved successfully unless the last pathname component before 848c2ecf20Sopenharmony_ci the trailing <slash> characters names an existing directory or a 858c2ecf20Sopenharmony_ci directory entry that is to be created for a directory immediately 868c2ecf20Sopenharmony_ci after the pathname is resolved. 878c2ecf20Sopenharmony_ci 888c2ecf20Sopenharmony_ciThe Linux pathname walking code (mostly in ``fs/namei.c``) deals with 898c2ecf20Sopenharmony_ciall of these issues: breaking the path into components, handling the 908c2ecf20Sopenharmony_ci"everything else" quite separately from the final component, and 918c2ecf20Sopenharmony_cichecking that the trailing slash is not used where it isn't 928c2ecf20Sopenharmony_cipermitted. It also addresses the important issue of concurrent 938c2ecf20Sopenharmony_ciaccess. 948c2ecf20Sopenharmony_ci 958c2ecf20Sopenharmony_ciWhile one process is looking up a pathname, another might be making 968c2ecf20Sopenharmony_cichanges that affect that lookup. One fairly extreme case is that if 978c2ecf20Sopenharmony_ci"a/b" were renamed to "a/c/b" while another process were looking up 988c2ecf20Sopenharmony_ci"a/b/..", that process might successfully resolve on "a/c". 998c2ecf20Sopenharmony_ciMost races are much more subtle, and a big part of the task of 1008c2ecf20Sopenharmony_cipathname lookup is to prevent them from having damaging effects. Many 1018c2ecf20Sopenharmony_ciof the possible races are seen most clearly in the context of the 1028c2ecf20Sopenharmony_ci"dcache" and an understanding of that is central to understanding 1038c2ecf20Sopenharmony_cipathname lookup. 1048c2ecf20Sopenharmony_ci 1058c2ecf20Sopenharmony_ciMore than just a cache 1068c2ecf20Sopenharmony_ci---------------------- 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ciThe "dcache" caches information about names in each filesystem to 1098c2ecf20Sopenharmony_cimake them quickly available for lookup. Each entry (known as a 1108c2ecf20Sopenharmony_ci"dentry") contains three significant fields: a component name, a 1118c2ecf20Sopenharmony_cipointer to a parent dentry, and a pointer to the "inode" which 1128c2ecf20Sopenharmony_cicontains further information about the object in that parent with 1138c2ecf20Sopenharmony_cithe given name. The inode pointer can be ``NULL`` indicating that the 1148c2ecf20Sopenharmony_ciname doesn't exist in the parent. While there can be linkage in the 1158c2ecf20Sopenharmony_cidentry of a directory to the dentries of the children, that linkage is 1168c2ecf20Sopenharmony_cinot used for pathname lookup, and so will not be considered here. 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ciThe dcache has a number of uses apart from accelerating lookup. One 1198c2ecf20Sopenharmony_cithat will be particularly relevant is that it is closely integrated 1208c2ecf20Sopenharmony_ciwith the mount table that records which filesystem is mounted where. 1218c2ecf20Sopenharmony_ciWhat the mount table actually stores is which dentry is mounted on top 1228c2ecf20Sopenharmony_ciof which other dentry. 1238c2ecf20Sopenharmony_ci 1248c2ecf20Sopenharmony_ciWhen considering the dcache, we have another of our "two types" 1258c2ecf20Sopenharmony_cidistinctions: there are two types of filesystems. 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ciSome filesystems ensure that the information in the dcache is always 1288c2ecf20Sopenharmony_cicompletely accurate (though not necessarily complete). This can allow 1298c2ecf20Sopenharmony_cithe VFS to determine if a particular file does or doesn't exist 1308c2ecf20Sopenharmony_ciwithout checking with the filesystem, and means that the VFS can 1318c2ecf20Sopenharmony_ciprotect the filesystem against certain races and other problems. 1328c2ecf20Sopenharmony_ciThese are typically "local" filesystems such as ext3, XFS, and Btrfs. 1338c2ecf20Sopenharmony_ci 1348c2ecf20Sopenharmony_ciOther filesystems don't provide that guarantee because they cannot. 1358c2ecf20Sopenharmony_ciThese are typically filesystems that are shared across a network, 1368c2ecf20Sopenharmony_ciwhether remote filesystems like NFS and 9P, or cluster filesystems 1378c2ecf20Sopenharmony_cilike ocfs2 or cephfs. These filesystems allow the VFS to revalidate 1388c2ecf20Sopenharmony_cicached information, and must provide their own protection against 1398c2ecf20Sopenharmony_ciawkward races. The VFS can detect these filesystems by the 1408c2ecf20Sopenharmony_ci``DCACHE_OP_REVALIDATE`` flag being set in the dentry. 1418c2ecf20Sopenharmony_ci 1428c2ecf20Sopenharmony_ciREF-walk: simple concurrency management with refcounts and spinlocks 1438c2ecf20Sopenharmony_ci-------------------------------------------------------------------- 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ciWith all of those divisions carefully classified, we can now start 1468c2ecf20Sopenharmony_cilooking at the actual process of walking along a path. In particular 1478c2ecf20Sopenharmony_ciwe will start with the handling of the "everything else" part of a 1488c2ecf20Sopenharmony_cipathname, and focus on the "REF-walk" approach to concurrency 1498c2ecf20Sopenharmony_cimanagement. This code is found in the ``link_path_walk()`` function, if 1508c2ecf20Sopenharmony_ciyou ignore all the places that only run when "``LOOKUP_RCU``" 1518c2ecf20Sopenharmony_ci(indicating the use of RCU-walk) is set. 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_ci.. _Meet the Lockers: https://lwn.net/Articles/453685/ 1548c2ecf20Sopenharmony_ci 1558c2ecf20Sopenharmony_ciREF-walk is fairly heavy-handed with locks and reference counts. Not 1568c2ecf20Sopenharmony_cias heavy-handed as in the old "big kernel lock" days, but certainly not 1578c2ecf20Sopenharmony_ciafraid of taking a lock when one is needed. It uses a variety of 1588c2ecf20Sopenharmony_cidifferent concurrency controls. A background understanding of the 1598c2ecf20Sopenharmony_civarious primitives is assumed, or can be gleaned from elsewhere such 1608c2ecf20Sopenharmony_cias in `Meet the Lockers`_. 1618c2ecf20Sopenharmony_ci 1628c2ecf20Sopenharmony_ciThe locking mechanisms used by REF-walk include: 1638c2ecf20Sopenharmony_ci 1648c2ecf20Sopenharmony_cidentry->d_lockref 1658c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~ 1668c2ecf20Sopenharmony_ci 1678c2ecf20Sopenharmony_ciThis uses the lockref primitive to provide both a spinlock and a 1688c2ecf20Sopenharmony_cireference count. The special-sauce of this primitive is that the 1698c2ecf20Sopenharmony_ciconceptual sequence "lock; inc_ref; unlock;" can often be performed 1708c2ecf20Sopenharmony_ciwith a single atomic memory operation. 1718c2ecf20Sopenharmony_ci 1728c2ecf20Sopenharmony_ciHolding a reference on a dentry ensures that the dentry won't suddenly 1738c2ecf20Sopenharmony_cibe freed and used for something else, so the values in various fields 1748c2ecf20Sopenharmony_ciwill behave as expected. It also protects the ``->d_inode`` reference 1758c2ecf20Sopenharmony_cito the inode to some extent. 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ciThe association between a dentry and its inode is fairly permanent. 1788c2ecf20Sopenharmony_ciFor example, when a file is renamed, the dentry and inode move 1798c2ecf20Sopenharmony_citogether to the new location. When a file is created the dentry will 1808c2ecf20Sopenharmony_ciinitially be negative (i.e. ``d_inode`` is ``NULL``), and will be assigned 1818c2ecf20Sopenharmony_cito the new inode as part of the act of creation. 1828c2ecf20Sopenharmony_ci 1838c2ecf20Sopenharmony_ciWhen a file is deleted, this can be reflected in the cache either by 1848c2ecf20Sopenharmony_cisetting ``d_inode`` to ``NULL``, or by removing it from the hash table 1858c2ecf20Sopenharmony_ci(described shortly) used to look up the name in the parent directory. 1868c2ecf20Sopenharmony_ciIf the dentry is still in use the second option is used as it is 1878c2ecf20Sopenharmony_ciperfectly legal to keep using an open file after it has been deleted 1888c2ecf20Sopenharmony_ciand having the dentry around helps. If the dentry is not otherwise in 1898c2ecf20Sopenharmony_ciuse (i.e. if the refcount in ``d_lockref`` is one), only then will 1908c2ecf20Sopenharmony_ci``d_inode`` be set to ``NULL``. Doing it this way is more efficient for a 1918c2ecf20Sopenharmony_civery common case. 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ciSo as long as a counted reference is held to a dentry, a non-``NULL`` ``->d_inode`` 1948c2ecf20Sopenharmony_civalue will never be changed. 1958c2ecf20Sopenharmony_ci 1968c2ecf20Sopenharmony_cidentry->d_lock 1978c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~ 1988c2ecf20Sopenharmony_ci 1998c2ecf20Sopenharmony_ci``d_lock`` is a synonym for the spinlock that is part of ``d_lockref`` above. 2008c2ecf20Sopenharmony_ciFor our purposes, holding this lock protects against the dentry being 2018c2ecf20Sopenharmony_cirenamed or unlinked. In particular, its parent (``d_parent``), and its 2028c2ecf20Sopenharmony_ciname (``d_name``) cannot be changed, and it cannot be removed from the 2038c2ecf20Sopenharmony_cidentry hash table. 2048c2ecf20Sopenharmony_ci 2058c2ecf20Sopenharmony_ciWhen looking for a name in a directory, REF-walk takes ``d_lock`` on 2068c2ecf20Sopenharmony_cieach candidate dentry that it finds in the hash table and then checks 2078c2ecf20Sopenharmony_cithat the parent and name are correct. So it doesn't lock the parent 2088c2ecf20Sopenharmony_ciwhile searching in the cache; it only locks children. 2098c2ecf20Sopenharmony_ci 2108c2ecf20Sopenharmony_ciWhen looking for the parent for a given name (to handle "``..``"), 2118c2ecf20Sopenharmony_ciREF-walk can take ``d_lock`` to get a stable reference to ``d_parent``, 2128c2ecf20Sopenharmony_cibut it first tries a more lightweight approach. As seen in 2138c2ecf20Sopenharmony_ci``dget_parent()``, if a reference can be claimed on the parent, and if 2148c2ecf20Sopenharmony_cisubsequently ``d_parent`` can be seen to have not changed, then there is 2158c2ecf20Sopenharmony_cino need to actually take the lock on the child. 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_cirename_lock 2188c2ecf20Sopenharmony_ci~~~~~~~~~~~ 2198c2ecf20Sopenharmony_ci 2208c2ecf20Sopenharmony_ciLooking up a given name in a given directory involves computing a hash 2218c2ecf20Sopenharmony_cifrom the two values (the name and the dentry of the directory), 2228c2ecf20Sopenharmony_ciaccessing that slot in a hash table, and searching the linked list 2238c2ecf20Sopenharmony_cithat is found there. 2248c2ecf20Sopenharmony_ci 2258c2ecf20Sopenharmony_ciWhen a dentry is renamed, the name and the parent dentry can both 2268c2ecf20Sopenharmony_cichange so the hash will almost certainly change too. This would move the 2278c2ecf20Sopenharmony_cidentry to a different chain in the hash table. If a filename search 2288c2ecf20Sopenharmony_cihappened to be looking at a dentry that was moved in this way, 2298c2ecf20Sopenharmony_ciit might end up continuing the search down the wrong chain, 2308c2ecf20Sopenharmony_ciand so miss out on part of the correct chain. 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ciThe name-lookup process (``d_lookup()``) does *not* try to prevent this 2338c2ecf20Sopenharmony_cifrom happening, but only to detect when it happens. 2348c2ecf20Sopenharmony_ci``rename_lock`` is a seqlock that is updated whenever any dentry is 2358c2ecf20Sopenharmony_cirenamed. If ``d_lookup`` finds that a rename happened while it 2368c2ecf20Sopenharmony_ciunsuccessfully scanned a chain in the hash table, it simply tries 2378c2ecf20Sopenharmony_ciagain. 2388c2ecf20Sopenharmony_ci 2398c2ecf20Sopenharmony_ci``rename_lock`` is also used to detect and defend against potential attacks 2408c2ecf20Sopenharmony_ciagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where 2418c2ecf20Sopenharmony_cithe parent directory is moved outside the root, bypassing the ``path_equal()`` 2428c2ecf20Sopenharmony_cicheck). If ``rename_lock`` is updated during the lookup and the path encounters 2438c2ecf20Sopenharmony_cia "..", a potential attack occurred and ``handle_dots()`` will bail out with 2448c2ecf20Sopenharmony_ci``-EAGAIN``. 2458c2ecf20Sopenharmony_ci 2468c2ecf20Sopenharmony_ciinode->i_rwsem 2478c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~ 2488c2ecf20Sopenharmony_ci 2498c2ecf20Sopenharmony_ci``i_rwsem`` is a read/write semaphore that serializes all changes to a particular 2508c2ecf20Sopenharmony_cidirectory. This ensures that, for example, an ``unlink()`` and a ``rename()`` 2518c2ecf20Sopenharmony_cicannot both happen at the same time. It also keeps the directory 2528c2ecf20Sopenharmony_cistable while the filesystem is asked to look up a name that is not 2538c2ecf20Sopenharmony_cicurrently in the dcache or, optionally, when the list of entries in a 2548c2ecf20Sopenharmony_cidirectory is being retrieved with ``readdir()``. 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_ciThis has a complementary role to that of ``d_lock``: ``i_rwsem`` on a 2578c2ecf20Sopenharmony_cidirectory protects all of the names in that directory, while ``d_lock`` 2588c2ecf20Sopenharmony_cion a name protects just one name in a directory. Most changes to the 2598c2ecf20Sopenharmony_cidcache hold ``i_rwsem`` on the relevant directory inode and briefly take 2608c2ecf20Sopenharmony_ci``d_lock`` on one or more the dentries while the change happens. One 2618c2ecf20Sopenharmony_ciexception is when idle dentries are removed from the dcache due to 2628c2ecf20Sopenharmony_cimemory pressure. This uses ``d_lock``, but ``i_rwsem`` plays no role. 2638c2ecf20Sopenharmony_ci 2648c2ecf20Sopenharmony_ciThe semaphore affects pathname lookup in two distinct ways. Firstly it 2658c2ecf20Sopenharmony_ciprevents changes during lookup of a name in a directory. ``walk_component()`` uses 2668c2ecf20Sopenharmony_ci``lookup_fast()`` first which, in turn, checks to see if the name is in the cache, 2678c2ecf20Sopenharmony_ciusing only ``d_lock`` locking. If the name isn't found, then ``walk_component()`` 2688c2ecf20Sopenharmony_cifalls back to ``lookup_slow()`` which takes a shared lock on ``i_rwsem``, checks again that 2698c2ecf20Sopenharmony_cithe name isn't in the cache, and then calls in to the filesystem to get a 2708c2ecf20Sopenharmony_cidefinitive answer. A new dentry will be added to the cache regardless of 2718c2ecf20Sopenharmony_cithe result. 2728c2ecf20Sopenharmony_ci 2738c2ecf20Sopenharmony_ciSecondly, when pathname lookup reaches the final component, it will 2748c2ecf20Sopenharmony_cisometimes need to take an exclusive lock on ``i_rwsem`` before performing the last lookup so 2758c2ecf20Sopenharmony_cithat the required exclusion can be achieved. How path lookup chooses 2768c2ecf20Sopenharmony_cito take, or not take, ``i_rwsem`` is one of the 2778c2ecf20Sopenharmony_ciissues addressed in a subsequent section. 2788c2ecf20Sopenharmony_ci 2798c2ecf20Sopenharmony_ciIf two threads attempt to look up the same name at the same time - a 2808c2ecf20Sopenharmony_ciname that is not yet in the dcache - the shared lock on ``i_rwsem`` will 2818c2ecf20Sopenharmony_cinot prevent them both adding new dentries with the same name. As this 2828c2ecf20Sopenharmony_ciwould result in confusion an extra level of interlocking is used, 2838c2ecf20Sopenharmony_cibased around a secondary hash table (``in_lookup_hashtable``) and a 2848c2ecf20Sopenharmony_ciper-dentry flag bit (``DCACHE_PAR_LOOKUP``). 2858c2ecf20Sopenharmony_ci 2868c2ecf20Sopenharmony_ciTo add a new dentry to the cache while only holding a shared lock on 2878c2ecf20Sopenharmony_ci``i_rwsem``, a thread must call ``d_alloc_parallel()``. This allocates a 2888c2ecf20Sopenharmony_cidentry, stores the required name and parent in it, checks if there 2898c2ecf20Sopenharmony_ciis already a matching dentry in the primary or secondary hash 2908c2ecf20Sopenharmony_citables, and if not, stores the newly allocated dentry in the secondary 2918c2ecf20Sopenharmony_cihash table, with ``DCACHE_PAR_LOOKUP`` set. 2928c2ecf20Sopenharmony_ci 2938c2ecf20Sopenharmony_ciIf a matching dentry was found in the primary hash table then that is 2948c2ecf20Sopenharmony_cireturned and the caller can know that it lost a race with some other 2958c2ecf20Sopenharmony_cithread adding the entry. If no matching dentry is found in either 2968c2ecf20Sopenharmony_cicache, the newly allocated dentry is returned and the caller can 2978c2ecf20Sopenharmony_cidetect this from the presence of ``DCACHE_PAR_LOOKUP``. In this case it 2988c2ecf20Sopenharmony_ciknows that it has won any race and now is responsible for asking the 2998c2ecf20Sopenharmony_cifilesystem to perform the lookup and find the matching inode. When 3008c2ecf20Sopenharmony_cithe lookup is complete, it must call ``d_lookup_done()`` which clears 3018c2ecf20Sopenharmony_cithe flag and does some other house keeping, including removing the 3028c2ecf20Sopenharmony_cidentry from the secondary hash table - it will normally have been 3038c2ecf20Sopenharmony_ciadded to the primary hash table already. Note that a ``struct 3048c2ecf20Sopenharmony_ciwaitqueue_head`` is passed to ``d_alloc_parallel()``, and 3058c2ecf20Sopenharmony_ci``d_lookup_done()`` must be called while this ``waitqueue_head`` is still 3068c2ecf20Sopenharmony_ciin scope. 3078c2ecf20Sopenharmony_ci 3088c2ecf20Sopenharmony_ciIf a matching dentry is found in the secondary hash table, 3098c2ecf20Sopenharmony_ci``d_alloc_parallel()`` has a little more work to do. It first waits for 3108c2ecf20Sopenharmony_ci``DCACHE_PAR_LOOKUP`` to be cleared, using a wait_queue that was passed 3118c2ecf20Sopenharmony_cito the instance of ``d_alloc_parallel()`` that won the race and that 3128c2ecf20Sopenharmony_ciwill be woken by the call to ``d_lookup_done()``. It then checks to see 3138c2ecf20Sopenharmony_ciif the dentry has now been added to the primary hash table. If it 3148c2ecf20Sopenharmony_cihas, the dentry is returned and the caller just sees that it lost any 3158c2ecf20Sopenharmony_cirace. If it hasn't been added to the primary hash table, the most 3168c2ecf20Sopenharmony_cilikely explanation is that some other dentry was added instead using 3178c2ecf20Sopenharmony_ci``d_splice_alias()``. In any case, ``d_alloc_parallel()`` repeats all the 3188c2ecf20Sopenharmony_cilook ups from the start and will normally return something from the 3198c2ecf20Sopenharmony_ciprimary hash table. 3208c2ecf20Sopenharmony_ci 3218c2ecf20Sopenharmony_cimnt->mnt_count 3228c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~ 3238c2ecf20Sopenharmony_ci 3248c2ecf20Sopenharmony_ci``mnt_count`` is a per-CPU reference counter on "``mount``" structures. 3258c2ecf20Sopenharmony_ciPer-CPU here means that incrementing the count is cheap as it only 3268c2ecf20Sopenharmony_ciuses CPU-local memory, but checking if the count is zero is expensive as 3278c2ecf20Sopenharmony_ciit needs to check with every CPU. Taking a ``mnt_count`` reference 3288c2ecf20Sopenharmony_ciprevents the mount structure from disappearing as the result of regular 3298c2ecf20Sopenharmony_ciunmount operations, but does not prevent a "lazy" unmount. So holding 3308c2ecf20Sopenharmony_ci``mnt_count`` doesn't ensure that the mount remains in the namespace and, 3318c2ecf20Sopenharmony_ciin particular, doesn't stabilize the link to the mounted-on dentry. It 3328c2ecf20Sopenharmony_cidoes, however, ensure that the ``mount`` data structure remains coherent, 3338c2ecf20Sopenharmony_ciand it provides a reference to the root dentry of the mounted 3348c2ecf20Sopenharmony_cifilesystem. So a reference through ``->mnt_count`` provides a stable 3358c2ecf20Sopenharmony_cireference to the mounted dentry, but not the mounted-on dentry. 3368c2ecf20Sopenharmony_ci 3378c2ecf20Sopenharmony_cimount_lock 3388c2ecf20Sopenharmony_ci~~~~~~~~~~ 3398c2ecf20Sopenharmony_ci 3408c2ecf20Sopenharmony_ci``mount_lock`` is a global seqlock, a bit like ``rename_lock``. It can be used to 3418c2ecf20Sopenharmony_cicheck if any change has been made to any mount points. 3428c2ecf20Sopenharmony_ci 3438c2ecf20Sopenharmony_ciWhile walking down the tree (away from the root) this lock is used when 3448c2ecf20Sopenharmony_cicrossing a mount point to check that the crossing was safe. That is, 3458c2ecf20Sopenharmony_cithe value in the seqlock is read, then the code finds the mount that 3468c2ecf20Sopenharmony_ciis mounted on the current directory, if there is one, and increments 3478c2ecf20Sopenharmony_cithe ``mnt_count``. Finally the value in ``mount_lock`` is checked against 3488c2ecf20Sopenharmony_cithe old value. If there is no change, then the crossing was safe. If there 3498c2ecf20Sopenharmony_ciwas a change, the ``mnt_count`` is decremented and the whole process is 3508c2ecf20Sopenharmony_ciretried. 3518c2ecf20Sopenharmony_ci 3528c2ecf20Sopenharmony_ciWhen walking up the tree (towards the root) by following a ".." link, 3538c2ecf20Sopenharmony_cia little more care is needed. In this case the seqlock (which 3548c2ecf20Sopenharmony_cicontains both a counter and a spinlock) is fully locked to prevent 3558c2ecf20Sopenharmony_ciany changes to any mount points while stepping up. This locking is 3568c2ecf20Sopenharmony_cineeded to stabilize the link to the mounted-on dentry, which the 3578c2ecf20Sopenharmony_cirefcount on the mount itself doesn't ensure. 3588c2ecf20Sopenharmony_ci 3598c2ecf20Sopenharmony_ci``mount_lock`` is also used to detect and defend against potential attacks 3608c2ecf20Sopenharmony_ciagainst ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where 3618c2ecf20Sopenharmony_cithe parent directory is moved outside the root, bypassing the ``path_equal()`` 3628c2ecf20Sopenharmony_cicheck). If ``mount_lock`` is updated during the lookup and the path encounters 3638c2ecf20Sopenharmony_cia "..", a potential attack occurred and ``handle_dots()`` will bail out with 3648c2ecf20Sopenharmony_ci``-EAGAIN``. 3658c2ecf20Sopenharmony_ci 3668c2ecf20Sopenharmony_ciRCU 3678c2ecf20Sopenharmony_ci~~~ 3688c2ecf20Sopenharmony_ci 3698c2ecf20Sopenharmony_ciFinally the global (but extremely lightweight) RCU read lock is held 3708c2ecf20Sopenharmony_cifrom time to time to ensure certain data structures don't get freed 3718c2ecf20Sopenharmony_ciunexpectedly. 3728c2ecf20Sopenharmony_ci 3738c2ecf20Sopenharmony_ciIn particular it is held while scanning chains in the dcache hash 3748c2ecf20Sopenharmony_citable, and the mount point hash table. 3758c2ecf20Sopenharmony_ci 3768c2ecf20Sopenharmony_ciBringing it together with ``struct nameidata`` 3778c2ecf20Sopenharmony_ci---------------------------------------------- 3788c2ecf20Sopenharmony_ci 3798c2ecf20Sopenharmony_ci.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s 3808c2ecf20Sopenharmony_ci 3818c2ecf20Sopenharmony_ciThroughout the process of walking a path, the current status is stored 3828c2ecf20Sopenharmony_ciin a ``struct nameidata``, "namei" being the traditional name - dating 3838c2ecf20Sopenharmony_ciall the way back to `First Edition Unix`_ - of the function that 3848c2ecf20Sopenharmony_ciconverts a "name" to an "inode". ``struct nameidata`` contains (among 3858c2ecf20Sopenharmony_ciother fields): 3868c2ecf20Sopenharmony_ci 3878c2ecf20Sopenharmony_ci``struct path path`` 3888c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 3898c2ecf20Sopenharmony_ci 3908c2ecf20Sopenharmony_ciA ``path`` contains a ``struct vfsmount`` (which is 3918c2ecf20Sopenharmony_ciembedded in a ``struct mount``) and a ``struct dentry``. Together these 3928c2ecf20Sopenharmony_cirecord the current status of the walk. They start out referring to the 3938c2ecf20Sopenharmony_cistarting point (the current working directory, the root directory, or some other 3948c2ecf20Sopenharmony_cidirectory identified by a file descriptor), and are updated on each 3958c2ecf20Sopenharmony_cistep. A reference through ``d_lockref`` and ``mnt_count`` is always 3968c2ecf20Sopenharmony_ciheld. 3978c2ecf20Sopenharmony_ci 3988c2ecf20Sopenharmony_ci``struct qstr last`` 3998c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 4008c2ecf20Sopenharmony_ci 4018c2ecf20Sopenharmony_ciThis is a string together with a length (i.e. *not* ``nul`` terminated) 4028c2ecf20Sopenharmony_cithat is the "next" component in the pathname. 4038c2ecf20Sopenharmony_ci 4048c2ecf20Sopenharmony_ci``int last_type`` 4058c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~ 4068c2ecf20Sopenharmony_ci 4078c2ecf20Sopenharmony_ciThis is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT`` or ``LAST_DOTDOT``. 4088c2ecf20Sopenharmony_ciThe ``last`` field is only valid if the type is ``LAST_NORM``. 4098c2ecf20Sopenharmony_ci 4108c2ecf20Sopenharmony_ci``struct path root`` 4118c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~ 4128c2ecf20Sopenharmony_ci 4138c2ecf20Sopenharmony_ciThis is used to hold a reference to the effective root of the 4148c2ecf20Sopenharmony_cifilesystem. Often that reference won't be needed, so this field is 4158c2ecf20Sopenharmony_cionly assigned the first time it is used, or when a non-standard root 4168c2ecf20Sopenharmony_ciis requested. Keeping a reference in the ``nameidata`` ensures that 4178c2ecf20Sopenharmony_cionly one root is in effect for the entire path walk, even if it races 4188c2ecf20Sopenharmony_ciwith a ``chroot()`` system call. 4198c2ecf20Sopenharmony_ci 4208c2ecf20Sopenharmony_ciIt should be noted that in the case of ``LOOKUP_IN_ROOT`` or 4218c2ecf20Sopenharmony_ci``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor 4228c2ecf20Sopenharmony_cipassed to ``openat2()`` (which exposes these ``LOOKUP_`` flags). 4238c2ecf20Sopenharmony_ci 4248c2ecf20Sopenharmony_ciThe root is needed when either of two conditions holds: (1) either the 4258c2ecf20Sopenharmony_cipathname or a symbolic link starts with a "'/'", or (2) a "``..``" 4268c2ecf20Sopenharmony_cicomponent is being handled, since "``..``" from the root must always stay 4278c2ecf20Sopenharmony_ciat the root. The value used is usually the current root directory of 4288c2ecf20Sopenharmony_cithe calling process. An alternate root can be provided as when 4298c2ecf20Sopenharmony_ci``sysctl()`` calls ``file_open_root()``, and when NFSv4 or Btrfs call 4308c2ecf20Sopenharmony_ci``mount_subtree()``. In each case a pathname is being looked up in a very 4318c2ecf20Sopenharmony_cispecific part of the filesystem, and the lookup must not be allowed to 4328c2ecf20Sopenharmony_ciescape that subtree. It works a bit like a local ``chroot()``. 4338c2ecf20Sopenharmony_ci 4348c2ecf20Sopenharmony_ciIgnoring the handling of symbolic links, we can now describe the 4358c2ecf20Sopenharmony_ci"``link_path_walk()``" function, which handles the lookup of everything 4368c2ecf20Sopenharmony_ciexcept the final component as: 4378c2ecf20Sopenharmony_ci 4388c2ecf20Sopenharmony_ci Given a path (``name``) and a nameidata structure (``nd``), check that the 4398c2ecf20Sopenharmony_ci current directory has execute permission and then advance ``name`` 4408c2ecf20Sopenharmony_ci over one component while updating ``last_type`` and ``last``. If that 4418c2ecf20Sopenharmony_ci was the final component, then return, otherwise call 4428c2ecf20Sopenharmony_ci ``walk_component()`` and repeat from the top. 4438c2ecf20Sopenharmony_ci 4448c2ecf20Sopenharmony_ci``walk_component()`` is even easier. If the component is ``LAST_DOTS``, 4458c2ecf20Sopenharmony_ciit calls ``handle_dots()`` which does the necessary locking as already 4468c2ecf20Sopenharmony_cidescribed. If it finds a ``LAST_NORM`` component it first calls 4478c2ecf20Sopenharmony_ci"``lookup_fast()``" which only looks in the dcache, but will ask the 4488c2ecf20Sopenharmony_cifilesystem to revalidate the result if it is that sort of filesystem. 4498c2ecf20Sopenharmony_ciIf that doesn't get a good result, it calls "``lookup_slow()``" which 4508c2ecf20Sopenharmony_citakes ``i_rwsem``, rechecks the cache, and then asks the filesystem 4518c2ecf20Sopenharmony_cito find a definitive answer. Each of these will call 4528c2ecf20Sopenharmony_ci``follow_managed()`` (as described below) to handle any mount points. 4538c2ecf20Sopenharmony_ci 4548c2ecf20Sopenharmony_ciIn the absence of symbolic links, ``walk_component()`` creates a new 4558c2ecf20Sopenharmony_ci``struct path`` containing a counted reference to the new dentry and a 4568c2ecf20Sopenharmony_cireference to the new ``vfsmount`` which is only counted if it is 4578c2ecf20Sopenharmony_cidifferent from the previous ``vfsmount``. It then calls 4588c2ecf20Sopenharmony_ci``path_to_nameidata()`` to install the new ``struct path`` in the 4598c2ecf20Sopenharmony_ci``struct nameidata`` and drop the unneeded references. 4608c2ecf20Sopenharmony_ci 4618c2ecf20Sopenharmony_ciThis "hand-over-hand" sequencing of getting a reference to the new 4628c2ecf20Sopenharmony_cidentry before dropping the reference to the previous dentry may 4638c2ecf20Sopenharmony_ciseem obvious, but is worth pointing out so that we will recognize its 4648c2ecf20Sopenharmony_cianalogue in the "RCU-walk" version. 4658c2ecf20Sopenharmony_ci 4668c2ecf20Sopenharmony_ciHandling the final component 4678c2ecf20Sopenharmony_ci---------------------------- 4688c2ecf20Sopenharmony_ci 4698c2ecf20Sopenharmony_ci``link_path_walk()`` only walks as far as setting ``nd->last`` and 4708c2ecf20Sopenharmony_ci``nd->last_type`` to refer to the final component of the path. It does 4718c2ecf20Sopenharmony_cinot call ``walk_component()`` that last time. Handling that final 4728c2ecf20Sopenharmony_cicomponent remains for the caller to sort out. Those callers are 4738c2ecf20Sopenharmony_ci``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and 4748c2ecf20Sopenharmony_ci``path_openat()`` each of which handles the differing requirements of 4758c2ecf20Sopenharmony_cidifferent system calls. 4768c2ecf20Sopenharmony_ci 4778c2ecf20Sopenharmony_ci``path_parentat()`` is clearly the simplest - it just wraps a little bit 4788c2ecf20Sopenharmony_ciof housekeeping around ``link_path_walk()`` and returns the parent 4798c2ecf20Sopenharmony_cidirectory and final component to the caller. The caller will be either 4808c2ecf20Sopenharmony_ciaiming to create a name (via ``filename_create()``) or remove or rename 4818c2ecf20Sopenharmony_cia name (in which case ``user_path_parent()`` is used). They will use 4828c2ecf20Sopenharmony_ci``i_rwsem`` to exclude other changes while they validate and then 4838c2ecf20Sopenharmony_ciperform their operation. 4848c2ecf20Sopenharmony_ci 4858c2ecf20Sopenharmony_ci``path_lookupat()`` is nearly as simple - it is used when an existing 4868c2ecf20Sopenharmony_ciobject is wanted such as by ``stat()`` or ``chmod()``. It essentially just 4878c2ecf20Sopenharmony_cicalls ``walk_component()`` on the final component through a call to 4888c2ecf20Sopenharmony_ci``lookup_last()``. ``path_lookupat()`` returns just the final dentry. 4898c2ecf20Sopenharmony_ci 4908c2ecf20Sopenharmony_ci``path_mountpoint()`` handles the special case of unmounting which must 4918c2ecf20Sopenharmony_cinot try to revalidate the mounted filesystem. It effectively 4928c2ecf20Sopenharmony_cicontains, through a call to ``mountpoint_last()``, an alternate 4938c2ecf20Sopenharmony_ciimplementation of ``lookup_slow()`` which skips that step. This is 4948c2ecf20Sopenharmony_ciimportant when unmounting a filesystem that is inaccessible, such as 4958c2ecf20Sopenharmony_cione provided by a dead NFS server. 4968c2ecf20Sopenharmony_ci 4978c2ecf20Sopenharmony_ciFinally ``path_openat()`` is used for the ``open()`` system call; it 4988c2ecf20Sopenharmony_cicontains, in support functions starting with "``do_last()``", all the 4998c2ecf20Sopenharmony_cicomplexity needed to handle the different subtleties of O_CREAT (with 5008c2ecf20Sopenharmony_cior without O_EXCL), final "``/``" characters, and trailing symbolic 5018c2ecf20Sopenharmony_cilinks. We will revisit this in the final part of this series, which 5028c2ecf20Sopenharmony_cifocuses on those symbolic links. "``do_last()``" will sometimes, but 5038c2ecf20Sopenharmony_cinot always, take ``i_rwsem``, depending on what it finds. 5048c2ecf20Sopenharmony_ci 5058c2ecf20Sopenharmony_ciEach of these, or the functions which call them, need to be alert to 5068c2ecf20Sopenharmony_cithe possibility that the final component is not ``LAST_NORM``. If the 5078c2ecf20Sopenharmony_cigoal of the lookup is to create something, then any value for 5088c2ecf20Sopenharmony_ci``last_type`` other than ``LAST_NORM`` will result in an error. For 5098c2ecf20Sopenharmony_ciexample if ``path_parentat()`` reports ``LAST_DOTDOT``, then the caller 5108c2ecf20Sopenharmony_ciwon't try to create that name. They also check for trailing slashes 5118c2ecf20Sopenharmony_ciby testing ``last.name[last.len]``. If there is any character beyond 5128c2ecf20Sopenharmony_cithe final component, it must be a trailing slash. 5138c2ecf20Sopenharmony_ci 5148c2ecf20Sopenharmony_ciRevalidation and automounts 5158c2ecf20Sopenharmony_ci--------------------------- 5168c2ecf20Sopenharmony_ci 5178c2ecf20Sopenharmony_ciApart from symbolic links, there are only two parts of the "REF-walk" 5188c2ecf20Sopenharmony_ciprocess not yet covered. One is the handling of stale cache entries 5198c2ecf20Sopenharmony_ciand the other is automounts. 5208c2ecf20Sopenharmony_ci 5218c2ecf20Sopenharmony_ciOn filesystems that require it, the lookup routines will call the 5228c2ecf20Sopenharmony_ci``->d_revalidate()`` dentry method to ensure that the cached information 5238c2ecf20Sopenharmony_ciis current. This will often confirm validity or update a few details 5248c2ecf20Sopenharmony_cifrom a server. In some cases it may find that there has been change 5258c2ecf20Sopenharmony_cifurther up the path and that something that was thought to be valid 5268c2ecf20Sopenharmony_cipreviously isn't really. When this happens the lookup of the whole 5278c2ecf20Sopenharmony_cipath is aborted and retried with the "``LOOKUP_REVAL``" flag set. This 5288c2ecf20Sopenharmony_ciforces revalidation to be more thorough. We will see more details of 5298c2ecf20Sopenharmony_cithis retry process in the next article. 5308c2ecf20Sopenharmony_ci 5318c2ecf20Sopenharmony_ciAutomount points are locations in the filesystem where an attempt to 5328c2ecf20Sopenharmony_cilookup a name can trigger changes to how that lookup should be 5338c2ecf20Sopenharmony_cihandled, in particular by mounting a filesystem there. These are 5348c2ecf20Sopenharmony_cicovered in greater detail in autofs.txt in the Linux documentation 5358c2ecf20Sopenharmony_citree, but a few notes specifically related to path lookup are in order 5368c2ecf20Sopenharmony_cihere. 5378c2ecf20Sopenharmony_ci 5388c2ecf20Sopenharmony_ciThe Linux VFS has a concept of "managed" dentries which is reflected 5398c2ecf20Sopenharmony_ciin function names such as "``follow_managed()``". There are three 5408c2ecf20Sopenharmony_cipotentially interesting things about these dentries corresponding 5418c2ecf20Sopenharmony_cito three different flags that might be set in ``dentry->d_flags``: 5428c2ecf20Sopenharmony_ci 5438c2ecf20Sopenharmony_ci``DCACHE_MANAGE_TRANSIT`` 5448c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 5458c2ecf20Sopenharmony_ci 5468c2ecf20Sopenharmony_ciIf this flag has been set, then the filesystem has requested that the 5478c2ecf20Sopenharmony_ci``d_manage()`` dentry operation be called before handling any possible 5488c2ecf20Sopenharmony_cimount point. This can perform two particular services: 5498c2ecf20Sopenharmony_ci 5508c2ecf20Sopenharmony_ciIt can block to avoid races. If an automount point is being 5518c2ecf20Sopenharmony_ciunmounted, the ``d_manage()`` function will usually wait for that 5528c2ecf20Sopenharmony_ciprocess to complete before letting the new lookup proceed and possibly 5538c2ecf20Sopenharmony_citrigger a new automount. 5548c2ecf20Sopenharmony_ci 5558c2ecf20Sopenharmony_ciIt can selectively allow only some processes to transit through a 5568c2ecf20Sopenharmony_cimount point. When a server process is managing automounts, it may 5578c2ecf20Sopenharmony_cineed to access a directory without triggering normal automount 5588c2ecf20Sopenharmony_ciprocessing. That server process can identify itself to the ``autofs`` 5598c2ecf20Sopenharmony_cifilesystem, which will then give it a special pass through 5608c2ecf20Sopenharmony_ci``d_manage()`` by returning ``-EISDIR``. 5618c2ecf20Sopenharmony_ci 5628c2ecf20Sopenharmony_ci``DCACHE_MOUNTED`` 5638c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 5648c2ecf20Sopenharmony_ci 5658c2ecf20Sopenharmony_ciThis flag is set on every dentry that is mounted on. As Linux 5668c2ecf20Sopenharmony_cisupports multiple filesystem namespaces, it is possible that the 5678c2ecf20Sopenharmony_cidentry may not be mounted on in *this* namespace, just in some 5688c2ecf20Sopenharmony_ciother. So this flag is seen as a hint, not a promise. 5698c2ecf20Sopenharmony_ci 5708c2ecf20Sopenharmony_ciIf this flag is set, and ``d_manage()`` didn't return ``-EISDIR``, 5718c2ecf20Sopenharmony_ci``lookup_mnt()`` is called to examine the mount hash table (honoring the 5728c2ecf20Sopenharmony_ci``mount_lock`` described earlier) and possibly return a new ``vfsmount`` 5738c2ecf20Sopenharmony_ciand a new ``dentry`` (both with counted references). 5748c2ecf20Sopenharmony_ci 5758c2ecf20Sopenharmony_ci``DCACHE_NEED_AUTOMOUNT`` 5768c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 5778c2ecf20Sopenharmony_ci 5788c2ecf20Sopenharmony_ciIf ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't 5798c2ecf20Sopenharmony_cifind a mount point, then this flag causes the ``d_automount()`` dentry 5808c2ecf20Sopenharmony_cioperation to be called. 5818c2ecf20Sopenharmony_ci 5828c2ecf20Sopenharmony_ciThe ``d_automount()`` operation can be arbitrarily complex and may 5838c2ecf20Sopenharmony_cicommunicate with server processes etc. but it should ultimately either 5848c2ecf20Sopenharmony_cireport that there was an error, that there was nothing to mount, or 5858c2ecf20Sopenharmony_cishould provide an updated ``struct path`` with new ``dentry`` and ``vfsmount``. 5868c2ecf20Sopenharmony_ci 5878c2ecf20Sopenharmony_ciIn the latter case, ``finish_automount()`` will be called to safely 5888c2ecf20Sopenharmony_ciinstall the new mount point into the mount table. 5898c2ecf20Sopenharmony_ci 5908c2ecf20Sopenharmony_ciThere is no new locking of import here and it is important that no 5918c2ecf20Sopenharmony_cilocks (only counted references) are held over this processing due to 5928c2ecf20Sopenharmony_cithe very real possibility of extended delays. 5938c2ecf20Sopenharmony_ciThis will become more important next time when we examine RCU-walk 5948c2ecf20Sopenharmony_ciwhich is particularly sensitive to delays. 5958c2ecf20Sopenharmony_ci 5968c2ecf20Sopenharmony_ciRCU-walk - faster pathname lookup in Linux 5978c2ecf20Sopenharmony_ci========================================== 5988c2ecf20Sopenharmony_ci 5998c2ecf20Sopenharmony_ciRCU-walk is another algorithm for performing pathname lookup in Linux. 6008c2ecf20Sopenharmony_ciIt is in many ways similar to REF-walk and the two share quite a bit 6018c2ecf20Sopenharmony_ciof code. The significant difference in RCU-walk is how it allows for 6028c2ecf20Sopenharmony_cithe possibility of concurrent access. 6038c2ecf20Sopenharmony_ci 6048c2ecf20Sopenharmony_ciWe noted that REF-walk is complex because there are numerous details 6058c2ecf20Sopenharmony_ciand special cases. RCU-walk reduces this complexity by simply 6068c2ecf20Sopenharmony_cirefusing to handle a number of cases -- it instead falls back to 6078c2ecf20Sopenharmony_ciREF-walk. The difficulty with RCU-walk comes from a different 6088c2ecf20Sopenharmony_cidirection: unfamiliarity. The locking rules when depending on RCU are 6098c2ecf20Sopenharmony_ciquite different from traditional locking, so we will spend a little extra 6108c2ecf20Sopenharmony_citime when we come to those. 6118c2ecf20Sopenharmony_ci 6128c2ecf20Sopenharmony_ciClear demarcation of roles 6138c2ecf20Sopenharmony_ci-------------------------- 6148c2ecf20Sopenharmony_ci 6158c2ecf20Sopenharmony_ciThe easiest way to manage concurrency is to forcibly stop any other 6168c2ecf20Sopenharmony_cithread from changing the data structures that a given thread is 6178c2ecf20Sopenharmony_cilooking at. In cases where no other thread would even think of 6188c2ecf20Sopenharmony_cichanging the data and lots of different threads want to read at the 6198c2ecf20Sopenharmony_cisame time, this can be very costly. Even when using locks that permit 6208c2ecf20Sopenharmony_cimultiple concurrent readers, the simple act of updating the count of 6218c2ecf20Sopenharmony_cithe number of current readers can impose an unwanted cost. So the 6228c2ecf20Sopenharmony_cigoal when reading a shared data structure that no other process is 6238c2ecf20Sopenharmony_cichanging is to avoid writing anything to memory at all. Take no 6248c2ecf20Sopenharmony_cilocks, increment no counts, leave no footprints. 6258c2ecf20Sopenharmony_ci 6268c2ecf20Sopenharmony_ciThe REF-walk mechanism already described certainly doesn't follow this 6278c2ecf20Sopenharmony_ciprinciple, but then it is really designed to work when there may well 6288c2ecf20Sopenharmony_cibe other threads modifying the data. RCU-walk, in contrast, is 6298c2ecf20Sopenharmony_cidesigned for the common situation where there are lots of frequent 6308c2ecf20Sopenharmony_cireaders and only occasional writers. This may not be common in all 6318c2ecf20Sopenharmony_ciparts of the filesystem tree, but in many parts it will be. For the 6328c2ecf20Sopenharmony_ciother parts it is important that RCU-walk can quickly fall back to 6338c2ecf20Sopenharmony_ciusing REF-walk. 6348c2ecf20Sopenharmony_ci 6358c2ecf20Sopenharmony_ciPathname lookup always starts in RCU-walk mode but only remains there 6368c2ecf20Sopenharmony_cias long as what it is looking for is in the cache and is stable. It 6378c2ecf20Sopenharmony_cidances lightly down the cached filesystem image, leaving no footprints 6388c2ecf20Sopenharmony_ciand carefully watching where it is, to be sure it doesn't trip. If it 6398c2ecf20Sopenharmony_cinotices that something has changed or is changing, or if something 6408c2ecf20Sopenharmony_ciisn't in the cache, then it tries to stop gracefully and switch to 6418c2ecf20Sopenharmony_ciREF-walk. 6428c2ecf20Sopenharmony_ci 6438c2ecf20Sopenharmony_ciThis stopping requires getting a counted reference on the current 6448c2ecf20Sopenharmony_ci``vfsmount`` and ``dentry``, and ensuring that these are still valid - 6458c2ecf20Sopenharmony_cithat a path walk with REF-walk would have found the same entries. 6468c2ecf20Sopenharmony_ciThis is an invariant that RCU-walk must guarantee. It can only make 6478c2ecf20Sopenharmony_cidecisions, such as selecting the next step, that are decisions which 6488c2ecf20Sopenharmony_ciREF-walk could also have made if it were walking down the tree at the 6498c2ecf20Sopenharmony_cisame time. If the graceful stop succeeds, the rest of the path is 6508c2ecf20Sopenharmony_ciprocessed with the reliable, if slightly sluggish, REF-walk. If 6518c2ecf20Sopenharmony_ciRCU-walk finds it cannot stop gracefully, it simply gives up and 6528c2ecf20Sopenharmony_cirestarts from the top with REF-walk. 6538c2ecf20Sopenharmony_ci 6548c2ecf20Sopenharmony_ciThis pattern of "try RCU-walk, if that fails try REF-walk" can be 6558c2ecf20Sopenharmony_ciclearly seen in functions like ``filename_lookup()``, 6568c2ecf20Sopenharmony_ci``filename_parentat()``, ``filename_mountpoint()``, 6578c2ecf20Sopenharmony_ci``do_filp_open()``, and ``do_file_open_root()``. These five 6588c2ecf20Sopenharmony_cicorrespond roughly to the four ``path_*()`` functions we met earlier, 6598c2ecf20Sopenharmony_cieach of which calls ``link_path_walk()``. The ``path_*()`` functions are 6608c2ecf20Sopenharmony_cicalled using different mode flags until a mode is found which works. 6618c2ecf20Sopenharmony_ciThey are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If 6628c2ecf20Sopenharmony_cithat fails with the error ``ECHILD`` they are called again with no 6638c2ecf20Sopenharmony_cispecial flag to request "REF-walk". If either of those report the 6648c2ecf20Sopenharmony_cierror ``ESTALE`` a final attempt is made with ``LOOKUP_REVAL`` set (and no 6658c2ecf20Sopenharmony_ci``LOOKUP_RCU``) to ensure that entries found in the cache are forcibly 6668c2ecf20Sopenharmony_cirevalidated - normally entries are only revalidated if the filesystem 6678c2ecf20Sopenharmony_cidetermines that they are too old to trust. 6688c2ecf20Sopenharmony_ci 6698c2ecf20Sopenharmony_ciThe ``LOOKUP_RCU`` attempt may drop that flag internally and switch to 6708c2ecf20Sopenharmony_ciREF-walk, but will never then try to switch back to RCU-walk. Places 6718c2ecf20Sopenharmony_cithat trip up RCU-walk are much more likely to be near the leaves and 6728c2ecf20Sopenharmony_ciso it is very unlikely that there will be much, if any, benefit from 6738c2ecf20Sopenharmony_ciswitching back. 6748c2ecf20Sopenharmony_ci 6758c2ecf20Sopenharmony_ciRCU and seqlocks: fast and light 6768c2ecf20Sopenharmony_ci-------------------------------- 6778c2ecf20Sopenharmony_ci 6788c2ecf20Sopenharmony_ciRCU is, unsurprisingly, critical to RCU-walk mode. The 6798c2ecf20Sopenharmony_ci``rcu_read_lock()`` is held for the entire time that RCU-walk is walking 6808c2ecf20Sopenharmony_cidown a path. The particular guarantee it provides is that the key 6818c2ecf20Sopenharmony_cidata structures - dentries, inodes, super_blocks, and mounts - will 6828c2ecf20Sopenharmony_cinot be freed while the lock is held. They might be unlinked or 6838c2ecf20Sopenharmony_ciinvalidated in one way or another, but the memory will not be 6848c2ecf20Sopenharmony_cirepurposed so values in various fields will still be meaningful. This 6858c2ecf20Sopenharmony_ciis the only guarantee that RCU provides; everything else is done using 6868c2ecf20Sopenharmony_ciseqlocks. 6878c2ecf20Sopenharmony_ci 6888c2ecf20Sopenharmony_ciAs we saw above, REF-walk holds a counted reference to the current 6898c2ecf20Sopenharmony_cidentry and the current vfsmount, and does not release those references 6908c2ecf20Sopenharmony_cibefore taking references to the "next" dentry or vfsmount. It also 6918c2ecf20Sopenharmony_cisometimes takes the ``d_lock`` spinlock. These references and locks are 6928c2ecf20Sopenharmony_citaken to prevent certain changes from happening. RCU-walk must not 6938c2ecf20Sopenharmony_citake those references or locks and so cannot prevent such changes. 6948c2ecf20Sopenharmony_ciInstead, it checks to see if a change has been made, and aborts or 6958c2ecf20Sopenharmony_ciretries if it has. 6968c2ecf20Sopenharmony_ci 6978c2ecf20Sopenharmony_ciTo preserve the invariant mentioned above (that RCU-walk may only make 6988c2ecf20Sopenharmony_cidecisions that REF-walk could have made), it must make the checks at 6998c2ecf20Sopenharmony_cior near the same places that REF-walk holds the references. So, when 7008c2ecf20Sopenharmony_ciREF-walk increments a reference count or takes a spinlock, RCU-walk 7018c2ecf20Sopenharmony_cisamples the status of a seqlock using ``read_seqcount_begin()`` or a 7028c2ecf20Sopenharmony_cisimilar function. When REF-walk decrements the count or drops the 7038c2ecf20Sopenharmony_cilock, RCU-walk checks if the sampled status is still valid using 7048c2ecf20Sopenharmony_ci``read_seqcount_retry()`` or similar. 7058c2ecf20Sopenharmony_ci 7068c2ecf20Sopenharmony_ciHowever, there is a little bit more to seqlocks than that. If 7078c2ecf20Sopenharmony_ciRCU-walk accesses two different fields in a seqlock-protected 7088c2ecf20Sopenharmony_cistructure, or accesses the same field twice, there is no a priori 7098c2ecf20Sopenharmony_ciguarantee of any consistency between those accesses. When consistency 7108c2ecf20Sopenharmony_ciis needed - which it usually is - RCU-walk must take a copy and then 7118c2ecf20Sopenharmony_ciuse ``read_seqcount_retry()`` to validate that copy. 7128c2ecf20Sopenharmony_ci 7138c2ecf20Sopenharmony_ci``read_seqcount_retry()`` not only checks the sequence number, but also 7148c2ecf20Sopenharmony_ciimposes a memory barrier so that no memory-read instruction from 7158c2ecf20Sopenharmony_ci*before* the call can be delayed until *after* the call, either by the 7168c2ecf20Sopenharmony_ciCPU or by the compiler. A simple example of this can be seen in 7178c2ecf20Sopenharmony_ci``slow_dentry_cmp()`` which, for filesystems which do not use simple 7188c2ecf20Sopenharmony_cibyte-wise name equality, calls into the filesystem to compare a name 7198c2ecf20Sopenharmony_ciagainst a dentry. The length and name pointer are copied into local 7208c2ecf20Sopenharmony_civariables, then ``read_seqcount_retry()`` is called to confirm the two 7218c2ecf20Sopenharmony_ciare consistent, and only then is ``->d_compare()`` called. When 7228c2ecf20Sopenharmony_cistandard filename comparison is used, ``dentry_cmp()`` is called 7238c2ecf20Sopenharmony_ciinstead. Notably it does *not* use ``read_seqcount_retry()``, but 7248c2ecf20Sopenharmony_ciinstead has a large comment explaining why the consistency guarantee 7258c2ecf20Sopenharmony_ciisn't necessary. A subsequent ``read_seqcount_retry()`` will be 7268c2ecf20Sopenharmony_cisufficient to catch any problem that could occur at this point. 7278c2ecf20Sopenharmony_ci 7288c2ecf20Sopenharmony_ciWith that little refresher on seqlocks out of the way we can look at 7298c2ecf20Sopenharmony_cithe bigger picture of how RCU-walk uses seqlocks. 7308c2ecf20Sopenharmony_ci 7318c2ecf20Sopenharmony_ci``mount_lock`` and ``nd->m_seq`` 7328c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7338c2ecf20Sopenharmony_ci 7348c2ecf20Sopenharmony_ciWe already met the ``mount_lock`` seqlock when REF-walk used it to 7358c2ecf20Sopenharmony_ciensure that crossing a mount point is performed safely. RCU-walk uses 7368c2ecf20Sopenharmony_ciit for that too, but for quite a bit more. 7378c2ecf20Sopenharmony_ci 7388c2ecf20Sopenharmony_ciInstead of taking a counted reference to each ``vfsmount`` as it 7398c2ecf20Sopenharmony_cidescends the tree, RCU-walk samples the state of ``mount_lock`` at the 7408c2ecf20Sopenharmony_cistart of the walk and stores this initial sequence number in the 7418c2ecf20Sopenharmony_ci``struct nameidata`` in the ``m_seq`` field. This one lock and one 7428c2ecf20Sopenharmony_cisequence number are used to validate all accesses to all ``vfsmounts``, 7438c2ecf20Sopenharmony_ciand all mount point crossings. As changes to the mount table are 7448c2ecf20Sopenharmony_cirelatively rare, it is reasonable to fall back on REF-walk any time 7458c2ecf20Sopenharmony_cithat any "mount" or "unmount" happens. 7468c2ecf20Sopenharmony_ci 7478c2ecf20Sopenharmony_ci``m_seq`` is checked (using ``read_seqretry()``) at the end of an RCU-walk 7488c2ecf20Sopenharmony_cisequence, whether switching to REF-walk for the rest of the path or 7498c2ecf20Sopenharmony_ciwhen the end of the path is reached. It is also checked when stepping 7508c2ecf20Sopenharmony_cidown over a mount point (in ``__follow_mount_rcu()``) or up (in 7518c2ecf20Sopenharmony_ci``follow_dotdot_rcu()``). If it is ever found to have changed, the 7528c2ecf20Sopenharmony_ciwhole RCU-walk sequence is aborted and the path is processed again by 7538c2ecf20Sopenharmony_ciREF-walk. 7548c2ecf20Sopenharmony_ci 7558c2ecf20Sopenharmony_ciIf RCU-walk finds that ``mount_lock`` hasn't changed then it can be sure 7568c2ecf20Sopenharmony_cithat, had REF-walk taken counted references on each vfsmount, the 7578c2ecf20Sopenharmony_ciresults would have been the same. This ensures the invariant holds, 7588c2ecf20Sopenharmony_ciat least for vfsmount structures. 7598c2ecf20Sopenharmony_ci 7608c2ecf20Sopenharmony_ci``dentry->d_seq`` and ``nd->seq`` 7618c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 7628c2ecf20Sopenharmony_ci 7638c2ecf20Sopenharmony_ciIn place of taking a count or lock on ``d_reflock``, RCU-walk samples 7648c2ecf20Sopenharmony_cithe per-dentry ``d_seq`` seqlock, and stores the sequence number in the 7658c2ecf20Sopenharmony_ci``seq`` field of the nameidata structure, so ``nd->seq`` should always be 7668c2ecf20Sopenharmony_cithe current sequence number of ``nd->dentry``. This number needs to be 7678c2ecf20Sopenharmony_cirevalidated after copying, and before using, the name, parent, or 7688c2ecf20Sopenharmony_ciinode of the dentry. 7698c2ecf20Sopenharmony_ci 7708c2ecf20Sopenharmony_ciThe handling of the name we have already looked at, and the parent is 7718c2ecf20Sopenharmony_cionly accessed in ``follow_dotdot_rcu()`` which fairly trivially follows 7728c2ecf20Sopenharmony_cithe required pattern, though it does so for three different cases. 7738c2ecf20Sopenharmony_ci 7748c2ecf20Sopenharmony_ciWhen not at a mount point, ``d_parent`` is followed and its ``d_seq`` is 7758c2ecf20Sopenharmony_cicollected. When we are at a mount point, we instead follow the 7768c2ecf20Sopenharmony_ci``mnt->mnt_mountpoint`` link to get a new dentry and collect its 7778c2ecf20Sopenharmony_ci``d_seq``. Then, after finally finding a ``d_parent`` to follow, we must 7788c2ecf20Sopenharmony_cicheck if we have landed on a mount point and, if so, must find that 7798c2ecf20Sopenharmony_cimount point and follow the ``mnt->mnt_root`` link. This would imply a 7808c2ecf20Sopenharmony_cisomewhat unusual, but certainly possible, circumstance where the 7818c2ecf20Sopenharmony_cistarting point of the path lookup was in part of the filesystem that 7828c2ecf20Sopenharmony_ciwas mounted on, and so not visible from the root. 7838c2ecf20Sopenharmony_ci 7848c2ecf20Sopenharmony_ciThe inode pointer, stored in ``->d_inode``, is a little more 7858c2ecf20Sopenharmony_ciinteresting. The inode will always need to be accessed at least 7868c2ecf20Sopenharmony_citwice, once to determine if it is NULL and once to verify access 7878c2ecf20Sopenharmony_cipermissions. Symlink handling requires a validated inode pointer too. 7888c2ecf20Sopenharmony_ciRather than revalidating on each access, a copy is made on the first 7898c2ecf20Sopenharmony_ciaccess and it is stored in the ``inode`` field of ``nameidata`` from where 7908c2ecf20Sopenharmony_ciit can be safely accessed without further validation. 7918c2ecf20Sopenharmony_ci 7928c2ecf20Sopenharmony_ci``lookup_fast()`` is the only lookup routine that is used in RCU-mode, 7938c2ecf20Sopenharmony_ci``lookup_slow()`` being too slow and requiring locks. It is in 7948c2ecf20Sopenharmony_ci``lookup_fast()`` that we find the important "hand over hand" tracking 7958c2ecf20Sopenharmony_ciof the current dentry. 7968c2ecf20Sopenharmony_ci 7978c2ecf20Sopenharmony_ciThe current ``dentry`` and current ``seq`` number are passed to 7988c2ecf20Sopenharmony_ci``__d_lookup_rcu()`` which, on success, returns a new ``dentry`` and a 7998c2ecf20Sopenharmony_cinew ``seq`` number. ``lookup_fast()`` then copies the inode pointer and 8008c2ecf20Sopenharmony_cirevalidates the new ``seq`` number. It then validates the old ``dentry`` 8018c2ecf20Sopenharmony_ciwith the old ``seq`` number one last time and only then continues. This 8028c2ecf20Sopenharmony_ciprocess of getting the ``seq`` number of the new dentry and then 8038c2ecf20Sopenharmony_cichecking the ``seq`` number of the old exactly mirrors the process of 8048c2ecf20Sopenharmony_cigetting a counted reference to the new dentry before dropping that for 8058c2ecf20Sopenharmony_cithe old dentry which we saw in REF-walk. 8068c2ecf20Sopenharmony_ci 8078c2ecf20Sopenharmony_ciNo ``inode->i_rwsem`` or even ``rename_lock`` 8088c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 8098c2ecf20Sopenharmony_ci 8108c2ecf20Sopenharmony_ciA semaphore is a fairly heavyweight lock that can only be taken when it is 8118c2ecf20Sopenharmony_cipermissible to sleep. As ``rcu_read_lock()`` forbids sleeping, 8128c2ecf20Sopenharmony_ci``inode->i_rwsem`` plays no role in RCU-walk. If some other thread does 8138c2ecf20Sopenharmony_citake ``i_rwsem`` and modifies the directory in a way that RCU-walk needs 8148c2ecf20Sopenharmony_cito notice, the result will be either that RCU-walk fails to find the 8158c2ecf20Sopenharmony_cidentry that it is looking for, or it will find a dentry which 8168c2ecf20Sopenharmony_ci``read_seqretry()`` won't validate. In either case it will drop down to 8178c2ecf20Sopenharmony_ciREF-walk mode which can take whatever locks are needed. 8188c2ecf20Sopenharmony_ci 8198c2ecf20Sopenharmony_ciThough ``rename_lock`` could be used by RCU-walk as it doesn't require 8208c2ecf20Sopenharmony_ciany sleeping, RCU-walk doesn't bother. REF-walk uses ``rename_lock`` to 8218c2ecf20Sopenharmony_ciprotect against the possibility of hash chains in the dcache changing 8228c2ecf20Sopenharmony_ciwhile they are being searched. This can result in failing to find 8238c2ecf20Sopenharmony_cisomething that actually is there. When RCU-walk fails to find 8248c2ecf20Sopenharmony_cisomething in the dentry cache, whether it is really there or not, it 8258c2ecf20Sopenharmony_cialready drops down to REF-walk and tries again with appropriate 8268c2ecf20Sopenharmony_cilocking. This neatly handles all cases, so adding extra checks on 8278c2ecf20Sopenharmony_cirename_lock would bring no significant value. 8288c2ecf20Sopenharmony_ci 8298c2ecf20Sopenharmony_ci``unlazy walk()`` and ``complete_walk()`` 8308c2ecf20Sopenharmony_ci----------------------------------------- 8318c2ecf20Sopenharmony_ci 8328c2ecf20Sopenharmony_ciThat "dropping down to REF-walk" typically involves a call to 8338c2ecf20Sopenharmony_ci``unlazy_walk()``, so named because "RCU-walk" is also sometimes 8348c2ecf20Sopenharmony_cireferred to as "lazy walk". ``unlazy_walk()`` is called when 8358c2ecf20Sopenharmony_cifollowing the path down to the current vfsmount/dentry pair seems to 8368c2ecf20Sopenharmony_cihave proceeded successfully, but the next step is problematic. This 8378c2ecf20Sopenharmony_cican happen if the next name cannot be found in the dcache, if 8388c2ecf20Sopenharmony_cipermission checking or name revalidation couldn't be achieved while 8398c2ecf20Sopenharmony_cithe ``rcu_read_lock()`` is held (which forbids sleeping), if an 8408c2ecf20Sopenharmony_ciautomount point is found, or in a couple of cases involving symlinks. 8418c2ecf20Sopenharmony_ciIt is also called from ``complete_walk()`` when the lookup has reached 8428c2ecf20Sopenharmony_cithe final component, or the very end of the path, depending on which 8438c2ecf20Sopenharmony_ciparticular flavor of lookup is used. 8448c2ecf20Sopenharmony_ci 8458c2ecf20Sopenharmony_ciOther reasons for dropping out of RCU-walk that do not trigger a call 8468c2ecf20Sopenharmony_cito ``unlazy_walk()`` are when some inconsistency is found that cannot be 8478c2ecf20Sopenharmony_cihandled immediately, such as ``mount_lock`` or one of the ``d_seq`` 8488c2ecf20Sopenharmony_ciseqlocks reporting a change. In these cases the relevant function 8498c2ecf20Sopenharmony_ciwill return ``-ECHILD`` which will percolate up until it triggers a new 8508c2ecf20Sopenharmony_ciattempt from the top using REF-walk. 8518c2ecf20Sopenharmony_ci 8528c2ecf20Sopenharmony_ciFor those cases where ``unlazy_walk()`` is an option, it essentially 8538c2ecf20Sopenharmony_citakes a reference on each of the pointers that it holds (vfsmount, 8548c2ecf20Sopenharmony_cidentry, and possibly some symbolic links) and then verifies that the 8558c2ecf20Sopenharmony_cirelevant seqlocks have not been changed. If there have been changes, 8568c2ecf20Sopenharmony_ciit, too, aborts with ``-ECHILD``, otherwise the transition to REF-walk 8578c2ecf20Sopenharmony_cihas been a success and the lookup process continues. 8588c2ecf20Sopenharmony_ci 8598c2ecf20Sopenharmony_ciTaking a reference on those pointers is not quite as simple as just 8608c2ecf20Sopenharmony_ciincrementing a counter. That works to take a second reference if you 8618c2ecf20Sopenharmony_cialready have one (often indirectly through another object), but it 8628c2ecf20Sopenharmony_ciisn't sufficient if you don't actually have a counted reference at 8638c2ecf20Sopenharmony_ciall. For ``dentry->d_lockref``, it is safe to increment the reference 8648c2ecf20Sopenharmony_cicounter to get a reference unless it has been explicitly marked as 8658c2ecf20Sopenharmony_ci"dead" which involves setting the counter to ``-128``. 8668c2ecf20Sopenharmony_ci``lockref_get_not_dead()`` achieves this. 8678c2ecf20Sopenharmony_ci 8688c2ecf20Sopenharmony_ciFor ``mnt->mnt_count`` it is safe to take a reference as long as 8698c2ecf20Sopenharmony_ci``mount_lock`` is then used to validate the reference. If that 8708c2ecf20Sopenharmony_civalidation fails, it may *not* be safe to just drop that reference in 8718c2ecf20Sopenharmony_cithe standard way of calling ``mnt_put()`` - an unmount may have 8728c2ecf20Sopenharmony_ciprogressed too far. So the code in ``legitimize_mnt()``, when it 8738c2ecf20Sopenharmony_cifinds that the reference it got might not be safe, checks the 8748c2ecf20Sopenharmony_ci``MNT_SYNC_UMOUNT`` flag to determine if a simple ``mnt_put()`` is 8758c2ecf20Sopenharmony_cicorrect, or if it should just decrement the count and pretend none of 8768c2ecf20Sopenharmony_cithis ever happened. 8778c2ecf20Sopenharmony_ci 8788c2ecf20Sopenharmony_ciTaking care in filesystems 8798c2ecf20Sopenharmony_ci-------------------------- 8808c2ecf20Sopenharmony_ci 8818c2ecf20Sopenharmony_ciRCU-walk depends almost entirely on cached information and often will 8828c2ecf20Sopenharmony_cinot call into the filesystem at all. However there are two places, 8838c2ecf20Sopenharmony_cibesides the already-mentioned component-name comparison, where the 8848c2ecf20Sopenharmony_cifile system might be included in RCU-walk, and it must know to be 8858c2ecf20Sopenharmony_cicareful. 8868c2ecf20Sopenharmony_ci 8878c2ecf20Sopenharmony_ciIf the filesystem has non-standard permission-checking requirements - 8888c2ecf20Sopenharmony_cisuch as a networked filesystem which may need to check with the server 8898c2ecf20Sopenharmony_ci- the ``i_op->permission`` interface might be called during RCU-walk. 8908c2ecf20Sopenharmony_ciIn this case an extra "``MAY_NOT_BLOCK``" flag is passed so that it 8918c2ecf20Sopenharmony_ciknows not to sleep, but to return ``-ECHILD`` if it cannot complete 8928c2ecf20Sopenharmony_cipromptly. ``i_op->permission`` is given the inode pointer, not the 8938c2ecf20Sopenharmony_cidentry, so it doesn't need to worry about further consistency checks. 8948c2ecf20Sopenharmony_ciHowever if it accesses any other filesystem data structures, it must 8958c2ecf20Sopenharmony_ciensure they are safe to be accessed with only the ``rcu_read_lock()`` 8968c2ecf20Sopenharmony_ciheld. This typically means they must be freed using ``kfree_rcu()`` or 8978c2ecf20Sopenharmony_cisimilar. 8988c2ecf20Sopenharmony_ci 8998c2ecf20Sopenharmony_ci.. _READ_ONCE: https://lwn.net/Articles/624126/ 9008c2ecf20Sopenharmony_ci 9018c2ecf20Sopenharmony_ciIf the filesystem may need to revalidate dcache entries, then 9028c2ecf20Sopenharmony_ci``d_op->d_revalidate`` may be called in RCU-walk too. This interface 9038c2ecf20Sopenharmony_ci*is* passed the dentry but does not have access to the ``inode`` or the 9048c2ecf20Sopenharmony_ci``seq`` number from the ``nameidata``, so it needs to be extra careful 9058c2ecf20Sopenharmony_ciwhen accessing fields in the dentry. This "extra care" typically 9068c2ecf20Sopenharmony_ciinvolves using `READ_ONCE() <READ_ONCE_>`_ to access fields, and verifying the 9078c2ecf20Sopenharmony_ciresult is not NULL before using it. This pattern can be seen in 9088c2ecf20Sopenharmony_ci``nfs_lookup_revalidate()``. 9098c2ecf20Sopenharmony_ci 9108c2ecf20Sopenharmony_ciA pair of patterns 9118c2ecf20Sopenharmony_ci------------------ 9128c2ecf20Sopenharmony_ci 9138c2ecf20Sopenharmony_ciIn various places in the details of REF-walk and RCU-walk, and also in 9148c2ecf20Sopenharmony_cithe big picture, there are a couple of related patterns that are worth 9158c2ecf20Sopenharmony_cibeing aware of. 9168c2ecf20Sopenharmony_ci 9178c2ecf20Sopenharmony_ciThe first is "try quickly and check, if that fails try slowly". We 9188c2ecf20Sopenharmony_cican see that in the high-level approach of first trying RCU-walk and 9198c2ecf20Sopenharmony_cithen trying REF-walk, and in places where ``unlazy_walk()`` is used to 9208c2ecf20Sopenharmony_ciswitch to REF-walk for the rest of the path. We also saw it earlier 9218c2ecf20Sopenharmony_ciin ``dget_parent()`` when following a "``..``" link. It tries a quick way 9228c2ecf20Sopenharmony_cito get a reference, then falls back to taking locks if needed. 9238c2ecf20Sopenharmony_ci 9248c2ecf20Sopenharmony_ciThe second pattern is "try quickly and check, if that fails try 9258c2ecf20Sopenharmony_ciagain - repeatedly". This is seen with the use of ``rename_lock`` and 9268c2ecf20Sopenharmony_ci``mount_lock`` in REF-walk. RCU-walk doesn't make use of this pattern - 9278c2ecf20Sopenharmony_ciif anything goes wrong it is much safer to just abort and try a more 9288c2ecf20Sopenharmony_cisedate approach. 9298c2ecf20Sopenharmony_ci 9308c2ecf20Sopenharmony_ciThe emphasis here is "try quickly and check". It should probably be 9318c2ecf20Sopenharmony_ci"try quickly *and carefully*, then check". The fact that checking is 9328c2ecf20Sopenharmony_cineeded is a reminder that the system is dynamic and only a limited 9338c2ecf20Sopenharmony_cinumber of things are safe at all. The most likely cause of errors in 9348c2ecf20Sopenharmony_cithis whole process is assuming something is safe when in reality it 9358c2ecf20Sopenharmony_ciisn't. Careful consideration of what exactly guarantees the safety of 9368c2ecf20Sopenharmony_cieach access is sometimes necessary. 9378c2ecf20Sopenharmony_ci 9388c2ecf20Sopenharmony_ciA walk among the symlinks 9398c2ecf20Sopenharmony_ci========================= 9408c2ecf20Sopenharmony_ci 9418c2ecf20Sopenharmony_ciThere are several basic issues that we will examine to understand the 9428c2ecf20Sopenharmony_cihandling of symbolic links: the symlink stack, together with cache 9438c2ecf20Sopenharmony_cilifetimes, will help us understand the overall recursive handling of 9448c2ecf20Sopenharmony_cisymlinks and lead to the special care needed for the final component. 9458c2ecf20Sopenharmony_ciThen a consideration of access-time updates and summary of the various 9468c2ecf20Sopenharmony_ciflags controlling lookup will finish the story. 9478c2ecf20Sopenharmony_ci 9488c2ecf20Sopenharmony_ciThe symlink stack 9498c2ecf20Sopenharmony_ci----------------- 9508c2ecf20Sopenharmony_ci 9518c2ecf20Sopenharmony_ciThere are only two sorts of filesystem objects that can usefully 9528c2ecf20Sopenharmony_ciappear in a path prior to the final component: directories and symlinks. 9538c2ecf20Sopenharmony_ciHandling directories is quite straightforward: the new directory 9548c2ecf20Sopenharmony_cisimply becomes the starting point at which to interpret the next 9558c2ecf20Sopenharmony_cicomponent on the path. Handling symbolic links requires a bit more 9568c2ecf20Sopenharmony_ciwork. 9578c2ecf20Sopenharmony_ci 9588c2ecf20Sopenharmony_ciConceptually, symbolic links could be handled by editing the path. If 9598c2ecf20Sopenharmony_cia component name refers to a symbolic link, then that component is 9608c2ecf20Sopenharmony_cireplaced by the body of the link and, if that body starts with a '/', 9618c2ecf20Sopenharmony_cithen all preceding parts of the path are discarded. This is what the 9628c2ecf20Sopenharmony_ci"``readlink -f``" command does, though it also edits out "``.``" and 9638c2ecf20Sopenharmony_ci"``..``" components. 9648c2ecf20Sopenharmony_ci 9658c2ecf20Sopenharmony_ciDirectly editing the path string is not really necessary when looking 9668c2ecf20Sopenharmony_ciup a path, and discarding early components is pointless as they aren't 9678c2ecf20Sopenharmony_cilooked at anyway. Keeping track of all remaining components is 9688c2ecf20Sopenharmony_ciimportant, but they can of course be kept separately; there is no need 9698c2ecf20Sopenharmony_cito concatenate them. As one symlink may easily refer to another, 9708c2ecf20Sopenharmony_ciwhich in turn can refer to a third, we may need to keep the remaining 9718c2ecf20Sopenharmony_cicomponents of several paths, each to be processed when the preceding 9728c2ecf20Sopenharmony_ciones are completed. These path remnants are kept on a stack of 9738c2ecf20Sopenharmony_cilimited size. 9748c2ecf20Sopenharmony_ci 9758c2ecf20Sopenharmony_ciThere are two reasons for placing limits on how many symlinks can 9768c2ecf20Sopenharmony_cioccur in a single path lookup. The most obvious is to avoid loops. 9778c2ecf20Sopenharmony_ciIf a symlink referred to itself either directly or through 9788c2ecf20Sopenharmony_ciintermediaries, then following the symlink can never complete 9798c2ecf20Sopenharmony_cisuccessfully - the error ``ELOOP`` must be returned. Loops can be 9808c2ecf20Sopenharmony_cidetected without imposing limits, but limits are the simplest solution 9818c2ecf20Sopenharmony_ciand, given the second reason for restriction, quite sufficient. 9828c2ecf20Sopenharmony_ci 9838c2ecf20Sopenharmony_ci.. _outlined recently: http://thread.gmane.org/gmane.linux.kernel/1934390/focus=1934550 9848c2ecf20Sopenharmony_ci 9858c2ecf20Sopenharmony_ciThe second reason was `outlined recently`_ by Linus: 9868c2ecf20Sopenharmony_ci 9878c2ecf20Sopenharmony_ci Because it's a latency and DoS issue too. We need to react well to 9888c2ecf20Sopenharmony_ci true loops, but also to "very deep" non-loops. It's not about memory 9898c2ecf20Sopenharmony_ci use, it's about users triggering unreasonable CPU resources. 9908c2ecf20Sopenharmony_ci 9918c2ecf20Sopenharmony_ciLinux imposes a limit on the length of any pathname: ``PATH_MAX``, which 9928c2ecf20Sopenharmony_ciis 4096. There are a number of reasons for this limit; not letting the 9938c2ecf20Sopenharmony_cikernel spend too much time on just one path is one of them. With 9948c2ecf20Sopenharmony_cisymbolic links you can effectively generate much longer paths so some 9958c2ecf20Sopenharmony_cisort of limit is needed for the same reason. Linux imposes a limit of 9968c2ecf20Sopenharmony_ciat most 40 symlinks in any one path lookup. It previously imposed a 9978c2ecf20Sopenharmony_cifurther limit of eight on the maximum depth of recursion, but that was 9988c2ecf20Sopenharmony_ciraised to 40 when a separate stack was implemented, so there is now 9998c2ecf20Sopenharmony_cijust the one limit. 10008c2ecf20Sopenharmony_ci 10018c2ecf20Sopenharmony_ciThe ``nameidata`` structure that we met in an earlier article contains a 10028c2ecf20Sopenharmony_cismall stack that can be used to store the remaining part of up to two 10038c2ecf20Sopenharmony_cisymlinks. In many cases this will be sufficient. If it isn't, a 10048c2ecf20Sopenharmony_ciseparate stack is allocated with room for 40 symlinks. Pathname 10058c2ecf20Sopenharmony_cilookup will never exceed that stack as, once the 40th symlink is 10068c2ecf20Sopenharmony_cidetected, an error is returned. 10078c2ecf20Sopenharmony_ci 10088c2ecf20Sopenharmony_ciIt might seem that the name remnants are all that needs to be stored on 10098c2ecf20Sopenharmony_cithis stack, but we need a bit more. To see that, we need to move on to 10108c2ecf20Sopenharmony_cicache lifetimes. 10118c2ecf20Sopenharmony_ci 10128c2ecf20Sopenharmony_ciStorage and lifetime of cached symlinks 10138c2ecf20Sopenharmony_ci--------------------------------------- 10148c2ecf20Sopenharmony_ci 10158c2ecf20Sopenharmony_ciLike other filesystem resources, such as inodes and directory 10168c2ecf20Sopenharmony_cientries, symlinks are cached by Linux to avoid repeated costly access 10178c2ecf20Sopenharmony_cito external storage. It is particularly important for RCU-walk to be 10188c2ecf20Sopenharmony_ciable to find and temporarily hold onto these cached entries, so that 10198c2ecf20Sopenharmony_ciit doesn't need to drop down into REF-walk. 10208c2ecf20Sopenharmony_ci 10218c2ecf20Sopenharmony_ci.. _object-oriented design pattern: https://lwn.net/Articles/446317/ 10228c2ecf20Sopenharmony_ci 10238c2ecf20Sopenharmony_ciWhile each filesystem is free to make its own choice, symlinks are 10248c2ecf20Sopenharmony_citypically stored in one of two places. Short symlinks are often 10258c2ecf20Sopenharmony_cistored directly in the inode. When a filesystem allocates a ``struct 10268c2ecf20Sopenharmony_ciinode`` it typically allocates extra space to store private data (a 10278c2ecf20Sopenharmony_cicommon `object-oriented design pattern`_ in the kernel). This will 10288c2ecf20Sopenharmony_cisometimes include space for a symlink. The other common location is 10298c2ecf20Sopenharmony_ciin the page cache, which normally stores the content of files. The 10308c2ecf20Sopenharmony_cipathname in a symlink can be seen as the content of that symlink and 10318c2ecf20Sopenharmony_cican easily be stored in the page cache just like file content. 10328c2ecf20Sopenharmony_ci 10338c2ecf20Sopenharmony_ciWhen neither of these is suitable, the next most likely scenario is 10348c2ecf20Sopenharmony_cithat the filesystem will allocate some temporary memory and copy or 10358c2ecf20Sopenharmony_ciconstruct the symlink content into that memory whenever it is needed. 10368c2ecf20Sopenharmony_ci 10378c2ecf20Sopenharmony_ciWhen the symlink is stored in the inode, it has the same lifetime as 10388c2ecf20Sopenharmony_cithe inode which, itself, is protected by RCU or by a counted reference 10398c2ecf20Sopenharmony_cion the dentry. This means that the mechanisms that pathname lookup 10408c2ecf20Sopenharmony_ciuses to access the dcache and icache (inode cache) safely are quite 10418c2ecf20Sopenharmony_cisufficient for accessing some cached symlinks safely. In these cases, 10428c2ecf20Sopenharmony_cithe ``i_link`` pointer in the inode is set to point to wherever the 10438c2ecf20Sopenharmony_cisymlink is stored and it can be accessed directly whenever needed. 10448c2ecf20Sopenharmony_ci 10458c2ecf20Sopenharmony_ciWhen the symlink is stored in the page cache or elsewhere, the 10468c2ecf20Sopenharmony_cisituation is not so straightforward. A reference on a dentry or even 10478c2ecf20Sopenharmony_cion an inode does not imply any reference on cached pages of that 10488c2ecf20Sopenharmony_ciinode, and even an ``rcu_read_lock()`` is not sufficient to ensure that 10498c2ecf20Sopenharmony_cia page will not disappear. So for these symlinks the pathname lookup 10508c2ecf20Sopenharmony_cicode needs to ask the filesystem to provide a stable reference and, 10518c2ecf20Sopenharmony_cisignificantly, needs to release that reference when it is finished 10528c2ecf20Sopenharmony_ciwith it. 10538c2ecf20Sopenharmony_ci 10548c2ecf20Sopenharmony_ciTaking a reference to a cache page is often possible even in RCU-walk 10558c2ecf20Sopenharmony_cimode. It does require making changes to memory, which is best avoided, 10568c2ecf20Sopenharmony_cibut that isn't necessarily a big cost and it is better than dropping 10578c2ecf20Sopenharmony_ciout of RCU-walk mode completely. Even filesystems that allocate 10588c2ecf20Sopenharmony_cispace to copy the symlink into can use ``GFP_ATOMIC`` to often successfully 10598c2ecf20Sopenharmony_ciallocate memory without the need to drop out of RCU-walk. If a 10608c2ecf20Sopenharmony_cifilesystem cannot successfully get a reference in RCU-walk mode, it 10618c2ecf20Sopenharmony_cimust return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to 10628c2ecf20Sopenharmony_ciREF-walk mode in which the filesystem is allowed to sleep. 10638c2ecf20Sopenharmony_ci 10648c2ecf20Sopenharmony_ciThe place for all this to happen is the ``i_op->follow_link()`` inode 10658c2ecf20Sopenharmony_cimethod. In the present mainline code this is never actually called in 10668c2ecf20Sopenharmony_ciRCU-walk mode as the rewrite is not quite complete. It is likely that 10678c2ecf20Sopenharmony_ciin a future release this method will be passed an ``inode`` pointer when 10688c2ecf20Sopenharmony_cicalled in RCU-walk mode so it both (1) knows to be careful, and (2) has the 10698c2ecf20Sopenharmony_civalidated pointer. Much like the ``i_op->permission()`` method we 10708c2ecf20Sopenharmony_cilooked at previously, ``->follow_link()`` would need to be careful that 10718c2ecf20Sopenharmony_ciall the data structures it references are safe to be accessed while 10728c2ecf20Sopenharmony_ciholding no counted reference, only the RCU lock. Though getting a 10738c2ecf20Sopenharmony_cireference with ``->follow_link()`` is not yet done in RCU-walk mode, the 10748c2ecf20Sopenharmony_cicode is ready to release the reference when that does happen. 10758c2ecf20Sopenharmony_ci 10768c2ecf20Sopenharmony_ciThis need to drop the reference to a symlink adds significant 10778c2ecf20Sopenharmony_cicomplexity. It requires a reference to the inode so that the 10788c2ecf20Sopenharmony_ci``i_op->put_link()`` inode operation can be called. In REF-walk, that 10798c2ecf20Sopenharmony_cireference is kept implicitly through a reference to the dentry, so 10808c2ecf20Sopenharmony_cikeeping the ``struct path`` of the symlink is easiest. For RCU-walk, 10818c2ecf20Sopenharmony_cithe pointer to the inode is kept separately. To allow switching from 10828c2ecf20Sopenharmony_ciRCU-walk back to REF-walk in the middle of processing nested symlinks 10838c2ecf20Sopenharmony_ciwe also need the seq number for the dentry so we can confirm that 10848c2ecf20Sopenharmony_ciswitching back was safe. 10858c2ecf20Sopenharmony_ci 10868c2ecf20Sopenharmony_ciFinally, when providing a reference to a symlink, the filesystem also 10878c2ecf20Sopenharmony_ciprovides an opaque "cookie" that must be passed to ``->put_link()`` so that it 10888c2ecf20Sopenharmony_ciknows what to free. This might be the allocated memory area, or a 10898c2ecf20Sopenharmony_cipointer to the ``struct page`` in the page cache, or something else 10908c2ecf20Sopenharmony_cicompletely. Only the filesystem knows what it is. 10918c2ecf20Sopenharmony_ci 10928c2ecf20Sopenharmony_ciIn order for the reference to each symlink to be dropped when the walk completes, 10938c2ecf20Sopenharmony_ciwhether in RCU-walk or REF-walk, the symlink stack needs to contain, 10948c2ecf20Sopenharmony_cialong with the path remnants: 10958c2ecf20Sopenharmony_ci 10968c2ecf20Sopenharmony_ci- the ``struct path`` to provide a reference to the inode in REF-walk 10978c2ecf20Sopenharmony_ci- the ``struct inode *`` to provide a reference to the inode in RCU-walk 10988c2ecf20Sopenharmony_ci- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk 10998c2ecf20Sopenharmony_ci- the ``cookie`` that tells ``->put_path()`` what to put. 11008c2ecf20Sopenharmony_ci 11018c2ecf20Sopenharmony_ciThis means that each entry in the symlink stack needs to hold five 11028c2ecf20Sopenharmony_cipointers and an integer instead of just one pointer (the path 11038c2ecf20Sopenharmony_ciremnant). On a 64-bit system, this is about 40 bytes per entry; 11048c2ecf20Sopenharmony_ciwith 40 entries it adds up to 1600 bytes total, which is less than 11058c2ecf20Sopenharmony_cihalf a page. So it might seem like a lot, but is by no means 11068c2ecf20Sopenharmony_ciexcessive. 11078c2ecf20Sopenharmony_ci 11088c2ecf20Sopenharmony_ciNote that, in a given stack frame, the path remnant (``name``) is not 11098c2ecf20Sopenharmony_cipart of the symlink that the other fields refer to. It is the remnant 11108c2ecf20Sopenharmony_cito be followed once that symlink has been fully parsed. 11118c2ecf20Sopenharmony_ci 11128c2ecf20Sopenharmony_ciFollowing the symlink 11138c2ecf20Sopenharmony_ci--------------------- 11148c2ecf20Sopenharmony_ci 11158c2ecf20Sopenharmony_ciThe main loop in ``link_path_walk()`` iterates seamlessly over all 11168c2ecf20Sopenharmony_cicomponents in the path and all of the non-final symlinks. As symlinks 11178c2ecf20Sopenharmony_ciare processed, the ``name`` pointer is adjusted to point to a new 11188c2ecf20Sopenharmony_cisymlink, or is restored from the stack, so that much of the loop 11198c2ecf20Sopenharmony_cidoesn't need to notice. Getting this ``name`` variable on and off the 11208c2ecf20Sopenharmony_cistack is very straightforward; pushing and popping the references is 11218c2ecf20Sopenharmony_cia little more complex. 11228c2ecf20Sopenharmony_ci 11238c2ecf20Sopenharmony_ciWhen a symlink is found, ``walk_component()`` returns the value ``1`` 11248c2ecf20Sopenharmony_ci(``0`` is returned for any other sort of success, and a negative number 11258c2ecf20Sopenharmony_ciis, as usual, an error indicator). This causes ``get_link()`` to be 11268c2ecf20Sopenharmony_cicalled; it then gets the link from the filesystem. Providing that 11278c2ecf20Sopenharmony_cioperation is successful, the old path ``name`` is placed on the stack, 11288c2ecf20Sopenharmony_ciand the new value is used as the ``name`` for a while. When the end of 11298c2ecf20Sopenharmony_cithe path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored 11308c2ecf20Sopenharmony_cioff the stack and path walking continues. 11318c2ecf20Sopenharmony_ci 11328c2ecf20Sopenharmony_ciPushing and popping the reference pointers (inode, cookie, etc.) is more 11338c2ecf20Sopenharmony_cicomplex in part because of the desire to handle tail recursion. When 11348c2ecf20Sopenharmony_cithe last component of a symlink itself points to a symlink, we 11358c2ecf20Sopenharmony_ciwant to pop the symlink-just-completed off the stack before pushing 11368c2ecf20Sopenharmony_cithe symlink-just-found to avoid leaving empty path remnants that would 11378c2ecf20Sopenharmony_cijust get in the way. 11388c2ecf20Sopenharmony_ci 11398c2ecf20Sopenharmony_ciIt is most convenient to push the new symlink references onto the 11408c2ecf20Sopenharmony_cistack in ``walk_component()`` immediately when the symlink is found; 11418c2ecf20Sopenharmony_ci``walk_component()`` is also the last piece of code that needs to look at the 11428c2ecf20Sopenharmony_ciold symlink as it walks that last component. So it is quite 11438c2ecf20Sopenharmony_ciconvenient for ``walk_component()`` to release the old symlink and pop 11448c2ecf20Sopenharmony_cithe references just before pushing the reference information for the 11458c2ecf20Sopenharmony_cinew symlink. It is guided in this by two flags; ``WALK_GET``, which 11468c2ecf20Sopenharmony_cigives it permission to follow a symlink if it finds one, and 11478c2ecf20Sopenharmony_ci``WALK_PUT``, which tells it to release the current symlink after it has been 11488c2ecf20Sopenharmony_cifollowed. ``WALK_PUT`` is tested first, leading to a call to 11498c2ecf20Sopenharmony_ci``put_link()``. ``WALK_GET`` is tested subsequently (by 11508c2ecf20Sopenharmony_ci``should_follow_link()``) leading to a call to ``pick_link()`` which sets 11518c2ecf20Sopenharmony_ciup the stack frame. 11528c2ecf20Sopenharmony_ci 11538c2ecf20Sopenharmony_ciSymlinks with no final component 11548c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 11558c2ecf20Sopenharmony_ci 11568c2ecf20Sopenharmony_ciA pair of special-case symlinks deserve a little further explanation. 11578c2ecf20Sopenharmony_ciBoth result in a new ``struct path`` (with mount and dentry) being set 11588c2ecf20Sopenharmony_ciup in the ``nameidata``, and result in ``get_link()`` returning ``NULL``. 11598c2ecf20Sopenharmony_ci 11608c2ecf20Sopenharmony_ciThe more obvious case is a symlink to "``/``". All symlinks starting 11618c2ecf20Sopenharmony_ciwith "``/``" are detected in ``get_link()`` which resets the ``nameidata`` 11628c2ecf20Sopenharmony_cito point to the effective filesystem root. If the symlink only 11638c2ecf20Sopenharmony_cicontains "``/``" then there is nothing more to do, no components at all, 11648c2ecf20Sopenharmony_ciso ``NULL`` is returned to indicate that the symlink can be released and 11658c2ecf20Sopenharmony_cithe stack frame discarded. 11668c2ecf20Sopenharmony_ci 11678c2ecf20Sopenharmony_ciThe other case involves things in ``/proc`` that look like symlinks but 11688c2ecf20Sopenharmony_ciaren't really (and are therefore commonly referred to as "magic-links"):: 11698c2ecf20Sopenharmony_ci 11708c2ecf20Sopenharmony_ci $ ls -l /proc/self/fd/1 11718c2ecf20Sopenharmony_ci lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4 11728c2ecf20Sopenharmony_ci 11738c2ecf20Sopenharmony_ciEvery open file descriptor in any process is represented in ``/proc`` by 11748c2ecf20Sopenharmony_cisomething that looks like a symlink. It is really a reference to the 11758c2ecf20Sopenharmony_citarget file, not just the name of it. When you ``readlink`` these 11768c2ecf20Sopenharmony_ciobjects you get a name that might refer to the same file - unless it 11778c2ecf20Sopenharmony_cihas been unlinked or mounted over. When ``walk_component()`` follows 11788c2ecf20Sopenharmony_cione of these, the ``->follow_link()`` method in "procfs" doesn't return 11798c2ecf20Sopenharmony_cia string name, but instead calls ``nd_jump_link()`` which updates the 11808c2ecf20Sopenharmony_ci``nameidata`` in place to point to that target. ``->follow_link()`` then 11818c2ecf20Sopenharmony_cireturns ``NULL``. Again there is no final component and ``get_link()`` 11828c2ecf20Sopenharmony_cireports this by leaving the ``last_type`` field of ``nameidata`` as 11838c2ecf20Sopenharmony_ci``LAST_BIND``. 11848c2ecf20Sopenharmony_ci 11858c2ecf20Sopenharmony_ciFollowing the symlink in the final component 11868c2ecf20Sopenharmony_ci-------------------------------------------- 11878c2ecf20Sopenharmony_ci 11888c2ecf20Sopenharmony_ciAll this leads to ``link_path_walk()`` walking down every component, and 11898c2ecf20Sopenharmony_cifollowing all symbolic links it finds, until it reaches the final 11908c2ecf20Sopenharmony_cicomponent. This is just returned in the ``last`` field of ``nameidata``. 11918c2ecf20Sopenharmony_ciFor some callers, this is all they need; they want to create that 11928c2ecf20Sopenharmony_ci``last`` name if it doesn't exist or give an error if it does. Other 11938c2ecf20Sopenharmony_cicallers will want to follow a symlink if one is found, and possibly 11948c2ecf20Sopenharmony_ciapply special handling to the last component of that symlink, rather 11958c2ecf20Sopenharmony_cithan just the last component of the original file name. These callers 11968c2ecf20Sopenharmony_cipotentially need to call ``link_path_walk()`` again and again on 11978c2ecf20Sopenharmony_cisuccessive symlinks until one is found that doesn't point to another 11988c2ecf20Sopenharmony_cisymlink. 11998c2ecf20Sopenharmony_ci 12008c2ecf20Sopenharmony_ciThis case is handled by the relevant caller of ``link_path_walk()``, such as 12018c2ecf20Sopenharmony_ci``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then 12028c2ecf20Sopenharmony_cihandles the final component. If the final component is a symlink 12038c2ecf20Sopenharmony_cithat needs to be followed, then ``trailing_symlink()`` is called to set 12048c2ecf20Sopenharmony_cithings up properly and the loop repeats, calling ``link_path_walk()`` 12058c2ecf20Sopenharmony_ciagain. This could loop as many as 40 times if the last component of 12068c2ecf20Sopenharmony_cieach symlink is another symlink. 12078c2ecf20Sopenharmony_ci 12088c2ecf20Sopenharmony_ciThe various functions that examine the final component and possibly 12098c2ecf20Sopenharmony_cireport that it is a symlink are ``lookup_last()``, ``mountpoint_last()`` 12108c2ecf20Sopenharmony_ciand ``do_last()``, each of which use the same convention as 12118c2ecf20Sopenharmony_ci``walk_component()`` of returning ``1`` if a symlink was found that needs 12128c2ecf20Sopenharmony_cito be followed. 12138c2ecf20Sopenharmony_ci 12148c2ecf20Sopenharmony_ciOf these, ``do_last()`` is the most interesting as it is used for 12158c2ecf20Sopenharmony_ciopening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this 12168c2ecf20Sopenharmony_cipart is in a separate function: ``lookup_open()``. 12178c2ecf20Sopenharmony_ci 12188c2ecf20Sopenharmony_ciExplaining ``do_last()`` completely is beyond the scope of this article, 12198c2ecf20Sopenharmony_cibut a few highlights should help those interested in exploring the 12208c2ecf20Sopenharmony_cicode. 12218c2ecf20Sopenharmony_ci 12228c2ecf20Sopenharmony_ci1. Rather than just finding the target file, ``do_last()`` needs to open 12238c2ecf20Sopenharmony_ci it. If the file was found in the dcache, then ``vfs_open()`` is used for 12248c2ecf20Sopenharmony_ci this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if 12258c2ecf20Sopenharmony_ci the filesystem provides it) to combine the final lookup with the open, or 12268c2ecf20Sopenharmony_ci will perform the separate ``lookup_real()`` and ``vfs_create()`` steps 12278c2ecf20Sopenharmony_ci directly. In the later case the actual "open" of this newly found or 12288c2ecf20Sopenharmony_ci created file will be performed by ``vfs_open()``, just as if the name 12298c2ecf20Sopenharmony_ci were found in the dcache. 12308c2ecf20Sopenharmony_ci 12318c2ecf20Sopenharmony_ci2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information 12328c2ecf20Sopenharmony_ci wasn't quite current enough. Rather than restarting the lookup from 12338c2ecf20Sopenharmony_ci the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead, 12348c2ecf20Sopenharmony_ci giving the filesystem a chance to resolve small inconsistencies. 12358c2ecf20Sopenharmony_ci If that doesn't work, only then is the lookup restarted from the top. 12368c2ecf20Sopenharmony_ci 12378c2ecf20Sopenharmony_ci3. An open with O_CREAT **does** follow a symlink in the final component, 12388c2ecf20Sopenharmony_ci unlike other creation system calls (like ``mkdir``). So the sequence:: 12398c2ecf20Sopenharmony_ci 12408c2ecf20Sopenharmony_ci ln -s bar /tmp/foo 12418c2ecf20Sopenharmony_ci echo hello > /tmp/foo 12428c2ecf20Sopenharmony_ci 12438c2ecf20Sopenharmony_ci will create a file called ``/tmp/bar``. This is not permitted if 12448c2ecf20Sopenharmony_ci ``O_EXCL`` is set but otherwise is handled for an O_CREAT open much 12458c2ecf20Sopenharmony_ci like for a non-creating open: ``should_follow_link()`` returns ``1``, and 12468c2ecf20Sopenharmony_ci so does ``do_last()`` so that ``trailing_symlink()`` gets called and the 12478c2ecf20Sopenharmony_ci open process continues on the symlink that was found. 12488c2ecf20Sopenharmony_ci 12498c2ecf20Sopenharmony_ciUpdating the access time 12508c2ecf20Sopenharmony_ci------------------------ 12518c2ecf20Sopenharmony_ci 12528c2ecf20Sopenharmony_ciWe previously said of RCU-walk that it would "take no locks, increment 12538c2ecf20Sopenharmony_cino counts, leave no footprints." We have since seen that some 12548c2ecf20Sopenharmony_ci"footprints" can be needed when handling symlinks as a counted 12558c2ecf20Sopenharmony_cireference (or even a memory allocation) may be needed. But these 12568c2ecf20Sopenharmony_cifootprints are best kept to a minimum. 12578c2ecf20Sopenharmony_ci 12588c2ecf20Sopenharmony_ciOne other place where walking down a symlink can involve leaving 12598c2ecf20Sopenharmony_cifootprints in a way that doesn't affect directories is in updating access times. 12608c2ecf20Sopenharmony_ciIn Unix (and Linux) every filesystem object has a "last accessed 12618c2ecf20Sopenharmony_citime", or "``atime``". Passing through a directory to access a file 12628c2ecf20Sopenharmony_ciwithin is not considered to be an access for the purposes of 12638c2ecf20Sopenharmony_ci``atime``; only listing the contents of a directory can update its ``atime``. 12648c2ecf20Sopenharmony_ciSymlinks are different it seems. Both reading a symlink (with ``readlink()``) 12658c2ecf20Sopenharmony_ciand looking up a symlink on the way to some other destination can 12668c2ecf20Sopenharmony_ciupdate the atime on that symlink. 12678c2ecf20Sopenharmony_ci 12688c2ecf20Sopenharmony_ci.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08 12698c2ecf20Sopenharmony_ci 12708c2ecf20Sopenharmony_ciIt is not clear why this is the case; POSIX has little to say on the 12718c2ecf20Sopenharmony_cisubject. The `clearest statement`_ is that, if a particular implementation 12728c2ecf20Sopenharmony_ciupdates a timestamp in a place not specified by POSIX, this must be 12738c2ecf20Sopenharmony_cidocumented "except that any changes caused by pathname resolution need 12748c2ecf20Sopenharmony_cinot be documented". This seems to imply that POSIX doesn't really 12758c2ecf20Sopenharmony_cicare about access-time updates during pathname lookup. 12768c2ecf20Sopenharmony_ci 12778c2ecf20Sopenharmony_ci.. _Linux 1.3.87: https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/fs/ext2/symlink.c?id=f806c6db77b8eaa6e00dcfb6b567706feae8dbb8 12788c2ecf20Sopenharmony_ci 12798c2ecf20Sopenharmony_ciAn examination of history shows that prior to `Linux 1.3.87`_, the ext2 12808c2ecf20Sopenharmony_cifilesystem, at least, didn't update atime when following a link. 12818c2ecf20Sopenharmony_ciUnfortunately we have no record of why that behavior was changed. 12828c2ecf20Sopenharmony_ci 12838c2ecf20Sopenharmony_ciIn any case, access time must now be updated and that operation can be 12848c2ecf20Sopenharmony_ciquite complex. Trying to stay in RCU-walk while doing it is best 12858c2ecf20Sopenharmony_ciavoided. Fortunately it is often permitted to skip the ``atime`` 12868c2ecf20Sopenharmony_ciupdate. Because ``atime`` updates cause performance problems in various 12878c2ecf20Sopenharmony_ciareas, Linux supports the ``relatime`` mount option, which generally 12888c2ecf20Sopenharmony_cilimits the updates of ``atime`` to once per day on files that aren't 12898c2ecf20Sopenharmony_cibeing changed (and symlinks never change once created). Even without 12908c2ecf20Sopenharmony_ci``relatime``, many filesystems record ``atime`` with a one-second 12918c2ecf20Sopenharmony_cigranularity, so only one update per second is required. 12928c2ecf20Sopenharmony_ci 12938c2ecf20Sopenharmony_ciIt is easy to test if an ``atime`` update is needed while in RCU-walk 12948c2ecf20Sopenharmony_cimode and, if it isn't, the update can be skipped and RCU-walk mode 12958c2ecf20Sopenharmony_cicontinues. Only when an ``atime`` update is actually required does the 12968c2ecf20Sopenharmony_cipath walk drop down to REF-walk. All of this is handled in the 12978c2ecf20Sopenharmony_ci``get_link()`` function. 12988c2ecf20Sopenharmony_ci 12998c2ecf20Sopenharmony_ciA few flags 13008c2ecf20Sopenharmony_ci----------- 13018c2ecf20Sopenharmony_ci 13028c2ecf20Sopenharmony_ciA suitable way to wrap up this tour of pathname walking is to list 13038c2ecf20Sopenharmony_cithe various flags that can be stored in the ``nameidata`` to guide the 13048c2ecf20Sopenharmony_cilookup process. Many of these are only meaningful on the final 13058c2ecf20Sopenharmony_cicomponent, others reflect the current state of the pathname lookup, and some 13068c2ecf20Sopenharmony_ciapply restrictions to all path components encountered in the path lookup. 13078c2ecf20Sopenharmony_ci 13088c2ecf20Sopenharmony_ciAnd then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with 13098c2ecf20Sopenharmony_cithe others. If this is not set, an empty pathname causes an error 13108c2ecf20Sopenharmony_civery early on. If it is set, empty pathnames are not considered to be 13118c2ecf20Sopenharmony_cian error. 13128c2ecf20Sopenharmony_ci 13138c2ecf20Sopenharmony_ciGlobal state flags 13148c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 13158c2ecf20Sopenharmony_ci 13168c2ecf20Sopenharmony_ciWe have already met two global state flags: ``LOOKUP_RCU`` and 13178c2ecf20Sopenharmony_ci``LOOKUP_REVAL``. These select between one of three overall approaches 13188c2ecf20Sopenharmony_cito lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation. 13198c2ecf20Sopenharmony_ci 13208c2ecf20Sopenharmony_ci``LOOKUP_PARENT`` indicates that the final component hasn't been reached 13218c2ecf20Sopenharmony_ciyet. This is primarily used to tell the audit subsystem the full 13228c2ecf20Sopenharmony_cicontext of a particular access being audited. 13238c2ecf20Sopenharmony_ci 13248c2ecf20Sopenharmony_ci``ND_ROOT_PRESET`` indicates that the ``root`` field in the ``nameidata`` was 13258c2ecf20Sopenharmony_ciprovided by the caller, so it shouldn't be released when it is no 13268c2ecf20Sopenharmony_cilonger needed. 13278c2ecf20Sopenharmony_ci 13288c2ecf20Sopenharmony_ci``ND_JUMPED`` means that the current dentry was chosen not because 13298c2ecf20Sopenharmony_ciit had the right name but for some other reason. This happens when 13308c2ecf20Sopenharmony_cifollowing "``..``", following a symlink to ``/``, crossing a mount point 13318c2ecf20Sopenharmony_cior accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic 13328c2ecf20Sopenharmony_cilink"). In this case the filesystem has not been asked to revalidate the 13338c2ecf20Sopenharmony_ciname (with ``d_revalidate()``). In such cases the inode may still need 13348c2ecf20Sopenharmony_cito be revalidated, so ``d_op->d_weak_revalidate()`` is called if 13358c2ecf20Sopenharmony_ci``ND_JUMPED`` is set when the look completes - which may be at the 13368c2ecf20Sopenharmony_cifinal component or, when creating, unlinking, or renaming, at the penultimate component. 13378c2ecf20Sopenharmony_ci 13388c2ecf20Sopenharmony_ciResolution-restriction flags 13398c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13408c2ecf20Sopenharmony_ci 13418c2ecf20Sopenharmony_ciIn order to allow userspace to protect itself against certain race conditions 13428c2ecf20Sopenharmony_ciand attack scenarios involving changing path components, a series of flags are 13438c2ecf20Sopenharmony_ciavailable which apply restrictions to all path components encountered during 13448c2ecf20Sopenharmony_cipath lookup. These flags are exposed through ``openat2()``'s ``resolve`` field. 13458c2ecf20Sopenharmony_ci 13468c2ecf20Sopenharmony_ci``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links). 13478c2ecf20Sopenharmony_ciThis is distinctly different from ``LOOKUP_FOLLOW``, because the latter only 13488c2ecf20Sopenharmony_cirelates to restricting the following of trailing symlinks. 13498c2ecf20Sopenharmony_ci 13508c2ecf20Sopenharmony_ci``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must 13518c2ecf20Sopenharmony_ciensure that they return errors from ``nd_jump_link()``, because that is how 13528c2ecf20Sopenharmony_ci``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented. 13538c2ecf20Sopenharmony_ci 13548c2ecf20Sopenharmony_ci``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both 13558c2ecf20Sopenharmony_cibind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the 13568c2ecf20Sopenharmony_cilookup is determined by the first mountpoint the path lookup reaches -- 13578c2ecf20Sopenharmony_ciabsolute paths start with the ``vfsmount`` of ``/``, and relative paths start 13588c2ecf20Sopenharmony_ciwith the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the 13598c2ecf20Sopenharmony_ci``vfsmount`` of the path is unchanged. 13608c2ecf20Sopenharmony_ci 13618c2ecf20Sopenharmony_ci``LOOKUP_BENEATH`` blocks any path components which resolve outside the 13628c2ecf20Sopenharmony_cistarting point of the resolution. This is done by blocking ``nd_jump_root()`` 13638c2ecf20Sopenharmony_cias well as blocking ".." if it would jump outside the starting point. 13648c2ecf20Sopenharmony_ci``rename_lock`` and ``mount_lock`` are used to detect attacks against the 13658c2ecf20Sopenharmony_ciresolution of "..". Magic-links are also blocked. 13668c2ecf20Sopenharmony_ci 13678c2ecf20Sopenharmony_ci``LOOKUP_IN_ROOT`` resolves all path components as though the starting point 13688c2ecf20Sopenharmony_ciwere the filesystem root. ``nd_jump_root()`` brings the resolution back to 13698c2ecf20Sopenharmony_cithe starting point, and ".." at the starting point will act as a no-op. As with 13708c2ecf20Sopenharmony_ci``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect 13718c2ecf20Sopenharmony_ciattacks against ".." resolution. Magic-links are also blocked. 13728c2ecf20Sopenharmony_ci 13738c2ecf20Sopenharmony_ciFinal-component flags 13748c2ecf20Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~ 13758c2ecf20Sopenharmony_ci 13768c2ecf20Sopenharmony_ciSome of these flags are only set when the final component is being 13778c2ecf20Sopenharmony_ciconsidered. Others are only checked for when considering that final 13788c2ecf20Sopenharmony_cicomponent. 13798c2ecf20Sopenharmony_ci 13808c2ecf20Sopenharmony_ci``LOOKUP_AUTOMOUNT`` ensures that, if the final component is an automount 13818c2ecf20Sopenharmony_cipoint, then the mount is triggered. Some operations would trigger it 13828c2ecf20Sopenharmony_cianyway, but operations like ``stat()`` deliberately don't. ``statfs()`` 13838c2ecf20Sopenharmony_cineeds to trigger the mount but otherwise behaves a lot like ``stat()``, so 13848c2ecf20Sopenharmony_ciit sets ``LOOKUP_AUTOMOUNT``, as does "``quotactl()``" and the handling of 13858c2ecf20Sopenharmony_ci"``mount --bind``". 13868c2ecf20Sopenharmony_ci 13878c2ecf20Sopenharmony_ci``LOOKUP_FOLLOW`` has a similar function to ``LOOKUP_AUTOMOUNT`` but for 13888c2ecf20Sopenharmony_cisymlinks. Some system calls set or clear it implicitly, while 13898c2ecf20Sopenharmony_ciothers have API flags such as ``AT_SYMLINK_FOLLOW`` and 13908c2ecf20Sopenharmony_ci``UMOUNT_NOFOLLOW`` to control it. Its effect is similar to 13918c2ecf20Sopenharmony_ci``WALK_GET`` that we already met, but it is used in a different way. 13928c2ecf20Sopenharmony_ci 13938c2ecf20Sopenharmony_ci``LOOKUP_DIRECTORY`` insists that the final component is a directory. 13948c2ecf20Sopenharmony_ciVarious callers set this and it is also set when the final component 13958c2ecf20Sopenharmony_ciis found to be followed by a slash. 13968c2ecf20Sopenharmony_ci 13978c2ecf20Sopenharmony_ciFinally ``LOOKUP_OPEN``, ``LOOKUP_CREATE``, ``LOOKUP_EXCL``, and 13988c2ecf20Sopenharmony_ci``LOOKUP_RENAME_TARGET`` are not used directly by the VFS but are made 13998c2ecf20Sopenharmony_ciavailable to the filesystem and particularly the ``->d_revalidate()`` 14008c2ecf20Sopenharmony_cimethod. A filesystem can choose not to bother revalidating too hard 14018c2ecf20Sopenharmony_ciif it knows that it will be asked to open or create the file soon. 14028c2ecf20Sopenharmony_ciThese flags were previously useful for ``->lookup()`` too but with the 14038c2ecf20Sopenharmony_ciintroduction of ``->atomic_open()`` they are less relevant there. 14048c2ecf20Sopenharmony_ci 14058c2ecf20Sopenharmony_ciEnd of the road 14068c2ecf20Sopenharmony_ci--------------- 14078c2ecf20Sopenharmony_ci 14088c2ecf20Sopenharmony_ciDespite its complexity, all this pathname lookup code appears to be 14098c2ecf20Sopenharmony_ciin good shape - various parts are certainly easier to understand now 14108c2ecf20Sopenharmony_cithan even a couple of releases ago. But that doesn't mean it is 14118c2ecf20Sopenharmony_ci"finished". As already mentioned, RCU-walk currently only follows 14128c2ecf20Sopenharmony_cisymlinks that are stored in the inode so, while it handles many ext4 14138c2ecf20Sopenharmony_cisymlinks, it doesn't help with NFS, XFS, or Btrfs. That support 14148c2ecf20Sopenharmony_ciis not likely to be long delayed. 1415