162306a36Sopenharmony_ci:orphan:
262306a36Sopenharmony_ci
362306a36Sopenharmony_ciMaking Filesystems Exportable
462306a36Sopenharmony_ci=============================
562306a36Sopenharmony_ci
662306a36Sopenharmony_ciOverview
762306a36Sopenharmony_ci--------
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciAll filesystem operations require a dentry (or two) as a starting
1062306a36Sopenharmony_cipoint.  Local applications have a reference-counted hold on suitable
1162306a36Sopenharmony_cidentries via open file descriptors or cwd/root.  However remote
1262306a36Sopenharmony_ciapplications that access a filesystem via a remote filesystem protocol
1362306a36Sopenharmony_cisuch as NFS may not be able to hold such a reference, and so need a
1462306a36Sopenharmony_cidifferent way to refer to a particular dentry.  As the alternative
1562306a36Sopenharmony_ciform of reference needs to be stable across renames, truncates, and
1662306a36Sopenharmony_ciserver-reboot (among other things, though these tend to be the most
1762306a36Sopenharmony_ciproblematic), there is no simple answer like 'filename'.
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciThe mechanism discussed here allows each filesystem implementation to
2062306a36Sopenharmony_cispecify how to generate an opaque (outside of the filesystem) byte
2162306a36Sopenharmony_cistring for any dentry, and how to find an appropriate dentry for any
2262306a36Sopenharmony_cigiven opaque byte string.
2362306a36Sopenharmony_ciThis byte string will be called a "filehandle fragment" as it
2462306a36Sopenharmony_cicorresponds to part of an NFS filehandle.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciA filesystem which supports the mapping between filehandle fragments
2762306a36Sopenharmony_ciand dentries will be termed "exportable".
2862306a36Sopenharmony_ci
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciDcache Issues
3262306a36Sopenharmony_ci-------------
3362306a36Sopenharmony_ci
3462306a36Sopenharmony_ciThe dcache normally contains a proper prefix of any given filesystem
3562306a36Sopenharmony_citree.  This means that if any filesystem object is in the dcache, then
3662306a36Sopenharmony_ciall of the ancestors of that filesystem object are also in the dcache.
3762306a36Sopenharmony_ciAs normal access is by filename this prefix is created naturally and
3862306a36Sopenharmony_cimaintained easily (by each object maintaining a reference count on
3962306a36Sopenharmony_ciits parent).
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciHowever when objects are included into the dcache by interpreting a
4262306a36Sopenharmony_cifilehandle fragment, there is no automatic creation of a path prefix
4362306a36Sopenharmony_cifor the object.  This leads to two related but distinct features of
4462306a36Sopenharmony_cithe dcache that are not needed for normal filesystem access.
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ci1. The dcache must sometimes contain objects that are not part of the
4762306a36Sopenharmony_ci   proper prefix. i.e that are not connected to the root.
4862306a36Sopenharmony_ci2. The dcache must be prepared for a newly found (via ->lookup) directory
4962306a36Sopenharmony_ci   to already have a (non-connected) dentry, and must be able to move
5062306a36Sopenharmony_ci   that dentry into place (based on the parent and name in the
5162306a36Sopenharmony_ci   ->lookup).   This is particularly needed for directories as
5262306a36Sopenharmony_ci   it is a dcache invariant that directories only have one dentry.
5362306a36Sopenharmony_ci
5462306a36Sopenharmony_ciTo implement these features, the dcache has:
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_cia. A dentry flag DCACHE_DISCONNECTED which is set on
5762306a36Sopenharmony_ci   any dentry that might not be part of the proper prefix.
5862306a36Sopenharmony_ci   This is set when anonymous dentries are created, and cleared when a
5962306a36Sopenharmony_ci   dentry is noticed to be a child of a dentry which is in the proper
6062306a36Sopenharmony_ci   prefix.  If the refcount on a dentry with this flag set
6162306a36Sopenharmony_ci   becomes zero, the dentry is immediately discarded, rather than being
6262306a36Sopenharmony_ci   kept in the dcache.  If a dentry that is not already in the dcache
6362306a36Sopenharmony_ci   is repeatedly accessed by filehandle (as NFSD might do), an new dentry
6462306a36Sopenharmony_ci   will be a allocated for each access, and discarded at the end of
6562306a36Sopenharmony_ci   the access.
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ci   Note that such a dentry can acquire children, name, ancestors, etc.
6862306a36Sopenharmony_ci   without losing DCACHE_DISCONNECTED - that flag is only cleared when
6962306a36Sopenharmony_ci   subtree is successfully reconnected to root.  Until then dentries
7062306a36Sopenharmony_ci   in such subtree are retained only as long as there are references;
7162306a36Sopenharmony_ci   refcount reaching zero means immediate eviction, same as for unhashed
7262306a36Sopenharmony_ci   dentries.  That guarantees that we won't need to hunt them down upon
7362306a36Sopenharmony_ci   umount.
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_cib. A primitive for creation of secondary roots - d_obtain_root(inode).
7662306a36Sopenharmony_ci   Those do _not_ bear DCACHE_DISCONNECTED.  They are placed on the
7762306a36Sopenharmony_ci   per-superblock list (->s_roots), so they can be located at umount
7862306a36Sopenharmony_ci   time for eviction purposes.
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_cic. Helper routines to allocate anonymous dentries, and to help attach
8162306a36Sopenharmony_ci   loose directory dentries at lookup time. They are:
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ci    d_obtain_alias(inode) will return a dentry for the given inode.
8462306a36Sopenharmony_ci      If the inode already has a dentry, one of those is returned.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ci      If it doesn't, a new anonymous (IS_ROOT and
8762306a36Sopenharmony_ci      DCACHE_DISCONNECTED) dentry is allocated and attached.
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ci      In the case of a directory, care is taken that only one dentry
9062306a36Sopenharmony_ci      can ever be attached.
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ci    d_splice_alias(inode, dentry) will introduce a new dentry into the tree;
9362306a36Sopenharmony_ci      either the passed-in dentry or a preexisting alias for the given inode
9462306a36Sopenharmony_ci      (such as an anonymous one created by d_obtain_alias), if appropriate.
9562306a36Sopenharmony_ci      It returns NULL when the passed-in dentry is used, following the calling
9662306a36Sopenharmony_ci      convention of ->lookup.
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciFilesystem Issues
9962306a36Sopenharmony_ci-----------------
10062306a36Sopenharmony_ci
10162306a36Sopenharmony_ciFor a filesystem to be exportable it must:
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ci   1. provide the filehandle fragment routines described below.
10462306a36Sopenharmony_ci   2. make sure that d_splice_alias is used rather than d_add
10562306a36Sopenharmony_ci      when ->lookup finds an inode for a given parent and name.
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ci      If inode is NULL, d_splice_alias(inode, dentry) is equivalent to::
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ci		d_add(dentry, inode), NULL
11062306a36Sopenharmony_ci
11162306a36Sopenharmony_ci      Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
11262306a36Sopenharmony_ci
11362306a36Sopenharmony_ci      Typically the ->lookup routine will simply end with a::
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ci		return d_splice_alias(inode, dentry);
11662306a36Sopenharmony_ci	}
11762306a36Sopenharmony_ci
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ciA file system implementation declares that instances of the filesystem
12162306a36Sopenharmony_ciare exportable by setting the s_export_op field in the struct
12262306a36Sopenharmony_cisuper_block.  This field must point to a "struct export_operations"
12362306a36Sopenharmony_cistruct which has the following members:
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ci  encode_fh (optional)
12662306a36Sopenharmony_ci    Takes a dentry and creates a filehandle fragment which may later be used
12762306a36Sopenharmony_ci    to find or create a dentry for the same object.  The default
12862306a36Sopenharmony_ci    implementation creates a filehandle fragment that encodes a 32bit inode
12962306a36Sopenharmony_ci    and generation number for the inode encoded, and if necessary the
13062306a36Sopenharmony_ci    same information for the parent.
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ci  fh_to_dentry (mandatory)
13362306a36Sopenharmony_ci    Given a filehandle fragment, this should find the implied object and
13462306a36Sopenharmony_ci    create a dentry for it (possibly with d_obtain_alias).
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci  fh_to_parent (optional but strongly recommended)
13762306a36Sopenharmony_ci    Given a filehandle fragment, this should find the parent of the
13862306a36Sopenharmony_ci    implied object and create a dentry for it (possibly with
13962306a36Sopenharmony_ci    d_obtain_alias).  May fail if the filehandle fragment is too small.
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ci  get_parent (optional but strongly recommended)
14262306a36Sopenharmony_ci    When given a dentry for a directory, this should return  a dentry for
14362306a36Sopenharmony_ci    the parent.  Quite possibly the parent dentry will have been allocated
14462306a36Sopenharmony_ci    by d_alloc_anon.  The default get_parent function just returns an error
14562306a36Sopenharmony_ci    so any filehandle lookup that requires finding a parent will fail.
14662306a36Sopenharmony_ci    ->lookup("..") is *not* used as a default as it can leave ".." entries
14762306a36Sopenharmony_ci    in the dcache which are too messy to work with.
14862306a36Sopenharmony_ci
14962306a36Sopenharmony_ci  get_name (optional)
15062306a36Sopenharmony_ci    When given a parent dentry and a child dentry, this should find a name
15162306a36Sopenharmony_ci    in the directory identified by the parent dentry, which leads to the
15262306a36Sopenharmony_ci    object identified by the child dentry.  If no get_name function is
15362306a36Sopenharmony_ci    supplied, a default implementation is provided which uses vfs_readdir
15462306a36Sopenharmony_ci    to find potential names, and matches inode numbers to find the correct
15562306a36Sopenharmony_ci    match.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ci  flags
15862306a36Sopenharmony_ci    Some filesystems may need to be handled differently than others. The
15962306a36Sopenharmony_ci    export_operations struct also includes a flags field that allows the
16062306a36Sopenharmony_ci    filesystem to communicate such information to nfsd. See the Export
16162306a36Sopenharmony_ci    Operations Flags section below for more explanation.
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciA filehandle fragment consists of an array of 1 or more 4byte words,
16462306a36Sopenharmony_citogether with a one byte "type".
16562306a36Sopenharmony_ciThe decode_fh routine should not depend on the stated size that is
16662306a36Sopenharmony_cipassed to it.  This size may be larger than the original filehandle
16762306a36Sopenharmony_cigenerated by encode_fh, in which case it will have been padded with
16862306a36Sopenharmony_cinuls.  Rather, the encode_fh routine should choose a "type" which
16962306a36Sopenharmony_ciindicates the decode_fh how much of the filehandle is valid, and how
17062306a36Sopenharmony_ciit should be interpreted.
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciExport Operations Flags
17362306a36Sopenharmony_ci-----------------------
17462306a36Sopenharmony_ciIn addition to the operation vector pointers, struct export_operations also
17562306a36Sopenharmony_cicontains a "flags" field that allows the filesystem to communicate to nfsd
17662306a36Sopenharmony_cithat it may want to do things differently when dealing with it. The
17762306a36Sopenharmony_cifollowing flags are defined:
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ci  EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
18062306a36Sopenharmony_ci    RFC 1813 recommends that servers always send weak cache consistency
18162306a36Sopenharmony_ci    (WCC) data to the client after each operation. The server should
18262306a36Sopenharmony_ci    atomically collect attributes about the inode, do an operation on it,
18362306a36Sopenharmony_ci    and then collect the attributes afterward. This allows the client to
18462306a36Sopenharmony_ci    skip issuing GETATTRs in some situations but means that the server
18562306a36Sopenharmony_ci    is calling vfs_getattr for almost all RPCs. On some filesystems
18662306a36Sopenharmony_ci    (particularly those that are clustered or networked) this is expensive
18762306a36Sopenharmony_ci    and atomicity is difficult to guarantee. This flag indicates to nfsd
18862306a36Sopenharmony_ci    that it should skip providing WCC attributes to the client in NFSv3
18962306a36Sopenharmony_ci    replies when doing operations on this filesystem. Consider enabling
19062306a36Sopenharmony_ci    this on filesystems that have an expensive ->getattr inode operation,
19162306a36Sopenharmony_ci    or when atomicity between pre and post operation attribute collection
19262306a36Sopenharmony_ci    is impossible to guarantee.
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ci  EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
19562306a36Sopenharmony_ci    Many NFS operations deal with filehandles, which the server must then
19662306a36Sopenharmony_ci    vet to ensure that they live inside of an exported tree. When the
19762306a36Sopenharmony_ci    export consists of an entire filesystem, this is trivial. nfsd can just
19862306a36Sopenharmony_ci    ensure that the filehandle live on the filesystem. When only part of a
19962306a36Sopenharmony_ci    filesystem is exported however, then nfsd must walk the ancestors of the
20062306a36Sopenharmony_ci    inode to ensure that it's within an exported subtree. This is an
20162306a36Sopenharmony_ci    expensive operation and not all filesystems can support it properly.
20262306a36Sopenharmony_ci    This flag exempts the filesystem from subtree checking and causes
20362306a36Sopenharmony_ci    exportfs to get back an error if it tries to enable subtree checking
20462306a36Sopenharmony_ci    on it.
20562306a36Sopenharmony_ci
20662306a36Sopenharmony_ci  EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
20762306a36Sopenharmony_ci    On some exportable filesystems (such as NFS) unlinking a file that
20862306a36Sopenharmony_ci    is still open can cause a fair bit of extra work. For instance,
20962306a36Sopenharmony_ci    the NFS client will do a "sillyrename" to ensure that the file
21062306a36Sopenharmony_ci    sticks around while it's still open. When reexporting, that open
21162306a36Sopenharmony_ci    file is held by nfsd so we usually end up doing a sillyrename, and
21262306a36Sopenharmony_ci    then immediately deleting the sillyrenamed file just afterward when
21362306a36Sopenharmony_ci    the link count actually goes to zero. Sometimes this delete can race
21462306a36Sopenharmony_ci    with other operations (for instance an rmdir of the parent directory).
21562306a36Sopenharmony_ci    This flag causes nfsd to close any open files for this inode _before_
21662306a36Sopenharmony_ci    calling into the vfs to do an unlink or a rename that would replace
21762306a36Sopenharmony_ci    an existing file.
21862306a36Sopenharmony_ci
21962306a36Sopenharmony_ci  EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
22062306a36Sopenharmony_ci    PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to
22162306a36Sopenharmony_ci    write to one bdi (the final bdi) in order to free up writes queued
22262306a36Sopenharmony_ci    to another bdi (the client bdi). Such threads get a private balance
22362306a36Sopenharmony_ci    of dirty pages so that dirty pages for the client bdi do not imact
22462306a36Sopenharmony_ci    the daemon writing to the final bdi. For filesystems whose durable
22562306a36Sopenharmony_ci    storage is not local (such as exported NFS filesystems), this
22662306a36Sopenharmony_ci    constraint has negative consequences. EXPORT_OP_REMOTE_FS enables
22762306a36Sopenharmony_ci    an export to disable writeback throttling.
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_ci  EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
23062306a36Sopenharmony_ci    EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem
23162306a36Sopenharmony_ci    cannot provide the semantics required by the "atomic" boolean in
23262306a36Sopenharmony_ci    NFSv4's change_info4. This boolean indicates to a client whether the
23362306a36Sopenharmony_ci    returned before and after change attributes were obtained atomically
23462306a36Sopenharmony_ci    with the respect to the requested metadata operation (UNLINK,
23562306a36Sopenharmony_ci    OPEN/CREATE, MKDIR, etc).
23662306a36Sopenharmony_ci
23762306a36Sopenharmony_ci  EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
23862306a36Sopenharmony_ci    On most filesystems, inodes can remain under writeback after the
23962306a36Sopenharmony_ci    file is closed. NFSD relies on client activity or local flusher
24062306a36Sopenharmony_ci    threads to handle writeback. Certain filesystems, such as NFS, flush
24162306a36Sopenharmony_ci    all of an inode's dirty data on last close. Exports that behave this
24262306a36Sopenharmony_ci    way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
24362306a36Sopenharmony_ci    waiting for writeback when closing such files.
244