162306a36Sopenharmony_ci:orphan: 262306a36Sopenharmony_ci 362306a36Sopenharmony_ciMaking Filesystems Exportable 462306a36Sopenharmony_ci============================= 562306a36Sopenharmony_ci 662306a36Sopenharmony_ciOverview 762306a36Sopenharmony_ci-------- 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciAll filesystem operations require a dentry (or two) as a starting 1062306a36Sopenharmony_cipoint. Local applications have a reference-counted hold on suitable 1162306a36Sopenharmony_cidentries via open file descriptors or cwd/root. However remote 1262306a36Sopenharmony_ciapplications that access a filesystem via a remote filesystem protocol 1362306a36Sopenharmony_cisuch as NFS may not be able to hold such a reference, and so need a 1462306a36Sopenharmony_cidifferent way to refer to a particular dentry. As the alternative 1562306a36Sopenharmony_ciform of reference needs to be stable across renames, truncates, and 1662306a36Sopenharmony_ciserver-reboot (among other things, though these tend to be the most 1762306a36Sopenharmony_ciproblematic), there is no simple answer like 'filename'. 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciThe mechanism discussed here allows each filesystem implementation to 2062306a36Sopenharmony_cispecify how to generate an opaque (outside of the filesystem) byte 2162306a36Sopenharmony_cistring for any dentry, and how to find an appropriate dentry for any 2262306a36Sopenharmony_cigiven opaque byte string. 2362306a36Sopenharmony_ciThis byte string will be called a "filehandle fragment" as it 2462306a36Sopenharmony_cicorresponds to part of an NFS filehandle. 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ciA filesystem which supports the mapping between filehandle fragments 2762306a36Sopenharmony_ciand dentries will be termed "exportable". 2862306a36Sopenharmony_ci 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciDcache Issues 3262306a36Sopenharmony_ci------------- 3362306a36Sopenharmony_ci 3462306a36Sopenharmony_ciThe dcache normally contains a proper prefix of any given filesystem 3562306a36Sopenharmony_citree. This means that if any filesystem object is in the dcache, then 3662306a36Sopenharmony_ciall of the ancestors of that filesystem object are also in the dcache. 3762306a36Sopenharmony_ciAs normal access is by filename this prefix is created naturally and 3862306a36Sopenharmony_cimaintained easily (by each object maintaining a reference count on 3962306a36Sopenharmony_ciits parent). 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciHowever when objects are included into the dcache by interpreting a 4262306a36Sopenharmony_cifilehandle fragment, there is no automatic creation of a path prefix 4362306a36Sopenharmony_cifor the object. This leads to two related but distinct features of 4462306a36Sopenharmony_cithe dcache that are not needed for normal filesystem access. 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ci1. The dcache must sometimes contain objects that are not part of the 4762306a36Sopenharmony_ci proper prefix. i.e that are not connected to the root. 4862306a36Sopenharmony_ci2. The dcache must be prepared for a newly found (via ->lookup) directory 4962306a36Sopenharmony_ci to already have a (non-connected) dentry, and must be able to move 5062306a36Sopenharmony_ci that dentry into place (based on the parent and name in the 5162306a36Sopenharmony_ci ->lookup). This is particularly needed for directories as 5262306a36Sopenharmony_ci it is a dcache invariant that directories only have one dentry. 5362306a36Sopenharmony_ci 5462306a36Sopenharmony_ciTo implement these features, the dcache has: 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_cia. A dentry flag DCACHE_DISCONNECTED which is set on 5762306a36Sopenharmony_ci any dentry that might not be part of the proper prefix. 5862306a36Sopenharmony_ci This is set when anonymous dentries are created, and cleared when a 5962306a36Sopenharmony_ci dentry is noticed to be a child of a dentry which is in the proper 6062306a36Sopenharmony_ci prefix. If the refcount on a dentry with this flag set 6162306a36Sopenharmony_ci becomes zero, the dentry is immediately discarded, rather than being 6262306a36Sopenharmony_ci kept in the dcache. If a dentry that is not already in the dcache 6362306a36Sopenharmony_ci is repeatedly accessed by filehandle (as NFSD might do), an new dentry 6462306a36Sopenharmony_ci will be a allocated for each access, and discarded at the end of 6562306a36Sopenharmony_ci the access. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ci Note that such a dentry can acquire children, name, ancestors, etc. 6862306a36Sopenharmony_ci without losing DCACHE_DISCONNECTED - that flag is only cleared when 6962306a36Sopenharmony_ci subtree is successfully reconnected to root. Until then dentries 7062306a36Sopenharmony_ci in such subtree are retained only as long as there are references; 7162306a36Sopenharmony_ci refcount reaching zero means immediate eviction, same as for unhashed 7262306a36Sopenharmony_ci dentries. That guarantees that we won't need to hunt them down upon 7362306a36Sopenharmony_ci umount. 7462306a36Sopenharmony_ci 7562306a36Sopenharmony_cib. A primitive for creation of secondary roots - d_obtain_root(inode). 7662306a36Sopenharmony_ci Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the 7762306a36Sopenharmony_ci per-superblock list (->s_roots), so they can be located at umount 7862306a36Sopenharmony_ci time for eviction purposes. 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_cic. Helper routines to allocate anonymous dentries, and to help attach 8162306a36Sopenharmony_ci loose directory dentries at lookup time. They are: 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ci d_obtain_alias(inode) will return a dentry for the given inode. 8462306a36Sopenharmony_ci If the inode already has a dentry, one of those is returned. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ci If it doesn't, a new anonymous (IS_ROOT and 8762306a36Sopenharmony_ci DCACHE_DISCONNECTED) dentry is allocated and attached. 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ci In the case of a directory, care is taken that only one dentry 9062306a36Sopenharmony_ci can ever be attached. 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci d_splice_alias(inode, dentry) will introduce a new dentry into the tree; 9362306a36Sopenharmony_ci either the passed-in dentry or a preexisting alias for the given inode 9462306a36Sopenharmony_ci (such as an anonymous one created by d_obtain_alias), if appropriate. 9562306a36Sopenharmony_ci It returns NULL when the passed-in dentry is used, following the calling 9662306a36Sopenharmony_ci convention of ->lookup. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciFilesystem Issues 9962306a36Sopenharmony_ci----------------- 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ciFor a filesystem to be exportable it must: 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ci 1. provide the filehandle fragment routines described below. 10462306a36Sopenharmony_ci 2. make sure that d_splice_alias is used rather than d_add 10562306a36Sopenharmony_ci when ->lookup finds an inode for a given parent and name. 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ci If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:: 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ci d_add(dentry, inode), NULL 11062306a36Sopenharmony_ci 11162306a36Sopenharmony_ci Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) 11262306a36Sopenharmony_ci 11362306a36Sopenharmony_ci Typically the ->lookup routine will simply end with a:: 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ci return d_splice_alias(inode, dentry); 11662306a36Sopenharmony_ci } 11762306a36Sopenharmony_ci 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ciA file system implementation declares that instances of the filesystem 12162306a36Sopenharmony_ciare exportable by setting the s_export_op field in the struct 12262306a36Sopenharmony_cisuper_block. This field must point to a "struct export_operations" 12362306a36Sopenharmony_cistruct which has the following members: 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ci encode_fh (optional) 12662306a36Sopenharmony_ci Takes a dentry and creates a filehandle fragment which may later be used 12762306a36Sopenharmony_ci to find or create a dentry for the same object. The default 12862306a36Sopenharmony_ci implementation creates a filehandle fragment that encodes a 32bit inode 12962306a36Sopenharmony_ci and generation number for the inode encoded, and if necessary the 13062306a36Sopenharmony_ci same information for the parent. 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ci fh_to_dentry (mandatory) 13362306a36Sopenharmony_ci Given a filehandle fragment, this should find the implied object and 13462306a36Sopenharmony_ci create a dentry for it (possibly with d_obtain_alias). 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci fh_to_parent (optional but strongly recommended) 13762306a36Sopenharmony_ci Given a filehandle fragment, this should find the parent of the 13862306a36Sopenharmony_ci implied object and create a dentry for it (possibly with 13962306a36Sopenharmony_ci d_obtain_alias). May fail if the filehandle fragment is too small. 14062306a36Sopenharmony_ci 14162306a36Sopenharmony_ci get_parent (optional but strongly recommended) 14262306a36Sopenharmony_ci When given a dentry for a directory, this should return a dentry for 14362306a36Sopenharmony_ci the parent. Quite possibly the parent dentry will have been allocated 14462306a36Sopenharmony_ci by d_alloc_anon. The default get_parent function just returns an error 14562306a36Sopenharmony_ci so any filehandle lookup that requires finding a parent will fail. 14662306a36Sopenharmony_ci ->lookup("..") is *not* used as a default as it can leave ".." entries 14762306a36Sopenharmony_ci in the dcache which are too messy to work with. 14862306a36Sopenharmony_ci 14962306a36Sopenharmony_ci get_name (optional) 15062306a36Sopenharmony_ci When given a parent dentry and a child dentry, this should find a name 15162306a36Sopenharmony_ci in the directory identified by the parent dentry, which leads to the 15262306a36Sopenharmony_ci object identified by the child dentry. If no get_name function is 15362306a36Sopenharmony_ci supplied, a default implementation is provided which uses vfs_readdir 15462306a36Sopenharmony_ci to find potential names, and matches inode numbers to find the correct 15562306a36Sopenharmony_ci match. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ci flags 15862306a36Sopenharmony_ci Some filesystems may need to be handled differently than others. The 15962306a36Sopenharmony_ci export_operations struct also includes a flags field that allows the 16062306a36Sopenharmony_ci filesystem to communicate such information to nfsd. See the Export 16162306a36Sopenharmony_ci Operations Flags section below for more explanation. 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ciA filehandle fragment consists of an array of 1 or more 4byte words, 16462306a36Sopenharmony_citogether with a one byte "type". 16562306a36Sopenharmony_ciThe decode_fh routine should not depend on the stated size that is 16662306a36Sopenharmony_cipassed to it. This size may be larger than the original filehandle 16762306a36Sopenharmony_cigenerated by encode_fh, in which case it will have been padded with 16862306a36Sopenharmony_cinuls. Rather, the encode_fh routine should choose a "type" which 16962306a36Sopenharmony_ciindicates the decode_fh how much of the filehandle is valid, and how 17062306a36Sopenharmony_ciit should be interpreted. 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciExport Operations Flags 17362306a36Sopenharmony_ci----------------------- 17462306a36Sopenharmony_ciIn addition to the operation vector pointers, struct export_operations also 17562306a36Sopenharmony_cicontains a "flags" field that allows the filesystem to communicate to nfsd 17662306a36Sopenharmony_cithat it may want to do things differently when dealing with it. The 17762306a36Sopenharmony_cifollowing flags are defined: 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ci EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem 18062306a36Sopenharmony_ci RFC 1813 recommends that servers always send weak cache consistency 18162306a36Sopenharmony_ci (WCC) data to the client after each operation. The server should 18262306a36Sopenharmony_ci atomically collect attributes about the inode, do an operation on it, 18362306a36Sopenharmony_ci and then collect the attributes afterward. This allows the client to 18462306a36Sopenharmony_ci skip issuing GETATTRs in some situations but means that the server 18562306a36Sopenharmony_ci is calling vfs_getattr for almost all RPCs. On some filesystems 18662306a36Sopenharmony_ci (particularly those that are clustered or networked) this is expensive 18762306a36Sopenharmony_ci and atomicity is difficult to guarantee. This flag indicates to nfsd 18862306a36Sopenharmony_ci that it should skip providing WCC attributes to the client in NFSv3 18962306a36Sopenharmony_ci replies when doing operations on this filesystem. Consider enabling 19062306a36Sopenharmony_ci this on filesystems that have an expensive ->getattr inode operation, 19162306a36Sopenharmony_ci or when atomicity between pre and post operation attribute collection 19262306a36Sopenharmony_ci is impossible to guarantee. 19362306a36Sopenharmony_ci 19462306a36Sopenharmony_ci EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs 19562306a36Sopenharmony_ci Many NFS operations deal with filehandles, which the server must then 19662306a36Sopenharmony_ci vet to ensure that they live inside of an exported tree. When the 19762306a36Sopenharmony_ci export consists of an entire filesystem, this is trivial. nfsd can just 19862306a36Sopenharmony_ci ensure that the filehandle live on the filesystem. When only part of a 19962306a36Sopenharmony_ci filesystem is exported however, then nfsd must walk the ancestors of the 20062306a36Sopenharmony_ci inode to ensure that it's within an exported subtree. This is an 20162306a36Sopenharmony_ci expensive operation and not all filesystems can support it properly. 20262306a36Sopenharmony_ci This flag exempts the filesystem from subtree checking and causes 20362306a36Sopenharmony_ci exportfs to get back an error if it tries to enable subtree checking 20462306a36Sopenharmony_ci on it. 20562306a36Sopenharmony_ci 20662306a36Sopenharmony_ci EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking 20762306a36Sopenharmony_ci On some exportable filesystems (such as NFS) unlinking a file that 20862306a36Sopenharmony_ci is still open can cause a fair bit of extra work. For instance, 20962306a36Sopenharmony_ci the NFS client will do a "sillyrename" to ensure that the file 21062306a36Sopenharmony_ci sticks around while it's still open. When reexporting, that open 21162306a36Sopenharmony_ci file is held by nfsd so we usually end up doing a sillyrename, and 21262306a36Sopenharmony_ci then immediately deleting the sillyrenamed file just afterward when 21362306a36Sopenharmony_ci the link count actually goes to zero. Sometimes this delete can race 21462306a36Sopenharmony_ci with other operations (for instance an rmdir of the parent directory). 21562306a36Sopenharmony_ci This flag causes nfsd to close any open files for this inode _before_ 21662306a36Sopenharmony_ci calling into the vfs to do an unlink or a rename that would replace 21762306a36Sopenharmony_ci an existing file. 21862306a36Sopenharmony_ci 21962306a36Sopenharmony_ci EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote 22062306a36Sopenharmony_ci PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to 22162306a36Sopenharmony_ci write to one bdi (the final bdi) in order to free up writes queued 22262306a36Sopenharmony_ci to another bdi (the client bdi). Such threads get a private balance 22362306a36Sopenharmony_ci of dirty pages so that dirty pages for the client bdi do not imact 22462306a36Sopenharmony_ci the daemon writing to the final bdi. For filesystems whose durable 22562306a36Sopenharmony_ci storage is not local (such as exported NFS filesystems), this 22662306a36Sopenharmony_ci constraint has negative consequences. EXPORT_OP_REMOTE_FS enables 22762306a36Sopenharmony_ci an export to disable writeback throttling. 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_ci EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically 23062306a36Sopenharmony_ci EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem 23162306a36Sopenharmony_ci cannot provide the semantics required by the "atomic" boolean in 23262306a36Sopenharmony_ci NFSv4's change_info4. This boolean indicates to a client whether the 23362306a36Sopenharmony_ci returned before and after change attributes were obtained atomically 23462306a36Sopenharmony_ci with the respect to the requested metadata operation (UNLINK, 23562306a36Sopenharmony_ci OPEN/CREATE, MKDIR, etc). 23662306a36Sopenharmony_ci 23762306a36Sopenharmony_ci EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2) 23862306a36Sopenharmony_ci On most filesystems, inodes can remain under writeback after the 23962306a36Sopenharmony_ci file is closed. NFSD relies on client activity or local flusher 24062306a36Sopenharmony_ci threads to handle writeback. Certain filesystems, such as NFS, flush 24162306a36Sopenharmony_ci all of an inode's dirty data on last close. Exports that behave this 24262306a36Sopenharmony_ci way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip 24362306a36Sopenharmony_ci waiting for writeback when closing such files. 244