18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci========
48c2ecf20Sopenharmony_ciORANGEFS
58c2ecf20Sopenharmony_ci========
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciOrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal
88c2ecf20Sopenharmony_cifor large storage problems faced by HPC, BigData, Streaming Video,
98c2ecf20Sopenharmony_ciGenomics, Bioinformatics.
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ciOrangefs, originally called PVFS, was first developed in 1993 by
128c2ecf20Sopenharmony_ciWalt Ligon and Eric Blumer as a parallel file system for Parallel
138c2ecf20Sopenharmony_ciVirtual Machine (PVM) as part of a NASA grant to study the I/O patterns
148c2ecf20Sopenharmony_ciof parallel programs.
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ciOrangefs features include:
178c2ecf20Sopenharmony_ci
188c2ecf20Sopenharmony_ci  * Distributes file data among multiple file servers
198c2ecf20Sopenharmony_ci  * Supports simultaneous access by multiple clients
208c2ecf20Sopenharmony_ci  * Stores file data and metadata on servers using local file system
218c2ecf20Sopenharmony_ci    and access methods
228c2ecf20Sopenharmony_ci  * Userspace implementation is easy to install and maintain
238c2ecf20Sopenharmony_ci  * Direct MPI support
248c2ecf20Sopenharmony_ci  * Stateless
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ci
278c2ecf20Sopenharmony_ciMailing List Archives
288c2ecf20Sopenharmony_ci=====================
298c2ecf20Sopenharmony_ci
308c2ecf20Sopenharmony_cihttp://lists.orangefs.org/pipermail/devel_lists.orangefs.org/
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ciMailing List Submissions
348c2ecf20Sopenharmony_ci========================
358c2ecf20Sopenharmony_ci
368c2ecf20Sopenharmony_cidevel@lists.orangefs.org
378c2ecf20Sopenharmony_ci
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_ciDocumentation
408c2ecf20Sopenharmony_ci=============
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_cihttp://www.orangefs.org/documentation/
438c2ecf20Sopenharmony_ci
448c2ecf20Sopenharmony_ciRunning ORANGEFS On a Single Server
458c2ecf20Sopenharmony_ci===================================
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ciOrangeFS is usually run in large installations with multiple servers and
488c2ecf20Sopenharmony_ciclients, but a complete filesystem can be run on a single machine for
498c2ecf20Sopenharmony_cidevelopment and testing.
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ciOn Fedora, install orangefs and orangefs-server::
528c2ecf20Sopenharmony_ci
538c2ecf20Sopenharmony_ci    dnf -y install orangefs orangefs-server
548c2ecf20Sopenharmony_ci
558c2ecf20Sopenharmony_ciThere is an example server configuration file in
568c2ecf20Sopenharmony_ci/etc/orangefs/orangefs.conf.  Change localhost to your hostname if
578c2ecf20Sopenharmony_cinecessary.
588c2ecf20Sopenharmony_ci
598c2ecf20Sopenharmony_ciTo generate a filesystem to run xfstests against, see below.
608c2ecf20Sopenharmony_ci
618c2ecf20Sopenharmony_ciThere is an example client configuration file in /etc/pvfs2tab.  It is a
628c2ecf20Sopenharmony_cisingle line.  Uncomment it and change the hostname if necessary.  This
638c2ecf20Sopenharmony_cicontrols clients which use libpvfs2.  This does not control the
648c2ecf20Sopenharmony_cipvfs2-client-core.
658c2ecf20Sopenharmony_ci
668c2ecf20Sopenharmony_ciCreate the filesystem::
678c2ecf20Sopenharmony_ci
688c2ecf20Sopenharmony_ci    pvfs2-server -f /etc/orangefs/orangefs.conf
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ciStart the server::
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ci    systemctl start orangefs-server
738c2ecf20Sopenharmony_ci
748c2ecf20Sopenharmony_ciTest the server::
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ci    pvfs2-ping -m /pvfsmnt
778c2ecf20Sopenharmony_ci
788c2ecf20Sopenharmony_ciStart the client.  The module must be compiled in or loaded before this
798c2ecf20Sopenharmony_cipoint::
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ci    systemctl start orangefs-client
828c2ecf20Sopenharmony_ci
838c2ecf20Sopenharmony_ciMount the filesystem::
848c2ecf20Sopenharmony_ci
858c2ecf20Sopenharmony_ci    mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt
868c2ecf20Sopenharmony_ci
878c2ecf20Sopenharmony_ciUserspace Filesystem Source
888c2ecf20Sopenharmony_ci===========================
898c2ecf20Sopenharmony_ci
908c2ecf20Sopenharmony_cihttp://www.orangefs.org/download
918c2ecf20Sopenharmony_ci
928c2ecf20Sopenharmony_ciOrangefs versions prior to 2.9.3 would not be compatible with the
938c2ecf20Sopenharmony_ciupstream version of the kernel client.
948c2ecf20Sopenharmony_ci
958c2ecf20Sopenharmony_ci
968c2ecf20Sopenharmony_ciBuilding ORANGEFS on a Single Server
978c2ecf20Sopenharmony_ci====================================
988c2ecf20Sopenharmony_ci
998c2ecf20Sopenharmony_ciWhere OrangeFS cannot be installed from distribution packages, it may be
1008c2ecf20Sopenharmony_cibuilt from source.
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciYou can omit --prefix if you don't care that things are sprinkled around
1038c2ecf20Sopenharmony_ciin /usr/local.  As of version 2.9.6, OrangeFS uses Berkeley DB by
1048c2ecf20Sopenharmony_cidefault, we will probably be changing the default to LMDB soon.
1058c2ecf20Sopenharmony_ci
1068c2ecf20Sopenharmony_ci::
1078c2ecf20Sopenharmony_ci
1088c2ecf20Sopenharmony_ci    ./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint
1098c2ecf20Sopenharmony_ci
1108c2ecf20Sopenharmony_ci    make
1118c2ecf20Sopenharmony_ci
1128c2ecf20Sopenharmony_ci    make install
1138c2ecf20Sopenharmony_ci
1148c2ecf20Sopenharmony_ciCreate an orangefs config file by running pvfs2-genconfig and
1158c2ecf20Sopenharmony_cispecifying a target config file. Pvfs2-genconfig will prompt you
1168c2ecf20Sopenharmony_cithrough. Generally it works fine to take the defaults, but you
1178c2ecf20Sopenharmony_cishould use your server's hostname, rather than "localhost" when
1188c2ecf20Sopenharmony_ciit comes to that question::
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ci    /opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ciCreate an /etc/pvfs2tab file (localhost is fine)::
1238c2ecf20Sopenharmony_ci
1248c2ecf20Sopenharmony_ci    echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \
1258c2ecf20Sopenharmony_ci	/etc/pvfs2tab
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ciCreate the mount point you specified in the tab file if needed::
1288c2ecf20Sopenharmony_ci
1298c2ecf20Sopenharmony_ci    mkdir /pvfsmnt
1308c2ecf20Sopenharmony_ci
1318c2ecf20Sopenharmony_ciBootstrap the server::
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ci    /opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf
1348c2ecf20Sopenharmony_ci
1358c2ecf20Sopenharmony_ciStart the server::
1368c2ecf20Sopenharmony_ci
1378c2ecf20Sopenharmony_ci    /opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf
1388c2ecf20Sopenharmony_ci
1398c2ecf20Sopenharmony_ciNow the server should be running. Pvfs2-ls is a simple
1408c2ecf20Sopenharmony_citest to verify that the server is running::
1418c2ecf20Sopenharmony_ci
1428c2ecf20Sopenharmony_ci    /opt/ofs/bin/pvfs2-ls /pvfsmnt
1438c2ecf20Sopenharmony_ci
1448c2ecf20Sopenharmony_ciIf stuff seems to be working, load the kernel module and
1458c2ecf20Sopenharmony_citurn on the client core::
1468c2ecf20Sopenharmony_ci
1478c2ecf20Sopenharmony_ci    /opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core
1488c2ecf20Sopenharmony_ci
1498c2ecf20Sopenharmony_ciMount your filesystem::
1508c2ecf20Sopenharmony_ci
1518c2ecf20Sopenharmony_ci    mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ci
1548c2ecf20Sopenharmony_ciRunning xfstests
1558c2ecf20Sopenharmony_ci================
1568c2ecf20Sopenharmony_ci
1578c2ecf20Sopenharmony_ciIt is useful to use a scratch filesystem with xfstests.  This can be
1588c2ecf20Sopenharmony_cidone with only one server.
1598c2ecf20Sopenharmony_ci
1608c2ecf20Sopenharmony_ciMake a second copy of the FileSystem section in the server configuration
1618c2ecf20Sopenharmony_cifile, which is /etc/orangefs/orangefs.conf.  Change the Name to scratch.
1628c2ecf20Sopenharmony_ciChange the ID to something other than the ID of the first FileSystem
1638c2ecf20Sopenharmony_cisection (2 is usually a good choice).
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciThen there are two FileSystem sections: orangefs and scratch.
1668c2ecf20Sopenharmony_ci
1678c2ecf20Sopenharmony_ciThis change should be made before creating the filesystem.
1688c2ecf20Sopenharmony_ci
1698c2ecf20Sopenharmony_ci::
1708c2ecf20Sopenharmony_ci
1718c2ecf20Sopenharmony_ci    pvfs2-server -f /etc/orangefs/orangefs.conf
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_ciTo run xfstests, create /etc/xfsqa.config::
1748c2ecf20Sopenharmony_ci
1758c2ecf20Sopenharmony_ci    TEST_DIR=/orangefs
1768c2ecf20Sopenharmony_ci    TEST_DEV=tcp://localhost:3334/orangefs
1778c2ecf20Sopenharmony_ci    SCRATCH_MNT=/scratch
1788c2ecf20Sopenharmony_ci    SCRATCH_DEV=tcp://localhost:3334/scratch
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ciThen xfstests can be run::
1818c2ecf20Sopenharmony_ci
1828c2ecf20Sopenharmony_ci    ./check -pvfs2
1838c2ecf20Sopenharmony_ci
1848c2ecf20Sopenharmony_ci
1858c2ecf20Sopenharmony_ciOptions
1868c2ecf20Sopenharmony_ci=======
1878c2ecf20Sopenharmony_ci
1888c2ecf20Sopenharmony_ciThe following mount options are accepted:
1898c2ecf20Sopenharmony_ci
1908c2ecf20Sopenharmony_ci  acl
1918c2ecf20Sopenharmony_ci    Allow the use of Access Control Lists on files and directories.
1928c2ecf20Sopenharmony_ci
1938c2ecf20Sopenharmony_ci  intr
1948c2ecf20Sopenharmony_ci    Some operations between the kernel client and the user space
1958c2ecf20Sopenharmony_ci    filesystem can be interruptible, such as changes in debug levels
1968c2ecf20Sopenharmony_ci    and the setting of tunable parameters.
1978c2ecf20Sopenharmony_ci
1988c2ecf20Sopenharmony_ci  local_lock
1998c2ecf20Sopenharmony_ci    Enable posix locking from the perspective of "this" kernel. The
2008c2ecf20Sopenharmony_ci    default file_operations lock action is to return ENOSYS. Posix
2018c2ecf20Sopenharmony_ci    locking kicks in if the filesystem is mounted with -o local_lock.
2028c2ecf20Sopenharmony_ci    Distributed locking is being worked on for the future.
2038c2ecf20Sopenharmony_ci
2048c2ecf20Sopenharmony_ci
2058c2ecf20Sopenharmony_ciDebugging
2068c2ecf20Sopenharmony_ci=========
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ciIf you want the debug (GOSSIP) statements in a particular
2098c2ecf20Sopenharmony_cisource file (inode.c for example) go to syslog::
2108c2ecf20Sopenharmony_ci
2118c2ecf20Sopenharmony_ci  echo inode > /sys/kernel/debug/orangefs/kernel-debug
2128c2ecf20Sopenharmony_ci
2138c2ecf20Sopenharmony_ciNo debugging (the default)::
2148c2ecf20Sopenharmony_ci
2158c2ecf20Sopenharmony_ci  echo none > /sys/kernel/debug/orangefs/kernel-debug
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ciDebugging from several source files::
2188c2ecf20Sopenharmony_ci
2198c2ecf20Sopenharmony_ci  echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ciAll debugging::
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ci  echo all > /sys/kernel/debug/orangefs/kernel-debug
2248c2ecf20Sopenharmony_ci
2258c2ecf20Sopenharmony_ciGet a list of all debugging keywords::
2268c2ecf20Sopenharmony_ci
2278c2ecf20Sopenharmony_ci  cat /sys/kernel/debug/orangefs/debug-help
2288c2ecf20Sopenharmony_ci
2298c2ecf20Sopenharmony_ci
2308c2ecf20Sopenharmony_ciProtocol between Kernel Module and Userspace
2318c2ecf20Sopenharmony_ci============================================
2328c2ecf20Sopenharmony_ci
2338c2ecf20Sopenharmony_ciOrangefs is a user space filesystem and an associated kernel module.
2348c2ecf20Sopenharmony_ciWe'll just refer to the user space part of Orangefs as "userspace"
2358c2ecf20Sopenharmony_cifrom here on out. Orangefs descends from PVFS, and userspace code
2368c2ecf20Sopenharmony_cistill uses PVFS for function and variable names. Userspace typedefs
2378c2ecf20Sopenharmony_cimany of the important structures. Function and variable names in
2388c2ecf20Sopenharmony_cithe kernel module have been transitioned to "orangefs", and The Linux
2398c2ecf20Sopenharmony_ciCoding Style avoids typedefs, so kernel module structures that
2408c2ecf20Sopenharmony_cicorrespond to userspace structures are not typedefed.
2418c2ecf20Sopenharmony_ci
2428c2ecf20Sopenharmony_ciThe kernel module implements a pseudo device that userspace
2438c2ecf20Sopenharmony_cican read from and write to. Userspace can also manipulate the
2448c2ecf20Sopenharmony_cikernel module through the pseudo device with ioctl.
2458c2ecf20Sopenharmony_ci
2468c2ecf20Sopenharmony_ciThe Bufmap
2478c2ecf20Sopenharmony_ci----------
2488c2ecf20Sopenharmony_ci
2498c2ecf20Sopenharmony_ciAt startup userspace allocates two page-size-aligned (posix_memalign)
2508c2ecf20Sopenharmony_cimlocked memory buffers, one is used for IO and one is used for readdir
2518c2ecf20Sopenharmony_cioperations. The IO buffer is 41943040 bytes and the readdir buffer is
2528c2ecf20Sopenharmony_ci4194304 bytes. Each buffer contains logical chunks, or partitions, and
2538c2ecf20Sopenharmony_cia pointer to each buffer is added to its own PVFS_dev_map_desc structure
2548c2ecf20Sopenharmony_ciwhich also describes its total size, as well as the size and number of
2558c2ecf20Sopenharmony_cithe partitions.
2568c2ecf20Sopenharmony_ci
2578c2ecf20Sopenharmony_ciA pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a
2588c2ecf20Sopenharmony_cimapping routine in the kernel module with an ioctl. The structure is
2598c2ecf20Sopenharmony_cicopied from user space to kernel space with copy_from_user and is used
2608c2ecf20Sopenharmony_cito initialize the kernel module's "bufmap" (struct orangefs_bufmap), which
2618c2ecf20Sopenharmony_cithen contains:
2628c2ecf20Sopenharmony_ci
2638c2ecf20Sopenharmony_ci  * refcnt
2648c2ecf20Sopenharmony_ci    - a reference counter
2658c2ecf20Sopenharmony_ci  * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's
2668c2ecf20Sopenharmony_ci    partition size, which represents the filesystem's block size and
2678c2ecf20Sopenharmony_ci    is used for s_blocksize in super blocks.
2688c2ecf20Sopenharmony_ci  * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of
2698c2ecf20Sopenharmony_ci    partitions in the IO buffer.
2708c2ecf20Sopenharmony_ci  * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks.
2718c2ecf20Sopenharmony_ci  * total_size - the total size of the IO buffer.
2728c2ecf20Sopenharmony_ci  * page_count - the number of 4096 byte pages in the IO buffer.
2738c2ecf20Sopenharmony_ci  * page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes
2748c2ecf20Sopenharmony_ci    of kcalloced memory. This memory is used as an array of pointers
2758c2ecf20Sopenharmony_ci    to each of the pages in the IO buffer through a call to get_user_pages.
2768c2ecf20Sopenharmony_ci  * desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
2778c2ecf20Sopenharmony_ci    bytes of kcalloced memory. This memory is further intialized:
2788c2ecf20Sopenharmony_ci
2798c2ecf20Sopenharmony_ci      user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
2808c2ecf20Sopenharmony_ci      structure. user_desc->ptr points to the IO buffer.
2818c2ecf20Sopenharmony_ci
2828c2ecf20Sopenharmony_ci      ::
2838c2ecf20Sopenharmony_ci
2848c2ecf20Sopenharmony_ci	pages_per_desc = bufmap->desc_size / PAGE_SIZE
2858c2ecf20Sopenharmony_ci	offset = 0
2868c2ecf20Sopenharmony_ci
2878c2ecf20Sopenharmony_ci        bufmap->desc_array[0].page_array = &bufmap->page_array[offset]
2888c2ecf20Sopenharmony_ci        bufmap->desc_array[0].array_count = pages_per_desc = 1024
2898c2ecf20Sopenharmony_ci        bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096)
2908c2ecf20Sopenharmony_ci        offset += 1024
2918c2ecf20Sopenharmony_ci                           .
2928c2ecf20Sopenharmony_ci                           .
2938c2ecf20Sopenharmony_ci                           .
2948c2ecf20Sopenharmony_ci        bufmap->desc_array[9].page_array = &bufmap->page_array[offset]
2958c2ecf20Sopenharmony_ci        bufmap->desc_array[9].array_count = pages_per_desc = 1024
2968c2ecf20Sopenharmony_ci        bufmap->desc_array[9].uaddr = (user_desc->ptr) +
2978c2ecf20Sopenharmony_ci                                               (9 * 1024 * 4096)
2988c2ecf20Sopenharmony_ci        offset += 1024
2998c2ecf20Sopenharmony_ci
3008c2ecf20Sopenharmony_ci  * buffer_index_array - a desc_count sized array of ints, used to
3018c2ecf20Sopenharmony_ci    indicate which of the IO buffer's partitions are available to use.
3028c2ecf20Sopenharmony_ci  * buffer_index_lock - a spinlock to protect buffer_index_array during update.
3038c2ecf20Sopenharmony_ci  * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element
3048c2ecf20Sopenharmony_ci    int array used to indicate which of the readdir buffer's partitions are
3058c2ecf20Sopenharmony_ci    available to use.
3068c2ecf20Sopenharmony_ci  * readdir_index_lock - a spinlock to protect readdir_index_array during
3078c2ecf20Sopenharmony_ci    update.
3088c2ecf20Sopenharmony_ci
3098c2ecf20Sopenharmony_ciOperations
3108c2ecf20Sopenharmony_ci----------
3118c2ecf20Sopenharmony_ci
3128c2ecf20Sopenharmony_ciThe kernel module builds an "op" (struct orangefs_kernel_op_s) when it
3138c2ecf20Sopenharmony_cineeds to communicate with userspace. Part of the op contains the "upcall"
3148c2ecf20Sopenharmony_ciwhich expresses the request to userspace. Part of the op eventually
3158c2ecf20Sopenharmony_cicontains the "downcall" which expresses the results of the request.
3168c2ecf20Sopenharmony_ci
3178c2ecf20Sopenharmony_ciThe slab allocator is used to keep a cache of op structures handy.
3188c2ecf20Sopenharmony_ci
3198c2ecf20Sopenharmony_ciAt init time the kernel module defines and initializes a request list
3208c2ecf20Sopenharmony_ciand an in_progress hash table to keep track of all the ops that are
3218c2ecf20Sopenharmony_ciin flight at any given time.
3228c2ecf20Sopenharmony_ci
3238c2ecf20Sopenharmony_ciOps are stateful:
3248c2ecf20Sopenharmony_ci
3258c2ecf20Sopenharmony_ci * unknown
3268c2ecf20Sopenharmony_ci	    - op was just initialized
3278c2ecf20Sopenharmony_ci * waiting
3288c2ecf20Sopenharmony_ci	    - op is on request_list (upward bound)
3298c2ecf20Sopenharmony_ci * inprogr
3308c2ecf20Sopenharmony_ci	    - op is in progress (waiting for downcall)
3318c2ecf20Sopenharmony_ci * serviced
3328c2ecf20Sopenharmony_ci	    - op has matching downcall; ok
3338c2ecf20Sopenharmony_ci * purged
3348c2ecf20Sopenharmony_ci	    - op has to start a timer since client-core
3358c2ecf20Sopenharmony_ci              exited uncleanly before servicing op
3368c2ecf20Sopenharmony_ci * given up
3378c2ecf20Sopenharmony_ci	    - submitter has given up waiting for it
3388c2ecf20Sopenharmony_ci
3398c2ecf20Sopenharmony_ciWhen some arbitrary userspace program needs to perform a
3408c2ecf20Sopenharmony_cifilesystem operation on Orangefs (readdir, I/O, create, whatever)
3418c2ecf20Sopenharmony_cian op structure is initialized and tagged with a distinguishing ID
3428c2ecf20Sopenharmony_cinumber. The upcall part of the op is filled out, and the op is
3438c2ecf20Sopenharmony_cipassed to the "service_operation" function.
3448c2ecf20Sopenharmony_ci
3458c2ecf20Sopenharmony_ciService_operation changes the op's state to "waiting", puts
3468c2ecf20Sopenharmony_ciit on the request list, and signals the Orangefs file_operations.poll
3478c2ecf20Sopenharmony_cifunction through a wait queue. Userspace is polling the pseudo-device
3488c2ecf20Sopenharmony_ciand thus becomes aware of the upcall request that needs to be read.
3498c2ecf20Sopenharmony_ci
3508c2ecf20Sopenharmony_ciWhen the Orangefs file_operations.read function is triggered, the
3518c2ecf20Sopenharmony_cirequest list is searched for an op that seems ready-to-process.
3528c2ecf20Sopenharmony_ciThe op is removed from the request list. The tag from the op and
3538c2ecf20Sopenharmony_cithe filled-out upcall struct are copy_to_user'ed back to userspace.
3548c2ecf20Sopenharmony_ci
3558c2ecf20Sopenharmony_ciIf any of these (and some additional protocol) copy_to_users fail,
3568c2ecf20Sopenharmony_cithe op's state is set to "waiting" and the op is added back to
3578c2ecf20Sopenharmony_cithe request list. Otherwise, the op's state is changed to "in progress",
3588c2ecf20Sopenharmony_ciand the op is hashed on its tag and put onto the end of a list in the
3598c2ecf20Sopenharmony_ciin_progress hash table at the index the tag hashed to.
3608c2ecf20Sopenharmony_ci
3618c2ecf20Sopenharmony_ciWhen userspace has assembled the response to the upcall, it
3628c2ecf20Sopenharmony_ciwrites the response, which includes the distinguishing tag, back to
3638c2ecf20Sopenharmony_cithe pseudo device in a series of io_vecs. This triggers the Orangefs
3648c2ecf20Sopenharmony_cifile_operations.write_iter function to find the op with the associated
3658c2ecf20Sopenharmony_citag and remove it from the in_progress hash table. As long as the op's
3668c2ecf20Sopenharmony_cistate is not "canceled" or "given up", its state is set to "serviced".
3678c2ecf20Sopenharmony_ciThe file_operations.write_iter function returns to the waiting vfs,
3688c2ecf20Sopenharmony_ciand back to service_operation through wait_for_matching_downcall.
3698c2ecf20Sopenharmony_ci
3708c2ecf20Sopenharmony_ciService operation returns to its caller with the op's downcall
3718c2ecf20Sopenharmony_cipart (the response to the upcall) filled out.
3728c2ecf20Sopenharmony_ci
3738c2ecf20Sopenharmony_ciThe "client-core" is the bridge between the kernel module and
3748c2ecf20Sopenharmony_ciuserspace. The client-core is a daemon. The client-core has an
3758c2ecf20Sopenharmony_ciassociated watchdog daemon. If the client-core is ever signaled
3768c2ecf20Sopenharmony_cito die, the watchdog daemon restarts the client-core. Even though
3778c2ecf20Sopenharmony_cithe client-core is restarted "right away", there is a period of
3788c2ecf20Sopenharmony_citime during such an event that the client-core is dead. A dead client-core
3798c2ecf20Sopenharmony_cican't be triggered by the Orangefs file_operations.poll function.
3808c2ecf20Sopenharmony_ciOps that pass through service_operation during a "dead spell" can timeout
3818c2ecf20Sopenharmony_cion the wait queue and one attempt is made to recycle them. Obviously,
3828c2ecf20Sopenharmony_ciif the client-core stays dead too long, the arbitrary userspace processes
3838c2ecf20Sopenharmony_citrying to use Orangefs will be negatively affected. Waiting ops
3848c2ecf20Sopenharmony_cithat can't be serviced will be removed from the request list and
3858c2ecf20Sopenharmony_cihave their states set to "given up". In-progress ops that can't
3868c2ecf20Sopenharmony_cibe serviced will be removed from the in_progress hash table and
3878c2ecf20Sopenharmony_cihave their states set to "given up".
3888c2ecf20Sopenharmony_ci
3898c2ecf20Sopenharmony_ciReaddir and I/O ops are atypical with respect to their payloads.
3908c2ecf20Sopenharmony_ci
3918c2ecf20Sopenharmony_ci  - readdir ops use the smaller of the two pre-allocated pre-partitioned
3928c2ecf20Sopenharmony_ci    memory buffers. The readdir buffer is only available to userspace.
3938c2ecf20Sopenharmony_ci    The kernel module obtains an index to a free partition before launching
3948c2ecf20Sopenharmony_ci    a readdir op. Userspace deposits the results into the indexed partition
3958c2ecf20Sopenharmony_ci    and then writes them to back to the pvfs device.
3968c2ecf20Sopenharmony_ci
3978c2ecf20Sopenharmony_ci  - io (read and write) ops use the larger of the two pre-allocated
3988c2ecf20Sopenharmony_ci    pre-partitioned memory buffers. The IO buffer is accessible from
3998c2ecf20Sopenharmony_ci    both userspace and the kernel module. The kernel module obtains an
4008c2ecf20Sopenharmony_ci    index to a free partition before launching an io op. The kernel module
4018c2ecf20Sopenharmony_ci    deposits write data into the indexed partition, to be consumed
4028c2ecf20Sopenharmony_ci    directly by userspace. Userspace deposits the results of read
4038c2ecf20Sopenharmony_ci    requests into the indexed partition, to be consumed directly
4048c2ecf20Sopenharmony_ci    by the kernel module.
4058c2ecf20Sopenharmony_ci
4068c2ecf20Sopenharmony_ciResponses to kernel requests are all packaged in pvfs2_downcall_t
4078c2ecf20Sopenharmony_cistructs. Besides a few other members, pvfs2_downcall_t contains a
4088c2ecf20Sopenharmony_ciunion of structs, each of which is associated with a particular
4098c2ecf20Sopenharmony_ciresponse type.
4108c2ecf20Sopenharmony_ci
4118c2ecf20Sopenharmony_ciThe several members outside of the union are:
4128c2ecf20Sopenharmony_ci
4138c2ecf20Sopenharmony_ci ``int32_t type``
4148c2ecf20Sopenharmony_ci    - type of operation.
4158c2ecf20Sopenharmony_ci ``int32_t status``
4168c2ecf20Sopenharmony_ci    - return code for the operation.
4178c2ecf20Sopenharmony_ci ``int64_t trailer_size``
4188c2ecf20Sopenharmony_ci    - 0 unless readdir operation.
4198c2ecf20Sopenharmony_ci ``char *trailer_buf``
4208c2ecf20Sopenharmony_ci    - initialized to NULL, used during readdir operations.
4218c2ecf20Sopenharmony_ci
4228c2ecf20Sopenharmony_ciThe appropriate member inside the union is filled out for any
4238c2ecf20Sopenharmony_ciparticular response.
4248c2ecf20Sopenharmony_ci
4258c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_FILE_IO
4268c2ecf20Sopenharmony_ci    fill a pvfs2_io_response_t
4278c2ecf20Sopenharmony_ci
4288c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_LOOKUP
4298c2ecf20Sopenharmony_ci    fill a PVFS_object_kref
4308c2ecf20Sopenharmony_ci
4318c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_CREATE
4328c2ecf20Sopenharmony_ci    fill a PVFS_object_kref
4338c2ecf20Sopenharmony_ci
4348c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_SYMLINK
4358c2ecf20Sopenharmony_ci    fill a PVFS_object_kref
4368c2ecf20Sopenharmony_ci
4378c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_GETATTR
4388c2ecf20Sopenharmony_ci    fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need)
4398c2ecf20Sopenharmony_ci    fill in a string with the link target when the object is a symlink.
4408c2ecf20Sopenharmony_ci
4418c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_MKDIR
4428c2ecf20Sopenharmony_ci    fill a PVFS_object_kref
4438c2ecf20Sopenharmony_ci
4448c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_STATFS
4458c2ecf20Sopenharmony_ci    fill a pvfs2_statfs_response_t with useless info <g>. It is hard for
4468c2ecf20Sopenharmony_ci    us to know, in a timely fashion, these statistics about our
4478c2ecf20Sopenharmony_ci    distributed network filesystem.
4488c2ecf20Sopenharmony_ci
4498c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_FS_MOUNT
4508c2ecf20Sopenharmony_ci    fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref
4518c2ecf20Sopenharmony_ci    except its members are in a different order and "__pad1" is replaced
4528c2ecf20Sopenharmony_ci    with "id".
4538c2ecf20Sopenharmony_ci
4548c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_GETXATTR
4558c2ecf20Sopenharmony_ci    fill a pvfs2_getxattr_response_t
4568c2ecf20Sopenharmony_ci
4578c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_LISTXATTR
4588c2ecf20Sopenharmony_ci    fill a pvfs2_listxattr_response_t
4598c2ecf20Sopenharmony_ci
4608c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_PARAM
4618c2ecf20Sopenharmony_ci    fill a pvfs2_param_response_t
4628c2ecf20Sopenharmony_ci
4638c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_PERF_COUNT
4648c2ecf20Sopenharmony_ci    fill a pvfs2_perf_count_response_t
4658c2ecf20Sopenharmony_ci
4668c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_FSKEY
4678c2ecf20Sopenharmony_ci    file a pvfs2_fs_key_response_t
4688c2ecf20Sopenharmony_ci
4698c2ecf20Sopenharmony_ci  PVFS2_VFS_OP_READDIR
4708c2ecf20Sopenharmony_ci    jamb everything needed to represent a pvfs2_readdir_response_t into
4718c2ecf20Sopenharmony_ci    the readdir buffer descriptor specified in the upcall.
4728c2ecf20Sopenharmony_ci
4738c2ecf20Sopenharmony_ciUserspace uses writev() on /dev/pvfs2-req to pass responses to the requests
4748c2ecf20Sopenharmony_cimade by the kernel side.
4758c2ecf20Sopenharmony_ci
4768c2ecf20Sopenharmony_ciA buffer_list containing:
4778c2ecf20Sopenharmony_ci
4788c2ecf20Sopenharmony_ci  - a pointer to the prepared response to the request from the
4798c2ecf20Sopenharmony_ci    kernel (struct pvfs2_downcall_t).
4808c2ecf20Sopenharmony_ci  - and also, in the case of a readdir request, a pointer to a
4818c2ecf20Sopenharmony_ci    buffer containing descriptors for the objects in the target
4828c2ecf20Sopenharmony_ci    directory.
4838c2ecf20Sopenharmony_ci
4848c2ecf20Sopenharmony_ci... is sent to the function (PINT_dev_write_list) which performs
4858c2ecf20Sopenharmony_cithe writev.
4868c2ecf20Sopenharmony_ci
4878c2ecf20Sopenharmony_ciPINT_dev_write_list has a local iovec array: struct iovec io_array[10];
4888c2ecf20Sopenharmony_ci
4898c2ecf20Sopenharmony_ciThe first four elements of io_array are initialized like this for all
4908c2ecf20Sopenharmony_ciresponses::
4918c2ecf20Sopenharmony_ci
4928c2ecf20Sopenharmony_ci  io_array[0].iov_base = address of local variable "proto_ver" (int32_t)
4938c2ecf20Sopenharmony_ci  io_array[0].iov_len = sizeof(int32_t)
4948c2ecf20Sopenharmony_ci
4958c2ecf20Sopenharmony_ci  io_array[1].iov_base = address of global variable "pdev_magic" (int32_t)
4968c2ecf20Sopenharmony_ci  io_array[1].iov_len = sizeof(int32_t)
4978c2ecf20Sopenharmony_ci
4988c2ecf20Sopenharmony_ci  io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t)
4998c2ecf20Sopenharmony_ci  io_array[2].iov_len = sizeof(int64_t)
5008c2ecf20Sopenharmony_ci
5018c2ecf20Sopenharmony_ci  io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t)
5028c2ecf20Sopenharmony_ci                         of global variable vfs_request (vfs_request_t)
5038c2ecf20Sopenharmony_ci  io_array[3].iov_len = sizeof(pvfs2_downcall_t)
5048c2ecf20Sopenharmony_ci
5058c2ecf20Sopenharmony_ciReaddir responses initialize the fifth element io_array like this::
5068c2ecf20Sopenharmony_ci
5078c2ecf20Sopenharmony_ci  io_array[4].iov_base = contents of member trailer_buf (char *)
5088c2ecf20Sopenharmony_ci                         from out_downcall member of global variable
5098c2ecf20Sopenharmony_ci                         vfs_request
5108c2ecf20Sopenharmony_ci  io_array[4].iov_len = contents of member trailer_size (PVFS_size)
5118c2ecf20Sopenharmony_ci                        from out_downcall member of global variable
5128c2ecf20Sopenharmony_ci                        vfs_request
5138c2ecf20Sopenharmony_ci
5148c2ecf20Sopenharmony_ciOrangefs exploits the dcache in order to avoid sending redundant
5158c2ecf20Sopenharmony_cirequests to userspace. We keep object inode attributes up-to-date with
5168c2ecf20Sopenharmony_ciorangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to
5178c2ecf20Sopenharmony_cihelp it decide whether or not to update an inode: "new" and "bypass".
5188c2ecf20Sopenharmony_ciOrangefs keeps private data in an object's inode that includes a short
5198c2ecf20Sopenharmony_citimeout value, getattr_time, which allows any iteration of
5208c2ecf20Sopenharmony_ciorangefs_inode_getattr to know how long it has been since the inode was
5218c2ecf20Sopenharmony_ciupdated. When the object is not new (new == 0) and the bypass flag is not
5228c2ecf20Sopenharmony_ciset (bypass == 0) orangefs_inode_getattr returns without updating the inode
5238c2ecf20Sopenharmony_ciif getattr_time has not timed out. Getattr_time is updated each time the
5248c2ecf20Sopenharmony_ciinode is updated.
5258c2ecf20Sopenharmony_ci
5268c2ecf20Sopenharmony_ciCreation of a new object (file, dir, sym-link) includes the evaluation of
5278c2ecf20Sopenharmony_ciits pathname, resulting in a negative directory entry for the object.
5288c2ecf20Sopenharmony_ciA new inode is allocated and associated with the dentry, turning it from
5298c2ecf20Sopenharmony_cia negative dentry into a "productive full member of society". Orangefs
5308c2ecf20Sopenharmony_ciobtains the new inode from Linux with new_inode() and associates
5318c2ecf20Sopenharmony_cithe inode with the dentry by sending the pair back to Linux with
5328c2ecf20Sopenharmony_cid_instantiate().
5338c2ecf20Sopenharmony_ci
5348c2ecf20Sopenharmony_ciThe evaluation of a pathname for an object resolves to its corresponding
5358c2ecf20Sopenharmony_cidentry. If there is no corresponding dentry, one is created for it in
5368c2ecf20Sopenharmony_cithe dcache. Whenever a dentry is modified or verified Orangefs stores a
5378c2ecf20Sopenharmony_cishort timeout value in the dentry's d_time, and the dentry will be trusted
5388c2ecf20Sopenharmony_cifor that amount of time. Orangefs is a network filesystem, and objects
5398c2ecf20Sopenharmony_cican potentially change out-of-band with any particular Orangefs kernel module
5408c2ecf20Sopenharmony_ciinstance, so trusting a dentry is risky. The alternative to trusting
5418c2ecf20Sopenharmony_cidentries is to always obtain the needed information from userspace - at
5428c2ecf20Sopenharmony_cileast a trip to the client-core, maybe to the servers. Obtaining information
5438c2ecf20Sopenharmony_cifrom a dentry is cheap, obtaining it from userspace is relatively expensive,
5448c2ecf20Sopenharmony_cihence the motivation to use the dentry when possible.
5458c2ecf20Sopenharmony_ci
5468c2ecf20Sopenharmony_ciThe timeout values d_time and getattr_time are jiffy based, and the
5478c2ecf20Sopenharmony_cicode is designed to avoid the jiffy-wrap problem::
5488c2ecf20Sopenharmony_ci
5498c2ecf20Sopenharmony_ci    "In general, if the clock may have wrapped around more than once, there
5508c2ecf20Sopenharmony_ci    is no way to tell how much time has elapsed. However, if the times t1
5518c2ecf20Sopenharmony_ci    and t2 are known to be fairly close, we can reliably compute the
5528c2ecf20Sopenharmony_ci    difference in a way that takes into account the possibility that the
5538c2ecf20Sopenharmony_ci    clock may have wrapped between times."
5548c2ecf20Sopenharmony_ci
5558c2ecf20Sopenharmony_cifrom course notes by instructor Andy Wang
5568c2ecf20Sopenharmony_ci
557