18c2ecf20Sopenharmony_ciDirect Access for files
28c2ecf20Sopenharmony_ci-----------------------
38c2ecf20Sopenharmony_ci
48c2ecf20Sopenharmony_ciMotivation
58c2ecf20Sopenharmony_ci----------
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciThe page cache is usually used to buffer reads and writes to files.
88c2ecf20Sopenharmony_ciIt is also used to provide the pages which are mapped into userspace
98c2ecf20Sopenharmony_ciby a call to mmap.
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ciFor block devices that are memory-like, the page cache pages would be
128c2ecf20Sopenharmony_ciunnecessary copies of the original storage.  The DAX code removes the
138c2ecf20Sopenharmony_ciextra copy by performing reads and writes directly to the storage device.
148c2ecf20Sopenharmony_ciFor file mappings, the storage device is mapped directly into userspace.
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ciUsage
188c2ecf20Sopenharmony_ci-----
198c2ecf20Sopenharmony_ci
208c2ecf20Sopenharmony_ciIf you have a block device which supports DAX, you can make a filesystem
218c2ecf20Sopenharmony_cion it as usual.  The DAX code currently only supports files with a block
228c2ecf20Sopenharmony_cisize equal to your kernel's PAGE_SIZE, so you may need to specify a block
238c2ecf20Sopenharmony_cisize when creating the filesystem.
248c2ecf20Sopenharmony_ci
258c2ecf20Sopenharmony_ciCurrently 3 filesystems support DAX: ext2, ext4 and xfs.  Enabling DAX on them
268c2ecf20Sopenharmony_ciis different.
278c2ecf20Sopenharmony_ci
288c2ecf20Sopenharmony_ciEnabling DAX on ext2
298c2ecf20Sopenharmony_ci-----------------------------
308c2ecf20Sopenharmony_ci
318c2ecf20Sopenharmony_ciWhen mounting the filesystem, use the "-o dax" option on the command line or
328c2ecf20Sopenharmony_ciadd 'dax' to the options in /etc/fstab.  This works to enable DAX on all files
338c2ecf20Sopenharmony_ciwithin the filesystem.  It is equivalent to the '-o dax=always' behavior below.
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ci
368c2ecf20Sopenharmony_ciEnabling DAX on xfs and ext4
378c2ecf20Sopenharmony_ci----------------------------
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_ciSummary
408c2ecf20Sopenharmony_ci-------
418c2ecf20Sopenharmony_ci
428c2ecf20Sopenharmony_ci 1. There exists an in-kernel file access mode flag S_DAX that corresponds to
438c2ecf20Sopenharmony_ci    the statx flag STATX_ATTR_DAX.  See the manpage for statx(2) for details
448c2ecf20Sopenharmony_ci    about this access mode.
458c2ecf20Sopenharmony_ci
468c2ecf20Sopenharmony_ci 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
478c2ecf20Sopenharmony_ci    files and directories. This advisory flag can be set or cleared at any
488c2ecf20Sopenharmony_ci    time, but doing so does not immediately affect the S_DAX state.
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_ci 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
518c2ecf20Sopenharmony_ci    be inherited by all regular files and subdirectories that are subsequently
528c2ecf20Sopenharmony_ci    created in this directory. Files and subdirectories that exist at the time
538c2ecf20Sopenharmony_ci    this flag is set or cleared on the parent directory are not modified by
548c2ecf20Sopenharmony_ci    this modification of the parent directory.
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ci 4. There exist dax mount options which can override FS_XFLAG_DAX in the
578c2ecf20Sopenharmony_ci    setting of the S_DAX flag.  Given underlying storage which supports DAX the
588c2ecf20Sopenharmony_ci    following hold:
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ci    "-o dax=inode"  means "follow FS_XFLAG_DAX" and is the default.
618c2ecf20Sopenharmony_ci
628c2ecf20Sopenharmony_ci    "-o dax=never"  means "never set S_DAX, ignore FS_XFLAG_DAX."
638c2ecf20Sopenharmony_ci
648c2ecf20Sopenharmony_ci    "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
658c2ecf20Sopenharmony_ci
668c2ecf20Sopenharmony_ci    "-o dax"        is a legacy option which is an alias for "dax=always".
678c2ecf20Sopenharmony_ci		    This may be removed in the future so "-o dax=always" is
688c2ecf20Sopenharmony_ci		    the preferred method for specifying this behavior.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_ci    NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
718c2ecf20Sopenharmony_ci    the same even when the filesystem is mounted with a dax option.  However,
728c2ecf20Sopenharmony_ci    in-core inode state (S_DAX) will be overridden until the filesystem is
738c2ecf20Sopenharmony_ci    remounted with dax=inode and the inode is evicted from kernel memory.
748c2ecf20Sopenharmony_ci
758c2ecf20Sopenharmony_ci 5. The S_DAX policy can be changed via:
768c2ecf20Sopenharmony_ci
778c2ecf20Sopenharmony_ci    a) Setting the parent directory FS_XFLAG_DAX as needed before files are
788c2ecf20Sopenharmony_ci       created
798c2ecf20Sopenharmony_ci
808c2ecf20Sopenharmony_ci    b) Setting the appropriate dax="foo" mount option
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ci    c) Changing the FS_XFLAG_DAX flag on existing regular files and
838c2ecf20Sopenharmony_ci       directories.  This has runtime constraints and limitations that are
848c2ecf20Sopenharmony_ci       described in 6) below.
858c2ecf20Sopenharmony_ci
868c2ecf20Sopenharmony_ci 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
878c2ecf20Sopenharmony_ci    the change in behaviour for existing regular files may not occur
888c2ecf20Sopenharmony_ci    immediately.  If the change must take effect immediately, the administrator
898c2ecf20Sopenharmony_ci    needs to:
908c2ecf20Sopenharmony_ci
918c2ecf20Sopenharmony_ci    a) stop the application so there are no active references to the data set
928c2ecf20Sopenharmony_ci       the policy change will affect
938c2ecf20Sopenharmony_ci
948c2ecf20Sopenharmony_ci    b) evict the data set from kernel caches so it will be re-instantiated when
958c2ecf20Sopenharmony_ci       the application is restarted. This can be achieved by:
968c2ecf20Sopenharmony_ci
978c2ecf20Sopenharmony_ci       i. drop-caches
988c2ecf20Sopenharmony_ci       ii. a filesystem unmount and mount cycle
998c2ecf20Sopenharmony_ci       iii. a system reboot
1008c2ecf20Sopenharmony_ci
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciDetails
1038c2ecf20Sopenharmony_ci-------
1048c2ecf20Sopenharmony_ci
1058c2ecf20Sopenharmony_ciThere are 2 per-file dax flags.  One is a persistent inode setting (FS_XFLAG_DAX)
1068c2ecf20Sopenharmony_ciand the other is a volatile flag indicating the active state of the feature
1078c2ecf20Sopenharmony_ci(S_DAX).
1088c2ecf20Sopenharmony_ci
1098c2ecf20Sopenharmony_ciFS_XFLAG_DAX is preserved within the filesystem.  This persistent config
1108c2ecf20Sopenharmony_cisetting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
1118c2ecf20Sopenharmony_ci(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
1128c2ecf20Sopenharmony_ci
1138c2ecf20Sopenharmony_ciNew files and directories automatically inherit FS_XFLAG_DAX from
1148c2ecf20Sopenharmony_citheir parent directory _when_ _created_.  Therefore, setting FS_XFLAG_DAX at
1158c2ecf20Sopenharmony_cidirectory creation time can be used to set a default behavior for an entire
1168c2ecf20Sopenharmony_cisub-tree.
1178c2ecf20Sopenharmony_ci
1188c2ecf20Sopenharmony_ciTo clarify inheritance, here are 3 examples:
1198c2ecf20Sopenharmony_ci
1208c2ecf20Sopenharmony_ciExample A:
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_cimkdir -p a/b/c
1238c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' a
1248c2ecf20Sopenharmony_cimkdir a/b/c/d
1258c2ecf20Sopenharmony_cimkdir a/e
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ci	dax: a,e
1288c2ecf20Sopenharmony_ci	no dax: b,c,d
1298c2ecf20Sopenharmony_ci
1308c2ecf20Sopenharmony_ciExample B:
1318c2ecf20Sopenharmony_ci
1328c2ecf20Sopenharmony_cimkdir a
1338c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' a
1348c2ecf20Sopenharmony_cimkdir -p a/b/c/d
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ci	dax: a,b,c,d
1378c2ecf20Sopenharmony_ci	no dax:
1388c2ecf20Sopenharmony_ci
1398c2ecf20Sopenharmony_ciExample C:
1408c2ecf20Sopenharmony_ci
1418c2ecf20Sopenharmony_cimkdir -p a/b/c
1428c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' c
1438c2ecf20Sopenharmony_cimkdir a/b/c/d
1448c2ecf20Sopenharmony_ci
1458c2ecf20Sopenharmony_ci	dax: c,d
1468c2ecf20Sopenharmony_ci	no dax: a,b
1478c2ecf20Sopenharmony_ci
1488c2ecf20Sopenharmony_ci
1498c2ecf20Sopenharmony_ciThe current enabled state (S_DAX) is set when a file inode is instantiated in
1508c2ecf20Sopenharmony_cimemory by the kernel.  It is set based on the underlying media support, the
1518c2ecf20Sopenharmony_civalue of FS_XFLAG_DAX and the filesystem's dax mount option.
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_cistatx can be used to query S_DAX.  NOTE that only regular files will ever have
1548c2ecf20Sopenharmony_ciS_DAX set and therefore statx will never indicate that S_DAX is set on
1558c2ecf20Sopenharmony_cidirectories.
1568c2ecf20Sopenharmony_ci
1578c2ecf20Sopenharmony_ciSetting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
1588c2ecf20Sopenharmony_ciif the underlying media does not support dax and/or the filesystem is
1598c2ecf20Sopenharmony_cioverridden with a mount option.
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ci
1628c2ecf20Sopenharmony_ci
1638c2ecf20Sopenharmony_ciImplementation Tips for Block Driver Writers
1648c2ecf20Sopenharmony_ci--------------------------------------------
1658c2ecf20Sopenharmony_ci
1668c2ecf20Sopenharmony_ciTo support DAX in your block driver, implement the 'direct_access'
1678c2ecf20Sopenharmony_ciblock device operation.  It is used to translate the sector number
1688c2ecf20Sopenharmony_ci(expressed in units of 512-byte sectors) to a page frame number (pfn)
1698c2ecf20Sopenharmony_cithat identifies the physical page for the memory.  It also returns a
1708c2ecf20Sopenharmony_cikernel virtual address that can be used to access the memory.
1718c2ecf20Sopenharmony_ci
1728c2ecf20Sopenharmony_ciThe direct_access method takes a 'size' parameter that indicates the
1738c2ecf20Sopenharmony_cinumber of bytes being requested.  The function should return the number
1748c2ecf20Sopenharmony_ciof bytes that can be contiguously accessed at that offset.  It may also
1758c2ecf20Sopenharmony_cireturn a negative errno if an error occurs.
1768c2ecf20Sopenharmony_ci
1778c2ecf20Sopenharmony_ciIn order to support this method, the storage must be byte-accessible by
1788c2ecf20Sopenharmony_cithe CPU at all times.  If your device uses paging techniques to expose
1798c2ecf20Sopenharmony_cia large amount of memory through a smaller window, then you cannot
1808c2ecf20Sopenharmony_ciimplement direct_access.  Equally, if your device can occasionally
1818c2ecf20Sopenharmony_cistall the CPU for an extended period, you should also not attempt to
1828c2ecf20Sopenharmony_ciimplement direct_access.
1838c2ecf20Sopenharmony_ci
1848c2ecf20Sopenharmony_ciThese block devices may be used for inspiration:
1858c2ecf20Sopenharmony_ci- brd: RAM backed block device driver
1868c2ecf20Sopenharmony_ci- dcssblk: s390 dcss block device driver
1878c2ecf20Sopenharmony_ci- pmem: NVDIMM persistent memory driver
1888c2ecf20Sopenharmony_ci
1898c2ecf20Sopenharmony_ci
1908c2ecf20Sopenharmony_ciImplementation Tips for Filesystem Writers
1918c2ecf20Sopenharmony_ci------------------------------------------
1928c2ecf20Sopenharmony_ci
1938c2ecf20Sopenharmony_ciFilesystem support consists of
1948c2ecf20Sopenharmony_ci- adding support to mark inodes as being DAX by setting the S_DAX flag in
1958c2ecf20Sopenharmony_ci  i_flags
1968c2ecf20Sopenharmony_ci- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw()
1978c2ecf20Sopenharmony_ci  when inode has S_DAX flag set
1988c2ecf20Sopenharmony_ci- implementing an mmap file operation for DAX files which sets the
1998c2ecf20Sopenharmony_ci  VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
2008c2ecf20Sopenharmony_ci  include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
2018c2ecf20Sopenharmony_ci  handlers should probably call dax_iomap_fault() passing the appropriate
2028c2ecf20Sopenharmony_ci  fault size and iomap operations.
2038c2ecf20Sopenharmony_ci- calling iomap_zero_range() passing appropriate iomap operations instead of
2048c2ecf20Sopenharmony_ci  block_truncate_page() for DAX files
2058c2ecf20Sopenharmony_ci- ensuring that there is sufficient locking between reads, writes,
2068c2ecf20Sopenharmony_ci  truncates and page faults
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ciThe iomap handlers for allocating blocks must make sure that allocated blocks
2098c2ecf20Sopenharmony_ciare zeroed out and converted to written extents before being returned to avoid
2108c2ecf20Sopenharmony_ciexposure of uninitialized data through mmap.
2118c2ecf20Sopenharmony_ci
2128c2ecf20Sopenharmony_ciThese filesystems may be used for inspiration:
2138c2ecf20Sopenharmony_ci- ext2: see Documentation/filesystems/ext2.rst
2148c2ecf20Sopenharmony_ci- ext4: see Documentation/filesystems/ext4/
2158c2ecf20Sopenharmony_ci- xfs:  see Documentation/admin-guide/xfs.rst
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ciHandling Media Errors
2198c2ecf20Sopenharmony_ci---------------------
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ciThe libnvdimm subsystem stores a record of known media error locations for
2228c2ecf20Sopenharmony_cieach pmem block device (in gendisk->badblocks). If we fault at such location,
2238c2ecf20Sopenharmony_cior one with a latent error not yet discovered, the application can expect
2248c2ecf20Sopenharmony_cito receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
2258c2ecf20Sopenharmony_ciwriting the affected sectors (through the pmem driver, and if the underlying
2268c2ecf20Sopenharmony_ciNVDIMM supports the clear_poison DSM defined by ACPI).
2278c2ecf20Sopenharmony_ci
2288c2ecf20Sopenharmony_ciSince DAX IO normally doesn't go through the driver/bio path, applications or
2298c2ecf20Sopenharmony_cisysadmins have an option to restore the lost data from a prior backup/inbuilt
2308c2ecf20Sopenharmony_ciredundancy in the following ways:
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ci1. Delete the affected file, and restore from a backup (sysadmin route):
2338c2ecf20Sopenharmony_ci   This will free the filesystem blocks that were being used by the file,
2348c2ecf20Sopenharmony_ci   and the next time they're allocated, they will be zeroed first, which
2358c2ecf20Sopenharmony_ci   happens through the driver, and will clear bad sectors.
2368c2ecf20Sopenharmony_ci
2378c2ecf20Sopenharmony_ci2. Truncate or hole-punch the part of the file that has a bad-block (at least
2388c2ecf20Sopenharmony_ci   an entire aligned sector has to be hole-punched, but not necessarily an
2398c2ecf20Sopenharmony_ci   entire filesystem block).
2408c2ecf20Sopenharmony_ci
2418c2ecf20Sopenharmony_ciThese are the two basic paths that allow DAX filesystems to continue operating
2428c2ecf20Sopenharmony_ciin the presence of media errors. More robust error recovery mechanisms can be
2438c2ecf20Sopenharmony_cibuilt on top of this in the future, for example, involving redundancy/mirroring
2448c2ecf20Sopenharmony_ciprovided at the block layer through DM, or additionally, at the filesystem
2458c2ecf20Sopenharmony_cilevel. These would have to rely on the above two tenets, that error clearing
2468c2ecf20Sopenharmony_cican happen either by sending an IO through the driver, or zeroing (also through
2478c2ecf20Sopenharmony_cithe driver).
2488c2ecf20Sopenharmony_ci
2498c2ecf20Sopenharmony_ci
2508c2ecf20Sopenharmony_ciShortcomings
2518c2ecf20Sopenharmony_ci------------
2528c2ecf20Sopenharmony_ci
2538c2ecf20Sopenharmony_ciEven if the kernel or its modules are stored on a filesystem that supports
2548c2ecf20Sopenharmony_ciDAX on a block device that supports DAX, they will still be copied into RAM.
2558c2ecf20Sopenharmony_ci
2568c2ecf20Sopenharmony_ciThe DAX code does not work correctly on architectures which have virtually
2578c2ecf20Sopenharmony_cimapped caches such as ARM, MIPS and SPARC.
2588c2ecf20Sopenharmony_ci
2598c2ecf20Sopenharmony_ciCalling get_user_pages() on a range of user memory that has been mmaped
2608c2ecf20Sopenharmony_cifrom a DAX file will fail when there are no 'struct page' to describe
2618c2ecf20Sopenharmony_cithose pages.  This problem has been addressed in some device drivers
2628c2ecf20Sopenharmony_ciby adding optional struct page support for pages under the control of
2638c2ecf20Sopenharmony_cithe driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of
2648c2ecf20Sopenharmony_cihow to do this). In the non struct page cases O_DIRECT reads/writes to
2658c2ecf20Sopenharmony_cithose memory ranges from a non-DAX file will fail (note that O_DIRECT
2668c2ecf20Sopenharmony_cireads/writes _of a DAX file_ do work, it is the memory that is being
2678c2ecf20Sopenharmony_ciaccessed that is key here).  Other things that will not work in the
2688c2ecf20Sopenharmony_cinon struct page case include RDMA, sendfile() and splice().
269