18c2ecf20Sopenharmony_ciDirect Access for files 28c2ecf20Sopenharmony_ci----------------------- 38c2ecf20Sopenharmony_ci 48c2ecf20Sopenharmony_ciMotivation 58c2ecf20Sopenharmony_ci---------- 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciThe page cache is usually used to buffer reads and writes to files. 88c2ecf20Sopenharmony_ciIt is also used to provide the pages which are mapped into userspace 98c2ecf20Sopenharmony_ciby a call to mmap. 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ciFor block devices that are memory-like, the page cache pages would be 128c2ecf20Sopenharmony_ciunnecessary copies of the original storage. The DAX code removes the 138c2ecf20Sopenharmony_ciextra copy by performing reads and writes directly to the storage device. 148c2ecf20Sopenharmony_ciFor file mappings, the storage device is mapped directly into userspace. 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ci 178c2ecf20Sopenharmony_ciUsage 188c2ecf20Sopenharmony_ci----- 198c2ecf20Sopenharmony_ci 208c2ecf20Sopenharmony_ciIf you have a block device which supports DAX, you can make a filesystem 218c2ecf20Sopenharmony_cion it as usual. The DAX code currently only supports files with a block 228c2ecf20Sopenharmony_cisize equal to your kernel's PAGE_SIZE, so you may need to specify a block 238c2ecf20Sopenharmony_cisize when creating the filesystem. 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ciCurrently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them 268c2ecf20Sopenharmony_ciis different. 278c2ecf20Sopenharmony_ci 288c2ecf20Sopenharmony_ciEnabling DAX on ext2 298c2ecf20Sopenharmony_ci----------------------------- 308c2ecf20Sopenharmony_ci 318c2ecf20Sopenharmony_ciWhen mounting the filesystem, use the "-o dax" option on the command line or 328c2ecf20Sopenharmony_ciadd 'dax' to the options in /etc/fstab. This works to enable DAX on all files 338c2ecf20Sopenharmony_ciwithin the filesystem. It is equivalent to the '-o dax=always' behavior below. 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ci 368c2ecf20Sopenharmony_ciEnabling DAX on xfs and ext4 378c2ecf20Sopenharmony_ci---------------------------- 388c2ecf20Sopenharmony_ci 398c2ecf20Sopenharmony_ciSummary 408c2ecf20Sopenharmony_ci------- 418c2ecf20Sopenharmony_ci 428c2ecf20Sopenharmony_ci 1. There exists an in-kernel file access mode flag S_DAX that corresponds to 438c2ecf20Sopenharmony_ci the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details 448c2ecf20Sopenharmony_ci about this access mode. 458c2ecf20Sopenharmony_ci 468c2ecf20Sopenharmony_ci 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular 478c2ecf20Sopenharmony_ci files and directories. This advisory flag can be set or cleared at any 488c2ecf20Sopenharmony_ci time, but doing so does not immediately affect the S_DAX state. 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_ci 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will 518c2ecf20Sopenharmony_ci be inherited by all regular files and subdirectories that are subsequently 528c2ecf20Sopenharmony_ci created in this directory. Files and subdirectories that exist at the time 538c2ecf20Sopenharmony_ci this flag is set or cleared on the parent directory are not modified by 548c2ecf20Sopenharmony_ci this modification of the parent directory. 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ci 4. There exist dax mount options which can override FS_XFLAG_DAX in the 578c2ecf20Sopenharmony_ci setting of the S_DAX flag. Given underlying storage which supports DAX the 588c2ecf20Sopenharmony_ci following hold: 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ci "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default. 618c2ecf20Sopenharmony_ci 628c2ecf20Sopenharmony_ci "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX." 638c2ecf20Sopenharmony_ci 648c2ecf20Sopenharmony_ci "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX." 658c2ecf20Sopenharmony_ci 668c2ecf20Sopenharmony_ci "-o dax" is a legacy option which is an alias for "dax=always". 678c2ecf20Sopenharmony_ci This may be removed in the future so "-o dax=always" is 688c2ecf20Sopenharmony_ci the preferred method for specifying this behavior. 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ci NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain 718c2ecf20Sopenharmony_ci the same even when the filesystem is mounted with a dax option. However, 728c2ecf20Sopenharmony_ci in-core inode state (S_DAX) will be overridden until the filesystem is 738c2ecf20Sopenharmony_ci remounted with dax=inode and the inode is evicted from kernel memory. 748c2ecf20Sopenharmony_ci 758c2ecf20Sopenharmony_ci 5. The S_DAX policy can be changed via: 768c2ecf20Sopenharmony_ci 778c2ecf20Sopenharmony_ci a) Setting the parent directory FS_XFLAG_DAX as needed before files are 788c2ecf20Sopenharmony_ci created 798c2ecf20Sopenharmony_ci 808c2ecf20Sopenharmony_ci b) Setting the appropriate dax="foo" mount option 818c2ecf20Sopenharmony_ci 828c2ecf20Sopenharmony_ci c) Changing the FS_XFLAG_DAX flag on existing regular files and 838c2ecf20Sopenharmony_ci directories. This has runtime constraints and limitations that are 848c2ecf20Sopenharmony_ci described in 6) below. 858c2ecf20Sopenharmony_ci 868c2ecf20Sopenharmony_ci 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag, 878c2ecf20Sopenharmony_ci the change in behaviour for existing regular files may not occur 888c2ecf20Sopenharmony_ci immediately. If the change must take effect immediately, the administrator 898c2ecf20Sopenharmony_ci needs to: 908c2ecf20Sopenharmony_ci 918c2ecf20Sopenharmony_ci a) stop the application so there are no active references to the data set 928c2ecf20Sopenharmony_ci the policy change will affect 938c2ecf20Sopenharmony_ci 948c2ecf20Sopenharmony_ci b) evict the data set from kernel caches so it will be re-instantiated when 958c2ecf20Sopenharmony_ci the application is restarted. This can be achieved by: 968c2ecf20Sopenharmony_ci 978c2ecf20Sopenharmony_ci i. drop-caches 988c2ecf20Sopenharmony_ci ii. a filesystem unmount and mount cycle 998c2ecf20Sopenharmony_ci iii. a system reboot 1008c2ecf20Sopenharmony_ci 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ciDetails 1038c2ecf20Sopenharmony_ci------- 1048c2ecf20Sopenharmony_ci 1058c2ecf20Sopenharmony_ciThere are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX) 1068c2ecf20Sopenharmony_ciand the other is a volatile flag indicating the active state of the feature 1078c2ecf20Sopenharmony_ci(S_DAX). 1088c2ecf20Sopenharmony_ci 1098c2ecf20Sopenharmony_ciFS_XFLAG_DAX is preserved within the filesystem. This persistent config 1108c2ecf20Sopenharmony_cisetting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl 1118c2ecf20Sopenharmony_ci(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'. 1128c2ecf20Sopenharmony_ci 1138c2ecf20Sopenharmony_ciNew files and directories automatically inherit FS_XFLAG_DAX from 1148c2ecf20Sopenharmony_citheir parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at 1158c2ecf20Sopenharmony_cidirectory creation time can be used to set a default behavior for an entire 1168c2ecf20Sopenharmony_cisub-tree. 1178c2ecf20Sopenharmony_ci 1188c2ecf20Sopenharmony_ciTo clarify inheritance, here are 3 examples: 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ciExample A: 1218c2ecf20Sopenharmony_ci 1228c2ecf20Sopenharmony_cimkdir -p a/b/c 1238c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' a 1248c2ecf20Sopenharmony_cimkdir a/b/c/d 1258c2ecf20Sopenharmony_cimkdir a/e 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ci dax: a,e 1288c2ecf20Sopenharmony_ci no dax: b,c,d 1298c2ecf20Sopenharmony_ci 1308c2ecf20Sopenharmony_ciExample B: 1318c2ecf20Sopenharmony_ci 1328c2ecf20Sopenharmony_cimkdir a 1338c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' a 1348c2ecf20Sopenharmony_cimkdir -p a/b/c/d 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ci dax: a,b,c,d 1378c2ecf20Sopenharmony_ci no dax: 1388c2ecf20Sopenharmony_ci 1398c2ecf20Sopenharmony_ciExample C: 1408c2ecf20Sopenharmony_ci 1418c2ecf20Sopenharmony_cimkdir -p a/b/c 1428c2ecf20Sopenharmony_cixfs_io -c 'chattr +x' c 1438c2ecf20Sopenharmony_cimkdir a/b/c/d 1448c2ecf20Sopenharmony_ci 1458c2ecf20Sopenharmony_ci dax: c,d 1468c2ecf20Sopenharmony_ci no dax: a,b 1478c2ecf20Sopenharmony_ci 1488c2ecf20Sopenharmony_ci 1498c2ecf20Sopenharmony_ciThe current enabled state (S_DAX) is set when a file inode is instantiated in 1508c2ecf20Sopenharmony_cimemory by the kernel. It is set based on the underlying media support, the 1518c2ecf20Sopenharmony_civalue of FS_XFLAG_DAX and the filesystem's dax mount option. 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_cistatx can be used to query S_DAX. NOTE that only regular files will ever have 1548c2ecf20Sopenharmony_ciS_DAX set and therefore statx will never indicate that S_DAX is set on 1558c2ecf20Sopenharmony_cidirectories. 1568c2ecf20Sopenharmony_ci 1578c2ecf20Sopenharmony_ciSetting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even 1588c2ecf20Sopenharmony_ciif the underlying media does not support dax and/or the filesystem is 1598c2ecf20Sopenharmony_cioverridden with a mount option. 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ci 1628c2ecf20Sopenharmony_ci 1638c2ecf20Sopenharmony_ciImplementation Tips for Block Driver Writers 1648c2ecf20Sopenharmony_ci-------------------------------------------- 1658c2ecf20Sopenharmony_ci 1668c2ecf20Sopenharmony_ciTo support DAX in your block driver, implement the 'direct_access' 1678c2ecf20Sopenharmony_ciblock device operation. It is used to translate the sector number 1688c2ecf20Sopenharmony_ci(expressed in units of 512-byte sectors) to a page frame number (pfn) 1698c2ecf20Sopenharmony_cithat identifies the physical page for the memory. It also returns a 1708c2ecf20Sopenharmony_cikernel virtual address that can be used to access the memory. 1718c2ecf20Sopenharmony_ci 1728c2ecf20Sopenharmony_ciThe direct_access method takes a 'size' parameter that indicates the 1738c2ecf20Sopenharmony_cinumber of bytes being requested. The function should return the number 1748c2ecf20Sopenharmony_ciof bytes that can be contiguously accessed at that offset. It may also 1758c2ecf20Sopenharmony_cireturn a negative errno if an error occurs. 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ciIn order to support this method, the storage must be byte-accessible by 1788c2ecf20Sopenharmony_cithe CPU at all times. If your device uses paging techniques to expose 1798c2ecf20Sopenharmony_cia large amount of memory through a smaller window, then you cannot 1808c2ecf20Sopenharmony_ciimplement direct_access. Equally, if your device can occasionally 1818c2ecf20Sopenharmony_cistall the CPU for an extended period, you should also not attempt to 1828c2ecf20Sopenharmony_ciimplement direct_access. 1838c2ecf20Sopenharmony_ci 1848c2ecf20Sopenharmony_ciThese block devices may be used for inspiration: 1858c2ecf20Sopenharmony_ci- brd: RAM backed block device driver 1868c2ecf20Sopenharmony_ci- dcssblk: s390 dcss block device driver 1878c2ecf20Sopenharmony_ci- pmem: NVDIMM persistent memory driver 1888c2ecf20Sopenharmony_ci 1898c2ecf20Sopenharmony_ci 1908c2ecf20Sopenharmony_ciImplementation Tips for Filesystem Writers 1918c2ecf20Sopenharmony_ci------------------------------------------ 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ciFilesystem support consists of 1948c2ecf20Sopenharmony_ci- adding support to mark inodes as being DAX by setting the S_DAX flag in 1958c2ecf20Sopenharmony_ci i_flags 1968c2ecf20Sopenharmony_ci- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw() 1978c2ecf20Sopenharmony_ci when inode has S_DAX flag set 1988c2ecf20Sopenharmony_ci- implementing an mmap file operation for DAX files which sets the 1998c2ecf20Sopenharmony_ci VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to 2008c2ecf20Sopenharmony_ci include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These 2018c2ecf20Sopenharmony_ci handlers should probably call dax_iomap_fault() passing the appropriate 2028c2ecf20Sopenharmony_ci fault size and iomap operations. 2038c2ecf20Sopenharmony_ci- calling iomap_zero_range() passing appropriate iomap operations instead of 2048c2ecf20Sopenharmony_ci block_truncate_page() for DAX files 2058c2ecf20Sopenharmony_ci- ensuring that there is sufficient locking between reads, writes, 2068c2ecf20Sopenharmony_ci truncates and page faults 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ciThe iomap handlers for allocating blocks must make sure that allocated blocks 2098c2ecf20Sopenharmony_ciare zeroed out and converted to written extents before being returned to avoid 2108c2ecf20Sopenharmony_ciexposure of uninitialized data through mmap. 2118c2ecf20Sopenharmony_ci 2128c2ecf20Sopenharmony_ciThese filesystems may be used for inspiration: 2138c2ecf20Sopenharmony_ci- ext2: see Documentation/filesystems/ext2.rst 2148c2ecf20Sopenharmony_ci- ext4: see Documentation/filesystems/ext4/ 2158c2ecf20Sopenharmony_ci- xfs: see Documentation/admin-guide/xfs.rst 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_ci 2188c2ecf20Sopenharmony_ciHandling Media Errors 2198c2ecf20Sopenharmony_ci--------------------- 2208c2ecf20Sopenharmony_ci 2218c2ecf20Sopenharmony_ciThe libnvdimm subsystem stores a record of known media error locations for 2228c2ecf20Sopenharmony_cieach pmem block device (in gendisk->badblocks). If we fault at such location, 2238c2ecf20Sopenharmony_cior one with a latent error not yet discovered, the application can expect 2248c2ecf20Sopenharmony_cito receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply 2258c2ecf20Sopenharmony_ciwriting the affected sectors (through the pmem driver, and if the underlying 2268c2ecf20Sopenharmony_ciNVDIMM supports the clear_poison DSM defined by ACPI). 2278c2ecf20Sopenharmony_ci 2288c2ecf20Sopenharmony_ciSince DAX IO normally doesn't go through the driver/bio path, applications or 2298c2ecf20Sopenharmony_cisysadmins have an option to restore the lost data from a prior backup/inbuilt 2308c2ecf20Sopenharmony_ciredundancy in the following ways: 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ci1. Delete the affected file, and restore from a backup (sysadmin route): 2338c2ecf20Sopenharmony_ci This will free the filesystem blocks that were being used by the file, 2348c2ecf20Sopenharmony_ci and the next time they're allocated, they will be zeroed first, which 2358c2ecf20Sopenharmony_ci happens through the driver, and will clear bad sectors. 2368c2ecf20Sopenharmony_ci 2378c2ecf20Sopenharmony_ci2. Truncate or hole-punch the part of the file that has a bad-block (at least 2388c2ecf20Sopenharmony_ci an entire aligned sector has to be hole-punched, but not necessarily an 2398c2ecf20Sopenharmony_ci entire filesystem block). 2408c2ecf20Sopenharmony_ci 2418c2ecf20Sopenharmony_ciThese are the two basic paths that allow DAX filesystems to continue operating 2428c2ecf20Sopenharmony_ciin the presence of media errors. More robust error recovery mechanisms can be 2438c2ecf20Sopenharmony_cibuilt on top of this in the future, for example, involving redundancy/mirroring 2448c2ecf20Sopenharmony_ciprovided at the block layer through DM, or additionally, at the filesystem 2458c2ecf20Sopenharmony_cilevel. These would have to rely on the above two tenets, that error clearing 2468c2ecf20Sopenharmony_cican happen either by sending an IO through the driver, or zeroing (also through 2478c2ecf20Sopenharmony_cithe driver). 2488c2ecf20Sopenharmony_ci 2498c2ecf20Sopenharmony_ci 2508c2ecf20Sopenharmony_ciShortcomings 2518c2ecf20Sopenharmony_ci------------ 2528c2ecf20Sopenharmony_ci 2538c2ecf20Sopenharmony_ciEven if the kernel or its modules are stored on a filesystem that supports 2548c2ecf20Sopenharmony_ciDAX on a block device that supports DAX, they will still be copied into RAM. 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_ciThe DAX code does not work correctly on architectures which have virtually 2578c2ecf20Sopenharmony_cimapped caches such as ARM, MIPS and SPARC. 2588c2ecf20Sopenharmony_ci 2598c2ecf20Sopenharmony_ciCalling get_user_pages() on a range of user memory that has been mmaped 2608c2ecf20Sopenharmony_cifrom a DAX file will fail when there are no 'struct page' to describe 2618c2ecf20Sopenharmony_cithose pages. This problem has been addressed in some device drivers 2628c2ecf20Sopenharmony_ciby adding optional struct page support for pages under the control of 2638c2ecf20Sopenharmony_cithe driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of 2648c2ecf20Sopenharmony_cihow to do this). In the non struct page cases O_DIRECT reads/writes to 2658c2ecf20Sopenharmony_cithose memory ranges from a non-DAX file will fail (note that O_DIRECT 2668c2ecf20Sopenharmony_cireads/writes _of a DAX file_ do work, it is the memory that is being 2678c2ecf20Sopenharmony_ciaccessed that is key here). Other things that will not work in the 2688c2ecf20Sopenharmony_cinon struct page case include RDMA, sendfile() and splice(). 269