162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci================================================ 462306a36Sopenharmony_ciZoneFS - Zone filesystem for Zoned block devices 562306a36Sopenharmony_ci================================================ 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciIntroduction 862306a36Sopenharmony_ci============ 962306a36Sopenharmony_ci 1062306a36Sopenharmony_cizonefs is a very simple file system exposing each zone of a zoned block device 1162306a36Sopenharmony_cias a file. Unlike a regular POSIX-compliant file system with native zoned block 1262306a36Sopenharmony_cidevice support (e.g. f2fs), zonefs does not hide the sequential write 1362306a36Sopenharmony_ciconstraint of zoned block devices to the user. Files representing sequential 1462306a36Sopenharmony_ciwrite zones of the device must be written sequentially starting from the end 1562306a36Sopenharmony_ciof the file (append only writes). 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ciAs such, zonefs is in essence closer to a raw block device access interface 1862306a36Sopenharmony_cithan to a full-featured POSIX file system. The goal of zonefs is to simplify 1962306a36Sopenharmony_cithe implementation of zoned block device support in applications by replacing 2062306a36Sopenharmony_ciraw block device file accesses with a richer file API, avoiding relying on 2162306a36Sopenharmony_cidirect block device file ioctls which may be more obscure to developers. One 2262306a36Sopenharmony_ciexample of this approach is the implementation of LSM (log-structured merge) 2362306a36Sopenharmony_citree structures (such as used in RocksDB and LevelDB) on zoned block devices 2462306a36Sopenharmony_ciby allowing SSTables to be stored in a zone file similarly to a regular file 2562306a36Sopenharmony_cisystem rather than as a range of sectors of the entire disk. The introduction 2662306a36Sopenharmony_ciof the higher level construct "one file is one zone" can help reducing the 2762306a36Sopenharmony_ciamount of changes needed in the application as well as introducing support for 2862306a36Sopenharmony_cidifferent application programming languages. 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ciZoned block devices 3162306a36Sopenharmony_ci------------------- 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ciZoned storage devices belong to a class of storage devices with an address 3462306a36Sopenharmony_cispace that is divided into zones. A zone is a group of consecutive LBAs and all 3562306a36Sopenharmony_cizones are contiguous (there are no LBA gaps). Zones may have different types. 3662306a36Sopenharmony_ci 3762306a36Sopenharmony_ci* Conventional zones: there are no access constraints to LBAs belonging to 3862306a36Sopenharmony_ci conventional zones. Any read or write access can be executed, similarly to a 3962306a36Sopenharmony_ci regular block device. 4062306a36Sopenharmony_ci* Sequential zones: these zones accept random reads but must be written 4162306a36Sopenharmony_ci sequentially. Each sequential zone has a write pointer maintained by the 4262306a36Sopenharmony_ci device that keeps track of the mandatory start LBA position of the next write 4362306a36Sopenharmony_ci to the device. As a result of this write constraint, LBAs in a sequential zone 4462306a36Sopenharmony_ci cannot be overwritten. Sequential zones must first be erased using a special 4562306a36Sopenharmony_ci command (zone reset) before rewriting. 4662306a36Sopenharmony_ci 4762306a36Sopenharmony_ciZoned storage devices can be implemented using various recording and media 4862306a36Sopenharmony_citechnologies. The most common form of zoned storage today uses the SCSI Zoned 4962306a36Sopenharmony_ciBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled 5062306a36Sopenharmony_ciMagnetic Recording (SMR) HDDs. 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ciSolid State Disks (SSD) storage devices can also implement a zoned interface 5362306a36Sopenharmony_cito, for instance, reduce internal write amplification due to garbage collection. 5462306a36Sopenharmony_ciThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard 5562306a36Sopenharmony_cicommittee aiming at adding a zoned storage interface to the NVMe protocol. 5662306a36Sopenharmony_ci 5762306a36Sopenharmony_ciZonefs Overview 5862306a36Sopenharmony_ci=============== 5962306a36Sopenharmony_ci 6062306a36Sopenharmony_ciZonefs exposes the zones of a zoned block device as files. The files 6162306a36Sopenharmony_cirepresenting zones are grouped by zone type, which are themselves represented 6262306a36Sopenharmony_ciby sub-directories. This file structure is built entirely using zone information 6362306a36Sopenharmony_ciprovided by the device and so does not require any complex on-disk metadata 6462306a36Sopenharmony_cistructure. 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ciOn-disk metadata 6762306a36Sopenharmony_ci---------------- 6862306a36Sopenharmony_ci 6962306a36Sopenharmony_cizonefs on-disk metadata is reduced to an immutable super block which 7062306a36Sopenharmony_cipersistently stores a magic number and optional feature flags and values. On 7162306a36Sopenharmony_cimount, zonefs uses blkdev_report_zones() to obtain the device zone configuration 7262306a36Sopenharmony_ciand populates the mount point with a static file tree solely based on this 7362306a36Sopenharmony_ciinformation. File sizes come from the device zone type and write pointer 7462306a36Sopenharmony_ciposition managed by the device itself. 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_ciThe super block is always written on disk at sector 0. The first zone of the 7762306a36Sopenharmony_cidevice storing the super block is never exposed as a zone file by zonefs. If 7862306a36Sopenharmony_cithe zone containing the super block is a sequential zone, the mkzonefs format 7962306a36Sopenharmony_citool always "finishes" the zone, that is, it transitions the zone to a full 8062306a36Sopenharmony_cistate to make it read-only, preventing any data write. 8162306a36Sopenharmony_ci 8262306a36Sopenharmony_ciZone type sub-directories 8362306a36Sopenharmony_ci------------------------- 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ciFiles representing zones of the same type are grouped together under the same 8662306a36Sopenharmony_cisub-directory automatically created on mount. 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ciFor conventional zones, the sub-directory "cnv" is used. This directory is 8962306a36Sopenharmony_cihowever created if and only if the device has usable conventional zones. If 9062306a36Sopenharmony_cithe device only has a single conventional zone at sector 0, the zone will not 9162306a36Sopenharmony_cibe exposed as a file as it will be used to store the zonefs super block. For 9262306a36Sopenharmony_cisuch devices, the "cnv" sub-directory will not be created. 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ciFor sequential write zones, the sub-directory "seq" is used. 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ciThese two directories are the only directories that exist in zonefs. Users 9762306a36Sopenharmony_cicannot create other directories and cannot rename nor delete the "cnv" and 9862306a36Sopenharmony_ci"seq" sub-directories. 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ciThe size of the directories indicated by the st_size field of struct stat, 10162306a36Sopenharmony_ciobtained with the stat() or fstat() system calls, indicates the number of files 10262306a36Sopenharmony_ciexisting under the directory. 10362306a36Sopenharmony_ci 10462306a36Sopenharmony_ciZone files 10562306a36Sopenharmony_ci---------- 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ciZone files are named using the number of the zone they represent within the set 10862306a36Sopenharmony_ciof zones of a particular type. That is, both the "cnv" and "seq" directories 10962306a36Sopenharmony_cicontain files named "0", "1", "2", ... The file numbers also represent 11062306a36Sopenharmony_ciincreasing zone start sector on the device. 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ciAll read and write operations to zone files are not allowed beyond the file 11362306a36Sopenharmony_cimaximum size, that is, beyond the zone capacity. Any access exceeding the zone 11462306a36Sopenharmony_cicapacity is failed with the -EFBIG error. 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ciCreating, deleting, renaming or modifying any attribute of files and 11762306a36Sopenharmony_cisub-directories is not allowed. 11862306a36Sopenharmony_ci 11962306a36Sopenharmony_ciThe number of blocks of a file as reported by stat() and fstat() indicates the 12062306a36Sopenharmony_cicapacity of the zone file, or in other words, the maximum file size. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ciConventional zone files 12362306a36Sopenharmony_ci----------------------- 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ciThe size of conventional zone files is fixed to the size of the zone they 12662306a36Sopenharmony_cirepresent. Conventional zone files cannot be truncated. 12762306a36Sopenharmony_ci 12862306a36Sopenharmony_ciThese files can be randomly read and written using any type of I/O operation: 12962306a36Sopenharmony_cibuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O 13062306a36Sopenharmony_ciconstraint for these files beyond the file size limit mentioned above. 13162306a36Sopenharmony_ci 13262306a36Sopenharmony_ciSequential zone files 13362306a36Sopenharmony_ci--------------------- 13462306a36Sopenharmony_ci 13562306a36Sopenharmony_ciThe size of sequential zone files grouped in the "seq" sub-directory represents 13662306a36Sopenharmony_cithe file's zone write pointer position relative to the zone start sector. 13762306a36Sopenharmony_ci 13862306a36Sopenharmony_ciSequential zone files can only be written sequentially, starting from the file 13962306a36Sopenharmony_ciend, that is, write operations can only be append writes. Zonefs makes no 14062306a36Sopenharmony_ciattempt at accepting random writes and will fail any write request that has a 14162306a36Sopenharmony_cistart offset not corresponding to the end of the file, or to the end of the last 14262306a36Sopenharmony_ciwrite issued and still in-flight (for asynchronous I/O operations). 14362306a36Sopenharmony_ci 14462306a36Sopenharmony_ciSince dirty page writeback by the page cache does not guarantee a sequential 14562306a36Sopenharmony_ciwrite pattern, zonefs prevents buffered writes and writeable shared mappings 14662306a36Sopenharmony_cion sequential files. Only direct I/O writes are accepted for these files. 14762306a36Sopenharmony_cizonefs relies on the sequential delivery of write I/O requests to the device 14862306a36Sopenharmony_ciimplemented by the block layer elevator. An elevator implementing the sequential 14962306a36Sopenharmony_ciwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) 15062306a36Sopenharmony_cimust be used. This type of elevator (e.g. mq-deadline) is set by default 15162306a36Sopenharmony_cifor zoned block devices on device initialization. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciThere are no restrictions on the type of I/O used for read operations in 15462306a36Sopenharmony_cisequential zone files. Buffered I/Os, direct I/Os and shared read mappings are 15562306a36Sopenharmony_ciall accepted. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciTruncating sequential zone files is allowed only down to 0, in which case, the 15862306a36Sopenharmony_cizone is reset to rewind the file zone write pointer position to the start of 15962306a36Sopenharmony_cithe zone, or up to the zone capacity, in which case the file's zone is 16062306a36Sopenharmony_citransitioned to the FULL state (finish zone operation). 16162306a36Sopenharmony_ci 16262306a36Sopenharmony_ciFormat options 16362306a36Sopenharmony_ci-------------- 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciSeveral optional features of zonefs can be enabled at format time. 16662306a36Sopenharmony_ci 16762306a36Sopenharmony_ci* Conventional zone aggregation: ranges of contiguous conventional zones can be 16862306a36Sopenharmony_ci aggregated into a single larger file instead of the default one file per zone. 16962306a36Sopenharmony_ci* File ownership: The owner UID and GID of zone files is by default 0 (root) 17062306a36Sopenharmony_ci but can be changed to any valid UID/GID. 17162306a36Sopenharmony_ci* File access permissions: the default 640 access permissions can be changed. 17262306a36Sopenharmony_ci 17362306a36Sopenharmony_ciIO error handling 17462306a36Sopenharmony_ci----------------- 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ciZoned block devices may fail I/O requests for reasons similar to regular block 17762306a36Sopenharmony_cidevices, e.g. due to bad sectors. However, in addition to such known I/O 17862306a36Sopenharmony_cifailure pattern, the standards governing zoned block devices behavior define 17962306a36Sopenharmony_ciadditional conditions that result in I/O errors. 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ci* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY): 18262306a36Sopenharmony_ci While the data already written in the zone is still readable, the zone can 18362306a36Sopenharmony_ci no longer be written. No user action on the zone (zone management command or 18462306a36Sopenharmony_ci read/write access) can change the zone condition back to a normal read/write 18562306a36Sopenharmony_ci state. While the reasons for the device to transition a zone to read-only 18662306a36Sopenharmony_ci state are not defined by the standards, a typical cause for such transition 18762306a36Sopenharmony_ci would be a defective write head on an HDD (all zones under this head are 18862306a36Sopenharmony_ci changed to read-only). 18962306a36Sopenharmony_ci 19062306a36Sopenharmony_ci* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE): 19162306a36Sopenharmony_ci An offline zone cannot be read nor written. No user action can transition an 19262306a36Sopenharmony_ci offline zone back to an operational good state. Similarly to zone read-only 19362306a36Sopenharmony_ci transitions, the reasons for a drive to transition a zone to the offline 19462306a36Sopenharmony_ci condition are undefined. A typical cause would be a defective read-write head 19562306a36Sopenharmony_ci on an HDD causing all zones on the platter under the broken head to be 19662306a36Sopenharmony_ci inaccessible. 19762306a36Sopenharmony_ci 19862306a36Sopenharmony_ci* Unaligned write errors: These errors result from the host issuing write 19962306a36Sopenharmony_ci requests with a start sector that does not correspond to a zone write pointer 20062306a36Sopenharmony_ci position when the write request is executed by the device. Even though zonefs 20162306a36Sopenharmony_ci enforces sequential file write for sequential zones, unaligned write errors 20262306a36Sopenharmony_ci may still happen in the case of a partial failure of a very large direct I/O 20362306a36Sopenharmony_ci operation split into multiple BIOs/requests or asynchronous I/O operations. 20462306a36Sopenharmony_ci If one of the write request within the set of sequential write requests 20562306a36Sopenharmony_ci issued to the device fails, all write requests queued after it will 20662306a36Sopenharmony_ci become unaligned and fail. 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ci* Delayed write errors: similarly to regular block devices, if the device side 20962306a36Sopenharmony_ci write cache is enabled, write errors may occur in ranges of previously 21062306a36Sopenharmony_ci completed writes when the device write cache is flushed, e.g. on fsync(). 21162306a36Sopenharmony_ci Similarly to the previous immediate unaligned write error case, delayed write 21262306a36Sopenharmony_ci errors can propagate through a stream of cached sequential data for a zone 21362306a36Sopenharmony_ci causing all data to be dropped after the sector that caused the error. 21462306a36Sopenharmony_ci 21562306a36Sopenharmony_ciAll I/O errors detected by zonefs are notified to the user with an error code 21662306a36Sopenharmony_cireturn for the system call that triggered or detected the error. The recovery 21762306a36Sopenharmony_ciactions taken by zonefs in response to I/O errors depend on the I/O type (read 21862306a36Sopenharmony_civs write) and on the reason for the error (bad sector, unaligned writes or zone 21962306a36Sopenharmony_cicondition change). 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_ci* For read I/O errors, zonefs does not execute any particular recovery action, 22262306a36Sopenharmony_ci but only if the file zone is still in a good condition and there is no 22362306a36Sopenharmony_ci inconsistency between the file inode size and its zone write pointer position. 22462306a36Sopenharmony_ci If a problem is detected, I/O error recovery is executed (see below table). 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ci* For write I/O errors, zonefs I/O error recovery is always executed. 22762306a36Sopenharmony_ci 22862306a36Sopenharmony_ci* A zone condition change to read-only or offline also always triggers zonefs 22962306a36Sopenharmony_ci I/O error recovery. 23062306a36Sopenharmony_ci 23162306a36Sopenharmony_ciZonefs minimal I/O error recovery may change a file size and file access 23262306a36Sopenharmony_cipermissions. 23362306a36Sopenharmony_ci 23462306a36Sopenharmony_ci* File size changes: 23562306a36Sopenharmony_ci Immediate or delayed write errors in a sequential zone file may cause the file 23662306a36Sopenharmony_ci inode size to be inconsistent with the amount of data successfully written in 23762306a36Sopenharmony_ci the file zone. For instance, the partial failure of a multi-BIO large write 23862306a36Sopenharmony_ci operation will cause the zone write pointer to advance partially, even though 23962306a36Sopenharmony_ci the entire write operation will be reported as failed to the user. In such 24062306a36Sopenharmony_ci case, the file inode size must be advanced to reflect the zone write pointer 24162306a36Sopenharmony_ci change and eventually allow the user to restart writing at the end of the 24262306a36Sopenharmony_ci file. 24362306a36Sopenharmony_ci A file size may also be reduced to reflect a delayed write error detected on 24462306a36Sopenharmony_ci fsync(): in this case, the amount of data effectively written in the zone may 24562306a36Sopenharmony_ci be less than originally indicated by the file inode size. After such I/O 24662306a36Sopenharmony_ci error, zonefs always fixes the file inode size to reflect the amount of data 24762306a36Sopenharmony_ci persistently stored in the file zone. 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ci* Access permission changes: 25062306a36Sopenharmony_ci A zone condition change to read-only is indicated with a change in the file 25162306a36Sopenharmony_ci access permissions to render the file read-only. This disables changes to the 25262306a36Sopenharmony_ci file attributes and data modification. For offline zones, all permissions 25362306a36Sopenharmony_ci (read and write) to the file are disabled. 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ciFurther action taken by zonefs I/O error recovery can be controlled by the user 25662306a36Sopenharmony_ciwith the "errors=xxx" mount option. The table below summarizes the result of 25762306a36Sopenharmony_cizonefs I/O error processing depending on the mount option and on the zone 25862306a36Sopenharmony_ciconditions:: 25962306a36Sopenharmony_ci 26062306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 26162306a36Sopenharmony_ci | | | Post error state | 26262306a36Sopenharmony_ci | "errors=xxx" | device | access permissions | 26362306a36Sopenharmony_ci | mount | zone | file file device zone | 26462306a36Sopenharmony_ci | option | condition | size read write read write | 26562306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 26662306a36Sopenharmony_ci | | good | fixed yes no yes yes | 26762306a36Sopenharmony_ci | remount-ro | read-only | as is yes no yes no | 26862306a36Sopenharmony_ci | (default) | offline | 0 no no no no | 26962306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 27062306a36Sopenharmony_ci | | good | fixed yes no yes yes | 27162306a36Sopenharmony_ci | zone-ro | read-only | as is yes no yes no | 27262306a36Sopenharmony_ci | | offline | 0 no no no no | 27362306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 27462306a36Sopenharmony_ci | | good | 0 no no yes yes | 27562306a36Sopenharmony_ci | zone-offline | read-only | 0 no no yes no | 27662306a36Sopenharmony_ci | | offline | 0 no no no no | 27762306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 27862306a36Sopenharmony_ci | | good | fixed yes yes yes yes | 27962306a36Sopenharmony_ci | repair | read-only | as is yes no yes no | 28062306a36Sopenharmony_ci | | offline | 0 no no no no | 28162306a36Sopenharmony_ci +--------------+-----------+-----------------------------------------+ 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ciFurther notes: 28462306a36Sopenharmony_ci 28562306a36Sopenharmony_ci* The "errors=remount-ro" mount option is the default behavior of zonefs I/O 28662306a36Sopenharmony_ci error processing if no errors mount option is specified. 28762306a36Sopenharmony_ci* With the "errors=remount-ro" mount option, the change of the file access 28862306a36Sopenharmony_ci permissions to read-only applies to all files. The file system is remounted 28962306a36Sopenharmony_ci read-only. 29062306a36Sopenharmony_ci* Access permission and file size changes due to the device transitioning zones 29162306a36Sopenharmony_ci to the offline condition are permanent. Remounting or reformatting the device 29262306a36Sopenharmony_ci with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good 29362306a36Sopenharmony_ci state. 29462306a36Sopenharmony_ci* File access permission changes to read-only due to the device transitioning 29562306a36Sopenharmony_ci zones to the read-only condition are permanent. Remounting or reformatting 29662306a36Sopenharmony_ci the device will not re-enable file write access. 29762306a36Sopenharmony_ci* File access permission changes implied by the remount-ro, zone-ro and 29862306a36Sopenharmony_ci zone-offline mount options are temporary for zones in a good condition. 29962306a36Sopenharmony_ci Unmounting and remounting the file system will restore the previous default 30062306a36Sopenharmony_ci (format time values) access rights to the files affected. 30162306a36Sopenharmony_ci* The repair mount option triggers only the minimal set of I/O error recovery 30262306a36Sopenharmony_ci actions, that is, file size fixes for zones in a good condition. Zones 30362306a36Sopenharmony_ci indicated as being read-only or offline by the device still imply changes to 30462306a36Sopenharmony_ci the zone file access permissions as noted in the table above. 30562306a36Sopenharmony_ci 30662306a36Sopenharmony_ciMount options 30762306a36Sopenharmony_ci------------- 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_cizonefs defines several mount options: 31062306a36Sopenharmony_ci* errors=<behavior> 31162306a36Sopenharmony_ci* explicit-open 31262306a36Sopenharmony_ci 31362306a36Sopenharmony_ci"errors=<behavior>" option 31462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~ 31562306a36Sopenharmony_ci 31662306a36Sopenharmony_ciThe "errors=<behavior>" option mount option allows the user to specify zonefs 31762306a36Sopenharmony_cibehavior in response to I/O errors, inode size inconsistencies or zone 31862306a36Sopenharmony_cicondition changes. The defined behaviors are as follow: 31962306a36Sopenharmony_ci 32062306a36Sopenharmony_ci* remount-ro (default) 32162306a36Sopenharmony_ci* zone-ro 32262306a36Sopenharmony_ci* zone-offline 32362306a36Sopenharmony_ci* repair 32462306a36Sopenharmony_ci 32562306a36Sopenharmony_ciThe run-time I/O error actions defined for each behavior are detailed in the 32662306a36Sopenharmony_ciprevious section. Mount time I/O errors will cause the mount operation to fail. 32762306a36Sopenharmony_ciThe handling of read-only zones also differs between mount-time and run-time. 32862306a36Sopenharmony_ciIf a read-only zone is found at mount time, the zone is always treated in the 32962306a36Sopenharmony_cisame manner as offline zones, that is, all accesses are disabled and the zone 33062306a36Sopenharmony_cifile size set to 0. This is necessary as the write pointer of read-only zones 33162306a36Sopenharmony_ciis defined as invalib by the ZBC and ZAC standards, making it impossible to 33262306a36Sopenharmony_cidiscover the amount of data that has been written to the zone. In the case of a 33362306a36Sopenharmony_ciread-only zone discovered at run-time, as indicated in the previous section. 33462306a36Sopenharmony_ciThe size of the zone file is left unchanged from its last updated value. 33562306a36Sopenharmony_ci 33662306a36Sopenharmony_ci"explicit-open" option 33762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~ 33862306a36Sopenharmony_ci 33962306a36Sopenharmony_ciA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on 34062306a36Sopenharmony_cithe number of zones that can be active, that is, zones that are in the 34162306a36Sopenharmony_ciimplicit open, explicit open or closed conditions. This potential limitation 34262306a36Sopenharmony_citranslates into a risk for applications to see write IO errors due to this 34362306a36Sopenharmony_cilimit being exceeded if the zone of a file is not already active when a write 34462306a36Sopenharmony_cirequest is issued by the user. 34562306a36Sopenharmony_ci 34662306a36Sopenharmony_ciTo avoid these potential errors, the "explicit-open" mount option forces zones 34762306a36Sopenharmony_cito be made active using an open zone command when a file is opened for writing 34862306a36Sopenharmony_cifor the first time. If the zone open command succeeds, the application is then 34962306a36Sopenharmony_ciguaranteed that write requests can be processed. Conversely, the 35062306a36Sopenharmony_ci"explicit-open" mount option will result in a zone close command being issued 35162306a36Sopenharmony_cito the device on the last close() of a zone file if the zone is not full nor 35262306a36Sopenharmony_ciempty. 35362306a36Sopenharmony_ci 35462306a36Sopenharmony_ciRuntime sysfs attributes 35562306a36Sopenharmony_ci------------------------ 35662306a36Sopenharmony_ci 35762306a36Sopenharmony_cizonefs defines several sysfs attributes for mounted devices. All attributes 35862306a36Sopenharmony_ciare user readable and can be found in the directory /sys/fs/zonefs/<dev>/, 35962306a36Sopenharmony_ciwhere <dev> is the name of the mounted zoned block device. 36062306a36Sopenharmony_ci 36162306a36Sopenharmony_ciThe attributes defined are as follows. 36262306a36Sopenharmony_ci 36362306a36Sopenharmony_ci* **max_wro_seq_files**: This attribute reports the maximum number of 36462306a36Sopenharmony_ci sequential zone files that can be open for writing. This number corresponds 36562306a36Sopenharmony_ci to the maximum number of explicitly or implicitly open zones that the device 36662306a36Sopenharmony_ci supports. A value of 0 means that the device has no limit and that any zone 36762306a36Sopenharmony_ci (any file) can be open for writing and written at any time, regardless of the 36862306a36Sopenharmony_ci state of other zones. When the *explicit-open* mount option is used, zonefs 36962306a36Sopenharmony_ci will fail any open() system call requesting to open a sequential zone file for 37062306a36Sopenharmony_ci writing when the number of sequential zone files already open for writing has 37162306a36Sopenharmony_ci reached the *max_wro_seq_files* limit. 37262306a36Sopenharmony_ci* **nr_wro_seq_files**: This attribute reports the current number of sequential 37362306a36Sopenharmony_ci zone files open for writing. When the "explicit-open" mount option is used, 37462306a36Sopenharmony_ci this number can never exceed *max_wro_seq_files*. If the *explicit-open* 37562306a36Sopenharmony_ci mount option is not used, the reported number can be greater than 37662306a36Sopenharmony_ci *max_wro_seq_files*. In such case, it is the responsibility of the 37762306a36Sopenharmony_ci application to not write simultaneously more than *max_wro_seq_files* 37862306a36Sopenharmony_ci sequential zone files. Failure to do so can result in write errors. 37962306a36Sopenharmony_ci* **max_active_seq_files**: This attribute reports the maximum number of 38062306a36Sopenharmony_ci sequential zone files that are in an active state, that is, sequential zone 38162306a36Sopenharmony_ci files that are partially written (not empty nor full) or that have a zone that 38262306a36Sopenharmony_ci is explicitly open (which happens only if the *explicit-open* mount option is 38362306a36Sopenharmony_ci used). This number is always equal to the maximum number of active zones that 38462306a36Sopenharmony_ci the device supports. A value of 0 means that the mounted device has no limit 38562306a36Sopenharmony_ci on the number of sequential zone files that can be active. 38662306a36Sopenharmony_ci* **nr_active_seq_files**: This attributes reports the current number of 38762306a36Sopenharmony_ci sequential zone files that are active. If *max_active_seq_files* is not 0, 38862306a36Sopenharmony_ci then the value of *nr_active_seq_files* can never exceed the value of 38962306a36Sopenharmony_ci *nr_active_seq_files*, regardless of the use of the *explicit-open* mount 39062306a36Sopenharmony_ci option. 39162306a36Sopenharmony_ci 39262306a36Sopenharmony_ciZonefs User Space Tools 39362306a36Sopenharmony_ci======================= 39462306a36Sopenharmony_ci 39562306a36Sopenharmony_ciThe mkzonefs tool is used to format zoned block devices for use with zonefs. 39662306a36Sopenharmony_ciThis tool is available on Github at: 39762306a36Sopenharmony_ci 39862306a36Sopenharmony_cihttps://github.com/damien-lemoal/zonefs-tools 39962306a36Sopenharmony_ci 40062306a36Sopenharmony_cizonefs-tools also includes a test suite which can be run against any zoned 40162306a36Sopenharmony_ciblock device, including null_blk block device created with zoned mode. 40262306a36Sopenharmony_ci 40362306a36Sopenharmony_ciExamples 40462306a36Sopenharmony_ci-------- 40562306a36Sopenharmony_ci 40662306a36Sopenharmony_ciThe following formats a 15TB host-managed SMR HDD with 256 MB zones 40762306a36Sopenharmony_ciwith the conventional zones aggregation feature enabled:: 40862306a36Sopenharmony_ci 40962306a36Sopenharmony_ci # mkzonefs -o aggr_cnv /dev/sdX 41062306a36Sopenharmony_ci # mount -t zonefs /dev/sdX /mnt 41162306a36Sopenharmony_ci # ls -l /mnt/ 41262306a36Sopenharmony_ci total 0 41362306a36Sopenharmony_ci dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv 41462306a36Sopenharmony_ci dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq 41562306a36Sopenharmony_ci 41662306a36Sopenharmony_ciThe size of the zone files sub-directories indicate the number of files 41762306a36Sopenharmony_ciexisting for each type of zones. In this example, there is only one 41862306a36Sopenharmony_ciconventional zone file (all conventional zones are aggregated under a single 41962306a36Sopenharmony_cifile):: 42062306a36Sopenharmony_ci 42162306a36Sopenharmony_ci # ls -l /mnt/cnv 42262306a36Sopenharmony_ci total 137101312 42362306a36Sopenharmony_ci -rw-r----- 1 root root 140391743488 Nov 25 13:23 0 42462306a36Sopenharmony_ci 42562306a36Sopenharmony_ciThis aggregated conventional zone file can be used as a regular file:: 42662306a36Sopenharmony_ci 42762306a36Sopenharmony_ci # mkfs.ext4 /mnt/cnv/0 42862306a36Sopenharmony_ci # mount -o loop /mnt/cnv/0 /data 42962306a36Sopenharmony_ci 43062306a36Sopenharmony_ciThe "seq" sub-directory grouping files for sequential write zones has in this 43162306a36Sopenharmony_ciexample 55356 zones:: 43262306a36Sopenharmony_ci 43362306a36Sopenharmony_ci # ls -lv /mnt/seq 43462306a36Sopenharmony_ci total 14511243264 43562306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:23 0 43662306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:23 1 43762306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:23 2 43862306a36Sopenharmony_ci ... 43962306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:23 55354 44062306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:23 55355 44162306a36Sopenharmony_ci 44262306a36Sopenharmony_ciFor sequential write zone files, the file size changes as data is appended at 44362306a36Sopenharmony_cithe end of the file, similarly to any regular file system:: 44462306a36Sopenharmony_ci 44562306a36Sopenharmony_ci # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct 44662306a36Sopenharmony_ci 1+0 records in 44762306a36Sopenharmony_ci 1+0 records out 44862306a36Sopenharmony_ci 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s 44962306a36Sopenharmony_ci 45062306a36Sopenharmony_ci # ls -l /mnt/seq/0 45162306a36Sopenharmony_ci -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0 45262306a36Sopenharmony_ci 45362306a36Sopenharmony_ciThe written file can be truncated to the zone size, preventing any further 45462306a36Sopenharmony_ciwrite operation:: 45562306a36Sopenharmony_ci 45662306a36Sopenharmony_ci # truncate -s 268435456 /mnt/seq/0 45762306a36Sopenharmony_ci # ls -l /mnt/seq/0 45862306a36Sopenharmony_ci -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0 45962306a36Sopenharmony_ci 46062306a36Sopenharmony_ciTruncation to 0 size allows freeing the file zone storage space and restart 46162306a36Sopenharmony_ciappend-writes to the file:: 46262306a36Sopenharmony_ci 46362306a36Sopenharmony_ci # truncate -s 0 /mnt/seq/0 46462306a36Sopenharmony_ci # ls -l /mnt/seq/0 46562306a36Sopenharmony_ci -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0 46662306a36Sopenharmony_ci 46762306a36Sopenharmony_ciSince files are statically mapped to zones on the disk, the number of blocks 46862306a36Sopenharmony_ciof a file as reported by stat() and fstat() indicates the capacity of the file 46962306a36Sopenharmony_cizone:: 47062306a36Sopenharmony_ci 47162306a36Sopenharmony_ci # stat /mnt/seq/0 47262306a36Sopenharmony_ci File: /mnt/seq/0 47362306a36Sopenharmony_ci Size: 0 Blocks: 524288 IO Block: 4096 regular empty file 47462306a36Sopenharmony_ci Device: 870h/2160d Inode: 50431 Links: 1 47562306a36Sopenharmony_ci Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root) 47662306a36Sopenharmony_ci Access: 2019-11-25 13:23:57.048971997 +0900 47762306a36Sopenharmony_ci Modify: 2019-11-25 13:52:25.553805765 +0900 47862306a36Sopenharmony_ci Change: 2019-11-25 13:52:25.553805765 +0900 47962306a36Sopenharmony_ci Birth: - 48062306a36Sopenharmony_ci 48162306a36Sopenharmony_ciThe number of blocks of the file ("Blocks") in units of 512B blocks gives the 48262306a36Sopenharmony_cimaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone 48362306a36Sopenharmony_cicapacity in this example. Of note is that the "IO block" field always 48462306a36Sopenharmony_ciindicates the minimum I/O size for writes and corresponds to the device 48562306a36Sopenharmony_ciphysical sector size. 486