18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci================================================
48c2ecf20Sopenharmony_ciZoneFS - Zone filesystem for Zoned block devices
58c2ecf20Sopenharmony_ci================================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciIntroduction
88c2ecf20Sopenharmony_ci============
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_cizonefs is a very simple file system exposing each zone of a zoned block device
118c2ecf20Sopenharmony_cias a file. Unlike a regular POSIX-compliant file system with native zoned block
128c2ecf20Sopenharmony_cidevice support (e.g. f2fs), zonefs does not hide the sequential write
138c2ecf20Sopenharmony_ciconstraint of zoned block devices to the user. Files representing sequential
148c2ecf20Sopenharmony_ciwrite zones of the device must be written sequentially starting from the end
158c2ecf20Sopenharmony_ciof the file (append only writes).
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ciAs such, zonefs is in essence closer to a raw block device access interface
188c2ecf20Sopenharmony_cithan to a full-featured POSIX file system. The goal of zonefs is to simplify
198c2ecf20Sopenharmony_cithe implementation of zoned block device support in applications by replacing
208c2ecf20Sopenharmony_ciraw block device file accesses with a richer file API, avoiding relying on
218c2ecf20Sopenharmony_cidirect block device file ioctls which may be more obscure to developers. One
228c2ecf20Sopenharmony_ciexample of this approach is the implementation of LSM (log-structured merge)
238c2ecf20Sopenharmony_citree structures (such as used in RocksDB and LevelDB) on zoned block devices
248c2ecf20Sopenharmony_ciby allowing SSTables to be stored in a zone file similarly to a regular file
258c2ecf20Sopenharmony_cisystem rather than as a range of sectors of the entire disk. The introduction
268c2ecf20Sopenharmony_ciof the higher level construct "one file is one zone" can help reducing the
278c2ecf20Sopenharmony_ciamount of changes needed in the application as well as introducing support for
288c2ecf20Sopenharmony_cidifferent application programming languages.
298c2ecf20Sopenharmony_ci
308c2ecf20Sopenharmony_ciZoned block devices
318c2ecf20Sopenharmony_ci-------------------
328c2ecf20Sopenharmony_ci
338c2ecf20Sopenharmony_ciZoned storage devices belong to a class of storage devices with an address
348c2ecf20Sopenharmony_cispace that is divided into zones. A zone is a group of consecutive LBAs and all
358c2ecf20Sopenharmony_cizones are contiguous (there are no LBA gaps). Zones may have different types.
368c2ecf20Sopenharmony_ci
378c2ecf20Sopenharmony_ci* Conventional zones: there are no access constraints to LBAs belonging to
388c2ecf20Sopenharmony_ci  conventional zones. Any read or write access can be executed, similarly to a
398c2ecf20Sopenharmony_ci  regular block device.
408c2ecf20Sopenharmony_ci* Sequential zones: these zones accept random reads but must be written
418c2ecf20Sopenharmony_ci  sequentially. Each sequential zone has a write pointer maintained by the
428c2ecf20Sopenharmony_ci  device that keeps track of the mandatory start LBA position of the next write
438c2ecf20Sopenharmony_ci  to the device. As a result of this write constraint, LBAs in a sequential zone
448c2ecf20Sopenharmony_ci  cannot be overwritten. Sequential zones must first be erased using a special
458c2ecf20Sopenharmony_ci  command (zone reset) before rewriting.
468c2ecf20Sopenharmony_ci
478c2ecf20Sopenharmony_ciZoned storage devices can be implemented using various recording and media
488c2ecf20Sopenharmony_citechnologies. The most common form of zoned storage today uses the SCSI Zoned
498c2ecf20Sopenharmony_ciBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
508c2ecf20Sopenharmony_ciMagnetic Recording (SMR) HDDs.
518c2ecf20Sopenharmony_ci
528c2ecf20Sopenharmony_ciSolid State Disks (SSD) storage devices can also implement a zoned interface
538c2ecf20Sopenharmony_cito, for instance, reduce internal write amplification due to garbage collection.
548c2ecf20Sopenharmony_ciThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
558c2ecf20Sopenharmony_cicommittee aiming at adding a zoned storage interface to the NVMe protocol.
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ciZonefs Overview
588c2ecf20Sopenharmony_ci===============
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ciZonefs exposes the zones of a zoned block device as files. The files
618c2ecf20Sopenharmony_cirepresenting zones are grouped by zone type, which are themselves represented
628c2ecf20Sopenharmony_ciby sub-directories. This file structure is built entirely using zone information
638c2ecf20Sopenharmony_ciprovided by the device and so does not require any complex on-disk metadata
648c2ecf20Sopenharmony_cistructure.
658c2ecf20Sopenharmony_ci
668c2ecf20Sopenharmony_ciOn-disk metadata
678c2ecf20Sopenharmony_ci----------------
688c2ecf20Sopenharmony_ci
698c2ecf20Sopenharmony_cizonefs on-disk metadata is reduced to an immutable super block which
708c2ecf20Sopenharmony_cipersistently stores a magic number and optional feature flags and values. On
718c2ecf20Sopenharmony_cimount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
728c2ecf20Sopenharmony_ciand populates the mount point with a static file tree solely based on this
738c2ecf20Sopenharmony_ciinformation. File sizes come from the device zone type and write pointer
748c2ecf20Sopenharmony_ciposition managed by the device itself.
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciThe super block is always written on disk at sector 0. The first zone of the
778c2ecf20Sopenharmony_cidevice storing the super block is never exposed as a zone file by zonefs. If
788c2ecf20Sopenharmony_cithe zone containing the super block is a sequential zone, the mkzonefs format
798c2ecf20Sopenharmony_citool always "finishes" the zone, that is, it transitions the zone to a full
808c2ecf20Sopenharmony_cistate to make it read-only, preventing any data write.
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ciZone type sub-directories
838c2ecf20Sopenharmony_ci-------------------------
848c2ecf20Sopenharmony_ci
858c2ecf20Sopenharmony_ciFiles representing zones of the same type are grouped together under the same
868c2ecf20Sopenharmony_cisub-directory automatically created on mount.
878c2ecf20Sopenharmony_ci
888c2ecf20Sopenharmony_ciFor conventional zones, the sub-directory "cnv" is used. This directory is
898c2ecf20Sopenharmony_cihowever created if and only if the device has usable conventional zones. If
908c2ecf20Sopenharmony_cithe device only has a single conventional zone at sector 0, the zone will not
918c2ecf20Sopenharmony_cibe exposed as a file as it will be used to store the zonefs super block. For
928c2ecf20Sopenharmony_cisuch devices, the "cnv" sub-directory will not be created.
938c2ecf20Sopenharmony_ci
948c2ecf20Sopenharmony_ciFor sequential write zones, the sub-directory "seq" is used.
958c2ecf20Sopenharmony_ci
968c2ecf20Sopenharmony_ciThese two directories are the only directories that exist in zonefs. Users
978c2ecf20Sopenharmony_cicannot create other directories and cannot rename nor delete the "cnv" and
988c2ecf20Sopenharmony_ci"seq" sub-directories.
998c2ecf20Sopenharmony_ci
1008c2ecf20Sopenharmony_ciThe size of the directories indicated by the st_size field of struct stat,
1018c2ecf20Sopenharmony_ciobtained with the stat() or fstat() system calls, indicates the number of files
1028c2ecf20Sopenharmony_ciexisting under the directory.
1038c2ecf20Sopenharmony_ci
1048c2ecf20Sopenharmony_ciZone files
1058c2ecf20Sopenharmony_ci----------
1068c2ecf20Sopenharmony_ci
1078c2ecf20Sopenharmony_ciZone files are named using the number of the zone they represent within the set
1088c2ecf20Sopenharmony_ciof zones of a particular type. That is, both the "cnv" and "seq" directories
1098c2ecf20Sopenharmony_cicontain files named "0", "1", "2", ... The file numbers also represent
1108c2ecf20Sopenharmony_ciincreasing zone start sector on the device.
1118c2ecf20Sopenharmony_ci
1128c2ecf20Sopenharmony_ciAll read and write operations to zone files are not allowed beyond the file
1138c2ecf20Sopenharmony_cimaximum size, that is, beyond the zone capacity. Any access exceeding the zone
1148c2ecf20Sopenharmony_cicapacity is failed with the -EFBIG error.
1158c2ecf20Sopenharmony_ci
1168c2ecf20Sopenharmony_ciCreating, deleting, renaming or modifying any attribute of files and
1178c2ecf20Sopenharmony_cisub-directories is not allowed.
1188c2ecf20Sopenharmony_ci
1198c2ecf20Sopenharmony_ciThe number of blocks of a file as reported by stat() and fstat() indicates the
1208c2ecf20Sopenharmony_cicapacity of the zone file, or in other words, the maximum file size.
1218c2ecf20Sopenharmony_ci
1228c2ecf20Sopenharmony_ciConventional zone files
1238c2ecf20Sopenharmony_ci-----------------------
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ciThe size of conventional zone files is fixed to the size of the zone they
1268c2ecf20Sopenharmony_cirepresent. Conventional zone files cannot be truncated.
1278c2ecf20Sopenharmony_ci
1288c2ecf20Sopenharmony_ciThese files can be randomly read and written using any type of I/O operation:
1298c2ecf20Sopenharmony_cibuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
1308c2ecf20Sopenharmony_ciconstraint for these files beyond the file size limit mentioned above.
1318c2ecf20Sopenharmony_ci
1328c2ecf20Sopenharmony_ciSequential zone files
1338c2ecf20Sopenharmony_ci---------------------
1348c2ecf20Sopenharmony_ci
1358c2ecf20Sopenharmony_ciThe size of sequential zone files grouped in the "seq" sub-directory represents
1368c2ecf20Sopenharmony_cithe file's zone write pointer position relative to the zone start sector.
1378c2ecf20Sopenharmony_ci
1388c2ecf20Sopenharmony_ciSequential zone files can only be written sequentially, starting from the file
1398c2ecf20Sopenharmony_ciend, that is, write operations can only be append writes. Zonefs makes no
1408c2ecf20Sopenharmony_ciattempt at accepting random writes and will fail any write request that has a
1418c2ecf20Sopenharmony_cistart offset not corresponding to the end of the file, or to the end of the last
1428c2ecf20Sopenharmony_ciwrite issued and still in-flight (for asynchronous I/O operations).
1438c2ecf20Sopenharmony_ci
1448c2ecf20Sopenharmony_ciSince dirty page writeback by the page cache does not guarantee a sequential
1458c2ecf20Sopenharmony_ciwrite pattern, zonefs prevents buffered writes and writeable shared mappings
1468c2ecf20Sopenharmony_cion sequential files. Only direct I/O writes are accepted for these files.
1478c2ecf20Sopenharmony_cizonefs relies on the sequential delivery of write I/O requests to the device
1488c2ecf20Sopenharmony_ciimplemented by the block layer elevator. An elevator implementing the sequential
1498c2ecf20Sopenharmony_ciwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
1508c2ecf20Sopenharmony_cimust be used. This type of elevator (e.g. mq-deadline) is set by default
1518c2ecf20Sopenharmony_cifor zoned block devices on device initialization.
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ciThere are no restrictions on the type of I/O used for read operations in
1548c2ecf20Sopenharmony_cisequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
1558c2ecf20Sopenharmony_ciall accepted.
1568c2ecf20Sopenharmony_ci
1578c2ecf20Sopenharmony_ciTruncating sequential zone files is allowed only down to 0, in which case, the
1588c2ecf20Sopenharmony_cizone is reset to rewind the file zone write pointer position to the start of
1598c2ecf20Sopenharmony_cithe zone, or up to the zone capacity, in which case the file's zone is
1608c2ecf20Sopenharmony_citransitioned to the FULL state (finish zone operation).
1618c2ecf20Sopenharmony_ci
1628c2ecf20Sopenharmony_ciFormat options
1638c2ecf20Sopenharmony_ci--------------
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciSeveral optional features of zonefs can be enabled at format time.
1668c2ecf20Sopenharmony_ci
1678c2ecf20Sopenharmony_ci* Conventional zone aggregation: ranges of contiguous conventional zones can be
1688c2ecf20Sopenharmony_ci  aggregated into a single larger file instead of the default one file per zone.
1698c2ecf20Sopenharmony_ci* File ownership: The owner UID and GID of zone files is by default 0 (root)
1708c2ecf20Sopenharmony_ci  but can be changed to any valid UID/GID.
1718c2ecf20Sopenharmony_ci* File access permissions: the default 640 access permissions can be changed.
1728c2ecf20Sopenharmony_ci
1738c2ecf20Sopenharmony_ciIO error handling
1748c2ecf20Sopenharmony_ci-----------------
1758c2ecf20Sopenharmony_ci
1768c2ecf20Sopenharmony_ciZoned block devices may fail I/O requests for reasons similar to regular block
1778c2ecf20Sopenharmony_cidevices, e.g. due to bad sectors. However, in addition to such known I/O
1788c2ecf20Sopenharmony_cifailure pattern, the standards governing zoned block devices behavior define
1798c2ecf20Sopenharmony_ciadditional conditions that result in I/O errors.
1808c2ecf20Sopenharmony_ci
1818c2ecf20Sopenharmony_ci* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
1828c2ecf20Sopenharmony_ci  While the data already written in the zone is still readable, the zone can
1838c2ecf20Sopenharmony_ci  no longer be written. No user action on the zone (zone management command or
1848c2ecf20Sopenharmony_ci  read/write access) can change the zone condition back to a normal read/write
1858c2ecf20Sopenharmony_ci  state. While the reasons for the device to transition a zone to read-only
1868c2ecf20Sopenharmony_ci  state are not defined by the standards, a typical cause for such transition
1878c2ecf20Sopenharmony_ci  would be a defective write head on an HDD (all zones under this head are
1888c2ecf20Sopenharmony_ci  changed to read-only).
1898c2ecf20Sopenharmony_ci
1908c2ecf20Sopenharmony_ci* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
1918c2ecf20Sopenharmony_ci  An offline zone cannot be read nor written. No user action can transition an
1928c2ecf20Sopenharmony_ci  offline zone back to an operational good state. Similarly to zone read-only
1938c2ecf20Sopenharmony_ci  transitions, the reasons for a drive to transition a zone to the offline
1948c2ecf20Sopenharmony_ci  condition are undefined. A typical cause would be a defective read-write head
1958c2ecf20Sopenharmony_ci  on an HDD causing all zones on the platter under the broken head to be
1968c2ecf20Sopenharmony_ci  inaccessible.
1978c2ecf20Sopenharmony_ci
1988c2ecf20Sopenharmony_ci* Unaligned write errors: These errors result from the host issuing write
1998c2ecf20Sopenharmony_ci  requests with a start sector that does not correspond to a zone write pointer
2008c2ecf20Sopenharmony_ci  position when the write request is executed by the device. Even though zonefs
2018c2ecf20Sopenharmony_ci  enforces sequential file write for sequential zones, unaligned write errors
2028c2ecf20Sopenharmony_ci  may still happen in the case of a partial failure of a very large direct I/O
2038c2ecf20Sopenharmony_ci  operation split into multiple BIOs/requests or asynchronous I/O operations.
2048c2ecf20Sopenharmony_ci  If one of the write request within the set of sequential write requests
2058c2ecf20Sopenharmony_ci  issued to the device fails, all write requests queued after it will
2068c2ecf20Sopenharmony_ci  become unaligned and fail.
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ci* Delayed write errors: similarly to regular block devices, if the device side
2098c2ecf20Sopenharmony_ci  write cache is enabled, write errors may occur in ranges of previously
2108c2ecf20Sopenharmony_ci  completed writes when the device write cache is flushed, e.g. on fsync().
2118c2ecf20Sopenharmony_ci  Similarly to the previous immediate unaligned write error case, delayed write
2128c2ecf20Sopenharmony_ci  errors can propagate through a stream of cached sequential data for a zone
2138c2ecf20Sopenharmony_ci  causing all data to be dropped after the sector that caused the error.
2148c2ecf20Sopenharmony_ci
2158c2ecf20Sopenharmony_ciAll I/O errors detected by zonefs are notified to the user with an error code
2168c2ecf20Sopenharmony_cireturn for the system call that triggered or detected the error. The recovery
2178c2ecf20Sopenharmony_ciactions taken by zonefs in response to I/O errors depend on the I/O type (read
2188c2ecf20Sopenharmony_civs write) and on the reason for the error (bad sector, unaligned writes or zone
2198c2ecf20Sopenharmony_cicondition change).
2208c2ecf20Sopenharmony_ci
2218c2ecf20Sopenharmony_ci* For read I/O errors, zonefs does not execute any particular recovery action,
2228c2ecf20Sopenharmony_ci  but only if the file zone is still in a good condition and there is no
2238c2ecf20Sopenharmony_ci  inconsistency between the file inode size and its zone write pointer position.
2248c2ecf20Sopenharmony_ci  If a problem is detected, I/O error recovery is executed (see below table).
2258c2ecf20Sopenharmony_ci
2268c2ecf20Sopenharmony_ci* For write I/O errors, zonefs I/O error recovery is always executed.
2278c2ecf20Sopenharmony_ci
2288c2ecf20Sopenharmony_ci* A zone condition change to read-only or offline also always triggers zonefs
2298c2ecf20Sopenharmony_ci  I/O error recovery.
2308c2ecf20Sopenharmony_ci
2318c2ecf20Sopenharmony_ciZonefs minimal I/O error recovery may change a file size and file access
2328c2ecf20Sopenharmony_cipermissions.
2338c2ecf20Sopenharmony_ci
2348c2ecf20Sopenharmony_ci* File size changes:
2358c2ecf20Sopenharmony_ci  Immediate or delayed write errors in a sequential zone file may cause the file
2368c2ecf20Sopenharmony_ci  inode size to be inconsistent with the amount of data successfully written in
2378c2ecf20Sopenharmony_ci  the file zone. For instance, the partial failure of a multi-BIO large write
2388c2ecf20Sopenharmony_ci  operation will cause the zone write pointer to advance partially, even though
2398c2ecf20Sopenharmony_ci  the entire write operation will be reported as failed to the user. In such
2408c2ecf20Sopenharmony_ci  case, the file inode size must be advanced to reflect the zone write pointer
2418c2ecf20Sopenharmony_ci  change and eventually allow the user to restart writing at the end of the
2428c2ecf20Sopenharmony_ci  file.
2438c2ecf20Sopenharmony_ci  A file size may also be reduced to reflect a delayed write error detected on
2448c2ecf20Sopenharmony_ci  fsync(): in this case, the amount of data effectively written in the zone may
2458c2ecf20Sopenharmony_ci  be less than originally indicated by the file inode size. After such I/O
2468c2ecf20Sopenharmony_ci  error, zonefs always fixes the file inode size to reflect the amount of data
2478c2ecf20Sopenharmony_ci  persistently stored in the file zone.
2488c2ecf20Sopenharmony_ci
2498c2ecf20Sopenharmony_ci* Access permission changes:
2508c2ecf20Sopenharmony_ci  A zone condition change to read-only is indicated with a change in the file
2518c2ecf20Sopenharmony_ci  access permissions to render the file read-only. This disables changes to the
2528c2ecf20Sopenharmony_ci  file attributes and data modification. For offline zones, all permissions
2538c2ecf20Sopenharmony_ci  (read and write) to the file are disabled.
2548c2ecf20Sopenharmony_ci
2558c2ecf20Sopenharmony_ciFurther action taken by zonefs I/O error recovery can be controlled by the user
2568c2ecf20Sopenharmony_ciwith the "errors=xxx" mount option. The table below summarizes the result of
2578c2ecf20Sopenharmony_cizonefs I/O error processing depending on the mount option and on the zone
2588c2ecf20Sopenharmony_ciconditions::
2598c2ecf20Sopenharmony_ci
2608c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2618c2ecf20Sopenharmony_ci    |              |           |            Post error state             |
2628c2ecf20Sopenharmony_ci    | "errors=xxx" |  device   |                 access permissions      |
2638c2ecf20Sopenharmony_ci    |    mount     |   zone    | file         file          device zone  |
2648c2ecf20Sopenharmony_ci    |    option    | condition | size     read    write    read    write |
2658c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2668c2ecf20Sopenharmony_ci    |              | good      | fixed    yes     no       yes     yes   |
2678c2ecf20Sopenharmony_ci    | remount-ro   | read-only | as is    yes     no       yes     no    |
2688c2ecf20Sopenharmony_ci    | (default)    | offline   |   0      no      no       no      no    |
2698c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2708c2ecf20Sopenharmony_ci    |              | good      | fixed    yes     no       yes     yes   |
2718c2ecf20Sopenharmony_ci    | zone-ro      | read-only | as is    yes     no       yes     no    |
2728c2ecf20Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
2738c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2748c2ecf20Sopenharmony_ci    |              | good      |   0      no      no       yes     yes   |
2758c2ecf20Sopenharmony_ci    | zone-offline | read-only |   0      no      no       yes     no    |
2768c2ecf20Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
2778c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2788c2ecf20Sopenharmony_ci    |              | good      | fixed    yes     yes      yes     yes   |
2798c2ecf20Sopenharmony_ci    | repair       | read-only | as is    yes     no       yes     no    |
2808c2ecf20Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
2818c2ecf20Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
2828c2ecf20Sopenharmony_ci
2838c2ecf20Sopenharmony_ciFurther notes:
2848c2ecf20Sopenharmony_ci
2858c2ecf20Sopenharmony_ci* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
2868c2ecf20Sopenharmony_ci  error processing if no errors mount option is specified.
2878c2ecf20Sopenharmony_ci* With the "errors=remount-ro" mount option, the change of the file access
2888c2ecf20Sopenharmony_ci  permissions to read-only applies to all files. The file system is remounted
2898c2ecf20Sopenharmony_ci  read-only.
2908c2ecf20Sopenharmony_ci* Access permission and file size changes due to the device transitioning zones
2918c2ecf20Sopenharmony_ci  to the offline condition are permanent. Remounting or reformatting the device
2928c2ecf20Sopenharmony_ci  with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
2938c2ecf20Sopenharmony_ci  state.
2948c2ecf20Sopenharmony_ci* File access permission changes to read-only due to the device transitioning
2958c2ecf20Sopenharmony_ci  zones to the read-only condition are permanent. Remounting or reformatting
2968c2ecf20Sopenharmony_ci  the device will not re-enable file write access.
2978c2ecf20Sopenharmony_ci* File access permission changes implied by the remount-ro, zone-ro and
2988c2ecf20Sopenharmony_ci  zone-offline mount options are temporary for zones in a good condition.
2998c2ecf20Sopenharmony_ci  Unmounting and remounting the file system will restore the previous default
3008c2ecf20Sopenharmony_ci  (format time values) access rights to the files affected.
3018c2ecf20Sopenharmony_ci* The repair mount option triggers only the minimal set of I/O error recovery
3028c2ecf20Sopenharmony_ci  actions, that is, file size fixes for zones in a good condition. Zones
3038c2ecf20Sopenharmony_ci  indicated as being read-only or offline by the device still imply changes to
3048c2ecf20Sopenharmony_ci  the zone file access permissions as noted in the table above.
3058c2ecf20Sopenharmony_ci
3068c2ecf20Sopenharmony_ciMount options
3078c2ecf20Sopenharmony_ci-------------
3088c2ecf20Sopenharmony_ci
3098c2ecf20Sopenharmony_cizonefs define the "errors=<behavior>" mount option to allow the user to specify
3108c2ecf20Sopenharmony_cizonefs behavior in response to I/O errors, inode size inconsistencies or zone
3118c2ecf20Sopenharmony_cicondition changes. The defined behaviors are as follow:
3128c2ecf20Sopenharmony_ci
3138c2ecf20Sopenharmony_ci* remount-ro (default)
3148c2ecf20Sopenharmony_ci* zone-ro
3158c2ecf20Sopenharmony_ci* zone-offline
3168c2ecf20Sopenharmony_ci* repair
3178c2ecf20Sopenharmony_ci
3188c2ecf20Sopenharmony_ciThe run-time I/O error actions defined for each behavior are detailed in the
3198c2ecf20Sopenharmony_ciprevious section. Mount time I/O errors will cause the mount operation to fail.
3208c2ecf20Sopenharmony_ciThe handling of read-only zones also differs between mount-time and run-time.
3218c2ecf20Sopenharmony_ciIf a read-only zone is found at mount time, the zone is always treated in the
3228c2ecf20Sopenharmony_cisame manner as offline zones, that is, all accesses are disabled and the zone
3238c2ecf20Sopenharmony_cifile size set to 0. This is necessary as the write pointer of read-only zones
3248c2ecf20Sopenharmony_ciis defined as invalib by the ZBC and ZAC standards, making it impossible to
3258c2ecf20Sopenharmony_cidiscover the amount of data that has been written to the zone. In the case of a
3268c2ecf20Sopenharmony_ciread-only zone discovered at run-time, as indicated in the previous section.
3278c2ecf20Sopenharmony_ciThe size of the zone file is left unchanged from its last updated value.
3288c2ecf20Sopenharmony_ci
3298c2ecf20Sopenharmony_ciA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
3308c2ecf20Sopenharmony_cithe number of zones that can be active, that is, zones that are in the
3318c2ecf20Sopenharmony_ciimplicit open, explicit open or closed conditions.  This potential limitation
3328c2ecf20Sopenharmony_citranslates into a risk for applications to see write IO errors due to this
3338c2ecf20Sopenharmony_cilimit being exceeded if the zone of a file is not already active when a write
3348c2ecf20Sopenharmony_cirequest is issued by the user.
3358c2ecf20Sopenharmony_ci
3368c2ecf20Sopenharmony_ciTo avoid these potential errors, the "explicit-open" mount option forces zones
3378c2ecf20Sopenharmony_cito be made active using an open zone command when a file is opened for writing
3388c2ecf20Sopenharmony_cifor the first time. If the zone open command succeeds, the application is then
3398c2ecf20Sopenharmony_ciguaranteed that write requests can be processed. Conversely, the
3408c2ecf20Sopenharmony_ci"explicit-open" mount option will result in a zone close command being issued
3418c2ecf20Sopenharmony_cito the device on the last close() of a zone file if the zone is not full nor
3428c2ecf20Sopenharmony_ciempty.
3438c2ecf20Sopenharmony_ci
3448c2ecf20Sopenharmony_ciZonefs User Space Tools
3458c2ecf20Sopenharmony_ci=======================
3468c2ecf20Sopenharmony_ci
3478c2ecf20Sopenharmony_ciThe mkzonefs tool is used to format zoned block devices for use with zonefs.
3488c2ecf20Sopenharmony_ciThis tool is available on Github at:
3498c2ecf20Sopenharmony_ci
3508c2ecf20Sopenharmony_cihttps://github.com/damien-lemoal/zonefs-tools
3518c2ecf20Sopenharmony_ci
3528c2ecf20Sopenharmony_cizonefs-tools also includes a test suite which can be run against any zoned
3538c2ecf20Sopenharmony_ciblock device, including null_blk block device created with zoned mode.
3548c2ecf20Sopenharmony_ci
3558c2ecf20Sopenharmony_ciExamples
3568c2ecf20Sopenharmony_ci--------
3578c2ecf20Sopenharmony_ci
3588c2ecf20Sopenharmony_ciThe following formats a 15TB host-managed SMR HDD with 256 MB zones
3598c2ecf20Sopenharmony_ciwith the conventional zones aggregation feature enabled::
3608c2ecf20Sopenharmony_ci
3618c2ecf20Sopenharmony_ci    # mkzonefs -o aggr_cnv /dev/sdX
3628c2ecf20Sopenharmony_ci    # mount -t zonefs /dev/sdX /mnt
3638c2ecf20Sopenharmony_ci    # ls -l /mnt/
3648c2ecf20Sopenharmony_ci    total 0
3658c2ecf20Sopenharmony_ci    dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
3668c2ecf20Sopenharmony_ci    dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
3678c2ecf20Sopenharmony_ci
3688c2ecf20Sopenharmony_ciThe size of the zone files sub-directories indicate the number of files
3698c2ecf20Sopenharmony_ciexisting for each type of zones. In this example, there is only one
3708c2ecf20Sopenharmony_ciconventional zone file (all conventional zones are aggregated under a single
3718c2ecf20Sopenharmony_cifile)::
3728c2ecf20Sopenharmony_ci
3738c2ecf20Sopenharmony_ci    # ls -l /mnt/cnv
3748c2ecf20Sopenharmony_ci    total 137101312
3758c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
3768c2ecf20Sopenharmony_ci
3778c2ecf20Sopenharmony_ciThis aggregated conventional zone file can be used as a regular file::
3788c2ecf20Sopenharmony_ci
3798c2ecf20Sopenharmony_ci    # mkfs.ext4 /mnt/cnv/0
3808c2ecf20Sopenharmony_ci    # mount -o loop /mnt/cnv/0 /data
3818c2ecf20Sopenharmony_ci
3828c2ecf20Sopenharmony_ciThe "seq" sub-directory grouping files for sequential write zones has in this
3838c2ecf20Sopenharmony_ciexample 55356 zones::
3848c2ecf20Sopenharmony_ci
3858c2ecf20Sopenharmony_ci    # ls -lv /mnt/seq
3868c2ecf20Sopenharmony_ci    total 14511243264
3878c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 0
3888c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 1
3898c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 2
3908c2ecf20Sopenharmony_ci    ...
3918c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 55354
3928c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 55355
3938c2ecf20Sopenharmony_ci
3948c2ecf20Sopenharmony_ciFor sequential write zone files, the file size changes as data is appended at
3958c2ecf20Sopenharmony_cithe end of the file, similarly to any regular file system::
3968c2ecf20Sopenharmony_ci
3978c2ecf20Sopenharmony_ci    # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
3988c2ecf20Sopenharmony_ci    1+0 records in
3998c2ecf20Sopenharmony_ci    1+0 records out
4008c2ecf20Sopenharmony_ci    4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
4018c2ecf20Sopenharmony_ci
4028c2ecf20Sopenharmony_ci    # ls -l /mnt/seq/0
4038c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
4048c2ecf20Sopenharmony_ci
4058c2ecf20Sopenharmony_ciThe written file can be truncated to the zone size, preventing any further
4068c2ecf20Sopenharmony_ciwrite operation::
4078c2ecf20Sopenharmony_ci
4088c2ecf20Sopenharmony_ci    # truncate -s 268435456 /mnt/seq/0
4098c2ecf20Sopenharmony_ci    # ls -l /mnt/seq/0
4108c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
4118c2ecf20Sopenharmony_ci
4128c2ecf20Sopenharmony_ciTruncation to 0 size allows freeing the file zone storage space and restart
4138c2ecf20Sopenharmony_ciappend-writes to the file::
4148c2ecf20Sopenharmony_ci
4158c2ecf20Sopenharmony_ci    # truncate -s 0 /mnt/seq/0
4168c2ecf20Sopenharmony_ci    # ls -l /mnt/seq/0
4178c2ecf20Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
4188c2ecf20Sopenharmony_ci
4198c2ecf20Sopenharmony_ciSince files are statically mapped to zones on the disk, the number of blocks
4208c2ecf20Sopenharmony_ciof a file as reported by stat() and fstat() indicates the capacity of the file
4218c2ecf20Sopenharmony_cizone::
4228c2ecf20Sopenharmony_ci
4238c2ecf20Sopenharmony_ci    # stat /mnt/seq/0
4248c2ecf20Sopenharmony_ci    File: /mnt/seq/0
4258c2ecf20Sopenharmony_ci    Size: 0         	Blocks: 524288     IO Block: 4096   regular empty file
4268c2ecf20Sopenharmony_ci    Device: 870h/2160d	Inode: 50431       Links: 1
4278c2ecf20Sopenharmony_ci    Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
4288c2ecf20Sopenharmony_ci    Access: 2019-11-25 13:23:57.048971997 +0900
4298c2ecf20Sopenharmony_ci    Modify: 2019-11-25 13:52:25.553805765 +0900
4308c2ecf20Sopenharmony_ci    Change: 2019-11-25 13:52:25.553805765 +0900
4318c2ecf20Sopenharmony_ci    Birth: -
4328c2ecf20Sopenharmony_ci
4338c2ecf20Sopenharmony_ciThe number of blocks of the file ("Blocks") in units of 512B blocks gives the
4348c2ecf20Sopenharmony_cimaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
4358c2ecf20Sopenharmony_cicapacity in this example. Of note is that the "IO block" field always
4368c2ecf20Sopenharmony_ciindicates the minimum I/O size for writes and corresponds to the device
4378c2ecf20Sopenharmony_ciphysical sector size.
438