162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci================================================
462306a36Sopenharmony_ciZoneFS - Zone filesystem for Zoned block devices
562306a36Sopenharmony_ci================================================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciIntroduction
862306a36Sopenharmony_ci============
962306a36Sopenharmony_ci
1062306a36Sopenharmony_cizonefs is a very simple file system exposing each zone of a zoned block device
1162306a36Sopenharmony_cias a file. Unlike a regular POSIX-compliant file system with native zoned block
1262306a36Sopenharmony_cidevice support (e.g. f2fs), zonefs does not hide the sequential write
1362306a36Sopenharmony_ciconstraint of zoned block devices to the user. Files representing sequential
1462306a36Sopenharmony_ciwrite zones of the device must be written sequentially starting from the end
1562306a36Sopenharmony_ciof the file (append only writes).
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciAs such, zonefs is in essence closer to a raw block device access interface
1862306a36Sopenharmony_cithan to a full-featured POSIX file system. The goal of zonefs is to simplify
1962306a36Sopenharmony_cithe implementation of zoned block device support in applications by replacing
2062306a36Sopenharmony_ciraw block device file accesses with a richer file API, avoiding relying on
2162306a36Sopenharmony_cidirect block device file ioctls which may be more obscure to developers. One
2262306a36Sopenharmony_ciexample of this approach is the implementation of LSM (log-structured merge)
2362306a36Sopenharmony_citree structures (such as used in RocksDB and LevelDB) on zoned block devices
2462306a36Sopenharmony_ciby allowing SSTables to be stored in a zone file similarly to a regular file
2562306a36Sopenharmony_cisystem rather than as a range of sectors of the entire disk. The introduction
2662306a36Sopenharmony_ciof the higher level construct "one file is one zone" can help reducing the
2762306a36Sopenharmony_ciamount of changes needed in the application as well as introducing support for
2862306a36Sopenharmony_cidifferent application programming languages.
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciZoned block devices
3162306a36Sopenharmony_ci-------------------
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ciZoned storage devices belong to a class of storage devices with an address
3462306a36Sopenharmony_cispace that is divided into zones. A zone is a group of consecutive LBAs and all
3562306a36Sopenharmony_cizones are contiguous (there are no LBA gaps). Zones may have different types.
3662306a36Sopenharmony_ci
3762306a36Sopenharmony_ci* Conventional zones: there are no access constraints to LBAs belonging to
3862306a36Sopenharmony_ci  conventional zones. Any read or write access can be executed, similarly to a
3962306a36Sopenharmony_ci  regular block device.
4062306a36Sopenharmony_ci* Sequential zones: these zones accept random reads but must be written
4162306a36Sopenharmony_ci  sequentially. Each sequential zone has a write pointer maintained by the
4262306a36Sopenharmony_ci  device that keeps track of the mandatory start LBA position of the next write
4362306a36Sopenharmony_ci  to the device. As a result of this write constraint, LBAs in a sequential zone
4462306a36Sopenharmony_ci  cannot be overwritten. Sequential zones must first be erased using a special
4562306a36Sopenharmony_ci  command (zone reset) before rewriting.
4662306a36Sopenharmony_ci
4762306a36Sopenharmony_ciZoned storage devices can be implemented using various recording and media
4862306a36Sopenharmony_citechnologies. The most common form of zoned storage today uses the SCSI Zoned
4962306a36Sopenharmony_ciBlock Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
5062306a36Sopenharmony_ciMagnetic Recording (SMR) HDDs.
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ciSolid State Disks (SSD) storage devices can also implement a zoned interface
5362306a36Sopenharmony_cito, for instance, reduce internal write amplification due to garbage collection.
5462306a36Sopenharmony_ciThe NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
5562306a36Sopenharmony_cicommittee aiming at adding a zoned storage interface to the NVMe protocol.
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciZonefs Overview
5862306a36Sopenharmony_ci===============
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ciZonefs exposes the zones of a zoned block device as files. The files
6162306a36Sopenharmony_cirepresenting zones are grouped by zone type, which are themselves represented
6262306a36Sopenharmony_ciby sub-directories. This file structure is built entirely using zone information
6362306a36Sopenharmony_ciprovided by the device and so does not require any complex on-disk metadata
6462306a36Sopenharmony_cistructure.
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ciOn-disk metadata
6762306a36Sopenharmony_ci----------------
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_cizonefs on-disk metadata is reduced to an immutable super block which
7062306a36Sopenharmony_cipersistently stores a magic number and optional feature flags and values. On
7162306a36Sopenharmony_cimount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
7262306a36Sopenharmony_ciand populates the mount point with a static file tree solely based on this
7362306a36Sopenharmony_ciinformation. File sizes come from the device zone type and write pointer
7462306a36Sopenharmony_ciposition managed by the device itself.
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_ciThe super block is always written on disk at sector 0. The first zone of the
7762306a36Sopenharmony_cidevice storing the super block is never exposed as a zone file by zonefs. If
7862306a36Sopenharmony_cithe zone containing the super block is a sequential zone, the mkzonefs format
7962306a36Sopenharmony_citool always "finishes" the zone, that is, it transitions the zone to a full
8062306a36Sopenharmony_cistate to make it read-only, preventing any data write.
8162306a36Sopenharmony_ci
8262306a36Sopenharmony_ciZone type sub-directories
8362306a36Sopenharmony_ci-------------------------
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ciFiles representing zones of the same type are grouped together under the same
8662306a36Sopenharmony_cisub-directory automatically created on mount.
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ciFor conventional zones, the sub-directory "cnv" is used. This directory is
8962306a36Sopenharmony_cihowever created if and only if the device has usable conventional zones. If
9062306a36Sopenharmony_cithe device only has a single conventional zone at sector 0, the zone will not
9162306a36Sopenharmony_cibe exposed as a file as it will be used to store the zonefs super block. For
9262306a36Sopenharmony_cisuch devices, the "cnv" sub-directory will not be created.
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ciFor sequential write zones, the sub-directory "seq" is used.
9562306a36Sopenharmony_ci
9662306a36Sopenharmony_ciThese two directories are the only directories that exist in zonefs. Users
9762306a36Sopenharmony_cicannot create other directories and cannot rename nor delete the "cnv" and
9862306a36Sopenharmony_ci"seq" sub-directories.
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ciThe size of the directories indicated by the st_size field of struct stat,
10162306a36Sopenharmony_ciobtained with the stat() or fstat() system calls, indicates the number of files
10262306a36Sopenharmony_ciexisting under the directory.
10362306a36Sopenharmony_ci
10462306a36Sopenharmony_ciZone files
10562306a36Sopenharmony_ci----------
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ciZone files are named using the number of the zone they represent within the set
10862306a36Sopenharmony_ciof zones of a particular type. That is, both the "cnv" and "seq" directories
10962306a36Sopenharmony_cicontain files named "0", "1", "2", ... The file numbers also represent
11062306a36Sopenharmony_ciincreasing zone start sector on the device.
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ciAll read and write operations to zone files are not allowed beyond the file
11362306a36Sopenharmony_cimaximum size, that is, beyond the zone capacity. Any access exceeding the zone
11462306a36Sopenharmony_cicapacity is failed with the -EFBIG error.
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciCreating, deleting, renaming or modifying any attribute of files and
11762306a36Sopenharmony_cisub-directories is not allowed.
11862306a36Sopenharmony_ci
11962306a36Sopenharmony_ciThe number of blocks of a file as reported by stat() and fstat() indicates the
12062306a36Sopenharmony_cicapacity of the zone file, or in other words, the maximum file size.
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ciConventional zone files
12362306a36Sopenharmony_ci-----------------------
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ciThe size of conventional zone files is fixed to the size of the zone they
12662306a36Sopenharmony_cirepresent. Conventional zone files cannot be truncated.
12762306a36Sopenharmony_ci
12862306a36Sopenharmony_ciThese files can be randomly read and written using any type of I/O operation:
12962306a36Sopenharmony_cibuffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
13062306a36Sopenharmony_ciconstraint for these files beyond the file size limit mentioned above.
13162306a36Sopenharmony_ci
13262306a36Sopenharmony_ciSequential zone files
13362306a36Sopenharmony_ci---------------------
13462306a36Sopenharmony_ci
13562306a36Sopenharmony_ciThe size of sequential zone files grouped in the "seq" sub-directory represents
13662306a36Sopenharmony_cithe file's zone write pointer position relative to the zone start sector.
13762306a36Sopenharmony_ci
13862306a36Sopenharmony_ciSequential zone files can only be written sequentially, starting from the file
13962306a36Sopenharmony_ciend, that is, write operations can only be append writes. Zonefs makes no
14062306a36Sopenharmony_ciattempt at accepting random writes and will fail any write request that has a
14162306a36Sopenharmony_cistart offset not corresponding to the end of the file, or to the end of the last
14262306a36Sopenharmony_ciwrite issued and still in-flight (for asynchronous I/O operations).
14362306a36Sopenharmony_ci
14462306a36Sopenharmony_ciSince dirty page writeback by the page cache does not guarantee a sequential
14562306a36Sopenharmony_ciwrite pattern, zonefs prevents buffered writes and writeable shared mappings
14662306a36Sopenharmony_cion sequential files. Only direct I/O writes are accepted for these files.
14762306a36Sopenharmony_cizonefs relies on the sequential delivery of write I/O requests to the device
14862306a36Sopenharmony_ciimplemented by the block layer elevator. An elevator implementing the sequential
14962306a36Sopenharmony_ciwrite feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
15062306a36Sopenharmony_cimust be used. This type of elevator (e.g. mq-deadline) is set by default
15162306a36Sopenharmony_cifor zoned block devices on device initialization.
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciThere are no restrictions on the type of I/O used for read operations in
15462306a36Sopenharmony_cisequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
15562306a36Sopenharmony_ciall accepted.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciTruncating sequential zone files is allowed only down to 0, in which case, the
15862306a36Sopenharmony_cizone is reset to rewind the file zone write pointer position to the start of
15962306a36Sopenharmony_cithe zone, or up to the zone capacity, in which case the file's zone is
16062306a36Sopenharmony_citransitioned to the FULL state (finish zone operation).
16162306a36Sopenharmony_ci
16262306a36Sopenharmony_ciFormat options
16362306a36Sopenharmony_ci--------------
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciSeveral optional features of zonefs can be enabled at format time.
16662306a36Sopenharmony_ci
16762306a36Sopenharmony_ci* Conventional zone aggregation: ranges of contiguous conventional zones can be
16862306a36Sopenharmony_ci  aggregated into a single larger file instead of the default one file per zone.
16962306a36Sopenharmony_ci* File ownership: The owner UID and GID of zone files is by default 0 (root)
17062306a36Sopenharmony_ci  but can be changed to any valid UID/GID.
17162306a36Sopenharmony_ci* File access permissions: the default 640 access permissions can be changed.
17262306a36Sopenharmony_ci
17362306a36Sopenharmony_ciIO error handling
17462306a36Sopenharmony_ci-----------------
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ciZoned block devices may fail I/O requests for reasons similar to regular block
17762306a36Sopenharmony_cidevices, e.g. due to bad sectors. However, in addition to such known I/O
17862306a36Sopenharmony_cifailure pattern, the standards governing zoned block devices behavior define
17962306a36Sopenharmony_ciadditional conditions that result in I/O errors.
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
18262306a36Sopenharmony_ci  While the data already written in the zone is still readable, the zone can
18362306a36Sopenharmony_ci  no longer be written. No user action on the zone (zone management command or
18462306a36Sopenharmony_ci  read/write access) can change the zone condition back to a normal read/write
18562306a36Sopenharmony_ci  state. While the reasons for the device to transition a zone to read-only
18662306a36Sopenharmony_ci  state are not defined by the standards, a typical cause for such transition
18762306a36Sopenharmony_ci  would be a defective write head on an HDD (all zones under this head are
18862306a36Sopenharmony_ci  changed to read-only).
18962306a36Sopenharmony_ci
19062306a36Sopenharmony_ci* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
19162306a36Sopenharmony_ci  An offline zone cannot be read nor written. No user action can transition an
19262306a36Sopenharmony_ci  offline zone back to an operational good state. Similarly to zone read-only
19362306a36Sopenharmony_ci  transitions, the reasons for a drive to transition a zone to the offline
19462306a36Sopenharmony_ci  condition are undefined. A typical cause would be a defective read-write head
19562306a36Sopenharmony_ci  on an HDD causing all zones on the platter under the broken head to be
19662306a36Sopenharmony_ci  inaccessible.
19762306a36Sopenharmony_ci
19862306a36Sopenharmony_ci* Unaligned write errors: These errors result from the host issuing write
19962306a36Sopenharmony_ci  requests with a start sector that does not correspond to a zone write pointer
20062306a36Sopenharmony_ci  position when the write request is executed by the device. Even though zonefs
20162306a36Sopenharmony_ci  enforces sequential file write for sequential zones, unaligned write errors
20262306a36Sopenharmony_ci  may still happen in the case of a partial failure of a very large direct I/O
20362306a36Sopenharmony_ci  operation split into multiple BIOs/requests or asynchronous I/O operations.
20462306a36Sopenharmony_ci  If one of the write request within the set of sequential write requests
20562306a36Sopenharmony_ci  issued to the device fails, all write requests queued after it will
20662306a36Sopenharmony_ci  become unaligned and fail.
20762306a36Sopenharmony_ci
20862306a36Sopenharmony_ci* Delayed write errors: similarly to regular block devices, if the device side
20962306a36Sopenharmony_ci  write cache is enabled, write errors may occur in ranges of previously
21062306a36Sopenharmony_ci  completed writes when the device write cache is flushed, e.g. on fsync().
21162306a36Sopenharmony_ci  Similarly to the previous immediate unaligned write error case, delayed write
21262306a36Sopenharmony_ci  errors can propagate through a stream of cached sequential data for a zone
21362306a36Sopenharmony_ci  causing all data to be dropped after the sector that caused the error.
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ciAll I/O errors detected by zonefs are notified to the user with an error code
21662306a36Sopenharmony_cireturn for the system call that triggered or detected the error. The recovery
21762306a36Sopenharmony_ciactions taken by zonefs in response to I/O errors depend on the I/O type (read
21862306a36Sopenharmony_civs write) and on the reason for the error (bad sector, unaligned writes or zone
21962306a36Sopenharmony_cicondition change).
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ci* For read I/O errors, zonefs does not execute any particular recovery action,
22262306a36Sopenharmony_ci  but only if the file zone is still in a good condition and there is no
22362306a36Sopenharmony_ci  inconsistency between the file inode size and its zone write pointer position.
22462306a36Sopenharmony_ci  If a problem is detected, I/O error recovery is executed (see below table).
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ci* For write I/O errors, zonefs I/O error recovery is always executed.
22762306a36Sopenharmony_ci
22862306a36Sopenharmony_ci* A zone condition change to read-only or offline also always triggers zonefs
22962306a36Sopenharmony_ci  I/O error recovery.
23062306a36Sopenharmony_ci
23162306a36Sopenharmony_ciZonefs minimal I/O error recovery may change a file size and file access
23262306a36Sopenharmony_cipermissions.
23362306a36Sopenharmony_ci
23462306a36Sopenharmony_ci* File size changes:
23562306a36Sopenharmony_ci  Immediate or delayed write errors in a sequential zone file may cause the file
23662306a36Sopenharmony_ci  inode size to be inconsistent with the amount of data successfully written in
23762306a36Sopenharmony_ci  the file zone. For instance, the partial failure of a multi-BIO large write
23862306a36Sopenharmony_ci  operation will cause the zone write pointer to advance partially, even though
23962306a36Sopenharmony_ci  the entire write operation will be reported as failed to the user. In such
24062306a36Sopenharmony_ci  case, the file inode size must be advanced to reflect the zone write pointer
24162306a36Sopenharmony_ci  change and eventually allow the user to restart writing at the end of the
24262306a36Sopenharmony_ci  file.
24362306a36Sopenharmony_ci  A file size may also be reduced to reflect a delayed write error detected on
24462306a36Sopenharmony_ci  fsync(): in this case, the amount of data effectively written in the zone may
24562306a36Sopenharmony_ci  be less than originally indicated by the file inode size. After such I/O
24662306a36Sopenharmony_ci  error, zonefs always fixes the file inode size to reflect the amount of data
24762306a36Sopenharmony_ci  persistently stored in the file zone.
24862306a36Sopenharmony_ci
24962306a36Sopenharmony_ci* Access permission changes:
25062306a36Sopenharmony_ci  A zone condition change to read-only is indicated with a change in the file
25162306a36Sopenharmony_ci  access permissions to render the file read-only. This disables changes to the
25262306a36Sopenharmony_ci  file attributes and data modification. For offline zones, all permissions
25362306a36Sopenharmony_ci  (read and write) to the file are disabled.
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ciFurther action taken by zonefs I/O error recovery can be controlled by the user
25662306a36Sopenharmony_ciwith the "errors=xxx" mount option. The table below summarizes the result of
25762306a36Sopenharmony_cizonefs I/O error processing depending on the mount option and on the zone
25862306a36Sopenharmony_ciconditions::
25962306a36Sopenharmony_ci
26062306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
26162306a36Sopenharmony_ci    |              |           |            Post error state             |
26262306a36Sopenharmony_ci    | "errors=xxx" |  device   |                 access permissions      |
26362306a36Sopenharmony_ci    |    mount     |   zone    | file         file          device zone  |
26462306a36Sopenharmony_ci    |    option    | condition | size     read    write    read    write |
26562306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
26662306a36Sopenharmony_ci    |              | good      | fixed    yes     no       yes     yes   |
26762306a36Sopenharmony_ci    | remount-ro   | read-only | as is    yes     no       yes     no    |
26862306a36Sopenharmony_ci    | (default)    | offline   |   0      no      no       no      no    |
26962306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
27062306a36Sopenharmony_ci    |              | good      | fixed    yes     no       yes     yes   |
27162306a36Sopenharmony_ci    | zone-ro      | read-only | as is    yes     no       yes     no    |
27262306a36Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
27362306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
27462306a36Sopenharmony_ci    |              | good      |   0      no      no       yes     yes   |
27562306a36Sopenharmony_ci    | zone-offline | read-only |   0      no      no       yes     no    |
27662306a36Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
27762306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
27862306a36Sopenharmony_ci    |              | good      | fixed    yes     yes      yes     yes   |
27962306a36Sopenharmony_ci    | repair       | read-only | as is    yes     no       yes     no    |
28062306a36Sopenharmony_ci    |              | offline   |   0      no      no       no      no    |
28162306a36Sopenharmony_ci    +--------------+-----------+-----------------------------------------+
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_ciFurther notes:
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ci* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
28662306a36Sopenharmony_ci  error processing if no errors mount option is specified.
28762306a36Sopenharmony_ci* With the "errors=remount-ro" mount option, the change of the file access
28862306a36Sopenharmony_ci  permissions to read-only applies to all files. The file system is remounted
28962306a36Sopenharmony_ci  read-only.
29062306a36Sopenharmony_ci* Access permission and file size changes due to the device transitioning zones
29162306a36Sopenharmony_ci  to the offline condition are permanent. Remounting or reformatting the device
29262306a36Sopenharmony_ci  with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
29362306a36Sopenharmony_ci  state.
29462306a36Sopenharmony_ci* File access permission changes to read-only due to the device transitioning
29562306a36Sopenharmony_ci  zones to the read-only condition are permanent. Remounting or reformatting
29662306a36Sopenharmony_ci  the device will not re-enable file write access.
29762306a36Sopenharmony_ci* File access permission changes implied by the remount-ro, zone-ro and
29862306a36Sopenharmony_ci  zone-offline mount options are temporary for zones in a good condition.
29962306a36Sopenharmony_ci  Unmounting and remounting the file system will restore the previous default
30062306a36Sopenharmony_ci  (format time values) access rights to the files affected.
30162306a36Sopenharmony_ci* The repair mount option triggers only the minimal set of I/O error recovery
30262306a36Sopenharmony_ci  actions, that is, file size fixes for zones in a good condition. Zones
30362306a36Sopenharmony_ci  indicated as being read-only or offline by the device still imply changes to
30462306a36Sopenharmony_ci  the zone file access permissions as noted in the table above.
30562306a36Sopenharmony_ci
30662306a36Sopenharmony_ciMount options
30762306a36Sopenharmony_ci-------------
30862306a36Sopenharmony_ci
30962306a36Sopenharmony_cizonefs defines several mount options:
31062306a36Sopenharmony_ci* errors=<behavior>
31162306a36Sopenharmony_ci* explicit-open
31262306a36Sopenharmony_ci
31362306a36Sopenharmony_ci"errors=<behavior>" option
31462306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~~
31562306a36Sopenharmony_ci
31662306a36Sopenharmony_ciThe "errors=<behavior>" option mount option allows the user to specify zonefs
31762306a36Sopenharmony_cibehavior in response to I/O errors, inode size inconsistencies or zone
31862306a36Sopenharmony_cicondition changes. The defined behaviors are as follow:
31962306a36Sopenharmony_ci
32062306a36Sopenharmony_ci* remount-ro (default)
32162306a36Sopenharmony_ci* zone-ro
32262306a36Sopenharmony_ci* zone-offline
32362306a36Sopenharmony_ci* repair
32462306a36Sopenharmony_ci
32562306a36Sopenharmony_ciThe run-time I/O error actions defined for each behavior are detailed in the
32662306a36Sopenharmony_ciprevious section. Mount time I/O errors will cause the mount operation to fail.
32762306a36Sopenharmony_ciThe handling of read-only zones also differs between mount-time and run-time.
32862306a36Sopenharmony_ciIf a read-only zone is found at mount time, the zone is always treated in the
32962306a36Sopenharmony_cisame manner as offline zones, that is, all accesses are disabled and the zone
33062306a36Sopenharmony_cifile size set to 0. This is necessary as the write pointer of read-only zones
33162306a36Sopenharmony_ciis defined as invalib by the ZBC and ZAC standards, making it impossible to
33262306a36Sopenharmony_cidiscover the amount of data that has been written to the zone. In the case of a
33362306a36Sopenharmony_ciread-only zone discovered at run-time, as indicated in the previous section.
33462306a36Sopenharmony_ciThe size of the zone file is left unchanged from its last updated value.
33562306a36Sopenharmony_ci
33662306a36Sopenharmony_ci"explicit-open" option
33762306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~
33862306a36Sopenharmony_ci
33962306a36Sopenharmony_ciA zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
34062306a36Sopenharmony_cithe number of zones that can be active, that is, zones that are in the
34162306a36Sopenharmony_ciimplicit open, explicit open or closed conditions.  This potential limitation
34262306a36Sopenharmony_citranslates into a risk for applications to see write IO errors due to this
34362306a36Sopenharmony_cilimit being exceeded if the zone of a file is not already active when a write
34462306a36Sopenharmony_cirequest is issued by the user.
34562306a36Sopenharmony_ci
34662306a36Sopenharmony_ciTo avoid these potential errors, the "explicit-open" mount option forces zones
34762306a36Sopenharmony_cito be made active using an open zone command when a file is opened for writing
34862306a36Sopenharmony_cifor the first time. If the zone open command succeeds, the application is then
34962306a36Sopenharmony_ciguaranteed that write requests can be processed. Conversely, the
35062306a36Sopenharmony_ci"explicit-open" mount option will result in a zone close command being issued
35162306a36Sopenharmony_cito the device on the last close() of a zone file if the zone is not full nor
35262306a36Sopenharmony_ciempty.
35362306a36Sopenharmony_ci
35462306a36Sopenharmony_ciRuntime sysfs attributes
35562306a36Sopenharmony_ci------------------------
35662306a36Sopenharmony_ci
35762306a36Sopenharmony_cizonefs defines several sysfs attributes for mounted devices.  All attributes
35862306a36Sopenharmony_ciare user readable and can be found in the directory /sys/fs/zonefs/<dev>/,
35962306a36Sopenharmony_ciwhere <dev> is the name of the mounted zoned block device.
36062306a36Sopenharmony_ci
36162306a36Sopenharmony_ciThe attributes defined are as follows.
36262306a36Sopenharmony_ci
36362306a36Sopenharmony_ci* **max_wro_seq_files**:  This attribute reports the maximum number of
36462306a36Sopenharmony_ci  sequential zone files that can be open for writing.  This number corresponds
36562306a36Sopenharmony_ci  to the maximum number of explicitly or implicitly open zones that the device
36662306a36Sopenharmony_ci  supports.  A value of 0 means that the device has no limit and that any zone
36762306a36Sopenharmony_ci  (any file) can be open for writing and written at any time, regardless of the
36862306a36Sopenharmony_ci  state of other zones.  When the *explicit-open* mount option is used, zonefs
36962306a36Sopenharmony_ci  will fail any open() system call requesting to open a sequential zone file for
37062306a36Sopenharmony_ci  writing when the number of sequential zone files already open for writing has
37162306a36Sopenharmony_ci  reached the *max_wro_seq_files* limit.
37262306a36Sopenharmony_ci* **nr_wro_seq_files**:  This attribute reports the current number of sequential
37362306a36Sopenharmony_ci  zone files open for writing.  When the "explicit-open" mount option is used,
37462306a36Sopenharmony_ci  this number can never exceed *max_wro_seq_files*.  If the *explicit-open*
37562306a36Sopenharmony_ci  mount option is not used, the reported number can be greater than
37662306a36Sopenharmony_ci  *max_wro_seq_files*.  In such case, it is the responsibility of the
37762306a36Sopenharmony_ci  application to not write simultaneously more than *max_wro_seq_files*
37862306a36Sopenharmony_ci  sequential zone files.  Failure to do so can result in write errors.
37962306a36Sopenharmony_ci* **max_active_seq_files**:  This attribute reports the maximum number of
38062306a36Sopenharmony_ci  sequential zone files that are in an active state, that is, sequential zone
38162306a36Sopenharmony_ci  files that are partially written (not empty nor full) or that have a zone that
38262306a36Sopenharmony_ci  is explicitly open (which happens only if the *explicit-open* mount option is
38362306a36Sopenharmony_ci  used).  This number is always equal to the maximum number of active zones that
38462306a36Sopenharmony_ci  the device supports.  A value of 0 means that the mounted device has no limit
38562306a36Sopenharmony_ci  on the number of sequential zone files that can be active.
38662306a36Sopenharmony_ci* **nr_active_seq_files**:  This attributes reports the current number of
38762306a36Sopenharmony_ci  sequential zone files that are active. If *max_active_seq_files* is not 0,
38862306a36Sopenharmony_ci  then the value of *nr_active_seq_files* can never exceed the value of
38962306a36Sopenharmony_ci  *nr_active_seq_files*, regardless of the use of the *explicit-open* mount
39062306a36Sopenharmony_ci  option.
39162306a36Sopenharmony_ci
39262306a36Sopenharmony_ciZonefs User Space Tools
39362306a36Sopenharmony_ci=======================
39462306a36Sopenharmony_ci
39562306a36Sopenharmony_ciThe mkzonefs tool is used to format zoned block devices for use with zonefs.
39662306a36Sopenharmony_ciThis tool is available on Github at:
39762306a36Sopenharmony_ci
39862306a36Sopenharmony_cihttps://github.com/damien-lemoal/zonefs-tools
39962306a36Sopenharmony_ci
40062306a36Sopenharmony_cizonefs-tools also includes a test suite which can be run against any zoned
40162306a36Sopenharmony_ciblock device, including null_blk block device created with zoned mode.
40262306a36Sopenharmony_ci
40362306a36Sopenharmony_ciExamples
40462306a36Sopenharmony_ci--------
40562306a36Sopenharmony_ci
40662306a36Sopenharmony_ciThe following formats a 15TB host-managed SMR HDD with 256 MB zones
40762306a36Sopenharmony_ciwith the conventional zones aggregation feature enabled::
40862306a36Sopenharmony_ci
40962306a36Sopenharmony_ci    # mkzonefs -o aggr_cnv /dev/sdX
41062306a36Sopenharmony_ci    # mount -t zonefs /dev/sdX /mnt
41162306a36Sopenharmony_ci    # ls -l /mnt/
41262306a36Sopenharmony_ci    total 0
41362306a36Sopenharmony_ci    dr-xr-xr-x 2 root root     1 Nov 25 13:23 cnv
41462306a36Sopenharmony_ci    dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
41562306a36Sopenharmony_ci
41662306a36Sopenharmony_ciThe size of the zone files sub-directories indicate the number of files
41762306a36Sopenharmony_ciexisting for each type of zones. In this example, there is only one
41862306a36Sopenharmony_ciconventional zone file (all conventional zones are aggregated under a single
41962306a36Sopenharmony_cifile)::
42062306a36Sopenharmony_ci
42162306a36Sopenharmony_ci    # ls -l /mnt/cnv
42262306a36Sopenharmony_ci    total 137101312
42362306a36Sopenharmony_ci    -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
42462306a36Sopenharmony_ci
42562306a36Sopenharmony_ciThis aggregated conventional zone file can be used as a regular file::
42662306a36Sopenharmony_ci
42762306a36Sopenharmony_ci    # mkfs.ext4 /mnt/cnv/0
42862306a36Sopenharmony_ci    # mount -o loop /mnt/cnv/0 /data
42962306a36Sopenharmony_ci
43062306a36Sopenharmony_ciThe "seq" sub-directory grouping files for sequential write zones has in this
43162306a36Sopenharmony_ciexample 55356 zones::
43262306a36Sopenharmony_ci
43362306a36Sopenharmony_ci    # ls -lv /mnt/seq
43462306a36Sopenharmony_ci    total 14511243264
43562306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 0
43662306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 1
43762306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 2
43862306a36Sopenharmony_ci    ...
43962306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 55354
44062306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:23 55355
44162306a36Sopenharmony_ci
44262306a36Sopenharmony_ciFor sequential write zone files, the file size changes as data is appended at
44362306a36Sopenharmony_cithe end of the file, similarly to any regular file system::
44462306a36Sopenharmony_ci
44562306a36Sopenharmony_ci    # dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
44662306a36Sopenharmony_ci    1+0 records in
44762306a36Sopenharmony_ci    1+0 records out
44862306a36Sopenharmony_ci    4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
44962306a36Sopenharmony_ci
45062306a36Sopenharmony_ci    # ls -l /mnt/seq/0
45162306a36Sopenharmony_ci    -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
45262306a36Sopenharmony_ci
45362306a36Sopenharmony_ciThe written file can be truncated to the zone size, preventing any further
45462306a36Sopenharmony_ciwrite operation::
45562306a36Sopenharmony_ci
45662306a36Sopenharmony_ci    # truncate -s 268435456 /mnt/seq/0
45762306a36Sopenharmony_ci    # ls -l /mnt/seq/0
45862306a36Sopenharmony_ci    -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
45962306a36Sopenharmony_ci
46062306a36Sopenharmony_ciTruncation to 0 size allows freeing the file zone storage space and restart
46162306a36Sopenharmony_ciappend-writes to the file::
46262306a36Sopenharmony_ci
46362306a36Sopenharmony_ci    # truncate -s 0 /mnt/seq/0
46462306a36Sopenharmony_ci    # ls -l /mnt/seq/0
46562306a36Sopenharmony_ci    -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
46662306a36Sopenharmony_ci
46762306a36Sopenharmony_ciSince files are statically mapped to zones on the disk, the number of blocks
46862306a36Sopenharmony_ciof a file as reported by stat() and fstat() indicates the capacity of the file
46962306a36Sopenharmony_cizone::
47062306a36Sopenharmony_ci
47162306a36Sopenharmony_ci    # stat /mnt/seq/0
47262306a36Sopenharmony_ci    File: /mnt/seq/0
47362306a36Sopenharmony_ci    Size: 0         	Blocks: 524288     IO Block: 4096   regular empty file
47462306a36Sopenharmony_ci    Device: 870h/2160d	Inode: 50431       Links: 1
47562306a36Sopenharmony_ci    Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (    0/    root)
47662306a36Sopenharmony_ci    Access: 2019-11-25 13:23:57.048971997 +0900
47762306a36Sopenharmony_ci    Modify: 2019-11-25 13:52:25.553805765 +0900
47862306a36Sopenharmony_ci    Change: 2019-11-25 13:52:25.553805765 +0900
47962306a36Sopenharmony_ci    Birth: -
48062306a36Sopenharmony_ci
48162306a36Sopenharmony_ciThe number of blocks of the file ("Blocks") in units of 512B blocks gives the
48262306a36Sopenharmony_cimaximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
48362306a36Sopenharmony_cicapacity in this example. Of note is that the "IO block" field always
48462306a36Sopenharmony_ciindicates the minimum I/O size for writes and corresponds to the device
48562306a36Sopenharmony_ciphysical sector size.
486