18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci======================= 48c2ecf20Sopenharmony_ciSquashfs 4.0 Filesystem 58c2ecf20Sopenharmony_ci======================= 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciSquashfs is a compressed read-only filesystem for Linux. 88c2ecf20Sopenharmony_ci 98c2ecf20Sopenharmony_ciIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and 108c2ecf20Sopenharmony_cidirectories. Inodes in the system are very small and all blocks are packed to 118c2ecf20Sopenharmony_ciminimise data overhead. Block sizes greater than 4K are supported up to a 128c2ecf20Sopenharmony_cimaximum of 1Mbytes (default block size 128K). 138c2ecf20Sopenharmony_ci 148c2ecf20Sopenharmony_ciSquashfs is intended for general read-only filesystem use, for archival 158c2ecf20Sopenharmony_ciuse (i.e. in cases where a .tar.gz file may be used), and in constrained 168c2ecf20Sopenharmony_ciblock device/memory systems (e.g. embedded systems) where low overhead is 178c2ecf20Sopenharmony_cineeded. 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciMailing list: squashfs-devel@lists.sourceforge.net 208c2ecf20Sopenharmony_ciWeb site: www.squashfs.org 218c2ecf20Sopenharmony_ci 228c2ecf20Sopenharmony_ci1. Filesystem Features 238c2ecf20Sopenharmony_ci---------------------- 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ciSquashfs filesystem features versus Cramfs: 268c2ecf20Sopenharmony_ci 278c2ecf20Sopenharmony_ci============================== ========= ========== 288c2ecf20Sopenharmony_ci Squashfs Cramfs 298c2ecf20Sopenharmony_ci============================== ========= ========== 308c2ecf20Sopenharmony_ciMax filesystem size 2^64 256 MiB 318c2ecf20Sopenharmony_ciMax file size ~ 2 TiB 16 MiB 328c2ecf20Sopenharmony_ciMax files unlimited unlimited 338c2ecf20Sopenharmony_ciMax directories unlimited unlimited 348c2ecf20Sopenharmony_ciMax entries per directory unlimited unlimited 358c2ecf20Sopenharmony_ciMax block size 1 MiB 4 KiB 368c2ecf20Sopenharmony_ciMetadata compression yes no 378c2ecf20Sopenharmony_ciDirectory indexes yes no 388c2ecf20Sopenharmony_ciSparse file support yes no 398c2ecf20Sopenharmony_ciTail-end packing (fragments) yes no 408c2ecf20Sopenharmony_ciExportable (NFS etc.) yes no 418c2ecf20Sopenharmony_ciHard link support yes no 428c2ecf20Sopenharmony_ci"." and ".." in readdir yes no 438c2ecf20Sopenharmony_ciReal inode numbers yes no 448c2ecf20Sopenharmony_ci32-bit uids/gids yes no 458c2ecf20Sopenharmony_ciFile creation time yes no 468c2ecf20Sopenharmony_ciXattr support yes no 478c2ecf20Sopenharmony_ciACL support no no 488c2ecf20Sopenharmony_ci============================== ========= ========== 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_ciSquashfs compresses data, inodes and directories. In addition, inode and 518c2ecf20Sopenharmony_cidirectory data are highly compacted, and packed on byte boundaries. Each 528c2ecf20Sopenharmony_cicompressed inode is on average 8 bytes in length (the exact length varies on 538c2ecf20Sopenharmony_cifile type, i.e. regular file, directory, symbolic link, and block/char device 548c2ecf20Sopenharmony_ciinodes have different sizes). 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ci2. Using Squashfs 578c2ecf20Sopenharmony_ci----------------- 588c2ecf20Sopenharmony_ci 598c2ecf20Sopenharmony_ciAs squashfs is a read-only filesystem, the mksquashfs program must be used to 608c2ecf20Sopenharmony_cicreate populated squashfs filesystems. This and other squashfs utilities 618c2ecf20Sopenharmony_cican be obtained from http://www.squashfs.org. Usage instructions can be 628c2ecf20Sopenharmony_ciobtained from this site also. 638c2ecf20Sopenharmony_ci 648c2ecf20Sopenharmony_ciThe squashfs-tools development tree is now located on kernel.org 658c2ecf20Sopenharmony_ci git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git 668c2ecf20Sopenharmony_ci 678c2ecf20Sopenharmony_ci3. Squashfs Filesystem Design 688c2ecf20Sopenharmony_ci----------------------------- 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_ciA squashfs filesystem consists of a maximum of nine parts, packed together on a 718c2ecf20Sopenharmony_cibyte alignment:: 728c2ecf20Sopenharmony_ci 738c2ecf20Sopenharmony_ci --------------- 748c2ecf20Sopenharmony_ci | superblock | 758c2ecf20Sopenharmony_ci |---------------| 768c2ecf20Sopenharmony_ci | compression | 778c2ecf20Sopenharmony_ci | options | 788c2ecf20Sopenharmony_ci |---------------| 798c2ecf20Sopenharmony_ci | datablocks | 808c2ecf20Sopenharmony_ci | & fragments | 818c2ecf20Sopenharmony_ci |---------------| 828c2ecf20Sopenharmony_ci | inode table | 838c2ecf20Sopenharmony_ci |---------------| 848c2ecf20Sopenharmony_ci | directory | 858c2ecf20Sopenharmony_ci | table | 868c2ecf20Sopenharmony_ci |---------------| 878c2ecf20Sopenharmony_ci | fragment | 888c2ecf20Sopenharmony_ci | table | 898c2ecf20Sopenharmony_ci |---------------| 908c2ecf20Sopenharmony_ci | export | 918c2ecf20Sopenharmony_ci | table | 928c2ecf20Sopenharmony_ci |---------------| 938c2ecf20Sopenharmony_ci | uid/gid | 948c2ecf20Sopenharmony_ci | lookup table | 958c2ecf20Sopenharmony_ci |---------------| 968c2ecf20Sopenharmony_ci | xattr | 978c2ecf20Sopenharmony_ci | table | 988c2ecf20Sopenharmony_ci --------------- 998c2ecf20Sopenharmony_ci 1008c2ecf20Sopenharmony_ciCompressed data blocks are written to the filesystem as files are read from 1018c2ecf20Sopenharmony_cithe source directory, and checked for duplicates. Once all file data has been 1028c2ecf20Sopenharmony_ciwritten the completed inode, directory, fragment, export, uid/gid lookup and 1038c2ecf20Sopenharmony_cixattr tables are written. 1048c2ecf20Sopenharmony_ci 1058c2ecf20Sopenharmony_ci3.1 Compression options 1068c2ecf20Sopenharmony_ci----------------------- 1078c2ecf20Sopenharmony_ci 1088c2ecf20Sopenharmony_ciCompressors can optionally support compression specific options (e.g. 1098c2ecf20Sopenharmony_cidictionary size). If non-default compression options have been used, then 1108c2ecf20Sopenharmony_cithese are stored here. 1118c2ecf20Sopenharmony_ci 1128c2ecf20Sopenharmony_ci3.2 Inodes 1138c2ecf20Sopenharmony_ci---------- 1148c2ecf20Sopenharmony_ci 1158c2ecf20Sopenharmony_ciMetadata (inodes and directories) are compressed in 8Kbyte blocks. Each 1168c2ecf20Sopenharmony_cicompressed block is prefixed by a two byte length, the top bit is set if the 1178c2ecf20Sopenharmony_ciblock is uncompressed. A block will be uncompressed if the -noI option is set, 1188c2ecf20Sopenharmony_cior if the compressed block was larger than the uncompressed block. 1198c2ecf20Sopenharmony_ci 1208c2ecf20Sopenharmony_ciInodes are packed into the metadata blocks, and are not aligned to block 1218c2ecf20Sopenharmony_ciboundaries, therefore inodes overlap compressed blocks. Inodes are identified 1228c2ecf20Sopenharmony_ciby a 48-bit number which encodes the location of the compressed metadata block 1238c2ecf20Sopenharmony_cicontaining the inode, and the byte offset into that block where the inode is 1248c2ecf20Sopenharmony_ciplaced (<block, offset>). 1258c2ecf20Sopenharmony_ci 1268c2ecf20Sopenharmony_ciTo maximise compression there are different inodes for each file type 1278c2ecf20Sopenharmony_ci(regular file, directory, device, etc.), the inode contents and length 1288c2ecf20Sopenharmony_civarying with the type. 1298c2ecf20Sopenharmony_ci 1308c2ecf20Sopenharmony_ciTo further maximise compression, two types of regular file inode and 1318c2ecf20Sopenharmony_cidirectory inode are defined: inodes optimised for frequently occurring 1328c2ecf20Sopenharmony_ciregular files and directories, and extended types where extra 1338c2ecf20Sopenharmony_ciinformation has to be stored. 1348c2ecf20Sopenharmony_ci 1358c2ecf20Sopenharmony_ci3.3 Directories 1368c2ecf20Sopenharmony_ci--------------- 1378c2ecf20Sopenharmony_ci 1388c2ecf20Sopenharmony_ciLike inodes, directories are packed into compressed metadata blocks, stored 1398c2ecf20Sopenharmony_ciin a directory table. Directories are accessed using the start address of 1408c2ecf20Sopenharmony_cithe metablock containing the directory and the offset into the 1418c2ecf20Sopenharmony_cidecompressed block (<block, offset>). 1428c2ecf20Sopenharmony_ci 1438c2ecf20Sopenharmony_ciDirectories are organised in a slightly complex way, and are not simply 1448c2ecf20Sopenharmony_cia list of file names. The organisation takes advantage of the 1458c2ecf20Sopenharmony_cifact that (in most cases) the inodes of the files will be in the same 1468c2ecf20Sopenharmony_cicompressed metadata block, and therefore, can share the start block. 1478c2ecf20Sopenharmony_ciDirectories are therefore organised in a two level list, a directory 1488c2ecf20Sopenharmony_ciheader containing the shared start block value, and a sequence of directory 1498c2ecf20Sopenharmony_cientries, each of which share the shared start block. A new directory header 1508c2ecf20Sopenharmony_ciis written once/if the inode start block changes. The directory 1518c2ecf20Sopenharmony_ciheader/directory entry list is repeated as many times as necessary. 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_ciDirectories are sorted, and can contain a directory index to speed up 1548c2ecf20Sopenharmony_cifile lookup. Directory indexes store one entry per metablock, each entry 1558c2ecf20Sopenharmony_cistoring the index/filename mapping to the first directory header 1568c2ecf20Sopenharmony_ciin each metadata block. Directories are sorted in alphabetical order, 1578c2ecf20Sopenharmony_ciand at lookup the index is scanned linearly looking for the first filename 1588c2ecf20Sopenharmony_cialphabetically larger than the filename being looked up. At this point the 1598c2ecf20Sopenharmony_cilocation of the metadata block the filename is in has been found. 1608c2ecf20Sopenharmony_ciThe general idea of the index is to ensure only one metadata block needs to be 1618c2ecf20Sopenharmony_cidecompressed to do a lookup irrespective of the length of the directory. 1628c2ecf20Sopenharmony_ciThis scheme has the advantage that it doesn't require extra memory overhead 1638c2ecf20Sopenharmony_ciand doesn't require much extra storage on disk. 1648c2ecf20Sopenharmony_ci 1658c2ecf20Sopenharmony_ci3.4 File data 1668c2ecf20Sopenharmony_ci------------- 1678c2ecf20Sopenharmony_ci 1688c2ecf20Sopenharmony_ciRegular files consist of a sequence of contiguous compressed blocks, and/or a 1698c2ecf20Sopenharmony_cicompressed fragment block (tail-end packed block). The compressed size 1708c2ecf20Sopenharmony_ciof each datablock is stored in a block list contained within the 1718c2ecf20Sopenharmony_cifile inode. 1728c2ecf20Sopenharmony_ci 1738c2ecf20Sopenharmony_ciTo speed up access to datablocks when reading 'large' files (256 Mbytes or 1748c2ecf20Sopenharmony_cilarger), the code implements an index cache that caches the mapping from 1758c2ecf20Sopenharmony_ciblock index to datablock location on disk. 1768c2ecf20Sopenharmony_ci 1778c2ecf20Sopenharmony_ciThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while 1788c2ecf20Sopenharmony_ciretaining a simple and space-efficient block list on disk. The cache 1798c2ecf20Sopenharmony_ciis split into slots, caching up to eight 224 GiB files (128 KiB blocks). 1808c2ecf20Sopenharmony_ciLarger files use multiple slots, with 1.75 TiB files using all 8 slots. 1818c2ecf20Sopenharmony_ciThe index cache is designed to be memory efficient, and by default uses 1828c2ecf20Sopenharmony_ci16 KiB. 1838c2ecf20Sopenharmony_ci 1848c2ecf20Sopenharmony_ci3.5 Fragment lookup table 1858c2ecf20Sopenharmony_ci------------------------- 1868c2ecf20Sopenharmony_ci 1878c2ecf20Sopenharmony_ciRegular files can contain a fragment index which is mapped to a fragment 1888c2ecf20Sopenharmony_cilocation on disk and compressed size using a fragment lookup table. This 1898c2ecf20Sopenharmony_cifragment lookup table is itself stored compressed into metadata blocks. 1908c2ecf20Sopenharmony_ciA second index table is used to locate these. This second index table for 1918c2ecf20Sopenharmony_cispeed of access (and because it is small) is read at mount time and cached 1928c2ecf20Sopenharmony_ciin memory. 1938c2ecf20Sopenharmony_ci 1948c2ecf20Sopenharmony_ci3.6 Uid/gid lookup table 1958c2ecf20Sopenharmony_ci------------------------ 1968c2ecf20Sopenharmony_ci 1978c2ecf20Sopenharmony_ciFor space efficiency regular files store uid and gid indexes, which are 1988c2ecf20Sopenharmony_ciconverted to 32-bit uids/gids using an id look up table. This table is 1998c2ecf20Sopenharmony_cistored compressed into metadata blocks. A second index table is used to 2008c2ecf20Sopenharmony_cilocate these. This second index table for speed of access (and because it 2018c2ecf20Sopenharmony_ciis small) is read at mount time and cached in memory. 2028c2ecf20Sopenharmony_ci 2038c2ecf20Sopenharmony_ci3.7 Export table 2048c2ecf20Sopenharmony_ci---------------- 2058c2ecf20Sopenharmony_ci 2068c2ecf20Sopenharmony_ciTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 2078c2ecf20Sopenharmony_cican optionally (disabled with the -no-exports Mksquashfs option) contain 2088c2ecf20Sopenharmony_cian inode number to inode disk location lookup table. This is required to 2098c2ecf20Sopenharmony_cienable Squashfs to map inode numbers passed in filehandles to the inode 2108c2ecf20Sopenharmony_cilocation on disk, which is necessary when the export code reinstantiates 2118c2ecf20Sopenharmony_ciexpired/flushed inodes. 2128c2ecf20Sopenharmony_ci 2138c2ecf20Sopenharmony_ciThis table is stored compressed into metadata blocks. A second index table is 2148c2ecf20Sopenharmony_ciused to locate these. This second index table for speed of access (and because 2158c2ecf20Sopenharmony_ciit is small) is read at mount time and cached in memory. 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_ci3.8 Xattr table 2188c2ecf20Sopenharmony_ci--------------- 2198c2ecf20Sopenharmony_ci 2208c2ecf20Sopenharmony_ciThe xattr table contains extended attributes for each inode. The xattrs 2218c2ecf20Sopenharmony_cifor each inode are stored in a list, each list entry containing a type, 2228c2ecf20Sopenharmony_ciname and value field. The type field encodes the xattr prefix 2238c2ecf20Sopenharmony_ci("user.", "trusted." etc) and it also encodes how the name/value fields 2248c2ecf20Sopenharmony_cishould be interpreted. Currently the type indicates whether the value 2258c2ecf20Sopenharmony_ciis stored inline (in which case the value field contains the xattr value), 2268c2ecf20Sopenharmony_cior if it is stored out of line (in which case the value field stores a 2278c2ecf20Sopenharmony_cireference to where the actual value is stored). This allows large values 2288c2ecf20Sopenharmony_cito be stored out of line improving scanning and lookup performance and it 2298c2ecf20Sopenharmony_cialso allows values to be de-duplicated, the value being stored once, and 2308c2ecf20Sopenharmony_ciall other occurrences holding an out of line reference to that value. 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ciThe xattr lists are packed into compressed 8K metadata blocks. 2338c2ecf20Sopenharmony_ciTo reduce overhead in inodes, rather than storing the on-disk 2348c2ecf20Sopenharmony_cilocation of the xattr list inside each inode, a 32-bit xattr id 2358c2ecf20Sopenharmony_ciis stored. This xattr id is mapped into the location of the xattr 2368c2ecf20Sopenharmony_cilist using a second xattr id lookup table. 2378c2ecf20Sopenharmony_ci 2388c2ecf20Sopenharmony_ci4. TODOs and Outstanding Issues 2398c2ecf20Sopenharmony_ci------------------------------- 2408c2ecf20Sopenharmony_ci 2418c2ecf20Sopenharmony_ci4.1 TODO list 2428c2ecf20Sopenharmony_ci------------- 2438c2ecf20Sopenharmony_ci 2448c2ecf20Sopenharmony_ciImplement ACL support. 2458c2ecf20Sopenharmony_ci 2468c2ecf20Sopenharmony_ci4.2 Squashfs Internal Cache 2478c2ecf20Sopenharmony_ci--------------------------- 2488c2ecf20Sopenharmony_ci 2498c2ecf20Sopenharmony_ciBlocks in Squashfs are compressed. To avoid repeatedly decompressing 2508c2ecf20Sopenharmony_cirecently accessed data Squashfs uses two small metadata and fragment caches. 2518c2ecf20Sopenharmony_ci 2528c2ecf20Sopenharmony_ciThe cache is not used for file datablocks, these are decompressed and cached in 2538c2ecf20Sopenharmony_cithe page-cache in the normal way. The cache is used to temporarily cache 2548c2ecf20Sopenharmony_cifragment and metadata blocks which have been read as a result of a metadata 2558c2ecf20Sopenharmony_ci(i.e. inode or directory) or fragment access. Because metadata and fragments 2568c2ecf20Sopenharmony_ciare packed together into blocks (to gain greater compression) the read of a 2578c2ecf20Sopenharmony_ciparticular piece of metadata or fragment will retrieve other metadata/fragments 2588c2ecf20Sopenharmony_ciwhich have been packed with it, these because of locality-of-reference may be 2598c2ecf20Sopenharmony_ciread in the near future. Temporarily caching them ensures they are available 2608c2ecf20Sopenharmony_cifor near future access without requiring an additional read and decompress. 2618c2ecf20Sopenharmony_ci 2628c2ecf20Sopenharmony_ciIn the future this internal cache may be replaced with an implementation which 2638c2ecf20Sopenharmony_ciuses the kernel page cache. Because the page cache operates on page sized 2648c2ecf20Sopenharmony_ciunits this may introduce additional complexity in terms of locking and 2658c2ecf20Sopenharmony_ciassociated race conditions. 266