162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ciBlock and Inode Allocation Policy
462306a36Sopenharmony_ci---------------------------------
562306a36Sopenharmony_ci
662306a36Sopenharmony_ciext4 recognizes (better than ext3, anyway) that data locality is
762306a36Sopenharmony_cigenerally a desirably quality of a filesystem. On a spinning disk,
862306a36Sopenharmony_cikeeping related blocks near each other reduces the amount of movement
962306a36Sopenharmony_cithat the head actuator and disk must perform to access a data block,
1062306a36Sopenharmony_cithus speeding up disk IO. On an SSD there of course are no moving parts,
1162306a36Sopenharmony_cibut locality can increase the size of each transfer request while
1262306a36Sopenharmony_cireducing the total number of requests. This locality may also have the
1362306a36Sopenharmony_cieffect of concentrating writes on a single erase block, which can speed
1462306a36Sopenharmony_ciup file rewrites significantly. Therefore, it is useful to reduce
1562306a36Sopenharmony_cifragmentation whenever possible.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ciThe first tool that ext4 uses to combat fragmentation is the multi-block
1862306a36Sopenharmony_ciallocator. When a file is first created, the block allocator
1962306a36Sopenharmony_cispeculatively allocates 8KiB of disk space to the file on the assumption
2062306a36Sopenharmony_cithat the space will get written soon. When the file is closed, the
2162306a36Sopenharmony_ciunused speculative allocations are of course freed, but if the
2262306a36Sopenharmony_cispeculation is correct (typically the case for full writes of small
2362306a36Sopenharmony_cifiles) then the file data gets written out in a single multi-block
2462306a36Sopenharmony_ciextent. A second related trick that ext4 uses is delayed allocation.
2562306a36Sopenharmony_ciUnder this scheme, when a file needs more blocks to absorb file writes,
2662306a36Sopenharmony_cithe filesystem defers deciding the exact placement on the disk until all
2762306a36Sopenharmony_cithe dirty buffers are being written out to disk. By not committing to a
2862306a36Sopenharmony_ciparticular placement until it's absolutely necessary (the commit timeout
2962306a36Sopenharmony_ciis hit, or sync() is called, or the kernel runs out of memory), the hope
3062306a36Sopenharmony_ciis that the filesystem can make better location decisions.
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciThe third trick that ext4 (and ext3) uses is that it tries to keep a
3362306a36Sopenharmony_cifile's data blocks in the same block group as its inode. This cuts down
3462306a36Sopenharmony_cion the seek penalty when the filesystem first has to read a file's inode
3562306a36Sopenharmony_cito learn where the file's data blocks live and then seek over to the
3662306a36Sopenharmony_cifile's data blocks to begin I/O operations.
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciThe fourth trick is that all the inodes in a directory are placed in the
3962306a36Sopenharmony_cisame block group as the directory, when feasible. The working assumption
4062306a36Sopenharmony_cihere is that all the files in a directory might be related, therefore it
4162306a36Sopenharmony_ciis useful to try to keep them all together.
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ciThe fifth trick is that the disk volume is cut up into 128MB block
4462306a36Sopenharmony_cigroups; these mini-containers are used as outlined above to try to
4562306a36Sopenharmony_cimaintain data locality. However, there is a deliberate quirk -- when a
4662306a36Sopenharmony_cidirectory is created in the root directory, the inode allocator scans
4762306a36Sopenharmony_cithe block groups and puts that directory into the least heavily loaded
4862306a36Sopenharmony_ciblock group that it can find. This encourages directories to spread out
4962306a36Sopenharmony_ciover a disk; as the top-level directory/file blobs fill up one block
5062306a36Sopenharmony_cigroup, the allocators simply move on to the next block group. Allegedly
5162306a36Sopenharmony_cithis scheme evens out the loading on the block groups, though the author
5262306a36Sopenharmony_cisuspects that the directories which are so unlucky as to land towards
5362306a36Sopenharmony_cithe end of a spinning drive get a raw deal performance-wise.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciOf course if all of these mechanisms fail, one can always use e4defrag
5662306a36Sopenharmony_cito defragment files.
57