119ea8026Sopenharmony_ci## The design of littlefs 219ea8026Sopenharmony_ci 319ea8026Sopenharmony_ciA little fail-safe filesystem designed for microcontrollers. 419ea8026Sopenharmony_ci 519ea8026Sopenharmony_ci``` 619ea8026Sopenharmony_ci | | | .---._____ 719ea8026Sopenharmony_ci .-----. | | 819ea8026Sopenharmony_ci--|o |---| littlefs | 919ea8026Sopenharmony_ci--| |---| | 1019ea8026Sopenharmony_ci '-----' '----------' 1119ea8026Sopenharmony_ci | | | 1219ea8026Sopenharmony_ci``` 1319ea8026Sopenharmony_ci 1419ea8026Sopenharmony_cilittlefs was originally built as an experiment to learn about filesystem design 1519ea8026Sopenharmony_ciin the context of microcontrollers. The question was: How would you build a 1619ea8026Sopenharmony_cifilesystem that is resilient to power-loss and flash wear without using 1719ea8026Sopenharmony_ciunbounded memory? 1819ea8026Sopenharmony_ci 1919ea8026Sopenharmony_ciThis document covers the high-level design of littlefs, how it is different 2019ea8026Sopenharmony_cithan other filesystems, and the design decisions that got us here. For the 2119ea8026Sopenharmony_cilow-level details covering every bit on disk, check out [SPEC.md](SPEC.md). 2219ea8026Sopenharmony_ci 2319ea8026Sopenharmony_ci## The problem 2419ea8026Sopenharmony_ci 2519ea8026Sopenharmony_ciThe embedded systems littlefs targets are usually 32-bit microcontrollers with 2619ea8026Sopenharmony_ciaround 32 KiB of RAM and 512 KiB of ROM. These are often paired with SPI NOR 2719ea8026Sopenharmony_ciflash chips with about 4 MiB of flash storage. These devices are too small for 2819ea8026Sopenharmony_ciLinux and most existing filesystems, requiring code written specifically with 2919ea8026Sopenharmony_cisize in mind. 3019ea8026Sopenharmony_ci 3119ea8026Sopenharmony_ciFlash itself is an interesting piece of technology with its own quirks and 3219ea8026Sopenharmony_cinuance. Unlike other forms of storage, writing to flash requires two 3319ea8026Sopenharmony_cioperations: erasing and programming. Programming (setting bits to 0) is 3419ea8026Sopenharmony_cirelatively cheap and can be very granular. Erasing however (setting bits to 1), 3519ea8026Sopenharmony_cirequires an expensive and destructive operation which gives flash its name. 3619ea8026Sopenharmony_ci[Wikipedia][wikipedia-flash] has more information on how exactly flash works. 3719ea8026Sopenharmony_ci 3819ea8026Sopenharmony_ciTo make the situation more annoying, it's very common for these embedded 3919ea8026Sopenharmony_cisystems to lose power at any time. Usually, microcontroller code is simple and 4019ea8026Sopenharmony_cireactive, with no concept of a shutdown routine. This presents a big challenge 4119ea8026Sopenharmony_cifor persistent storage, where an unlucky power loss can corrupt the storage and 4219ea8026Sopenharmony_cileave a device unrecoverable. 4319ea8026Sopenharmony_ci 4419ea8026Sopenharmony_ciThis leaves us with three major requirements for an embedded filesystem. 4519ea8026Sopenharmony_ci 4619ea8026Sopenharmony_ci1. **Power-loss resilience** - On these systems, power can be lost at any time. 4719ea8026Sopenharmony_ci If a power loss corrupts any persistent data structures, this can cause the 4819ea8026Sopenharmony_ci device to become unrecoverable. An embedded filesystem must be designed to 4919ea8026Sopenharmony_ci recover from a power loss during any write operation. 5019ea8026Sopenharmony_ci 5119ea8026Sopenharmony_ci1. **Wear leveling** - Writing to flash is destructive. If a filesystem 5219ea8026Sopenharmony_ci repeatedly writes to the same block, eventually that block will wear out. 5319ea8026Sopenharmony_ci Filesystems that don't take wear into account can easily burn through blocks 5419ea8026Sopenharmony_ci used to store frequently updated metadata and cause a device's early death. 5519ea8026Sopenharmony_ci 5619ea8026Sopenharmony_ci1. **Bounded RAM/ROM** - If the above requirements weren't enough, these 5719ea8026Sopenharmony_ci systems also have very limited amounts of memory. This prevents many 5819ea8026Sopenharmony_ci existing filesystem designs, which can lean on relatively large amounts of 5919ea8026Sopenharmony_ci RAM to temporarily store filesystem metadata. 6019ea8026Sopenharmony_ci 6119ea8026Sopenharmony_ci For ROM, this means we need to keep our design simple and reuse code paths 6219ea8026Sopenharmony_ci were possible. For RAM we have a stronger requirement, all RAM usage is 6319ea8026Sopenharmony_ci bounded. This means RAM usage does not grow as the filesystem changes in 6419ea8026Sopenharmony_ci size or number of files. This creates a unique challenge as even presumably 6519ea8026Sopenharmony_ci simple operations, such as traversing the filesystem, become surprisingly 6619ea8026Sopenharmony_ci difficult. 6719ea8026Sopenharmony_ci 6819ea8026Sopenharmony_ci## Existing designs? 6919ea8026Sopenharmony_ci 7019ea8026Sopenharmony_ciSo, what's already out there? There are, of course, many different filesystems, 7119ea8026Sopenharmony_cihowever they often share and borrow feature from each other. If we look at 7219ea8026Sopenharmony_cipower-loss resilience and wear leveling, we can narrow these down to a handful 7319ea8026Sopenharmony_ciof designs. 7419ea8026Sopenharmony_ci 7519ea8026Sopenharmony_ci1. First we have the non-resilient, block based filesystems, such as [FAT] and 7619ea8026Sopenharmony_ci [ext2]. These are the earliest filesystem designs and often the most simple. 7719ea8026Sopenharmony_ci Here storage is divided into blocks, with each file being stored in a 7819ea8026Sopenharmony_ci collection of blocks. Without modifications, these filesystems are not 7919ea8026Sopenharmony_ci power-loss resilient, so updating a file is a simple as rewriting the blocks 8019ea8026Sopenharmony_ci in place. 8119ea8026Sopenharmony_ci 8219ea8026Sopenharmony_ci ``` 8319ea8026Sopenharmony_ci .--------. 8419ea8026Sopenharmony_ci | root | 8519ea8026Sopenharmony_ci | | 8619ea8026Sopenharmony_ci | | 8719ea8026Sopenharmony_ci '--------' 8819ea8026Sopenharmony_ci .-' '-. 8919ea8026Sopenharmony_ci v v 9019ea8026Sopenharmony_ci .--------. .--------. 9119ea8026Sopenharmony_ci | A | | B | 9219ea8026Sopenharmony_ci | | | | 9319ea8026Sopenharmony_ci | | | | 9419ea8026Sopenharmony_ci '--------' '--------' 9519ea8026Sopenharmony_ci .-' .-' '-. 9619ea8026Sopenharmony_ci v v v 9719ea8026Sopenharmony_ci .--------. .--------. .--------. 9819ea8026Sopenharmony_ci | C | | D | | E | 9919ea8026Sopenharmony_ci | | | | | | 10019ea8026Sopenharmony_ci | | | | | | 10119ea8026Sopenharmony_ci '--------' '--------' '--------' 10219ea8026Sopenharmony_ci ``` 10319ea8026Sopenharmony_ci 10419ea8026Sopenharmony_ci Because of their simplicity, these filesystems are usually both the fastest 10519ea8026Sopenharmony_ci and smallest. However the lack of power resilience is not great, and the 10619ea8026Sopenharmony_ci binding relationship of storage location and data removes the filesystem's 10719ea8026Sopenharmony_ci ability to manage wear. 10819ea8026Sopenharmony_ci 10919ea8026Sopenharmony_ci2. In a completely different direction, we have logging filesystems, such as 11019ea8026Sopenharmony_ci [JFFS], [YAFFS], and [SPIFFS], storage location is not bound to a piece of 11119ea8026Sopenharmony_ci data, instead the entire storage is used for a circular log which is 11219ea8026Sopenharmony_ci appended with every change made to the filesystem. Writing appends new 11319ea8026Sopenharmony_ci changes, while reading requires traversing the log to reconstruct a file. 11419ea8026Sopenharmony_ci Some logging filesystems cache files to avoid the read cost, but this comes 11519ea8026Sopenharmony_ci at a tradeoff of RAM. 11619ea8026Sopenharmony_ci 11719ea8026Sopenharmony_ci ``` 11819ea8026Sopenharmony_ci v 11919ea8026Sopenharmony_ci .--------.--------.--------.--------.--------.--------.--------.--------. 12019ea8026Sopenharmony_ci | C | new B | new A | | A | B | 12119ea8026Sopenharmony_ci | | | |-> | | | 12219ea8026Sopenharmony_ci | | | | | | | 12319ea8026Sopenharmony_ci '--------'--------'--------'--------'--------'--------'--------'--------' 12419ea8026Sopenharmony_ci ``` 12519ea8026Sopenharmony_ci 12619ea8026Sopenharmony_ci Logging filesystem are beautifully elegant. With a checksum, we can easily 12719ea8026Sopenharmony_ci detect power-loss and fall back to the previous state by ignoring failed 12819ea8026Sopenharmony_ci appends. And if that wasn't good enough, their cyclic nature means that 12919ea8026Sopenharmony_ci logging filesystems distribute wear across storage perfectly. 13019ea8026Sopenharmony_ci 13119ea8026Sopenharmony_ci The main downside is performance. If we look at garbage collection, the 13219ea8026Sopenharmony_ci process of cleaning up outdated data from the end of the log, I've yet to 13319ea8026Sopenharmony_ci see a pure logging filesystem that does not have one of these two costs: 13419ea8026Sopenharmony_ci 13519ea8026Sopenharmony_ci 1. _O(n²)_ runtime 13619ea8026Sopenharmony_ci 2. _O(n)_ RAM 13719ea8026Sopenharmony_ci 13819ea8026Sopenharmony_ci SPIFFS is a very interesting case here, as it uses the fact that repeated 13919ea8026Sopenharmony_ci programs to NOR flash is both atomic and masking. This is a very neat 14019ea8026Sopenharmony_ci solution, however it limits the type of storage you can support. 14119ea8026Sopenharmony_ci 14219ea8026Sopenharmony_ci3. Perhaps the most common type of filesystem, a journaling filesystem is the 14319ea8026Sopenharmony_ci offspring that happens when you mate a block based filesystem with a logging 14419ea8026Sopenharmony_ci filesystem. [ext4] and [NTFS] are good examples. Here, we take a normal 14519ea8026Sopenharmony_ci block based filesystem and add a bounded log where we note every change 14619ea8026Sopenharmony_ci before it occurs. 14719ea8026Sopenharmony_ci 14819ea8026Sopenharmony_ci ``` 14919ea8026Sopenharmony_ci journal 15019ea8026Sopenharmony_ci .--------.--------. 15119ea8026Sopenharmony_ci .--------. | C'| D'| | E'| 15219ea8026Sopenharmony_ci | root |-->| | |-> | | 15319ea8026Sopenharmony_ci | | | | | | | 15419ea8026Sopenharmony_ci | | '--------'--------' 15519ea8026Sopenharmony_ci '--------' 15619ea8026Sopenharmony_ci .-' '-. 15719ea8026Sopenharmony_ci v v 15819ea8026Sopenharmony_ci .--------. .--------. 15919ea8026Sopenharmony_ci | A | | B | 16019ea8026Sopenharmony_ci | | | | 16119ea8026Sopenharmony_ci | | | | 16219ea8026Sopenharmony_ci '--------' '--------' 16319ea8026Sopenharmony_ci .-' .-' '-. 16419ea8026Sopenharmony_ci v v v 16519ea8026Sopenharmony_ci .--------. .--------. .--------. 16619ea8026Sopenharmony_ci | C | | D | | E | 16719ea8026Sopenharmony_ci | | | | | | 16819ea8026Sopenharmony_ci | | | | | | 16919ea8026Sopenharmony_ci '--------' '--------' '--------' 17019ea8026Sopenharmony_ci ``` 17119ea8026Sopenharmony_ci 17219ea8026Sopenharmony_ci 17319ea8026Sopenharmony_ci This sort of filesystem takes the best from both worlds. Performance can be 17419ea8026Sopenharmony_ci as fast as a block based filesystem (though updating the journal does have 17519ea8026Sopenharmony_ci a small cost), and atomic updates to the journal allow the filesystem to 17619ea8026Sopenharmony_ci recover in the event of a power loss. 17719ea8026Sopenharmony_ci 17819ea8026Sopenharmony_ci Unfortunately, journaling filesystems have a couple of problems. They are 17919ea8026Sopenharmony_ci fairly complex, since there are effectively two filesystems running in 18019ea8026Sopenharmony_ci parallel, which comes with a code size cost. They also offer no protection 18119ea8026Sopenharmony_ci against wear because of the strong relationship between storage location 18219ea8026Sopenharmony_ci and data. 18319ea8026Sopenharmony_ci 18419ea8026Sopenharmony_ci4. Last but not least we have copy-on-write (COW) filesystems, such as 18519ea8026Sopenharmony_ci [btrfs] and [ZFS]. These are very similar to other block based filesystems, 18619ea8026Sopenharmony_ci but instead of updating block inplace, all updates are performed by creating 18719ea8026Sopenharmony_ci a copy with the changes and replacing any references to the old block with 18819ea8026Sopenharmony_ci our new block. This recursively pushes all of our problems upwards until we 18919ea8026Sopenharmony_ci reach the root of our filesystem, which is often stored in a very small log. 19019ea8026Sopenharmony_ci 19119ea8026Sopenharmony_ci ``` 19219ea8026Sopenharmony_ci .--------. .--------. 19319ea8026Sopenharmony_ci | root | write |new root| 19419ea8026Sopenharmony_ci | | ==> | | 19519ea8026Sopenharmony_ci | | | | 19619ea8026Sopenharmony_ci '--------' '--------' 19719ea8026Sopenharmony_ci .-' '-. | '-. 19819ea8026Sopenharmony_ci | .-------|------------------' v 19919ea8026Sopenharmony_ci v v v .--------. 20019ea8026Sopenharmony_ci .--------. .--------. | new B | 20119ea8026Sopenharmony_ci | A | | B | | | 20219ea8026Sopenharmony_ci | | | | | | 20319ea8026Sopenharmony_ci | | | | '--------' 20419ea8026Sopenharmony_ci '--------' '--------' .-' | 20519ea8026Sopenharmony_ci .-' .-' '-. .------------|------' 20619ea8026Sopenharmony_ci | | | | v 20719ea8026Sopenharmony_ci v v v v .--------. 20819ea8026Sopenharmony_ci .--------. .--------. .--------. | new D | 20919ea8026Sopenharmony_ci | C | | D | | E | | | 21019ea8026Sopenharmony_ci | | | | | | | | 21119ea8026Sopenharmony_ci | | | | | | '--------' 21219ea8026Sopenharmony_ci '--------' '--------' '--------' 21319ea8026Sopenharmony_ci ``` 21419ea8026Sopenharmony_ci 21519ea8026Sopenharmony_ci COW filesystems are interesting. They offer very similar performance to 21619ea8026Sopenharmony_ci block based filesystems while managing to pull off atomic updates without 21719ea8026Sopenharmony_ci storing data changes directly in a log. They even disassociate the storage 21819ea8026Sopenharmony_ci location of data, which creates an opportunity for wear leveling. 21919ea8026Sopenharmony_ci 22019ea8026Sopenharmony_ci Well, almost. The unbounded upwards movement of updates causes some 22119ea8026Sopenharmony_ci problems. Because updates to a COW filesystem don't stop until they've 22219ea8026Sopenharmony_ci reached the root, an update can cascade into a larger set of writes than 22319ea8026Sopenharmony_ci would be needed for the original data. On top of this, the upward motion 22419ea8026Sopenharmony_ci focuses these writes into the block, which can wear out much earlier than 22519ea8026Sopenharmony_ci the rest of the filesystem. 22619ea8026Sopenharmony_ci 22719ea8026Sopenharmony_ci## littlefs 22819ea8026Sopenharmony_ci 22919ea8026Sopenharmony_ciSo what does littlefs do? 23019ea8026Sopenharmony_ci 23119ea8026Sopenharmony_ciIf we look at existing filesystems, there are two interesting design patterns 23219ea8026Sopenharmony_cithat stand out, but each have their own set of problems. Logging, which 23319ea8026Sopenharmony_ciprovides independent atomicity, has poor runtime performance. And COW data 23419ea8026Sopenharmony_cistructures, which perform well, push the atomicity problem upwards. 23519ea8026Sopenharmony_ci 23619ea8026Sopenharmony_ciCan we work around these limitations? 23719ea8026Sopenharmony_ci 23819ea8026Sopenharmony_ciConsider logging. It has either a _O(n²)_ runtime or _O(n)_ RAM cost. We 23919ea8026Sopenharmony_cican't avoid these costs, _but_ if we put an upper bound on the size we can at 24019ea8026Sopenharmony_cileast prevent the theoretical cost from becoming problem. This relies on the 24119ea8026Sopenharmony_cisuper secret computer science hack where you can pretend any algorithmic 24219ea8026Sopenharmony_cicomplexity is _O(1)_ by bounding the input. 24319ea8026Sopenharmony_ci 24419ea8026Sopenharmony_ciIn the case of COW data structures, we can try twisting the definition a bit. 24519ea8026Sopenharmony_ciLet's say that our COW structure doesn't copy after a single write, but instead 24619ea8026Sopenharmony_cicopies after _n_ writes. This doesn't change most COW properties (assuming you 24719ea8026Sopenharmony_cican write atomically!), but what it does do is prevent the upward motion of 24819ea8026Sopenharmony_ciwear. This sort of copy-on-bounded-writes (CObW) still focuses wear, but at 24919ea8026Sopenharmony_cieach level we divide the propagation of wear by _n_. With a sufficiently 25019ea8026Sopenharmony_cilarge _n_ (> branching factor) wear propagation is no longer a problem. 25119ea8026Sopenharmony_ci 25219ea8026Sopenharmony_ciSee where this is going? Separate, logging and COW are imperfect solutions and 25319ea8026Sopenharmony_cihave weaknesses that limit their usefulness. But if we merge the two they can 25419ea8026Sopenharmony_cimutually solve each other's limitations. 25519ea8026Sopenharmony_ci 25619ea8026Sopenharmony_ciThis is the idea behind littlefs. At the sub-block level, littlefs is built 25719ea8026Sopenharmony_ciout of small, two block logs that provide atomic updates to metadata anywhere 25819ea8026Sopenharmony_cion the filesystem. At the super-block level, littlefs is a CObW tree of blocks 25919ea8026Sopenharmony_cithat can be evicted on demand. 26019ea8026Sopenharmony_ci 26119ea8026Sopenharmony_ci``` 26219ea8026Sopenharmony_ci root 26319ea8026Sopenharmony_ci .--------.--------. 26419ea8026Sopenharmony_ci | A'| B'| | 26519ea8026Sopenharmony_ci | | |-> | 26619ea8026Sopenharmony_ci | | | | 26719ea8026Sopenharmony_ci '--------'--------' 26819ea8026Sopenharmony_ci .----' '--------------. 26919ea8026Sopenharmony_ci A v B v 27019ea8026Sopenharmony_ci .--------.--------. .--------.--------. 27119ea8026Sopenharmony_ci | C'| D'| | | E'|new| | 27219ea8026Sopenharmony_ci | | |-> | | | E'|-> | 27319ea8026Sopenharmony_ci | | | | | | | | 27419ea8026Sopenharmony_ci '--------'--------' '--------'--------' 27519ea8026Sopenharmony_ci .-' '--. | '------------------. 27619ea8026Sopenharmony_ci v v .-' v 27719ea8026Sopenharmony_ci.--------. .--------. v .--------. 27819ea8026Sopenharmony_ci| C | | D | .--------. write | new E | 27919ea8026Sopenharmony_ci| | | | | E | ==> | | 28019ea8026Sopenharmony_ci| | | | | | | | 28119ea8026Sopenharmony_ci'--------' '--------' | | '--------' 28219ea8026Sopenharmony_ci '--------' .-' | 28319ea8026Sopenharmony_ci .-' '-. .-------------|------' 28419ea8026Sopenharmony_ci v v v v 28519ea8026Sopenharmony_ci .--------. .--------. .--------. 28619ea8026Sopenharmony_ci | F | | G | | new F | 28719ea8026Sopenharmony_ci | | | | | | 28819ea8026Sopenharmony_ci | | | | | | 28919ea8026Sopenharmony_ci '--------' '--------' '--------' 29019ea8026Sopenharmony_ci``` 29119ea8026Sopenharmony_ci 29219ea8026Sopenharmony_ciThere are still some minor issues. Small logs can be expensive in terms of 29319ea8026Sopenharmony_cistorage, in the worst case a small log costs 4x the size of the original data. 29419ea8026Sopenharmony_ciCObW structures require an efficient block allocator since allocation occurs 29519ea8026Sopenharmony_cievery _n_ writes. And there is still the challenge of keeping the RAM usage 29619ea8026Sopenharmony_ciconstant. 29719ea8026Sopenharmony_ci 29819ea8026Sopenharmony_ci## Metadata pairs 29919ea8026Sopenharmony_ci 30019ea8026Sopenharmony_ciMetadata pairs are the backbone of littlefs. These are small, two block logs 30119ea8026Sopenharmony_cithat allow atomic updates anywhere in the filesystem. 30219ea8026Sopenharmony_ci 30319ea8026Sopenharmony_ciWhy two blocks? Well, logs work by appending entries to a circular buffer 30419ea8026Sopenharmony_cistored on disk. But remember that flash has limited write granularity. We can 30519ea8026Sopenharmony_ciincrementally program new data onto erased blocks, but we need to erase a full 30619ea8026Sopenharmony_ciblock at a time. This means that in order for our circular buffer to work, we 30719ea8026Sopenharmony_cineed more than one block. 30819ea8026Sopenharmony_ci 30919ea8026Sopenharmony_ciWe could make our logs larger than two blocks, but the next challenge is how 31019ea8026Sopenharmony_cido we store references to these logs? Because the blocks themselves are erased 31119ea8026Sopenharmony_ciduring writes, using a data structure to track these blocks is complicated. 31219ea8026Sopenharmony_ciThe simple solution here is to store a two block addresses for every metadata 31319ea8026Sopenharmony_cipair. This has the added advantage that we can change out blocks in the 31419ea8026Sopenharmony_cimetadata pair independently, and we don't reduce our block granularity for 31519ea8026Sopenharmony_ciother operations. 31619ea8026Sopenharmony_ci 31719ea8026Sopenharmony_ciIn order to determine which metadata block is the most recent, we store a 31819ea8026Sopenharmony_cirevision count that we compare using [sequence arithmetic][wikipedia-sna] 31919ea8026Sopenharmony_ci(very handy for avoiding problems with integer overflow). Conveniently, this 32019ea8026Sopenharmony_cirevision count also gives us a rough idea of how many erases have occurred on 32119ea8026Sopenharmony_cithe block. 32219ea8026Sopenharmony_ci 32319ea8026Sopenharmony_ci``` 32419ea8026Sopenharmony_cimetadata pair pointer: {block 0, block 1} 32519ea8026Sopenharmony_ci | '--------------------. 32619ea8026Sopenharmony_ci '-. | 32719ea8026Sopenharmony_cidisk v v 32819ea8026Sopenharmony_ci.--------.--------.--------.--------.--------.--------.--------.--------. 32919ea8026Sopenharmony_ci| | |metadata| |metadata| | 33019ea8026Sopenharmony_ci| | |block 0 | |block 1 | | 33119ea8026Sopenharmony_ci| | | | | | | 33219ea8026Sopenharmony_ci'--------'--------'--------'--------'--------'--------'--------'--------' 33319ea8026Sopenharmony_ci '--. .----' 33419ea8026Sopenharmony_ci v v 33519ea8026Sopenharmony_ci metadata pair .----------------.----------------. 33619ea8026Sopenharmony_ci | revision 11 | revision 12 | 33719ea8026Sopenharmony_ci block 1 is |----------------|----------------| 33819ea8026Sopenharmony_ci most recent | A | A'' | 33919ea8026Sopenharmony_ci |----------------|----------------| 34019ea8026Sopenharmony_ci | checksum | checksum | 34119ea8026Sopenharmony_ci |----------------|----------------| 34219ea8026Sopenharmony_ci | B | A''' | <- most recent A 34319ea8026Sopenharmony_ci |----------------|----------------| 34419ea8026Sopenharmony_ci | A'' | checksum | 34519ea8026Sopenharmony_ci |----------------|----------------| 34619ea8026Sopenharmony_ci | checksum | | | 34719ea8026Sopenharmony_ci |----------------| v | 34819ea8026Sopenharmony_ci '----------------'----------------' 34919ea8026Sopenharmony_ci``` 35019ea8026Sopenharmony_ci 35119ea8026Sopenharmony_ciSo how do we atomically update our metadata pairs? Atomicity (a type of 35219ea8026Sopenharmony_cipower-loss resilience) requires two parts: redundancy and error detection. 35319ea8026Sopenharmony_ciError detection can be provided with a checksum, and in littlefs's case we 35419ea8026Sopenharmony_ciuse a 32-bit [CRC][wikipedia-crc]. Maintaining redundancy, on the other hand, 35519ea8026Sopenharmony_cirequires multiple stages. 35619ea8026Sopenharmony_ci 35719ea8026Sopenharmony_ci1. If our block is not full and the program size is small enough to let us 35819ea8026Sopenharmony_ci append more entries, we can simply append the entries to the log. Because 35919ea8026Sopenharmony_ci we don't overwrite the original entries (remember rewriting flash requires 36019ea8026Sopenharmony_ci an erase), we still have the original entries if we lose power during the 36119ea8026Sopenharmony_ci append. 36219ea8026Sopenharmony_ci 36319ea8026Sopenharmony_ci ``` 36419ea8026Sopenharmony_ci commit A 36519ea8026Sopenharmony_ci .----------------.----------------. .----------------.----------------. 36619ea8026Sopenharmony_ci | revision 1 | revision 0 | => | revision 1 | revision 0 | 36719ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 36819ea8026Sopenharmony_ci | | | | | A | | 36919ea8026Sopenharmony_ci | v | | |----------------| | 37019ea8026Sopenharmony_ci | | | | checksum | | 37119ea8026Sopenharmony_ci | | | |----------------| | 37219ea8026Sopenharmony_ci | | | | | | | 37319ea8026Sopenharmony_ci | | | | v | | 37419ea8026Sopenharmony_ci | | | | | | 37519ea8026Sopenharmony_ci | | | | | | 37619ea8026Sopenharmony_ci | | | | | | 37719ea8026Sopenharmony_ci | | | | | | 37819ea8026Sopenharmony_ci '----------------'----------------' '----------------'----------------' 37919ea8026Sopenharmony_ci ``` 38019ea8026Sopenharmony_ci 38119ea8026Sopenharmony_ci Note that littlefs doesn't maintain a checksum for each entry. Many logging 38219ea8026Sopenharmony_ci filesystems do this, but it limits what you can update in a single atomic 38319ea8026Sopenharmony_ci operation. What we can do instead is group multiple entries into a commit 38419ea8026Sopenharmony_ci that shares a single checksum. This lets us update multiple unrelated pieces 38519ea8026Sopenharmony_ci of metadata as long as they reside on the same metadata pair. 38619ea8026Sopenharmony_ci 38719ea8026Sopenharmony_ci ``` 38819ea8026Sopenharmony_ci commit B and A' 38919ea8026Sopenharmony_ci .----------------.----------------. .----------------.----------------. 39019ea8026Sopenharmony_ci | revision 1 | revision 0 | => | revision 1 | revision 0 | 39119ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 39219ea8026Sopenharmony_ci | A | | | A | | 39319ea8026Sopenharmony_ci |----------------| | |----------------| | 39419ea8026Sopenharmony_ci | checksum | | | checksum | | 39519ea8026Sopenharmony_ci |----------------| | |----------------| | 39619ea8026Sopenharmony_ci | | | | | B | | 39719ea8026Sopenharmony_ci | v | | |----------------| | 39819ea8026Sopenharmony_ci | | | | A' | | 39919ea8026Sopenharmony_ci | | | |----------------| | 40019ea8026Sopenharmony_ci | | | | checksum | | 40119ea8026Sopenharmony_ci | | | |----------------| | 40219ea8026Sopenharmony_ci '----------------'----------------' '----------------'----------------' 40319ea8026Sopenharmony_ci ``` 40419ea8026Sopenharmony_ci 40519ea8026Sopenharmony_ci2. If our block _is_ full of entries, we need to somehow remove outdated 40619ea8026Sopenharmony_ci entries to make space for new ones. This process is called garbage 40719ea8026Sopenharmony_ci collection, but because littlefs has multiple garbage collectors, we 40819ea8026Sopenharmony_ci also call this specific case compaction. 40919ea8026Sopenharmony_ci 41019ea8026Sopenharmony_ci Compared to other filesystems, littlefs's garbage collector is relatively 41119ea8026Sopenharmony_ci simple. We want to avoid RAM consumption, so we use a sort of brute force 41219ea8026Sopenharmony_ci solution where for each entry we check to see if a newer entry has been 41319ea8026Sopenharmony_ci written. If the entry is the most recent we append it to our new block. This 41419ea8026Sopenharmony_ci is where having two blocks becomes important, if we lose power we still have 41519ea8026Sopenharmony_ci everything in our original block. 41619ea8026Sopenharmony_ci 41719ea8026Sopenharmony_ci During this compaction step we also erase the metadata block and increment 41819ea8026Sopenharmony_ci the revision count. Because we can commit multiple entries at once, we can 41919ea8026Sopenharmony_ci write all of these changes to the second block without worrying about power 42019ea8026Sopenharmony_ci loss. It's only when the commit's checksum is written that the compacted 42119ea8026Sopenharmony_ci entries and revision count become committed and readable. 42219ea8026Sopenharmony_ci 42319ea8026Sopenharmony_ci ``` 42419ea8026Sopenharmony_ci commit B', need to compact 42519ea8026Sopenharmony_ci .----------------.----------------. .----------------.----------------. 42619ea8026Sopenharmony_ci | revision 1 | revision 0 | => | revision 1 | revision 2 | 42719ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 42819ea8026Sopenharmony_ci | A | | | A | A' | 42919ea8026Sopenharmony_ci |----------------| | |----------------|----------------| 43019ea8026Sopenharmony_ci | checksum | | | checksum | B' | 43119ea8026Sopenharmony_ci |----------------| | |----------------|----------------| 43219ea8026Sopenharmony_ci | B | | | B | checksum | 43319ea8026Sopenharmony_ci |----------------| | |----------------|----------------| 43419ea8026Sopenharmony_ci | A' | | | A' | | | 43519ea8026Sopenharmony_ci |----------------| | |----------------| v | 43619ea8026Sopenharmony_ci | checksum | | | checksum | | 43719ea8026Sopenharmony_ci |----------------| | |----------------| | 43819ea8026Sopenharmony_ci '----------------'----------------' '----------------'----------------' 43919ea8026Sopenharmony_ci ``` 44019ea8026Sopenharmony_ci 44119ea8026Sopenharmony_ci3. If our block is full of entries _and_ we can't find any garbage, then what? 44219ea8026Sopenharmony_ci At this point, most logging filesystems would return an error indicating no 44319ea8026Sopenharmony_ci more space is available, but because we have small logs, overflowing a log 44419ea8026Sopenharmony_ci isn't really an error condition. 44519ea8026Sopenharmony_ci 44619ea8026Sopenharmony_ci Instead, we split our original metadata pair into two metadata pairs, each 44719ea8026Sopenharmony_ci containing half of the entries, connected by a tail pointer. Instead of 44819ea8026Sopenharmony_ci increasing the size of the log and dealing with the scalability issues 44919ea8026Sopenharmony_ci associated with larger logs, we form a linked list of small bounded logs. 45019ea8026Sopenharmony_ci This is a tradeoff as this approach does use more storage space, but at the 45119ea8026Sopenharmony_ci benefit of improved scalability. 45219ea8026Sopenharmony_ci 45319ea8026Sopenharmony_ci Despite writing to two metadata pairs, we can still maintain power 45419ea8026Sopenharmony_ci resilience during this split step by first preparing the new metadata pair, 45519ea8026Sopenharmony_ci and then inserting the tail pointer during the commit to the original 45619ea8026Sopenharmony_ci metadata pair. 45719ea8026Sopenharmony_ci 45819ea8026Sopenharmony_ci ``` 45919ea8026Sopenharmony_ci commit C and D, need to split 46019ea8026Sopenharmony_ci .----------------.----------------. .----------------.----------------. 46119ea8026Sopenharmony_ci | revision 1 | revision 2 | => | revision 3 | revision 2 | 46219ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 46319ea8026Sopenharmony_ci | A | A' | | A' | A' | 46419ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 46519ea8026Sopenharmony_ci | checksum | B' | | B' | B' | 46619ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| 46719ea8026Sopenharmony_ci | B | checksum | | tail ---------------------. 46819ea8026Sopenharmony_ci |----------------|----------------| |----------------|----------------| | 46919ea8026Sopenharmony_ci | A' | | | | checksum | | | 47019ea8026Sopenharmony_ci |----------------| v | |----------------| | | 47119ea8026Sopenharmony_ci | checksum | | | | | | | 47219ea8026Sopenharmony_ci |----------------| | | v | | | 47319ea8026Sopenharmony_ci '----------------'----------------' '----------------'----------------' | 47419ea8026Sopenharmony_ci .----------------.---------' 47519ea8026Sopenharmony_ci v v 47619ea8026Sopenharmony_ci .----------------.----------------. 47719ea8026Sopenharmony_ci | revision 1 | revision 0 | 47819ea8026Sopenharmony_ci |----------------|----------------| 47919ea8026Sopenharmony_ci | C | | 48019ea8026Sopenharmony_ci |----------------| | 48119ea8026Sopenharmony_ci | D | | 48219ea8026Sopenharmony_ci |----------------| | 48319ea8026Sopenharmony_ci | checksum | | 48419ea8026Sopenharmony_ci |----------------| | 48519ea8026Sopenharmony_ci | | | | 48619ea8026Sopenharmony_ci | v | | 48719ea8026Sopenharmony_ci | | | 48819ea8026Sopenharmony_ci | | | 48919ea8026Sopenharmony_ci '----------------'----------------' 49019ea8026Sopenharmony_ci ``` 49119ea8026Sopenharmony_ci 49219ea8026Sopenharmony_ciThere is another complexity the crops up when dealing with small logs. The 49319ea8026Sopenharmony_ciamortized runtime cost of garbage collection is not only dependent on its 49419ea8026Sopenharmony_cione time cost (_O(n²)_ for littlefs), but also depends on how often 49519ea8026Sopenharmony_cigarbage collection occurs. 49619ea8026Sopenharmony_ci 49719ea8026Sopenharmony_ciConsider two extremes: 49819ea8026Sopenharmony_ci 49919ea8026Sopenharmony_ci1. Log is empty, garbage collection occurs once every _n_ updates 50019ea8026Sopenharmony_ci2. Log is full, garbage collection occurs **every** update 50119ea8026Sopenharmony_ci 50219ea8026Sopenharmony_ciClearly we need to be more aggressive than waiting for our metadata pair to 50319ea8026Sopenharmony_cibe full. As the metadata pair approaches fullness the frequency of compactions 50419ea8026Sopenharmony_cigrows very rapidly. 50519ea8026Sopenharmony_ci 50619ea8026Sopenharmony_ciLooking at the problem generically, consider a log with ![n] bytes for each 50719ea8026Sopenharmony_cientry, ![d] dynamic entries (entries that are outdated during garbage 50819ea8026Sopenharmony_cicollection), and ![s] static entries (entries that need to be copied during 50919ea8026Sopenharmony_cigarbage collection). If we look at the amortized runtime complexity of updating 51019ea8026Sopenharmony_cithis log we get this formula: 51119ea8026Sopenharmony_ci 51219ea8026Sopenharmony_ci![cost = n + n (s / d+1)][metadata-formula1] 51319ea8026Sopenharmony_ci 51419ea8026Sopenharmony_ciIf we let ![r] be the ratio of static space to the size of our log in bytes, we 51519ea8026Sopenharmony_cifind an alternative representation of the number of static and dynamic entries: 51619ea8026Sopenharmony_ci 51719ea8026Sopenharmony_ci![s = r (size/n)][metadata-formula2] 51819ea8026Sopenharmony_ci 51919ea8026Sopenharmony_ci![d = (1 - r) (size/n)][metadata-formula3] 52019ea8026Sopenharmony_ci 52119ea8026Sopenharmony_ciSubstituting these in for ![d] and ![s] gives us a nice formula for the cost of 52219ea8026Sopenharmony_ciupdating an entry given how full the log is: 52319ea8026Sopenharmony_ci 52419ea8026Sopenharmony_ci![cost = n + n (r (size/n) / ((1-r) (size/n) + 1))][metadata-formula4] 52519ea8026Sopenharmony_ci 52619ea8026Sopenharmony_ciAssuming 100 byte entries in a 4 KiB log, we can graph this using the entry 52719ea8026Sopenharmony_cisize to find a multiplicative cost: 52819ea8026Sopenharmony_ci 52919ea8026Sopenharmony_ci![Metadata pair update cost graph][metadata-cost-graph] 53019ea8026Sopenharmony_ci 53119ea8026Sopenharmony_ciSo at 50% usage, we're seeing an average of 2x cost per update, and at 75% 53219ea8026Sopenharmony_ciusage, we're already at an average of 4x cost per update. 53319ea8026Sopenharmony_ci 53419ea8026Sopenharmony_ciTo avoid this exponential growth, instead of waiting for our metadata pair 53519ea8026Sopenharmony_cito be full, we split the metadata pair once we exceed 50% capacity. We do this 53619ea8026Sopenharmony_cilazily, waiting until we need to compact before checking if we fit in our 50% 53719ea8026Sopenharmony_cilimit. This limits the overhead of garbage collection to 2x the runtime cost, 53819ea8026Sopenharmony_cigiving us an amortized runtime complexity of _O(1)_. 53919ea8026Sopenharmony_ci 54019ea8026Sopenharmony_ci--- 54119ea8026Sopenharmony_ci 54219ea8026Sopenharmony_ciIf we look at metadata pairs and linked-lists of metadata pairs at a high 54319ea8026Sopenharmony_cilevel, they have fairly nice runtime costs. Assuming _n_ metadata pairs, 54419ea8026Sopenharmony_cieach containing _m_ metadata entries, the _lookup_ cost for a specific 54519ea8026Sopenharmony_cientry has a worst case runtime complexity of _O(nm)_. For _updating_ a specific 54619ea8026Sopenharmony_cientry, the worst case complexity is _O(nm²)_, with an amortized complexity 54719ea8026Sopenharmony_ciof only _O(nm)_. 54819ea8026Sopenharmony_ci 54919ea8026Sopenharmony_ciHowever, splitting at 50% capacity does mean that in the best case our 55019ea8026Sopenharmony_cimetadata pairs will only be 1/2 full. If we include the overhead of the second 55119ea8026Sopenharmony_ciblock in our metadata pair, each metadata entry has an effective storage cost 55219ea8026Sopenharmony_ciof 4x the original size. I imagine users would not be happy if they found 55319ea8026Sopenharmony_cithat they can only use a quarter of their original storage. Metadata pairs 55419ea8026Sopenharmony_ciprovide a mechanism for performing atomic updates, but we need a separate 55519ea8026Sopenharmony_cimechanism for storing the bulk of our data. 55619ea8026Sopenharmony_ci 55719ea8026Sopenharmony_ci## CTZ skip-lists 55819ea8026Sopenharmony_ci 55919ea8026Sopenharmony_ciMetadata pairs provide efficient atomic updates but unfortunately have a large 56019ea8026Sopenharmony_cistorage cost. But we can work around this storage cost by only using the 56119ea8026Sopenharmony_cimetadata pairs to store references to more dense, copy-on-write (COW) data 56219ea8026Sopenharmony_cistructures. 56319ea8026Sopenharmony_ci 56419ea8026Sopenharmony_ci[Copy-on-write data structures][wikipedia-cow], also called purely functional 56519ea8026Sopenharmony_cidata structures, are a category of data structures where the underlying 56619ea8026Sopenharmony_cielements are immutable. Making changes to the data requires creating new 56719ea8026Sopenharmony_cielements containing a copy of the updated data and replacing any references 56819ea8026Sopenharmony_ciwith references to the new elements. Generally, the performance of a COW data 56919ea8026Sopenharmony_cistructure depends on how many old elements can be reused after replacing parts 57019ea8026Sopenharmony_ciof the data. 57119ea8026Sopenharmony_ci 57219ea8026Sopenharmony_cilittlefs has several requirements of its COW structures. They need to be 57319ea8026Sopenharmony_ciefficient to read and write, but most frustrating, they need to be traversable 57419ea8026Sopenharmony_ciwith a constant amount of RAM. Notably this rules out 57519ea8026Sopenharmony_ci[B-trees][wikipedia-B-tree], which can not be traversed with constant RAM, and 57619ea8026Sopenharmony_ci[B+-trees][wikipedia-B+-tree], which are not possible to update with COW 57719ea8026Sopenharmony_cioperations. 57819ea8026Sopenharmony_ci 57919ea8026Sopenharmony_ci--- 58019ea8026Sopenharmony_ci 58119ea8026Sopenharmony_ciSo, what can we do? First let's consider storing files in a simple COW 58219ea8026Sopenharmony_cilinked-list. Appending a block, which is the basis for writing files, means we 58319ea8026Sopenharmony_cihave to update the last block to point to our new block. This requires a COW 58419ea8026Sopenharmony_cioperation, which means we need to update the second-to-last block, and then the 58519ea8026Sopenharmony_cithird-to-last, and so on until we've copied out the entire file. 58619ea8026Sopenharmony_ci 58719ea8026Sopenharmony_ci``` 58819ea8026Sopenharmony_ciA linked-list 58919ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. .--------. 59019ea8026Sopenharmony_ci| data 0 |->| data 1 |->| data 2 |->| data 4 |->| data 5 |->| data 6 | 59119ea8026Sopenharmony_ci| | | | | | | | | | | | 59219ea8026Sopenharmony_ci| | | | | | | | | | | | 59319ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' '--------' 59419ea8026Sopenharmony_ci``` 59519ea8026Sopenharmony_ci 59619ea8026Sopenharmony_ciTo avoid a full copy during appends, we can store the data backwards. Appending 59719ea8026Sopenharmony_ciblocks just requires adding the new block and no other blocks need to be 59819ea8026Sopenharmony_ciupdated. If we update a block in the middle, we still need to copy the 59919ea8026Sopenharmony_cifollowing blocks, but can reuse any blocks before it. Since most file writes 60019ea8026Sopenharmony_ciare linear, this design gambles that appends are the most common type of data 60119ea8026Sopenharmony_ciupdate. 60219ea8026Sopenharmony_ci 60319ea8026Sopenharmony_ci``` 60419ea8026Sopenharmony_ciA backwards linked-list 60519ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. .--------. 60619ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 4 |<-| data 5 |<-| data 6 | 60719ea8026Sopenharmony_ci| | | | | | | | | | | | 60819ea8026Sopenharmony_ci| | | | | | | | | | | | 60919ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' '--------' 61019ea8026Sopenharmony_ci``` 61119ea8026Sopenharmony_ci 61219ea8026Sopenharmony_ciHowever, a backwards linked-list does have a rather glaring problem. Iterating 61319ea8026Sopenharmony_ciover a file _in order_ has a runtime cost of _O(n²)_. A quadratic runtime 61419ea8026Sopenharmony_cijust to read a file! That's awful. 61519ea8026Sopenharmony_ci 61619ea8026Sopenharmony_ciFortunately we can do better. Instead of a singly linked list, littlefs 61719ea8026Sopenharmony_ciuses a multilayered linked-list often called a 61819ea8026Sopenharmony_ci[skip-list][wikipedia-skip-list]. However, unlike the most common type of 61919ea8026Sopenharmony_ciskip-list, littlefs's skip-lists are strictly deterministic built around some 62019ea8026Sopenharmony_ciinteresting properties of the count-trailing-zeros (CTZ) instruction. 62119ea8026Sopenharmony_ci 62219ea8026Sopenharmony_ciThe rules CTZ skip-lists follow are that for every _n_‍th block where _n_ 62319ea8026Sopenharmony_ciis divisible by 2‍_ˣ_, that block contains a pointer to block 62419ea8026Sopenharmony_ci_n_-2‍_ˣ_. This means that each block contains anywhere from 1 to 62519ea8026Sopenharmony_cilog₂_n_ pointers that skip to different preceding elements of the 62619ea8026Sopenharmony_ciskip-list. 62719ea8026Sopenharmony_ci 62819ea8026Sopenharmony_ciThe name comes from heavy use of the [CTZ instruction][wikipedia-ctz], which 62919ea8026Sopenharmony_cilets us calculate the power-of-two factors efficiently. For a give block _n_, 63019ea8026Sopenharmony_cithat block contains ctz(_n_)+1 pointers. 63119ea8026Sopenharmony_ci 63219ea8026Sopenharmony_ci``` 63319ea8026Sopenharmony_ciA backwards CTZ skip-list 63419ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. .--------. 63519ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 | 63619ea8026Sopenharmony_ci| |<-| |--| |<-| |--| | | | 63719ea8026Sopenharmony_ci| |<-| |--| |--| |--| | | | 63819ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' '--------' 63919ea8026Sopenharmony_ci``` 64019ea8026Sopenharmony_ci 64119ea8026Sopenharmony_ciThe additional pointers let us navigate the data-structure on disk much more 64219ea8026Sopenharmony_ciefficiently than in a singly linked list. 64319ea8026Sopenharmony_ci 64419ea8026Sopenharmony_ciConsider a path from data block 5 to data block 1. You can see how data block 3 64519ea8026Sopenharmony_ciwas completely skipped: 64619ea8026Sopenharmony_ci``` 64719ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. .--------. 64819ea8026Sopenharmony_ci| data 0 | | data 1 |<-| data 2 | | data 3 | | data 4 |<-| data 5 | 64919ea8026Sopenharmony_ci| | | | | |<-| |--| | | | 65019ea8026Sopenharmony_ci| | | | | | | | | | | | 65119ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' '--------' 65219ea8026Sopenharmony_ci``` 65319ea8026Sopenharmony_ci 65419ea8026Sopenharmony_ciThe path to data block 0 is even faster, requiring only two jumps: 65519ea8026Sopenharmony_ci``` 65619ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. .--------. 65719ea8026Sopenharmony_ci| data 0 | | data 1 | | data 2 | | data 3 | | data 4 |<-| data 5 | 65819ea8026Sopenharmony_ci| | | | | | | | | | | | 65919ea8026Sopenharmony_ci| |<-| |--| |--| |--| | | | 66019ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' '--------' 66119ea8026Sopenharmony_ci``` 66219ea8026Sopenharmony_ci 66319ea8026Sopenharmony_ciWe can find the runtime complexity by looking at the path to any block from 66419ea8026Sopenharmony_cithe block containing the most pointers. Every step along the path divides 66519ea8026Sopenharmony_cithe search space for the block in half, giving us a runtime of _O(log n)_. 66619ea8026Sopenharmony_ciTo get _to_ the block with the most pointers, we can perform the same steps 66719ea8026Sopenharmony_cibackwards, which puts the runtime at _O(2 log n)_ = _O(log n)_. An interesting 66819ea8026Sopenharmony_cinote is that this optimal path occurs naturally if we greedily choose the 66919ea8026Sopenharmony_cipointer that covers the most distance without passing our target. 67019ea8026Sopenharmony_ci 67119ea8026Sopenharmony_ciSo now we have a [COW] data structure that is cheap to append with a runtime 67219ea8026Sopenharmony_ciof _O(1)_, and can be read with a worst case runtime of _O(n log n)_. Given 67319ea8026Sopenharmony_cithat this runtime is also divided by the amount of data we can store in a 67419ea8026Sopenharmony_ciblock, this cost is fairly reasonable. 67519ea8026Sopenharmony_ci 67619ea8026Sopenharmony_ci--- 67719ea8026Sopenharmony_ci 67819ea8026Sopenharmony_ciThis is a new data structure, so we still have several questions. What is the 67919ea8026Sopenharmony_cistorage overhead? Can the number of pointers exceed the size of a block? How do 68019ea8026Sopenharmony_ciwe store a CTZ skip-list in our metadata pairs? 68119ea8026Sopenharmony_ci 68219ea8026Sopenharmony_ciTo find the storage overhead, we can look at the data structure as multiple 68319ea8026Sopenharmony_cilinked-lists. Each linked-list skips twice as many blocks as the previous, 68419ea8026Sopenharmony_cior from another perspective, each linked-list uses half as much storage as 68519ea8026Sopenharmony_cithe previous. As we approach infinity, the storage overhead forms a geometric 68619ea8026Sopenharmony_ciseries. Solving this tells us that on average our storage overhead is only 68719ea8026Sopenharmony_ci2 pointers per block. 68819ea8026Sopenharmony_ci 68919ea8026Sopenharmony_ci![lim,n->inf((1/n)sum,i,0->n(ctz(i)+1)) = sum,i,0->inf(1/2^i) = 2][ctz-formula1] 69019ea8026Sopenharmony_ci 69119ea8026Sopenharmony_ciBecause our file size is limited the word width we use to store sizes, we can 69219ea8026Sopenharmony_cialso solve for the maximum number of pointers we would ever need to store in a 69319ea8026Sopenharmony_ciblock. If we set the overhead of pointers equal to the block size, we get the 69419ea8026Sopenharmony_cifollowing equation. Note that both a smaller block size (![B][bigB]) and larger 69519ea8026Sopenharmony_ciword width (![w]) result in more storage overhead. 69619ea8026Sopenharmony_ci 69719ea8026Sopenharmony_ci![B = (w/8)ceil(log2(2^w / (B-2w/8)))][ctz-formula2] 69819ea8026Sopenharmony_ci 69919ea8026Sopenharmony_ciSolving the equation for ![B][bigB] gives us the minimum block size for some 70019ea8026Sopenharmony_cicommon word widths: 70119ea8026Sopenharmony_ci 70219ea8026Sopenharmony_ci1. 32-bit CTZ skip-list => minimum block size of 104 bytes 70319ea8026Sopenharmony_ci2. 64-bit CTZ skip-list => minimum block size of 448 bytes 70419ea8026Sopenharmony_ci 70519ea8026Sopenharmony_cilittlefs uses a 32-bit word width, so our blocks can only overflow with 70619ea8026Sopenharmony_cipointers if they are smaller than 104 bytes. This is an easy requirement, as 70719ea8026Sopenharmony_ciin practice, most block sizes start at 512 bytes. As long as our block size 70819ea8026Sopenharmony_ciis larger than 104 bytes, we can avoid the extra logic needed to handle 70919ea8026Sopenharmony_cipointer overflow. 71019ea8026Sopenharmony_ci 71119ea8026Sopenharmony_ciThis last question is how do we store CTZ skip-lists? We need a pointer to the 71219ea8026Sopenharmony_cihead block, the size of the skip-list, the index of the head block, and our 71319ea8026Sopenharmony_cioffset in the head block. But it's worth noting that each size maps to a unique 71419ea8026Sopenharmony_ciindex + offset pair. So in theory we can store only a single pointer and size. 71519ea8026Sopenharmony_ci 71619ea8026Sopenharmony_ciHowever, calculating the index + offset pair from the size is a bit 71719ea8026Sopenharmony_cicomplicated. We can start with a summation that loops through all of the blocks 71819ea8026Sopenharmony_ciup until our given size. Let ![B][bigB] be the block size in bytes, ![w] be the 71919ea8026Sopenharmony_ciword width in bits, ![n] be the index of the block in the skip-list, and 72019ea8026Sopenharmony_ci![N][bigN] be the file size in bytes: 72119ea8026Sopenharmony_ci 72219ea8026Sopenharmony_ci![N = sum,i,0->n(B-(w/8)(ctz(i)+1))][ctz-formula3] 72319ea8026Sopenharmony_ci 72419ea8026Sopenharmony_ciThis works quite well, but requires _O(n)_ to compute, which brings the full 72519ea8026Sopenharmony_ciruntime of reading a file up to _O(n² log n)_. Fortunately, that summation 72619ea8026Sopenharmony_cidoesn't need to touch the disk, so the practical impact is minimal. 72719ea8026Sopenharmony_ci 72819ea8026Sopenharmony_ciHowever, despite the integration of a bitwise operation, we can actually reduce 72919ea8026Sopenharmony_cithis equation to a _O(1)_ form. While browsing the amazing resource that is 73019ea8026Sopenharmony_cithe [On-Line Encyclopedia of Integer Sequences (OEIS)][oeis], I managed to find 73119ea8026Sopenharmony_ci[A001511], which matches the iteration of the CTZ instruction, 73219ea8026Sopenharmony_ciand [A005187], which matches its partial summation. Much to my 73319ea8026Sopenharmony_cisurprise, these both result from simple equations, leading us to a rather 73419ea8026Sopenharmony_ciunintuitive property that ties together two seemingly unrelated bitwise 73519ea8026Sopenharmony_ciinstructions: 73619ea8026Sopenharmony_ci 73719ea8026Sopenharmony_ci![sum,i,0->n(ctz(i)+1) = 2n-popcount(n)][ctz-formula4] 73819ea8026Sopenharmony_ci 73919ea8026Sopenharmony_ciwhere: 74019ea8026Sopenharmony_ci 74119ea8026Sopenharmony_ci1. ctz(![x]) = the number of trailing bits that are 0 in ![x] 74219ea8026Sopenharmony_ci2. popcount(![x]) = the number of bits that are 1 in ![x] 74319ea8026Sopenharmony_ci 74419ea8026Sopenharmony_ciInitial tests of this surprising property seem to hold. As ![n] approaches 74519ea8026Sopenharmony_ciinfinity, we end up with an average overhead of 2 pointers, which matches our 74619ea8026Sopenharmony_ciassumption from earlier. During iteration, the popcount function seems to 74719ea8026Sopenharmony_cihandle deviations from this average. Of course, just to make sure I wrote a 74819ea8026Sopenharmony_ciquick script that verified this property for all 32-bit integers. 74919ea8026Sopenharmony_ci 75019ea8026Sopenharmony_ciNow we can substitute into our original equation to find a more efficient 75119ea8026Sopenharmony_ciequation for file size: 75219ea8026Sopenharmony_ci 75319ea8026Sopenharmony_ci![N = Bn - (w/8)(2n-popcount(n))][ctz-formula5] 75419ea8026Sopenharmony_ci 75519ea8026Sopenharmony_ciUnfortunately, the popcount function is non-injective, so we can't solve this 75619ea8026Sopenharmony_ciequation for our index. But what we can do is solve for an ![n'] index that 75719ea8026Sopenharmony_ciis greater than ![n] with error bounded by the range of the popcount function. 75819ea8026Sopenharmony_ciWe can repeatedly substitute ![n'] into the original equation until the error 75919ea8026Sopenharmony_ciis smaller than our integer resolution. As it turns out, we only need to 76019ea8026Sopenharmony_ciperform this substitution once, which gives us this formula for our index: 76119ea8026Sopenharmony_ci 76219ea8026Sopenharmony_ci![n = floor((N-(w/8)popcount(N/(B-2w/8))) / (B-2w/8))][ctz-formula6] 76319ea8026Sopenharmony_ci 76419ea8026Sopenharmony_ciNow that we have our index ![n], we can just plug it back into the above 76519ea8026Sopenharmony_ciequation to find the offset. We run into a bit of a problem with integer 76619ea8026Sopenharmony_cioverflow, but we can avoid this by rearranging the equation a bit: 76719ea8026Sopenharmony_ci 76819ea8026Sopenharmony_ci![off = N - (B-2w/8)n - (w/8)popcount(n)][ctz-formula7] 76919ea8026Sopenharmony_ci 77019ea8026Sopenharmony_ciOur solution requires quite a bit of math, but computers are very good at math. 77119ea8026Sopenharmony_ciNow we can find both our block index and offset from a size in _O(1)_, letting 77219ea8026Sopenharmony_cius store CTZ skip-lists with only a pointer and size. 77319ea8026Sopenharmony_ci 77419ea8026Sopenharmony_ciCTZ skip-lists give us a COW data structure that is easily traversable in 77519ea8026Sopenharmony_ci_O(n)_, can be appended in _O(1)_, and can be read in _O(n log n)_. All of 77619ea8026Sopenharmony_cithese operations work in a bounded amount of RAM and require only two words of 77719ea8026Sopenharmony_cistorage overhead per block. In combination with metadata pairs, CTZ skip-lists 77819ea8026Sopenharmony_ciprovide power resilience and compact storage of data. 77919ea8026Sopenharmony_ci 78019ea8026Sopenharmony_ci``` 78119ea8026Sopenharmony_ci .--------. 78219ea8026Sopenharmony_ci .|metadata| 78319ea8026Sopenharmony_ci || | 78419ea8026Sopenharmony_ci || | 78519ea8026Sopenharmony_ci |'--------' 78619ea8026Sopenharmony_ci '----|---' 78719ea8026Sopenharmony_ci v 78819ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. 78919ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | 79019ea8026Sopenharmony_ci| |<-| |--| | | | 79119ea8026Sopenharmony_ci| | | | | | | | 79219ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' 79319ea8026Sopenharmony_ci 79419ea8026Sopenharmony_ciwrite data to disk, create copies 79519ea8026Sopenharmony_ci=> 79619ea8026Sopenharmony_ci .--------. 79719ea8026Sopenharmony_ci .|metadata| 79819ea8026Sopenharmony_ci || | 79919ea8026Sopenharmony_ci || | 80019ea8026Sopenharmony_ci |'--------' 80119ea8026Sopenharmony_ci '----|---' 80219ea8026Sopenharmony_ci v 80319ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. 80419ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | 80519ea8026Sopenharmony_ci| |<-| |--| | | | 80619ea8026Sopenharmony_ci| | | | | | | | 80719ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' 80819ea8026Sopenharmony_ci ^ ^ ^ 80919ea8026Sopenharmony_ci | | | .--------. .--------. .--------. .--------. 81019ea8026Sopenharmony_ci | | '----| new |<-| new |<-| new |<-| new | 81119ea8026Sopenharmony_ci | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | 81219ea8026Sopenharmony_ci '------------------| |--| |--| | | | 81319ea8026Sopenharmony_ci '--------' '--------' '--------' '--------' 81419ea8026Sopenharmony_ci 81519ea8026Sopenharmony_cicommit to metadata pair 81619ea8026Sopenharmony_ci=> 81719ea8026Sopenharmony_ci .--------. 81819ea8026Sopenharmony_ci .|new | 81919ea8026Sopenharmony_ci ||metadata| 82019ea8026Sopenharmony_ci || | 82119ea8026Sopenharmony_ci |'--------' 82219ea8026Sopenharmony_ci '----|---' 82319ea8026Sopenharmony_ci | 82419ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. | 82519ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | | 82619ea8026Sopenharmony_ci| |<-| |--| | | | | 82719ea8026Sopenharmony_ci| | | | | | | | | 82819ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' | 82919ea8026Sopenharmony_ci ^ ^ ^ v 83019ea8026Sopenharmony_ci | | | .--------. .--------. .--------. .--------. 83119ea8026Sopenharmony_ci | | '----| new |<-| new |<-| new |<-| new | 83219ea8026Sopenharmony_ci | '----------------| data 2 |<-| data 3 |--| data 4 | | data 5 | 83319ea8026Sopenharmony_ci '------------------| |--| |--| | | | 83419ea8026Sopenharmony_ci '--------' '--------' '--------' '--------' 83519ea8026Sopenharmony_ci``` 83619ea8026Sopenharmony_ci 83719ea8026Sopenharmony_ci## The block allocator 83819ea8026Sopenharmony_ci 83919ea8026Sopenharmony_ciSo we now have the framework for an atomic, wear leveling filesystem. Small two 84019ea8026Sopenharmony_ciblock metadata pairs provide atomic updates, while CTZ skip-lists provide 84119ea8026Sopenharmony_cicompact storage of data in COW blocks. 84219ea8026Sopenharmony_ci 84319ea8026Sopenharmony_ciBut now we need to look at the [elephant] in the room. Where do all these 84419ea8026Sopenharmony_ciblocks come from? 84519ea8026Sopenharmony_ci 84619ea8026Sopenharmony_ciDeciding which block to use next is the responsibility of the block allocator. 84719ea8026Sopenharmony_ciIn filesystem design, block allocation is often a second-class citizen, but in 84819ea8026Sopenharmony_cia COW filesystem its role becomes much more important as it is needed for 84919ea8026Sopenharmony_cinearly every write to the filesystem. 85019ea8026Sopenharmony_ci 85119ea8026Sopenharmony_ciNormally, block allocation involves some sort of free list or bitmap stored on 85219ea8026Sopenharmony_cithe filesystem that is updated with free blocks. However, with power 85319ea8026Sopenharmony_ciresilience, keeping these structures consistent becomes difficult. It doesn't 85419ea8026Sopenharmony_cihelp that any mistake in updating these structures can result in lost blocks 85519ea8026Sopenharmony_cithat are impossible to recover. 85619ea8026Sopenharmony_ci 85719ea8026Sopenharmony_cilittlefs takes a cautious approach. Instead of trusting a free list on disk, 85819ea8026Sopenharmony_cilittlefs relies on the fact that the filesystem on disk is a mirror image of 85919ea8026Sopenharmony_cithe free blocks on the disk. The block allocator operates much like a garbage 86019ea8026Sopenharmony_cicollector in a scripting language, scanning for unused blocks on demand. 86119ea8026Sopenharmony_ci 86219ea8026Sopenharmony_ci``` 86319ea8026Sopenharmony_ci .----. 86419ea8026Sopenharmony_ci |root| 86519ea8026Sopenharmony_ci | | 86619ea8026Sopenharmony_ci '----' 86719ea8026Sopenharmony_ci v-------' '-------v 86819ea8026Sopenharmony_ci.----. . . .----. 86919ea8026Sopenharmony_ci| A | . . | B | 87019ea8026Sopenharmony_ci| | . . | | 87119ea8026Sopenharmony_ci'----' . . '----' 87219ea8026Sopenharmony_ci. . . . v--' '------------v---------v 87319ea8026Sopenharmony_ci. . . .----. . .----. .----. 87419ea8026Sopenharmony_ci. . . | C | . | D | | E | 87519ea8026Sopenharmony_ci. . . | | . | | | | 87619ea8026Sopenharmony_ci. . . '----' . '----' '----' 87719ea8026Sopenharmony_ci. . . . . . . . . . 87819ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----.----.----. 87919ea8026Sopenharmony_ci| A | |root| C | B | | D | | E | | 88019ea8026Sopenharmony_ci| | | | | | | | | | | 88119ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----'----'----' 88219ea8026Sopenharmony_ci ^ ^ ^ ^ ^ 88319ea8026Sopenharmony_ci '-------------------'----'-------------------'----'-- free blocks 88419ea8026Sopenharmony_ci``` 88519ea8026Sopenharmony_ci 88619ea8026Sopenharmony_ciWhile this approach may sound complicated, the decision to not maintain a free 88719ea8026Sopenharmony_cilist greatly simplifies the overall design of littlefs. Unlike programming 88819ea8026Sopenharmony_cilanguages, there are only a handful of data structures we need to traverse. 88919ea8026Sopenharmony_ciAnd block deallocation, which occurs nearly as often as block allocation, 89019ea8026Sopenharmony_ciis simply a noop. This "drop it on the floor" strategy greatly reduces the 89119ea8026Sopenharmony_cicomplexity of managing on disk data structures, especially when handling 89219ea8026Sopenharmony_cihigh-risk error conditions. 89319ea8026Sopenharmony_ci 89419ea8026Sopenharmony_ci--- 89519ea8026Sopenharmony_ci 89619ea8026Sopenharmony_ciOur block allocator needs to find free blocks efficiently. You could traverse 89719ea8026Sopenharmony_cithrough every block on storage and check each one against our filesystem tree; 89819ea8026Sopenharmony_cihowever, the runtime would be abhorrent. We need to somehow collect multiple 89919ea8026Sopenharmony_ciblocks per traversal. 90019ea8026Sopenharmony_ci 90119ea8026Sopenharmony_ciLooking at existing designs, some larger filesystems that use a similar "drop 90219ea8026Sopenharmony_ciit on the floor" strategy store a bitmap of the entire storage in [RAM]. This 90319ea8026Sopenharmony_ciworks well because bitmaps are surprisingly compact. We can't use the same 90419ea8026Sopenharmony_cistrategy here, as it violates our constant RAM requirement, but we may be able 90519ea8026Sopenharmony_cito modify the idea into a workable solution. 90619ea8026Sopenharmony_ci 90719ea8026Sopenharmony_ci``` 90819ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----.----.----. 90919ea8026Sopenharmony_ci| A | |root| C | B | | D | | E | | 91019ea8026Sopenharmony_ci| | | | | | | | | | | 91119ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----'----'----' 91219ea8026Sopenharmony_ci 1 0 1 1 1 0 0 1 0 1 0 0 91319ea8026Sopenharmony_ci \---------------------------+----------------------------/ 91419ea8026Sopenharmony_ci v 91519ea8026Sopenharmony_ci bitmap: 0xb94 (0b101110010100) 91619ea8026Sopenharmony_ci``` 91719ea8026Sopenharmony_ci 91819ea8026Sopenharmony_ciThe block allocator in littlefs is a compromise between a disk-sized bitmap and 91919ea8026Sopenharmony_cia brute force traversal. Instead of a bitmap the size of storage, we keep track 92019ea8026Sopenharmony_ciof a small, fixed-size bitmap called the lookahead buffer. During block 92119ea8026Sopenharmony_ciallocation, we take blocks from the lookahead buffer. If the lookahead buffer 92219ea8026Sopenharmony_ciis empty, we scan the filesystem for more free blocks, populating our lookahead 92319ea8026Sopenharmony_cibuffer. In each scan we use an increasing offset, circling the storage as 92419ea8026Sopenharmony_ciblocks are allocated. 92519ea8026Sopenharmony_ci 92619ea8026Sopenharmony_ciHere's what it might look like to allocate 4 blocks on a decently busy 92719ea8026Sopenharmony_cifilesystem with a 32 bit lookahead and a total of 128 blocks (512 KiB 92819ea8026Sopenharmony_ciof storage if blocks are 4 KiB): 92919ea8026Sopenharmony_ci``` 93019ea8026Sopenharmony_ciboot... lookahead: 93119ea8026Sopenharmony_ci fs blocks: fffff9fffffffffeffffffffffff0000 93219ea8026Sopenharmony_ciscanning... lookahead: fffff9ff 93319ea8026Sopenharmony_ci fs blocks: fffff9fffffffffeffffffffffff0000 93419ea8026Sopenharmony_cialloc = 21 lookahead: fffffdff 93519ea8026Sopenharmony_ci fs blocks: fffffdfffffffffeffffffffffff0000 93619ea8026Sopenharmony_cialloc = 22 lookahead: ffffffff 93719ea8026Sopenharmony_ci fs blocks: fffffffffffffffeffffffffffff0000 93819ea8026Sopenharmony_ciscanning... lookahead: fffffffe 93919ea8026Sopenharmony_ci fs blocks: fffffffffffffffeffffffffffff0000 94019ea8026Sopenharmony_cialloc = 63 lookahead: ffffffff 94119ea8026Sopenharmony_ci fs blocks: ffffffffffffffffffffffffffff0000 94219ea8026Sopenharmony_ciscanning... lookahead: ffffffff 94319ea8026Sopenharmony_ci fs blocks: ffffffffffffffffffffffffffff0000 94419ea8026Sopenharmony_ciscanning... lookahead: ffffffff 94519ea8026Sopenharmony_ci fs blocks: ffffffffffffffffffffffffffff0000 94619ea8026Sopenharmony_ciscanning... lookahead: ffff0000 94719ea8026Sopenharmony_ci fs blocks: ffffffffffffffffffffffffffff0000 94819ea8026Sopenharmony_cialloc = 112 lookahead: ffff8000 94919ea8026Sopenharmony_ci fs blocks: ffffffffffffffffffffffffffff8000 95019ea8026Sopenharmony_ci``` 95119ea8026Sopenharmony_ci 95219ea8026Sopenharmony_ciThis lookahead approach has a runtime complexity of _O(n²)_ to completely 95319ea8026Sopenharmony_ciscan storage; however, bitmaps are surprisingly compact, and in practice only 95419ea8026Sopenharmony_cione or two passes are usually needed to find free blocks. Additionally, the 95519ea8026Sopenharmony_ciperformance of the allocator can be optimized by adjusting the block size or 95619ea8026Sopenharmony_cisize of the lookahead buffer, trading either write granularity or RAM for 95719ea8026Sopenharmony_ciallocator performance. 95819ea8026Sopenharmony_ci 95919ea8026Sopenharmony_ci## Wear leveling 96019ea8026Sopenharmony_ci 96119ea8026Sopenharmony_ciThe block allocator has a secondary role: wear leveling. 96219ea8026Sopenharmony_ci 96319ea8026Sopenharmony_ciWear leveling is the process of distributing wear across all blocks in the 96419ea8026Sopenharmony_cistorage to prevent the filesystem from experiencing an early death due to 96519ea8026Sopenharmony_ciwear on a single block in the storage. 96619ea8026Sopenharmony_ci 96719ea8026Sopenharmony_cilittlefs has two methods of protecting against wear: 96819ea8026Sopenharmony_ci1. Detection and recovery from bad blocks 96919ea8026Sopenharmony_ci2. Evenly distributing wear across dynamic blocks 97019ea8026Sopenharmony_ci 97119ea8026Sopenharmony_ci--- 97219ea8026Sopenharmony_ci 97319ea8026Sopenharmony_ciRecovery from bad blocks doesn't actually have anything to do with the block 97419ea8026Sopenharmony_ciallocator itself. Instead, it relies on the ability of the filesystem to detect 97519ea8026Sopenharmony_ciand evict bad blocks when they occur. 97619ea8026Sopenharmony_ci 97719ea8026Sopenharmony_ciIn littlefs, it is fairly straightforward to detect bad blocks at write time. 97819ea8026Sopenharmony_ciAll writes must be sourced by some form of data in RAM, so immediately after we 97919ea8026Sopenharmony_ciwrite to a block, we can read the data back and verify that it was written 98019ea8026Sopenharmony_cicorrectly. If we find that the data on disk does not match the copy we have in 98119ea8026Sopenharmony_ciRAM, a write error has occurred and we most likely have a bad block. 98219ea8026Sopenharmony_ci 98319ea8026Sopenharmony_ciOnce we detect a bad block, we need to recover from it. In the case of write 98419ea8026Sopenharmony_cierrors, we have a copy of the corrupted data in RAM, so all we need to do is 98519ea8026Sopenharmony_cievict the bad block, allocate a new, hopefully good block, and repeat the write 98619ea8026Sopenharmony_cithat previously failed. 98719ea8026Sopenharmony_ci 98819ea8026Sopenharmony_ciThe actual act of evicting the bad block and replacing it with a new block is 98919ea8026Sopenharmony_cileft up to the filesystem's copy-on-bounded-writes (CObW) data structures. One 99019ea8026Sopenharmony_ciproperty of CObW data structures is that any block can be replaced during a 99119ea8026Sopenharmony_ciCOW operation. The bounded-writes part is normally triggered by a counter, but 99219ea8026Sopenharmony_cinothing prevents us from triggering a COW operation as soon as we find a bad 99319ea8026Sopenharmony_ciblock. 99419ea8026Sopenharmony_ci 99519ea8026Sopenharmony_ci``` 99619ea8026Sopenharmony_ci .----. 99719ea8026Sopenharmony_ci |root| 99819ea8026Sopenharmony_ci | | 99919ea8026Sopenharmony_ci '----' 100019ea8026Sopenharmony_ci v--' '----------------------v 100119ea8026Sopenharmony_ci.----. .----. 100219ea8026Sopenharmony_ci| A | | B | 100319ea8026Sopenharmony_ci| | | | 100419ea8026Sopenharmony_ci'----' '----' 100519ea8026Sopenharmony_ci. . v---' . 100619ea8026Sopenharmony_ci. . .----. . 100719ea8026Sopenharmony_ci. . | C | . 100819ea8026Sopenharmony_ci. . | | . 100919ea8026Sopenharmony_ci. . '----' . 101019ea8026Sopenharmony_ci. . . . . 101119ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 101219ea8026Sopenharmony_ci| A |root| | C | B | | 101319ea8026Sopenharmony_ci| | | | | | | 101419ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 101519ea8026Sopenharmony_ci 101619ea8026Sopenharmony_ciupdate C 101719ea8026Sopenharmony_ci=> 101819ea8026Sopenharmony_ci .----. 101919ea8026Sopenharmony_ci |root| 102019ea8026Sopenharmony_ci | | 102119ea8026Sopenharmony_ci '----' 102219ea8026Sopenharmony_ci v--' '----------------------v 102319ea8026Sopenharmony_ci.----. .----. 102419ea8026Sopenharmony_ci| A | | B | 102519ea8026Sopenharmony_ci| | | | 102619ea8026Sopenharmony_ci'----' '----' 102719ea8026Sopenharmony_ci. . v---' . 102819ea8026Sopenharmony_ci. . .----. . 102919ea8026Sopenharmony_ci. . |bad | . 103019ea8026Sopenharmony_ci. . |blck| . 103119ea8026Sopenharmony_ci. . '----' . 103219ea8026Sopenharmony_ci. . . . . 103319ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 103419ea8026Sopenharmony_ci| A |root| |bad | B | | 103519ea8026Sopenharmony_ci| | | |blck| | | 103619ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 103719ea8026Sopenharmony_ci 103819ea8026Sopenharmony_cioh no! bad block! relocate C 103919ea8026Sopenharmony_ci=> 104019ea8026Sopenharmony_ci .----. 104119ea8026Sopenharmony_ci |root| 104219ea8026Sopenharmony_ci | | 104319ea8026Sopenharmony_ci '----' 104419ea8026Sopenharmony_ci v--' '----------------------v 104519ea8026Sopenharmony_ci.----. .----. 104619ea8026Sopenharmony_ci| A | | B | 104719ea8026Sopenharmony_ci| | | | 104819ea8026Sopenharmony_ci'----' '----' 104919ea8026Sopenharmony_ci. . v---' . 105019ea8026Sopenharmony_ci. . .----. . 105119ea8026Sopenharmony_ci. . |bad | . 105219ea8026Sopenharmony_ci. . |blck| . 105319ea8026Sopenharmony_ci. . '----' . 105419ea8026Sopenharmony_ci. . . . . 105519ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 105619ea8026Sopenharmony_ci| A |root| |bad | B |bad | | 105719ea8026Sopenharmony_ci| | | |blck| |blck| | 105819ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 105919ea8026Sopenharmony_ci ---------> 106019ea8026Sopenharmony_cioh no! bad block! relocate C 106119ea8026Sopenharmony_ci=> 106219ea8026Sopenharmony_ci .----. 106319ea8026Sopenharmony_ci |root| 106419ea8026Sopenharmony_ci | | 106519ea8026Sopenharmony_ci '----' 106619ea8026Sopenharmony_ci v--' '----------------------v 106719ea8026Sopenharmony_ci.----. .----. 106819ea8026Sopenharmony_ci| A | | B | 106919ea8026Sopenharmony_ci| | | | 107019ea8026Sopenharmony_ci'----' '----' 107119ea8026Sopenharmony_ci. . v---' . 107219ea8026Sopenharmony_ci. . .----. . .----. 107319ea8026Sopenharmony_ci. . |bad | . | C' | 107419ea8026Sopenharmony_ci. . |blck| . | | 107519ea8026Sopenharmony_ci. . '----' . '----' 107619ea8026Sopenharmony_ci. . . . . . . 107719ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 107819ea8026Sopenharmony_ci| A |root| |bad | B |bad | C' | | 107919ea8026Sopenharmony_ci| | | |blck| |blck| | | 108019ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 108119ea8026Sopenharmony_ci --------------> 108219ea8026Sopenharmony_cisuccessfully relocated C, update B 108319ea8026Sopenharmony_ci=> 108419ea8026Sopenharmony_ci .----. 108519ea8026Sopenharmony_ci |root| 108619ea8026Sopenharmony_ci | | 108719ea8026Sopenharmony_ci '----' 108819ea8026Sopenharmony_ci v--' '----------------------v 108919ea8026Sopenharmony_ci.----. .----. 109019ea8026Sopenharmony_ci| A | |bad | 109119ea8026Sopenharmony_ci| | |blck| 109219ea8026Sopenharmony_ci'----' '----' 109319ea8026Sopenharmony_ci. . v---' . 109419ea8026Sopenharmony_ci. . .----. . .----. 109519ea8026Sopenharmony_ci. . |bad | . | C' | 109619ea8026Sopenharmony_ci. . |blck| . | | 109719ea8026Sopenharmony_ci. . '----' . '----' 109819ea8026Sopenharmony_ci. . . . . . . 109919ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 110019ea8026Sopenharmony_ci| A |root| |bad |bad |bad | C' | | 110119ea8026Sopenharmony_ci| | | |blck|blck|blck| | | 110219ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 110319ea8026Sopenharmony_ci 110419ea8026Sopenharmony_cioh no! bad block! relocate B 110519ea8026Sopenharmony_ci=> 110619ea8026Sopenharmony_ci .----. 110719ea8026Sopenharmony_ci |root| 110819ea8026Sopenharmony_ci | | 110919ea8026Sopenharmony_ci '----' 111019ea8026Sopenharmony_ci v--' '----------------------v 111119ea8026Sopenharmony_ci.----. .----. .----. 111219ea8026Sopenharmony_ci| A | |bad | |bad | 111319ea8026Sopenharmony_ci| | |blck| |blck| 111419ea8026Sopenharmony_ci'----' '----' '----' 111519ea8026Sopenharmony_ci. . v---' . . . 111619ea8026Sopenharmony_ci. . .----. . .----. . 111719ea8026Sopenharmony_ci. . |bad | . | C' | . 111819ea8026Sopenharmony_ci. . |blck| . | | . 111919ea8026Sopenharmony_ci. . '----' . '----' . 112019ea8026Sopenharmony_ci. . . . . . . . 112119ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 112219ea8026Sopenharmony_ci| A |root| |bad |bad |bad | C' |bad | 112319ea8026Sopenharmony_ci| | | |blck|blck|blck| |blck| 112419ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 112519ea8026Sopenharmony_ci --------------> 112619ea8026Sopenharmony_cioh no! bad block! relocate B 112719ea8026Sopenharmony_ci=> 112819ea8026Sopenharmony_ci .----. 112919ea8026Sopenharmony_ci |root| 113019ea8026Sopenharmony_ci | | 113119ea8026Sopenharmony_ci '----' 113219ea8026Sopenharmony_ci v--' '----------------------v 113319ea8026Sopenharmony_ci.----. .----. .----. 113419ea8026Sopenharmony_ci| A | | B' | |bad | 113519ea8026Sopenharmony_ci| | | | |blck| 113619ea8026Sopenharmony_ci'----' '----' '----' 113719ea8026Sopenharmony_ci. . . | . .---' . 113819ea8026Sopenharmony_ci. . . '--------------v-------------v 113919ea8026Sopenharmony_ci. . . . .----. . .----. 114019ea8026Sopenharmony_ci. . . . |bad | . | C' | 114119ea8026Sopenharmony_ci. . . . |blck| . | | 114219ea8026Sopenharmony_ci. . . . '----' . '----' 114319ea8026Sopenharmony_ci. . . . . . . . . 114419ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 114519ea8026Sopenharmony_ci| A |root| B' | |bad |bad |bad | C' |bad | 114619ea8026Sopenharmony_ci| | | | |blck|blck|blck| |blck| 114719ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 114819ea8026Sopenharmony_ci------------> ------------------ 114919ea8026Sopenharmony_cisuccessfully relocated B, update root 115019ea8026Sopenharmony_ci=> 115119ea8026Sopenharmony_ci .----. 115219ea8026Sopenharmony_ci |root| 115319ea8026Sopenharmony_ci | | 115419ea8026Sopenharmony_ci '----' 115519ea8026Sopenharmony_ci v--' '--v 115619ea8026Sopenharmony_ci.----. .----. 115719ea8026Sopenharmony_ci| A | | B' | 115819ea8026Sopenharmony_ci| | | | 115919ea8026Sopenharmony_ci'----' '----' 116019ea8026Sopenharmony_ci. . . '---------------------------v 116119ea8026Sopenharmony_ci. . . . .----. 116219ea8026Sopenharmony_ci. . . . | C' | 116319ea8026Sopenharmony_ci. . . . | | 116419ea8026Sopenharmony_ci. . . . '----' 116519ea8026Sopenharmony_ci. . . . . . 116619ea8026Sopenharmony_ci.----.----.----.----.----.----.----.----.----.----. 116719ea8026Sopenharmony_ci| A |root| B' | |bad |bad |bad | C' |bad | 116819ea8026Sopenharmony_ci| | | | |blck|blck|blck| |blck| 116919ea8026Sopenharmony_ci'----'----'----'----'----'----'----'----'----'----' 117019ea8026Sopenharmony_ci``` 117119ea8026Sopenharmony_ci 117219ea8026Sopenharmony_ciWe may find that the new block is also bad, but hopefully after repeating this 117319ea8026Sopenharmony_cicycle we'll eventually find a new block where a write succeeds. If we don't, 117419ea8026Sopenharmony_cithat means that all blocks in our storage are bad, and we've reached the end of 117519ea8026Sopenharmony_ciour device's usable life. At this point, littlefs will return an "out of space" 117619ea8026Sopenharmony_cierror. This is technically true, as there are no more good blocks, but as an 117719ea8026Sopenharmony_ciadded benefit it also matches the error condition expected by users of 117819ea8026Sopenharmony_cidynamically sized data. 117919ea8026Sopenharmony_ci 118019ea8026Sopenharmony_ci--- 118119ea8026Sopenharmony_ci 118219ea8026Sopenharmony_ciRead errors, on the other hand, are quite a bit more complicated. We don't have 118319ea8026Sopenharmony_cia copy of the data lingering around in RAM, so we need a way to reconstruct the 118419ea8026Sopenharmony_cioriginal data even after it has been corrupted. One such mechanism for this is 118519ea8026Sopenharmony_ci[error-correction-codes (ECC)][wikipedia-ecc]. 118619ea8026Sopenharmony_ci 118719ea8026Sopenharmony_ciECC is an extension to the idea of a checksum. Where a checksum such as CRC can 118819ea8026Sopenharmony_cidetect that an error has occurred in the data, ECC can detect and actually 118919ea8026Sopenharmony_cicorrect some amount of errors. However, there is a limit to how many errors ECC 119019ea8026Sopenharmony_cican detect: the [Hamming bound][wikipedia-hamming-bound]. As the number of 119119ea8026Sopenharmony_cierrors approaches the Hamming bound, we may still be able to detect errors, but 119219ea8026Sopenharmony_cican no longer fix the data. If we've reached this point the block is 119319ea8026Sopenharmony_ciunrecoverable. 119419ea8026Sopenharmony_ci 119519ea8026Sopenharmony_cilittlefs by itself does **not** provide ECC. The block nature and relatively 119619ea8026Sopenharmony_cilarge footprint of ECC does not work well with the dynamically sized data of 119719ea8026Sopenharmony_cifilesystems, correcting errors without RAM is complicated, and ECC fits better 119819ea8026Sopenharmony_ciwith the geometry of block devices. In fact, several NOR flash chips have extra 119919ea8026Sopenharmony_cistorage intended for ECC, and many NAND chips can even calculate ECC on the 120019ea8026Sopenharmony_cichip itself. 120119ea8026Sopenharmony_ci 120219ea8026Sopenharmony_ciIn littlefs, ECC is entirely optional. Read errors can instead be prevented 120319ea8026Sopenharmony_ciproactively by wear leveling. But it's important to note that ECC can be used 120419ea8026Sopenharmony_ciat the block device level to modestly extend the life of a device. littlefs 120519ea8026Sopenharmony_cirespects any errors reported by the block device, allowing a block device to 120619ea8026Sopenharmony_ciprovide additional aggressive error detection. 120719ea8026Sopenharmony_ci 120819ea8026Sopenharmony_ci--- 120919ea8026Sopenharmony_ci 121019ea8026Sopenharmony_ciTo avoid read errors, we need to be proactive, as opposed to reactive as we 121119ea8026Sopenharmony_ciwere with write errors. 121219ea8026Sopenharmony_ci 121319ea8026Sopenharmony_ciOne way to do this is to detect when the number of errors in a block exceeds 121419ea8026Sopenharmony_cisome threshold, but is still recoverable. With ECC we can do this at write 121519ea8026Sopenharmony_citime, and treat the error as a write error, evicting the block before fatal 121619ea8026Sopenharmony_ciread errors have a chance to develop. 121719ea8026Sopenharmony_ci 121819ea8026Sopenharmony_ciA different, more generic strategy, is to proactively distribute wear across 121919ea8026Sopenharmony_ciall blocks in the storage, with the hope that no single block fails before the 122019ea8026Sopenharmony_cirest of storage is approaching the end of its usable life. This is called 122119ea8026Sopenharmony_ciwear leveling. 122219ea8026Sopenharmony_ci 122319ea8026Sopenharmony_ciGenerally, wear leveling algorithms fall into one of two categories: 122419ea8026Sopenharmony_ci 122519ea8026Sopenharmony_ci1. [Dynamic wear leveling][wikipedia-dynamic-wear-leveling], where we 122619ea8026Sopenharmony_ci distribute wear over "dynamic" blocks. The can be accomplished by 122719ea8026Sopenharmony_ci only considering unused blocks. 122819ea8026Sopenharmony_ci 122919ea8026Sopenharmony_ci2. [Static wear leveling][wikipedia-static-wear-leveling], where we 123019ea8026Sopenharmony_ci distribute wear over both "dynamic" and "static" blocks. To make this work, 123119ea8026Sopenharmony_ci we need to consider all blocks, including blocks that already contain data. 123219ea8026Sopenharmony_ci 123319ea8026Sopenharmony_ciAs a tradeoff for code size and complexity, littlefs (currently) only provides 123419ea8026Sopenharmony_cidynamic wear leveling. This is a best effort solution. Wear is not distributed 123519ea8026Sopenharmony_ciperfectly, but it is distributed among the free blocks and greatly extends the 123619ea8026Sopenharmony_cilife of a device. 123719ea8026Sopenharmony_ci 123819ea8026Sopenharmony_ciOn top of this, littlefs uses a statistical wear leveling algorithm. What this 123919ea8026Sopenharmony_cimeans is that we don’t actively track wear, instead we rely on a uniform 124019ea8026Sopenharmony_cidistribution of wear across storage to approximate a dynamic wear leveling 124119ea8026Sopenharmony_cialgorithm. Despite the long name, this is actually a simplification of dynamic 124219ea8026Sopenharmony_ciwear leveling. 124319ea8026Sopenharmony_ci 124419ea8026Sopenharmony_ciThe uniform distribution of wear is left up to the block allocator, which 124519ea8026Sopenharmony_cicreates a uniform distribution in two parts. The easy part is when the device 124619ea8026Sopenharmony_ciis powered, in which case we allocate the blocks linearly, circling the device. 124719ea8026Sopenharmony_ciThe harder part is what to do when the device loses power. We can't just 124819ea8026Sopenharmony_cirestart the allocator at the beginning of storage, as this would bias the wear. 124919ea8026Sopenharmony_ciInstead, we start the allocator as a random offset every time we mount the 125019ea8026Sopenharmony_cifilesystem. As long as this random offset is uniform, the combined allocation 125119ea8026Sopenharmony_cipattern is also a uniform distribution. 125219ea8026Sopenharmony_ci 125319ea8026Sopenharmony_ci![Cumulative wear distribution graph][wear-distribution-graph] 125419ea8026Sopenharmony_ci 125519ea8026Sopenharmony_ciInitially, this approach to wear leveling looks like it creates a difficult 125619ea8026Sopenharmony_cidependency on a power-independent random number generator, which must return 125719ea8026Sopenharmony_cidifferent random numbers on each boot. However, the filesystem is in a 125819ea8026Sopenharmony_cirelatively unique situation in that it is sitting on top of a large of amount 125919ea8026Sopenharmony_ciof entropy that persists across power loss. 126019ea8026Sopenharmony_ci 126119ea8026Sopenharmony_ciWe can actually use the data on disk to directly drive our random number 126219ea8026Sopenharmony_cigenerator. In practice, this is implemented by xoring the checksums of each 126319ea8026Sopenharmony_cimetadata pair, which is already calculated to fetch and mount the filesystem. 126419ea8026Sopenharmony_ci 126519ea8026Sopenharmony_ci``` 126619ea8026Sopenharmony_ci .--------. \ probably random 126719ea8026Sopenharmony_ci .|metadata| | ^ 126819ea8026Sopenharmony_ci || | +-> crc ----------------------> xor 126919ea8026Sopenharmony_ci || | | ^ 127019ea8026Sopenharmony_ci |'--------' / | 127119ea8026Sopenharmony_ci '---|--|-' | 127219ea8026Sopenharmony_ci .-' '-------------------------. | 127319ea8026Sopenharmony_ci | | | 127419ea8026Sopenharmony_ci | .--------------> xor ------------> xor 127519ea8026Sopenharmony_ci | | ^ | ^ 127619ea8026Sopenharmony_ci v crc crc v crc 127719ea8026Sopenharmony_ci .--------. \ ^ .--------. \ ^ .--------. \ ^ 127819ea8026Sopenharmony_ci .|metadata|-|--|-->|metadata| | | .|metadata| | | 127919ea8026Sopenharmony_ci || | +--' || | +--' || | +--' 128019ea8026Sopenharmony_ci || | | || | | || | | 128119ea8026Sopenharmony_ci |'--------' / |'--------' / |'--------' / 128219ea8026Sopenharmony_ci '---|--|-' '----|---' '---|--|-' 128319ea8026Sopenharmony_ci .-' '-. | .-' '-. 128419ea8026Sopenharmony_ci v v v v v 128519ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. 128619ea8026Sopenharmony_ci| data | | data | | data | | data | | data | 128719ea8026Sopenharmony_ci| | | | | | | | | | 128819ea8026Sopenharmony_ci| | | | | | | | | | 128919ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' 129019ea8026Sopenharmony_ci``` 129119ea8026Sopenharmony_ci 129219ea8026Sopenharmony_ciNote that this random number generator is not perfect. It only returns unique 129319ea8026Sopenharmony_cirandom numbers when the filesystem is modified. This is exactly what we want 129419ea8026Sopenharmony_cifor distributing wear in the allocator, but means this random number generator 129519ea8026Sopenharmony_ciis not useful for general use. 129619ea8026Sopenharmony_ci 129719ea8026Sopenharmony_ci--- 129819ea8026Sopenharmony_ci 129919ea8026Sopenharmony_ciTogether, bad block detection and dynamic wear leveling provide a best effort 130019ea8026Sopenharmony_cisolution for avoiding the early death of a filesystem due to wear. Importantly, 130119ea8026Sopenharmony_cilittlefs's wear leveling algorithm provides a key feature: You can increase the 130219ea8026Sopenharmony_cilife of a device simply by increasing the size of storage. And if more 130319ea8026Sopenharmony_ciaggressive wear leveling is desired, you can always combine littlefs with a 130419ea8026Sopenharmony_ci[flash translation layer (FTL)][wikipedia-ftl] to get a small power resilient 130519ea8026Sopenharmony_cifilesystem with static wear leveling. 130619ea8026Sopenharmony_ci 130719ea8026Sopenharmony_ci## Files 130819ea8026Sopenharmony_ci 130919ea8026Sopenharmony_ciNow that we have our building blocks out of the way, we can start looking at 131019ea8026Sopenharmony_ciour filesystem as a whole. 131119ea8026Sopenharmony_ci 131219ea8026Sopenharmony_ciThe first step: How do we actually store our files? 131319ea8026Sopenharmony_ci 131419ea8026Sopenharmony_ciWe've determined that CTZ skip-lists are pretty good at storing data compactly, 131519ea8026Sopenharmony_ciso following the precedent found in other filesystems we could give each file 131619ea8026Sopenharmony_cia skip-list stored in a metadata pair that acts as an inode for the file. 131719ea8026Sopenharmony_ci 131819ea8026Sopenharmony_ci 131919ea8026Sopenharmony_ci``` 132019ea8026Sopenharmony_ci .--------. 132119ea8026Sopenharmony_ci .|metadata| 132219ea8026Sopenharmony_ci || | 132319ea8026Sopenharmony_ci || | 132419ea8026Sopenharmony_ci |'--------' 132519ea8026Sopenharmony_ci '----|---' 132619ea8026Sopenharmony_ci v 132719ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. 132819ea8026Sopenharmony_ci| data 0 |<-| data 1 |<-| data 2 |<-| data 3 | 132919ea8026Sopenharmony_ci| |<-| |--| | | | 133019ea8026Sopenharmony_ci| | | | | | | | 133119ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' 133219ea8026Sopenharmony_ci``` 133319ea8026Sopenharmony_ci 133419ea8026Sopenharmony_ciHowever, this doesn't work well when files are small, which is common for 133519ea8026Sopenharmony_ciembedded systems. Compared to PCs, _all_ data in an embedded system is small. 133619ea8026Sopenharmony_ci 133719ea8026Sopenharmony_ciConsider a small 4-byte file. With a two block metadata-pair and one block for 133819ea8026Sopenharmony_cithe CTZ skip-list, we find ourselves using a full 3 blocks. On most NOR flash 133919ea8026Sopenharmony_ciwith 4 KiB blocks, this is 12 KiB of overhead. A ridiculous 3072x increase. 134019ea8026Sopenharmony_ci 134119ea8026Sopenharmony_ci``` 134219ea8026Sopenharmony_cifile stored as inode, 4 bytes costs ~12 KiB 134319ea8026Sopenharmony_ci 134419ea8026Sopenharmony_ci .----------------. \ 134519ea8026Sopenharmony_ci.| revision | | 134619ea8026Sopenharmony_ci||----------------| \ | 134719ea8026Sopenharmony_ci|| skiplist ---. +- metadata | 134819ea8026Sopenharmony_ci||----------------| | / 4x8 bytes | 134919ea8026Sopenharmony_ci|| checksum | | 32 bytes | 135019ea8026Sopenharmony_ci||----------------| | | 135119ea8026Sopenharmony_ci|| | | | +- metadata pair 135219ea8026Sopenharmony_ci|| v | | | 2x4 KiB 135319ea8026Sopenharmony_ci|| | | | 8 KiB 135419ea8026Sopenharmony_ci|| | | | 135519ea8026Sopenharmony_ci|| | | | 135619ea8026Sopenharmony_ci|| | | | 135719ea8026Sopenharmony_ci|'----------------' | | 135819ea8026Sopenharmony_ci'----------------' | / 135919ea8026Sopenharmony_ci .--------' 136019ea8026Sopenharmony_ci v 136119ea8026Sopenharmony_ci .----------------. \ \ 136219ea8026Sopenharmony_ci | data | +- data | 136319ea8026Sopenharmony_ci |----------------| / 4 bytes | 136419ea8026Sopenharmony_ci | | | 136519ea8026Sopenharmony_ci | | | 136619ea8026Sopenharmony_ci | | | 136719ea8026Sopenharmony_ci | | +- data block 136819ea8026Sopenharmony_ci | | | 4 KiB 136919ea8026Sopenharmony_ci | | | 137019ea8026Sopenharmony_ci | | | 137119ea8026Sopenharmony_ci | | | 137219ea8026Sopenharmony_ci | | | 137319ea8026Sopenharmony_ci | | | 137419ea8026Sopenharmony_ci '----------------' / 137519ea8026Sopenharmony_ci``` 137619ea8026Sopenharmony_ci 137719ea8026Sopenharmony_ciWe can make several improvements. First, instead of giving each file its own 137819ea8026Sopenharmony_cimetadata pair, we can store multiple files in a single metadata pair. One way 137919ea8026Sopenharmony_cito do this is to directly associate a directory with a metadata pair (or a 138019ea8026Sopenharmony_cilinked list of metadata pairs). This makes it easy for multiple files to share 138119ea8026Sopenharmony_cithe directory's metadata pair for logging and reduces the collective storage 138219ea8026Sopenharmony_cioverhead. 138319ea8026Sopenharmony_ci 138419ea8026Sopenharmony_ciThe strict binding of metadata pairs and directories also gives users 138519ea8026Sopenharmony_cidirect control over storage utilization depending on how they organize their 138619ea8026Sopenharmony_cidirectories. 138719ea8026Sopenharmony_ci 138819ea8026Sopenharmony_ci``` 138919ea8026Sopenharmony_cimultiple files stored in metadata pair, 4 bytes costs ~4 KiB 139019ea8026Sopenharmony_ci 139119ea8026Sopenharmony_ci .----------------. 139219ea8026Sopenharmony_ci .| revision | 139319ea8026Sopenharmony_ci ||----------------| 139419ea8026Sopenharmony_ci || A name | 139519ea8026Sopenharmony_ci || A skiplist -----. 139619ea8026Sopenharmony_ci ||----------------| | \ 139719ea8026Sopenharmony_ci || B name | | +- metadata 139819ea8026Sopenharmony_ci || B skiplist ---. | | 4x8 bytes 139919ea8026Sopenharmony_ci ||----------------| | | / 32 bytes 140019ea8026Sopenharmony_ci || checksum | | | 140119ea8026Sopenharmony_ci ||----------------| | | 140219ea8026Sopenharmony_ci || | | | | 140319ea8026Sopenharmony_ci || v | | | 140419ea8026Sopenharmony_ci |'----------------' | | 140519ea8026Sopenharmony_ci '----------------' | | 140619ea8026Sopenharmony_ci .----------------' | 140719ea8026Sopenharmony_ci v v 140819ea8026Sopenharmony_ci.----------------. .----------------. \ \ 140919ea8026Sopenharmony_ci| A data | | B data | +- data | 141019ea8026Sopenharmony_ci| | |----------------| / 4 bytes | 141119ea8026Sopenharmony_ci| | | | | 141219ea8026Sopenharmony_ci| | | | | 141319ea8026Sopenharmony_ci| | | | | 141419ea8026Sopenharmony_ci| | | | + data block 141519ea8026Sopenharmony_ci| | | | | 4 KiB 141619ea8026Sopenharmony_ci| | | | | 141719ea8026Sopenharmony_ci|----------------| | | | 141819ea8026Sopenharmony_ci| | | | | 141919ea8026Sopenharmony_ci| | | | | 142019ea8026Sopenharmony_ci| | | | | 142119ea8026Sopenharmony_ci'----------------' '----------------' / 142219ea8026Sopenharmony_ci``` 142319ea8026Sopenharmony_ci 142419ea8026Sopenharmony_ciThe second improvement we can make is noticing that for very small files, our 142519ea8026Sopenharmony_ciattempts to use CTZ skip-lists for compact storage backfires. Metadata pairs 142619ea8026Sopenharmony_cihave a ~4x storage cost, so if our file is smaller than 1/4 the block size, 142719ea8026Sopenharmony_cithere's actually no benefit in storing our file outside of our metadata pair. 142819ea8026Sopenharmony_ci 142919ea8026Sopenharmony_ciIn this case, we can store the file directly in our directory's metadata pair. 143019ea8026Sopenharmony_ciWe call this an inline file, and it allows a directory to store many small 143119ea8026Sopenharmony_cifiles quite efficiently. Our previous 4 byte file now only takes up a 143219ea8026Sopenharmony_citheoretical 16 bytes on disk. 143319ea8026Sopenharmony_ci 143419ea8026Sopenharmony_ci``` 143519ea8026Sopenharmony_ciinline files stored in metadata pair, 4 bytes costs ~16 bytes 143619ea8026Sopenharmony_ci 143719ea8026Sopenharmony_ci .----------------. 143819ea8026Sopenharmony_ci.| revision | 143919ea8026Sopenharmony_ci||----------------| 144019ea8026Sopenharmony_ci|| A name | 144119ea8026Sopenharmony_ci|| A skiplist ---. 144219ea8026Sopenharmony_ci||----------------| | \ 144319ea8026Sopenharmony_ci|| B name | | +- data 144419ea8026Sopenharmony_ci|| B data | | | 4x4 bytes 144519ea8026Sopenharmony_ci||----------------| | / 16 bytes 144619ea8026Sopenharmony_ci|| checksum | | 144719ea8026Sopenharmony_ci||----------------| | 144819ea8026Sopenharmony_ci|| | | | 144919ea8026Sopenharmony_ci|| v | | 145019ea8026Sopenharmony_ci|'----------------' | 145119ea8026Sopenharmony_ci'----------------' | 145219ea8026Sopenharmony_ci .---------' 145319ea8026Sopenharmony_ci v 145419ea8026Sopenharmony_ci .----------------. 145519ea8026Sopenharmony_ci | A data | 145619ea8026Sopenharmony_ci | | 145719ea8026Sopenharmony_ci | | 145819ea8026Sopenharmony_ci | | 145919ea8026Sopenharmony_ci | | 146019ea8026Sopenharmony_ci | | 146119ea8026Sopenharmony_ci | | 146219ea8026Sopenharmony_ci | | 146319ea8026Sopenharmony_ci |----------------| 146419ea8026Sopenharmony_ci | | 146519ea8026Sopenharmony_ci | | 146619ea8026Sopenharmony_ci | | 146719ea8026Sopenharmony_ci '----------------' 146819ea8026Sopenharmony_ci``` 146919ea8026Sopenharmony_ci 147019ea8026Sopenharmony_ciOnce the file exceeds 1/4 the block size, we switch to a CTZ skip-list. This 147119ea8026Sopenharmony_cimeans that our files never use more than 4x storage overhead, decreasing as 147219ea8026Sopenharmony_cithe file grows in size. 147319ea8026Sopenharmony_ci 147419ea8026Sopenharmony_ci![File storage cost graph][file-cost-graph] 147519ea8026Sopenharmony_ci 147619ea8026Sopenharmony_ci## Directories 147719ea8026Sopenharmony_ci 147819ea8026Sopenharmony_ciNow we just need directories to store our files. As mentioned above we want 147919ea8026Sopenharmony_cia strict binding of directories and metadata pairs, but there are a few 148019ea8026Sopenharmony_cicomplications we need to sort out. 148119ea8026Sopenharmony_ci 148219ea8026Sopenharmony_ciOn their own, each directory is a linked-list of metadata pairs. This lets us 148319ea8026Sopenharmony_cistore an unlimited number of files in each directory, and we don't need to 148419ea8026Sopenharmony_ciworry about the runtime complexity of unbounded logs. We can store other 148519ea8026Sopenharmony_cidirectory pointers in our metadata pairs, which gives us a directory tree, much 148619ea8026Sopenharmony_cilike what you find on other filesystems. 148719ea8026Sopenharmony_ci 148819ea8026Sopenharmony_ci``` 148919ea8026Sopenharmony_ci .--------. 149019ea8026Sopenharmony_ci .| root | 149119ea8026Sopenharmony_ci || | 149219ea8026Sopenharmony_ci || | 149319ea8026Sopenharmony_ci |'--------' 149419ea8026Sopenharmony_ci '---|--|-' 149519ea8026Sopenharmony_ci .-' '-------------------------. 149619ea8026Sopenharmony_ci v v 149719ea8026Sopenharmony_ci .--------. .--------. .--------. 149819ea8026Sopenharmony_ci .| dir A |------->| dir A | .| dir B | 149919ea8026Sopenharmony_ci || | || | || | 150019ea8026Sopenharmony_ci || | || | || | 150119ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 150219ea8026Sopenharmony_ci '---|--|-' '----|---' '---|--|-' 150319ea8026Sopenharmony_ci .-' '-. | .-' '-. 150419ea8026Sopenharmony_ci v v v v v 150519ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. 150619ea8026Sopenharmony_ci| file C | | file D | | file E | | file F | | file G | 150719ea8026Sopenharmony_ci| | | | | | | | | | 150819ea8026Sopenharmony_ci| | | | | | | | | | 150919ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' 151019ea8026Sopenharmony_ci``` 151119ea8026Sopenharmony_ci 151219ea8026Sopenharmony_ciThe main complication is, once again, traversal with a constant amount of 151319ea8026Sopenharmony_ci[RAM]. The directory tree is a tree, and the unfortunate fact is you can't 151419ea8026Sopenharmony_citraverse a tree with constant RAM. 151519ea8026Sopenharmony_ci 151619ea8026Sopenharmony_ciFortunately, the elements of our tree are metadata pairs, so unlike CTZ 151719ea8026Sopenharmony_ciskip-lists, we're not limited to strict COW operations. One thing we can do is 151819ea8026Sopenharmony_cithread a linked-list through our tree, explicitly enabling cheap traversal 151919ea8026Sopenharmony_ciover the entire filesystem. 152019ea8026Sopenharmony_ci 152119ea8026Sopenharmony_ci``` 152219ea8026Sopenharmony_ci .--------. 152319ea8026Sopenharmony_ci .| root |-. 152419ea8026Sopenharmony_ci || | | 152519ea8026Sopenharmony_ci .-------|| |-' 152619ea8026Sopenharmony_ci | |'--------' 152719ea8026Sopenharmony_ci | '---|--|-' 152819ea8026Sopenharmony_ci | .-' '-------------------------. 152919ea8026Sopenharmony_ci | v v 153019ea8026Sopenharmony_ci | .--------. .--------. .--------. 153119ea8026Sopenharmony_ci '->| dir A |------->| dir A |------->| dir B | 153219ea8026Sopenharmony_ci || | || | || | 153319ea8026Sopenharmony_ci || | || | || | 153419ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 153519ea8026Sopenharmony_ci '---|--|-' '----|---' '---|--|-' 153619ea8026Sopenharmony_ci .-' '-. | .-' '-. 153719ea8026Sopenharmony_ci v v v v v 153819ea8026Sopenharmony_ci.--------. .--------. .--------. .--------. .--------. 153919ea8026Sopenharmony_ci| file C | | file D | | file E | | file F | | file G | 154019ea8026Sopenharmony_ci| | | | | | | | | | 154119ea8026Sopenharmony_ci| | | | | | | | | | 154219ea8026Sopenharmony_ci'--------' '--------' '--------' '--------' '--------' 154319ea8026Sopenharmony_ci``` 154419ea8026Sopenharmony_ci 154519ea8026Sopenharmony_ciUnfortunately, not sticking to pure COW operations creates some problems. Now, 154619ea8026Sopenharmony_ciwhenever we want to manipulate the directory tree, multiple pointers need to be 154719ea8026Sopenharmony_ciupdated. If you're familiar with designing atomic data structures this should 154819ea8026Sopenharmony_ciset off a bunch of red flags. 154919ea8026Sopenharmony_ci 155019ea8026Sopenharmony_ciTo work around this, our threaded linked-list has a bit of leeway. Instead of 155119ea8026Sopenharmony_cionly containing metadata pairs found in our filesystem, it is allowed to 155219ea8026Sopenharmony_cicontain metadata pairs that have no parent because of a power loss. These are 155319ea8026Sopenharmony_cicalled orphaned metadata pairs. 155419ea8026Sopenharmony_ci 155519ea8026Sopenharmony_ciWith the possibility of orphans, we can build power loss resilient operations 155619ea8026Sopenharmony_cithat maintain a filesystem tree threaded with a linked-list for traversal. 155719ea8026Sopenharmony_ci 155819ea8026Sopenharmony_ciAdding a directory to our tree: 155919ea8026Sopenharmony_ci 156019ea8026Sopenharmony_ci``` 156119ea8026Sopenharmony_ci .--------. 156219ea8026Sopenharmony_ci .| root |-. 156319ea8026Sopenharmony_ci || | | 156419ea8026Sopenharmony_ci.-------|| |-' 156519ea8026Sopenharmony_ci| |'--------' 156619ea8026Sopenharmony_ci| '---|--|-' 156719ea8026Sopenharmony_ci| .-' '-. 156819ea8026Sopenharmony_ci| v v 156919ea8026Sopenharmony_ci| .--------. .--------. 157019ea8026Sopenharmony_ci'->| dir A |->| dir C | 157119ea8026Sopenharmony_ci || | || | 157219ea8026Sopenharmony_ci || | || | 157319ea8026Sopenharmony_ci |'--------' |'--------' 157419ea8026Sopenharmony_ci '--------' '--------' 157519ea8026Sopenharmony_ci 157619ea8026Sopenharmony_ciallocate dir B 157719ea8026Sopenharmony_ci=> 157819ea8026Sopenharmony_ci .--------. 157919ea8026Sopenharmony_ci .| root |-. 158019ea8026Sopenharmony_ci || | | 158119ea8026Sopenharmony_ci.-------|| |-' 158219ea8026Sopenharmony_ci| |'--------' 158319ea8026Sopenharmony_ci| '---|--|-' 158419ea8026Sopenharmony_ci| .-' '-. 158519ea8026Sopenharmony_ci| v v 158619ea8026Sopenharmony_ci| .--------. .--------. 158719ea8026Sopenharmony_ci'->| dir A |--->| dir C | 158819ea8026Sopenharmony_ci || | .->| | 158919ea8026Sopenharmony_ci || | | || | 159019ea8026Sopenharmony_ci |'--------' | |'--------' 159119ea8026Sopenharmony_ci '--------' | '--------' 159219ea8026Sopenharmony_ci | 159319ea8026Sopenharmony_ci .--------. | 159419ea8026Sopenharmony_ci .| dir B |-' 159519ea8026Sopenharmony_ci || | 159619ea8026Sopenharmony_ci || | 159719ea8026Sopenharmony_ci |'--------' 159819ea8026Sopenharmony_ci '--------' 159919ea8026Sopenharmony_ci 160019ea8026Sopenharmony_ciinsert dir B into threaded linked-list, creating an orphan 160119ea8026Sopenharmony_ci=> 160219ea8026Sopenharmony_ci .--------. 160319ea8026Sopenharmony_ci .| root |-. 160419ea8026Sopenharmony_ci || | | 160519ea8026Sopenharmony_ci.-------|| |-' 160619ea8026Sopenharmony_ci| |'--------' 160719ea8026Sopenharmony_ci| '---|--|-' 160819ea8026Sopenharmony_ci| .-' '-------------. 160919ea8026Sopenharmony_ci| v v 161019ea8026Sopenharmony_ci| .--------. .--------. .--------. 161119ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 161219ea8026Sopenharmony_ci || | || orphan!| || | 161319ea8026Sopenharmony_ci || | || | || | 161419ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 161519ea8026Sopenharmony_ci '--------' '--------' '--------' 161619ea8026Sopenharmony_ci 161719ea8026Sopenharmony_ciadd dir B to parent directory 161819ea8026Sopenharmony_ci=> 161919ea8026Sopenharmony_ci .--------. 162019ea8026Sopenharmony_ci .| root |-. 162119ea8026Sopenharmony_ci || | | 162219ea8026Sopenharmony_ci.-------------|| |-' 162319ea8026Sopenharmony_ci| |'--------' 162419ea8026Sopenharmony_ci| '--|-|-|-' 162519ea8026Sopenharmony_ci| .------' | '-------. 162619ea8026Sopenharmony_ci| v v v 162719ea8026Sopenharmony_ci| .--------. .--------. .--------. 162819ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 162919ea8026Sopenharmony_ci || | || | || | 163019ea8026Sopenharmony_ci || | || | || | 163119ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 163219ea8026Sopenharmony_ci '--------' '--------' '--------' 163319ea8026Sopenharmony_ci``` 163419ea8026Sopenharmony_ci 163519ea8026Sopenharmony_ciRemoving a directory: 163619ea8026Sopenharmony_ci 163719ea8026Sopenharmony_ci``` 163819ea8026Sopenharmony_ci .--------. 163919ea8026Sopenharmony_ci .| root |-. 164019ea8026Sopenharmony_ci || | | 164119ea8026Sopenharmony_ci.-------------|| |-' 164219ea8026Sopenharmony_ci| |'--------' 164319ea8026Sopenharmony_ci| '--|-|-|-' 164419ea8026Sopenharmony_ci| .------' | '-------. 164519ea8026Sopenharmony_ci| v v v 164619ea8026Sopenharmony_ci| .--------. .--------. .--------. 164719ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 164819ea8026Sopenharmony_ci || | || | || | 164919ea8026Sopenharmony_ci || | || | || | 165019ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 165119ea8026Sopenharmony_ci '--------' '--------' '--------' 165219ea8026Sopenharmony_ci 165319ea8026Sopenharmony_ciremove dir B from parent directory, creating an orphan 165419ea8026Sopenharmony_ci=> 165519ea8026Sopenharmony_ci .--------. 165619ea8026Sopenharmony_ci .| root |-. 165719ea8026Sopenharmony_ci || | | 165819ea8026Sopenharmony_ci.-------|| |-' 165919ea8026Sopenharmony_ci| |'--------' 166019ea8026Sopenharmony_ci| '---|--|-' 166119ea8026Sopenharmony_ci| .-' '-------------. 166219ea8026Sopenharmony_ci| v v 166319ea8026Sopenharmony_ci| .--------. .--------. .--------. 166419ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 166519ea8026Sopenharmony_ci || | || orphan!| || | 166619ea8026Sopenharmony_ci || | || | || | 166719ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 166819ea8026Sopenharmony_ci '--------' '--------' '--------' 166919ea8026Sopenharmony_ci 167019ea8026Sopenharmony_ciremove dir B from threaded linked-list, returning dir B to free blocks 167119ea8026Sopenharmony_ci=> 167219ea8026Sopenharmony_ci .--------. 167319ea8026Sopenharmony_ci .| root |-. 167419ea8026Sopenharmony_ci || | | 167519ea8026Sopenharmony_ci.-------|| |-' 167619ea8026Sopenharmony_ci| |'--------' 167719ea8026Sopenharmony_ci| '---|--|-' 167819ea8026Sopenharmony_ci| .-' '-. 167919ea8026Sopenharmony_ci| v v 168019ea8026Sopenharmony_ci| .--------. .--------. 168119ea8026Sopenharmony_ci'->| dir A |->| dir C | 168219ea8026Sopenharmony_ci || | || | 168319ea8026Sopenharmony_ci || | || | 168419ea8026Sopenharmony_ci |'--------' |'--------' 168519ea8026Sopenharmony_ci '--------' '--------' 168619ea8026Sopenharmony_ci``` 168719ea8026Sopenharmony_ci 168819ea8026Sopenharmony_ciIn addition to normal directory tree operations, we can use orphans to evict 168919ea8026Sopenharmony_ciblocks in a metadata pair when the block goes bad or exceeds its allocated 169019ea8026Sopenharmony_cierases. If we lose power while evicting a metadata block we may end up with 169119ea8026Sopenharmony_cia situation where the filesystem references the replacement block while the 169219ea8026Sopenharmony_cithreaded linked-list still contains the evicted block. We call this a 169319ea8026Sopenharmony_cihalf-orphan. 169419ea8026Sopenharmony_ci 169519ea8026Sopenharmony_ci``` 169619ea8026Sopenharmony_ci .--------. 169719ea8026Sopenharmony_ci .| root |-. 169819ea8026Sopenharmony_ci || | | 169919ea8026Sopenharmony_ci.-------------|| |-' 170019ea8026Sopenharmony_ci| |'--------' 170119ea8026Sopenharmony_ci| '--|-|-|-' 170219ea8026Sopenharmony_ci| .------' | '-------. 170319ea8026Sopenharmony_ci| v v v 170419ea8026Sopenharmony_ci| .--------. .--------. .--------. 170519ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 170619ea8026Sopenharmony_ci || | || | || | 170719ea8026Sopenharmony_ci || | || | || | 170819ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 170919ea8026Sopenharmony_ci '--------' '--------' '--------' 171019ea8026Sopenharmony_ci 171119ea8026Sopenharmony_citry to write to dir B 171219ea8026Sopenharmony_ci=> 171319ea8026Sopenharmony_ci .--------. 171419ea8026Sopenharmony_ci .| root |-. 171519ea8026Sopenharmony_ci || | | 171619ea8026Sopenharmony_ci.----------------|| |-' 171719ea8026Sopenharmony_ci| |'--------' 171819ea8026Sopenharmony_ci| '-|-||-|-' 171919ea8026Sopenharmony_ci| .--------' || '-----. 172019ea8026Sopenharmony_ci| v |v v 172119ea8026Sopenharmony_ci| .--------. .--------. .--------. 172219ea8026Sopenharmony_ci'->| dir A |---->| dir B |->| dir C | 172319ea8026Sopenharmony_ci || |-. | | || | 172419ea8026Sopenharmony_ci || | | | | || | 172519ea8026Sopenharmony_ci |'--------' | '--------' |'--------' 172619ea8026Sopenharmony_ci '--------' | v '--------' 172719ea8026Sopenharmony_ci | .--------. 172819ea8026Sopenharmony_ci '->| dir B | 172919ea8026Sopenharmony_ci | bad | 173019ea8026Sopenharmony_ci | block! | 173119ea8026Sopenharmony_ci '--------' 173219ea8026Sopenharmony_ci 173319ea8026Sopenharmony_cioh no! bad block detected, allocate replacement 173419ea8026Sopenharmony_ci=> 173519ea8026Sopenharmony_ci .--------. 173619ea8026Sopenharmony_ci .| root |-. 173719ea8026Sopenharmony_ci || | | 173819ea8026Sopenharmony_ci.----------------|| |-' 173919ea8026Sopenharmony_ci| |'--------' 174019ea8026Sopenharmony_ci| '-|-||-|-' 174119ea8026Sopenharmony_ci| .--------' || '-------. 174219ea8026Sopenharmony_ci| v |v v 174319ea8026Sopenharmony_ci| .--------. .--------. .--------. 174419ea8026Sopenharmony_ci'->| dir A |---->| dir B |--->| dir C | 174519ea8026Sopenharmony_ci || |-. | | .->| | 174619ea8026Sopenharmony_ci || | | | | | || | 174719ea8026Sopenharmony_ci |'--------' | '--------' | |'--------' 174819ea8026Sopenharmony_ci '--------' | v | '--------' 174919ea8026Sopenharmony_ci | .--------. | 175019ea8026Sopenharmony_ci '->| dir B | | 175119ea8026Sopenharmony_ci | bad | | 175219ea8026Sopenharmony_ci | block! | | 175319ea8026Sopenharmony_ci '--------' | 175419ea8026Sopenharmony_ci | 175519ea8026Sopenharmony_ci .--------. | 175619ea8026Sopenharmony_ci | dir B |--' 175719ea8026Sopenharmony_ci | | 175819ea8026Sopenharmony_ci | | 175919ea8026Sopenharmony_ci '--------' 176019ea8026Sopenharmony_ci 176119ea8026Sopenharmony_ciinsert replacement in threaded linked-list, creating a half-orphan 176219ea8026Sopenharmony_ci=> 176319ea8026Sopenharmony_ci .--------. 176419ea8026Sopenharmony_ci .| root |-. 176519ea8026Sopenharmony_ci || | | 176619ea8026Sopenharmony_ci.----------------|| |-' 176719ea8026Sopenharmony_ci| |'--------' 176819ea8026Sopenharmony_ci| '-|-||-|-' 176919ea8026Sopenharmony_ci| .--------' || '-------. 177019ea8026Sopenharmony_ci| v |v v 177119ea8026Sopenharmony_ci| .--------. .--------. .--------. 177219ea8026Sopenharmony_ci'->| dir A |---->| dir B |--->| dir C | 177319ea8026Sopenharmony_ci || |-. | | .->| | 177419ea8026Sopenharmony_ci || | | | | | || | 177519ea8026Sopenharmony_ci |'--------' | '--------' | |'--------' 177619ea8026Sopenharmony_ci '--------' | v | '--------' 177719ea8026Sopenharmony_ci | .--------. | 177819ea8026Sopenharmony_ci | | dir B | | 177919ea8026Sopenharmony_ci | | bad | | 178019ea8026Sopenharmony_ci | | block! | | 178119ea8026Sopenharmony_ci | '--------' | 178219ea8026Sopenharmony_ci | | 178319ea8026Sopenharmony_ci | .--------. | 178419ea8026Sopenharmony_ci '->| dir B |--' 178519ea8026Sopenharmony_ci | half | 178619ea8026Sopenharmony_ci | orphan!| 178719ea8026Sopenharmony_ci '--------' 178819ea8026Sopenharmony_ci 178919ea8026Sopenharmony_cifix reference in parent directory 179019ea8026Sopenharmony_ci=> 179119ea8026Sopenharmony_ci .--------. 179219ea8026Sopenharmony_ci .| root |-. 179319ea8026Sopenharmony_ci || | | 179419ea8026Sopenharmony_ci.-------------|| |-' 179519ea8026Sopenharmony_ci| |'--------' 179619ea8026Sopenharmony_ci| '--|-|-|-' 179719ea8026Sopenharmony_ci| .------' | '-------. 179819ea8026Sopenharmony_ci| v v v 179919ea8026Sopenharmony_ci| .--------. .--------. .--------. 180019ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 180119ea8026Sopenharmony_ci || | || | || | 180219ea8026Sopenharmony_ci || | || | || | 180319ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 180419ea8026Sopenharmony_ci '--------' '--------' '--------' 180519ea8026Sopenharmony_ci``` 180619ea8026Sopenharmony_ci 180719ea8026Sopenharmony_ciFinding orphans and half-orphans is expensive, requiring a _O(n²)_ 180819ea8026Sopenharmony_cicomparison of every metadata pair with every directory entry. But the tradeoff 180919ea8026Sopenharmony_ciis a power resilient filesystem that works with only a bounded amount of RAM. 181019ea8026Sopenharmony_ciFortunately, we only need to check for orphans on the first allocation after 181119ea8026Sopenharmony_ciboot, and a read-only littlefs can ignore the threaded linked-list entirely. 181219ea8026Sopenharmony_ci 181319ea8026Sopenharmony_ciIf we only had some sort of global state, then we could also store a flag and 181419ea8026Sopenharmony_ciavoid searching for orphans unless we knew we were specifically interrupted 181519ea8026Sopenharmony_ciwhile manipulating the directory tree (foreshadowing!). 181619ea8026Sopenharmony_ci 181719ea8026Sopenharmony_ci## The move problem 181819ea8026Sopenharmony_ci 181919ea8026Sopenharmony_ciWe have one last challenge: the move problem. Phrasing the problem is simple: 182019ea8026Sopenharmony_ci 182119ea8026Sopenharmony_ciHow do you atomically move a file between two directories? 182219ea8026Sopenharmony_ci 182319ea8026Sopenharmony_ciIn littlefs we can atomically commit to directories, but we can't create 182419ea8026Sopenharmony_cian atomic commit that spans multiple directories. The filesystem must go 182519ea8026Sopenharmony_cithrough a minimum of two distinct states to complete a move. 182619ea8026Sopenharmony_ci 182719ea8026Sopenharmony_ciTo make matters worse, file moves are a common form of synchronization for 182819ea8026Sopenharmony_cifilesystems. As a filesystem designed for power-loss, it's important we get 182919ea8026Sopenharmony_ciatomic moves right. 183019ea8026Sopenharmony_ci 183119ea8026Sopenharmony_ciSo what can we do? 183219ea8026Sopenharmony_ci 183319ea8026Sopenharmony_ci- We definitely can't just let power-loss result in duplicated or lost files. 183419ea8026Sopenharmony_ci This could easily break users' code and would only reveal itself in extreme 183519ea8026Sopenharmony_ci cases. We were only able to be lazy about the threaded linked-list because 183619ea8026Sopenharmony_ci it isn't user facing and we can handle the corner cases internally. 183719ea8026Sopenharmony_ci 183819ea8026Sopenharmony_ci- Some filesystems propagate COW operations up the tree until a common parent 183919ea8026Sopenharmony_ci is found. Unfortunately this interacts poorly with our threaded tree and 184019ea8026Sopenharmony_ci brings back the issue of upward propagation of wear. 184119ea8026Sopenharmony_ci 184219ea8026Sopenharmony_ci- In a previous version of littlefs we tried to solve this problem by going 184319ea8026Sopenharmony_ci back and forth between the source and destination, marking and unmarking the 184419ea8026Sopenharmony_ci file as moving in order to make the move atomic from the user perspective. 184519ea8026Sopenharmony_ci This worked, but not well. Finding failed moves was expensive and required 184619ea8026Sopenharmony_ci a unique identifier for each file. 184719ea8026Sopenharmony_ci 184819ea8026Sopenharmony_ciIn the end, solving the move problem required creating a new mechanism for 184919ea8026Sopenharmony_cisharing knowledge between multiple metadata pairs. In littlefs this led to the 185019ea8026Sopenharmony_ciintroduction of a mechanism called "global state". 185119ea8026Sopenharmony_ci 185219ea8026Sopenharmony_ci--- 185319ea8026Sopenharmony_ci 185419ea8026Sopenharmony_ciGlobal state is a small set of state that can be updated from _any_ metadata 185519ea8026Sopenharmony_cipair. Combining global state with metadata pairs' ability to update multiple 185619ea8026Sopenharmony_cientries in one commit gives us a powerful tool for crafting complex atomic 185719ea8026Sopenharmony_cioperations. 185819ea8026Sopenharmony_ci 185919ea8026Sopenharmony_ciHow does global state work? 186019ea8026Sopenharmony_ci 186119ea8026Sopenharmony_ciGlobal state exists as a set of deltas that are distributed across the metadata 186219ea8026Sopenharmony_cipairs in the filesystem. The actual global state can be built out of these 186319ea8026Sopenharmony_cideltas by xoring together all of the deltas in the filesystem. 186419ea8026Sopenharmony_ci 186519ea8026Sopenharmony_ci``` 186619ea8026Sopenharmony_ci .--------. .--------. .--------. .--------. .--------. 186719ea8026Sopenharmony_ci.| |->| gdelta |->| |->| gdelta |->| gdelta | 186819ea8026Sopenharmony_ci|| | || 0x23 | || | || 0xff | || 0xce | 186919ea8026Sopenharmony_ci|| | || | || | || | || | 187019ea8026Sopenharmony_ci|'--------' |'--------' |'--------' |'--------' |'--------' 187119ea8026Sopenharmony_ci'--------' '----|---' '--------' '----|---' '----|---' 187219ea8026Sopenharmony_ci v v v 187319ea8026Sopenharmony_ci 0x00 --> xor ------------------> xor ------> xor --> gstate 0x12 187419ea8026Sopenharmony_ci``` 187519ea8026Sopenharmony_ci 187619ea8026Sopenharmony_ciTo update the global state from a metadata pair, we take the global state we 187719ea8026Sopenharmony_ciknow and xor it with both our changes and any existing delta in the metadata 187819ea8026Sopenharmony_cipair. Committing this new delta to the metadata pair commits the changes to 187919ea8026Sopenharmony_cithe filesystem's global state. 188019ea8026Sopenharmony_ci 188119ea8026Sopenharmony_ci``` 188219ea8026Sopenharmony_ci .--------. .--------. .--------. .--------. .--------. 188319ea8026Sopenharmony_ci.| |->| gdelta |->| |->| gdelta |->| gdelta | 188419ea8026Sopenharmony_ci|| | || 0x23 | || | || 0xff | || 0xce | 188519ea8026Sopenharmony_ci|| | || | || | || | || | 188619ea8026Sopenharmony_ci|'--------' |'--------' |'--------' |'--------' |'--------' 188719ea8026Sopenharmony_ci'--------' '----|---' '--------' '--|---|-' '----|---' 188819ea8026Sopenharmony_ci v v | v 188919ea8026Sopenharmony_ci 0x00 --> xor ----------------> xor -|------> xor --> gstate = 0x12 189019ea8026Sopenharmony_ci | | 189119ea8026Sopenharmony_ci | | 189219ea8026Sopenharmony_cichange gstate to 0xab --> xor <------------|--------------------------' 189319ea8026Sopenharmony_ci=> | v 189419ea8026Sopenharmony_ci '------------> xor 189519ea8026Sopenharmony_ci | 189619ea8026Sopenharmony_ci v 189719ea8026Sopenharmony_ci .--------. .--------. .--------. .--------. .--------. 189819ea8026Sopenharmony_ci.| |->| gdelta |->| |->| gdelta |->| gdelta | 189919ea8026Sopenharmony_ci|| | || 0x23 | || | || 0x46 | || 0xce | 190019ea8026Sopenharmony_ci|| | || | || | || | || | 190119ea8026Sopenharmony_ci|'--------' |'--------' |'--------' |'--------' |'--------' 190219ea8026Sopenharmony_ci'--------' '----|---' '--------' '----|---' '----|---' 190319ea8026Sopenharmony_ci v v v 190419ea8026Sopenharmony_ci 0x00 --> xor ------------------> xor ------> xor --> gstate = 0xab 190519ea8026Sopenharmony_ci``` 190619ea8026Sopenharmony_ci 190719ea8026Sopenharmony_ciTo make this efficient, we always keep a copy of the global state in RAM. We 190819ea8026Sopenharmony_cionly need to iterate over our metadata pairs and build the global state when 190919ea8026Sopenharmony_cithe filesystem is mounted. 191019ea8026Sopenharmony_ci 191119ea8026Sopenharmony_ciYou may have noticed that global state is very expensive. We keep a copy in 191219ea8026Sopenharmony_ciRAM and a delta in an unbounded number of metadata pairs. Even if we reset 191319ea8026Sopenharmony_cithe global state to its initial value, we can't easily clean up the deltas on 191419ea8026Sopenharmony_cidisk. For this reason, it's very important that we keep the size of global 191519ea8026Sopenharmony_cistate bounded and extremely small. But, even with a strict budget, global 191619ea8026Sopenharmony_cistate is incredibly valuable. 191719ea8026Sopenharmony_ci 191819ea8026Sopenharmony_ci--- 191919ea8026Sopenharmony_ci 192019ea8026Sopenharmony_ciNow we can solve the move problem. We can create global state describing our 192119ea8026Sopenharmony_cimove atomically with the creation of the new file, and we can clear this move 192219ea8026Sopenharmony_cistate atomically with the removal of the old file. 192319ea8026Sopenharmony_ci 192419ea8026Sopenharmony_ci``` 192519ea8026Sopenharmony_ci .--------. gstate = no move 192619ea8026Sopenharmony_ci .| root |-. 192719ea8026Sopenharmony_ci || | | 192819ea8026Sopenharmony_ci.-------------|| |-' 192919ea8026Sopenharmony_ci| |'--------' 193019ea8026Sopenharmony_ci| '--|-|-|-' 193119ea8026Sopenharmony_ci| .------' | '-------. 193219ea8026Sopenharmony_ci| v v v 193319ea8026Sopenharmony_ci| .--------. .--------. .--------. 193419ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 193519ea8026Sopenharmony_ci || | || | || | 193619ea8026Sopenharmony_ci || | || | || | 193719ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 193819ea8026Sopenharmony_ci '----|---' '--------' '--------' 193919ea8026Sopenharmony_ci v 194019ea8026Sopenharmony_ci .--------. 194119ea8026Sopenharmony_ci | file D | 194219ea8026Sopenharmony_ci | | 194319ea8026Sopenharmony_ci | | 194419ea8026Sopenharmony_ci '--------' 194519ea8026Sopenharmony_ci 194619ea8026Sopenharmony_cibegin move, add reference in dir C, change gstate to have move 194719ea8026Sopenharmony_ci=> 194819ea8026Sopenharmony_ci .--------. gstate = moving file D in dir A (m1) 194919ea8026Sopenharmony_ci .| root |-. 195019ea8026Sopenharmony_ci || | | 195119ea8026Sopenharmony_ci.-------------|| |-' 195219ea8026Sopenharmony_ci| |'--------' 195319ea8026Sopenharmony_ci| '--|-|-|-' 195419ea8026Sopenharmony_ci| .------' | '-------. 195519ea8026Sopenharmony_ci| v v v 195619ea8026Sopenharmony_ci| .--------. .--------. .--------. 195719ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 195819ea8026Sopenharmony_ci || | || | || gdelta | 195919ea8026Sopenharmony_ci || | || | || =m1 | 196019ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 196119ea8026Sopenharmony_ci '----|---' '--------' '----|---' 196219ea8026Sopenharmony_ci | .----------------' 196319ea8026Sopenharmony_ci v v 196419ea8026Sopenharmony_ci .--------. 196519ea8026Sopenharmony_ci | file D | 196619ea8026Sopenharmony_ci | | 196719ea8026Sopenharmony_ci | | 196819ea8026Sopenharmony_ci '--------' 196919ea8026Sopenharmony_ci 197019ea8026Sopenharmony_cicomplete move, remove reference in dir A, change gstate to no move 197119ea8026Sopenharmony_ci=> 197219ea8026Sopenharmony_ci .--------. gstate = no move (m1^~m1) 197319ea8026Sopenharmony_ci .| root |-. 197419ea8026Sopenharmony_ci || | | 197519ea8026Sopenharmony_ci.-------------|| |-' 197619ea8026Sopenharmony_ci| |'--------' 197719ea8026Sopenharmony_ci| '--|-|-|-' 197819ea8026Sopenharmony_ci| .------' | '-------. 197919ea8026Sopenharmony_ci| v v v 198019ea8026Sopenharmony_ci| .--------. .--------. .--------. 198119ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 198219ea8026Sopenharmony_ci || gdelta | || | || gdelta | 198319ea8026Sopenharmony_ci || =~m1 | || | || =m1 | 198419ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 198519ea8026Sopenharmony_ci '--------' '--------' '----|---' 198619ea8026Sopenharmony_ci v 198719ea8026Sopenharmony_ci .--------. 198819ea8026Sopenharmony_ci | file D | 198919ea8026Sopenharmony_ci | | 199019ea8026Sopenharmony_ci | | 199119ea8026Sopenharmony_ci '--------' 199219ea8026Sopenharmony_ci``` 199319ea8026Sopenharmony_ci 199419ea8026Sopenharmony_ci 199519ea8026Sopenharmony_ciIf, after building our global state during mount, we find information 199619ea8026Sopenharmony_cidescribing an ongoing move, we know we lost power during a move and the file 199719ea8026Sopenharmony_ciis duplicated in both the source and destination directories. If this happens, 199819ea8026Sopenharmony_ciwe can resolve the move using the information in the global state to remove 199919ea8026Sopenharmony_cione of the files. 200019ea8026Sopenharmony_ci 200119ea8026Sopenharmony_ci``` 200219ea8026Sopenharmony_ci .--------. gstate = moving file D in dir A (m1) 200319ea8026Sopenharmony_ci .| root |-. ^ 200419ea8026Sopenharmony_ci || |------------> xor 200519ea8026Sopenharmony_ci.---------------|| |-' ^ 200619ea8026Sopenharmony_ci| |'--------' | 200719ea8026Sopenharmony_ci| '--|-|-|-' | 200819ea8026Sopenharmony_ci| .--------' | '---------. | 200919ea8026Sopenharmony_ci| | | | | 201019ea8026Sopenharmony_ci| | .----------> xor --------> xor 201119ea8026Sopenharmony_ci| v | v ^ v ^ 201219ea8026Sopenharmony_ci| .--------. | .--------. | .--------. | 201319ea8026Sopenharmony_ci'->| dir A |-|->| dir B |-|->| dir C | | 201419ea8026Sopenharmony_ci || |-' || |-' || gdelta |-' 201519ea8026Sopenharmony_ci || | || | || =m1 | 201619ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 201719ea8026Sopenharmony_ci '----|---' '--------' '----|---' 201819ea8026Sopenharmony_ci | .---------------------' 201919ea8026Sopenharmony_ci v v 202019ea8026Sopenharmony_ci .--------. 202119ea8026Sopenharmony_ci | file D | 202219ea8026Sopenharmony_ci | | 202319ea8026Sopenharmony_ci | | 202419ea8026Sopenharmony_ci '--------' 202519ea8026Sopenharmony_ci``` 202619ea8026Sopenharmony_ci 202719ea8026Sopenharmony_ciWe can also move directories the same way we move files. There is the threaded 202819ea8026Sopenharmony_cilinked-list to consider, but leaving the threaded linked-list unchanged works 202919ea8026Sopenharmony_cifine as the order doesn't really matter. 203019ea8026Sopenharmony_ci 203119ea8026Sopenharmony_ci``` 203219ea8026Sopenharmony_ci .--------. gstate = no move (m1^~m1) 203319ea8026Sopenharmony_ci .| root |-. 203419ea8026Sopenharmony_ci || | | 203519ea8026Sopenharmony_ci.-------------|| |-' 203619ea8026Sopenharmony_ci| |'--------' 203719ea8026Sopenharmony_ci| '--|-|-|-' 203819ea8026Sopenharmony_ci| .------' | '-------. 203919ea8026Sopenharmony_ci| v v v 204019ea8026Sopenharmony_ci| .--------. .--------. .--------. 204119ea8026Sopenharmony_ci'->| dir A |->| dir B |->| dir C | 204219ea8026Sopenharmony_ci || gdelta | || | || gdelta | 204319ea8026Sopenharmony_ci || =~m1 | || | || =m1 | 204419ea8026Sopenharmony_ci |'--------' |'--------' |'--------' 204519ea8026Sopenharmony_ci '--------' '--------' '----|---' 204619ea8026Sopenharmony_ci v 204719ea8026Sopenharmony_ci .--------. 204819ea8026Sopenharmony_ci | file D | 204919ea8026Sopenharmony_ci | | 205019ea8026Sopenharmony_ci | | 205119ea8026Sopenharmony_ci '--------' 205219ea8026Sopenharmony_ci 205319ea8026Sopenharmony_cibegin move, add reference in dir C, change gstate to have move 205419ea8026Sopenharmony_ci=> 205519ea8026Sopenharmony_ci .--------. gstate = moving dir B in root (m1^~m1^m2) 205619ea8026Sopenharmony_ci .| root |-. 205719ea8026Sopenharmony_ci || | | 205819ea8026Sopenharmony_ci.--------------|| |-' 205919ea8026Sopenharmony_ci| |'--------' 206019ea8026Sopenharmony_ci| '--|-|-|-' 206119ea8026Sopenharmony_ci| .-------' | '----------. 206219ea8026Sopenharmony_ci| v | v 206319ea8026Sopenharmony_ci| .--------. | .--------. 206419ea8026Sopenharmony_ci'->| dir A |-. | .->| dir C | 206519ea8026Sopenharmony_ci || gdelta | | | | || gdelta | 206619ea8026Sopenharmony_ci || =~m1 | | | | || =m1^m2 | 206719ea8026Sopenharmony_ci |'--------' | | | |'--------' 206819ea8026Sopenharmony_ci '--------' | | | '---|--|-' 206919ea8026Sopenharmony_ci | | .-------' | 207019ea8026Sopenharmony_ci | v v | v 207119ea8026Sopenharmony_ci | .--------. | .--------. 207219ea8026Sopenharmony_ci '->| dir B |-' | file D | 207319ea8026Sopenharmony_ci || | | | 207419ea8026Sopenharmony_ci || | | | 207519ea8026Sopenharmony_ci |'--------' '--------' 207619ea8026Sopenharmony_ci '--------' 207719ea8026Sopenharmony_ci 207819ea8026Sopenharmony_cicomplete move, remove reference in root, change gstate to no move 207919ea8026Sopenharmony_ci=> 208019ea8026Sopenharmony_ci .--------. gstate = no move (m1^~m1^m2^~m2) 208119ea8026Sopenharmony_ci .| root |-. 208219ea8026Sopenharmony_ci || gdelta | | 208319ea8026Sopenharmony_ci.-----------|| =~m2 |-' 208419ea8026Sopenharmony_ci| |'--------' 208519ea8026Sopenharmony_ci| '---|--|-' 208619ea8026Sopenharmony_ci| .-----' '-----. 208719ea8026Sopenharmony_ci| v v 208819ea8026Sopenharmony_ci| .--------. .--------. 208919ea8026Sopenharmony_ci'->| dir A |-. .->| dir C | 209019ea8026Sopenharmony_ci || gdelta | | | || gdelta | 209119ea8026Sopenharmony_ci || =~m1 | | '-|| =m1^m2 |-------. 209219ea8026Sopenharmony_ci |'--------' | |'--------' | 209319ea8026Sopenharmony_ci '--------' | '---|--|-' | 209419ea8026Sopenharmony_ci | .-' '-. | 209519ea8026Sopenharmony_ci | v v | 209619ea8026Sopenharmony_ci | .--------. .--------. | 209719ea8026Sopenharmony_ci '->| dir B |--| file D |-' 209819ea8026Sopenharmony_ci || | | | 209919ea8026Sopenharmony_ci || | | | 210019ea8026Sopenharmony_ci |'--------' '--------' 210119ea8026Sopenharmony_ci '--------' 210219ea8026Sopenharmony_ci``` 210319ea8026Sopenharmony_ci 210419ea8026Sopenharmony_ciGlobal state gives us a powerful tool we can use to solve the move problem. 210519ea8026Sopenharmony_ciAnd the result is surprisingly performant, only needing the minimum number 210619ea8026Sopenharmony_ciof states and using the same number of commits as a naive move. Additionally, 210719ea8026Sopenharmony_ciglobal state gives us a bit of persistent state we can use for some other 210819ea8026Sopenharmony_cismall improvements. 210919ea8026Sopenharmony_ci 211019ea8026Sopenharmony_ci## Conclusion 211119ea8026Sopenharmony_ci 211219ea8026Sopenharmony_ciAnd that's littlefs, thanks for reading! 211319ea8026Sopenharmony_ci 211419ea8026Sopenharmony_ci 211519ea8026Sopenharmony_ci[wikipedia-flash]: https://en.wikipedia.org/wiki/Flash_memory 211619ea8026Sopenharmony_ci[wikipedia-sna]: https://en.wikipedia.org/wiki/Serial_number_arithmetic 211719ea8026Sopenharmony_ci[wikipedia-crc]: https://en.wikipedia.org/wiki/Cyclic_redundancy_check 211819ea8026Sopenharmony_ci[wikipedia-cow]: https://en.wikipedia.org/wiki/Copy-on-write 211919ea8026Sopenharmony_ci[wikipedia-B-tree]: https://en.wikipedia.org/wiki/B-tree 212019ea8026Sopenharmony_ci[wikipedia-B+-tree]: https://en.wikipedia.org/wiki/B%2B_tree 212119ea8026Sopenharmony_ci[wikipedia-skip-list]: https://en.wikipedia.org/wiki/Skip_list 212219ea8026Sopenharmony_ci[wikipedia-ctz]: https://en.wikipedia.org/wiki/Count_trailing_zeros 212319ea8026Sopenharmony_ci[wikipedia-ecc]: https://en.wikipedia.org/wiki/Error_correction_code 212419ea8026Sopenharmony_ci[wikipedia-hamming-bound]: https://en.wikipedia.org/wiki/Hamming_bound 212519ea8026Sopenharmony_ci[wikipedia-dynamic-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Dynamic_wear_leveling 212619ea8026Sopenharmony_ci[wikipedia-static-wear-leveling]: https://en.wikipedia.org/wiki/Wear_leveling#Static_wear_leveling 212719ea8026Sopenharmony_ci[wikipedia-ftl]: https://en.wikipedia.org/wiki/Flash_translation_layer 212819ea8026Sopenharmony_ci 212919ea8026Sopenharmony_ci[oeis]: https://oeis.org 213019ea8026Sopenharmony_ci[A001511]: https://oeis.org/A001511 213119ea8026Sopenharmony_ci[A005187]: https://oeis.org/A005187 213219ea8026Sopenharmony_ci 213319ea8026Sopenharmony_ci[fat]: https://en.wikipedia.org/wiki/Design_of_the_FAT_file_system 213419ea8026Sopenharmony_ci[ext2]: http://e2fsprogs.sourceforge.net/ext2intro.html 213519ea8026Sopenharmony_ci[jffs]: https://www.sourceware.org/jffs2/jffs2-html 213619ea8026Sopenharmony_ci[yaffs]: https://yaffs.net/documents/how-yaffs-works 213719ea8026Sopenharmony_ci[spiffs]: https://github.com/pellepl/spiffs/blob/master/docs/TECH_SPEC 213819ea8026Sopenharmony_ci[ext4]: https://ext4.wiki.kernel.org/index.php/Ext4_Design 213919ea8026Sopenharmony_ci[ntfs]: https://en.wikipedia.org/wiki/NTFS 214019ea8026Sopenharmony_ci[btrfs]: https://btrfs.wiki.kernel.org/index.php/Btrfs_design 214119ea8026Sopenharmony_ci[zfs]: https://en.wikipedia.org/wiki/ZFS 214219ea8026Sopenharmony_ci 214319ea8026Sopenharmony_ci[cow]: https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg 214419ea8026Sopenharmony_ci[elephant]: https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg 214519ea8026Sopenharmony_ci[ram]: https://upload.wikimedia.org/wikipedia/commons/9/97/New_Mexico_Bighorn_Sheep.JPG 214619ea8026Sopenharmony_ci 214719ea8026Sopenharmony_ci[metadata-formula1]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20+%20n%20%5Cfrac%7Bs%7D%7Bd+1%7D 214819ea8026Sopenharmony_ci[metadata-formula2]: https://latex.codecogs.com/svg.latex?s%20%3D%20r%20%5Cfrac%7Bsize%7D%7Bn%7D 214919ea8026Sopenharmony_ci[metadata-formula3]: https://latex.codecogs.com/svg.latex?d%20%3D%20%281-r%29%20%5Cfrac%7Bsize%7D%7Bn%7D 215019ea8026Sopenharmony_ci[metadata-formula4]: https://latex.codecogs.com/svg.latex?cost%20%3D%20n%20+%20n%20%5Cfrac%7Br%5Cfrac%7Bsize%7D%7Bn%7D%7D%7B%281-r%29%5Cfrac%7Bsize%7D%7Bn%7D+1%7D 215119ea8026Sopenharmony_ci 215219ea8026Sopenharmony_ci[ctz-formula1]: https://latex.codecogs.com/svg.latex?%5Clim_%7Bn%5Cto%5Cinfty%7D%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D0%7D%5E%7Bn%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%20%3D%20%5Csum_%7Bi%3D0%7D%5Cfrac%7B1%7D%7B2%5Ei%7D%20%3D%202 215319ea8026Sopenharmony_ci[ctz-formula2]: https://latex.codecogs.com/svg.latex?B%20%3D%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%5Clceil%5Clog_2%5Cleft%28%5Cfrac%7B2%5Ew%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%29%5Cright%5Crceil 215419ea8026Sopenharmony_ci[ctz-formula3]: https://latex.codecogs.com/svg.latex?N%20%3D%20%5Csum_i%5En%5Cleft%5BB-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%5Cright%5D 215519ea8026Sopenharmony_ci[ctz-formula4]: https://latex.codecogs.com/svg.latex?%5Csum_i%5En%5Cleft%28%5Ctext%7Bctz%7D%28i%29+1%5Cright%29%20%3D%202n-%5Ctext%7Bpopcount%7D%28n%29 215619ea8026Sopenharmony_ci[ctz-formula5]: https://latex.codecogs.com/svg.latex?N%20%3D%20Bn%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Cleft%282n-%5Ctext%7Bpopcount%7D%28n%29%5Cright%29 215719ea8026Sopenharmony_ci[ctz-formula6]: https://latex.codecogs.com/svg.latex?n%20%3D%20%5Cleft%5Clfloor%5Cfrac%7BN-%5Cfrac%7Bw%7D%7B8%7D%5Cleft%28%5Ctext%7Bpopcount%7D%5Cleft%28%5Cfrac%7BN%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D-1%5Cright%29+2%5Cright%29%7D%7BB-2%5Cfrac%7Bw%7D%7B8%7D%7D%5Cright%5Crfloor 215819ea8026Sopenharmony_ci[ctz-formula7]: https://latex.codecogs.com/svg.latex?%5Cmathit%7Boff%7D%20%3D%20N%20-%20%5Cleft%28B-2%5Cfrac%7Bw%7D%7B8%7D%5Cright%29n%20-%20%5Cfrac%7Bw%7D%7B8%7D%5Ctext%7Bpopcount%7D%28n%29 215919ea8026Sopenharmony_ci 216019ea8026Sopenharmony_ci[bigB]: https://latex.codecogs.com/svg.latex?B 216119ea8026Sopenharmony_ci[d]: https://latex.codecogs.com/svg.latex?d 216219ea8026Sopenharmony_ci[m]: https://latex.codecogs.com/svg.latex?m 216319ea8026Sopenharmony_ci[bigN]: https://latex.codecogs.com/svg.latex?N 216419ea8026Sopenharmony_ci[n]: https://latex.codecogs.com/svg.latex?n 216519ea8026Sopenharmony_ci[n']: https://latex.codecogs.com/svg.latex?n%27 216619ea8026Sopenharmony_ci[r]: https://latex.codecogs.com/svg.latex?r 216719ea8026Sopenharmony_ci[s]: https://latex.codecogs.com/svg.latex?s 216819ea8026Sopenharmony_ci[w]: https://latex.codecogs.com/svg.latex?w 216919ea8026Sopenharmony_ci[x]: https://latex.codecogs.com/svg.latex?x 217019ea8026Sopenharmony_ci 217119ea8026Sopenharmony_ci[metadata-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/metadata-cost.svg?sanitize=true 217219ea8026Sopenharmony_ci[wear-distribution-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/wear-distribution.svg?sanitize=true 217319ea8026Sopenharmony_ci[file-cost-graph]: https://raw.githubusercontent.com/geky/littlefs/gh-images/file-cost.svg?sanitize=true 2174