162306a36Sopenharmony_ci============================= 262306a36Sopenharmony_ciBTT - Block Translation Table 362306a36Sopenharmony_ci============================= 462306a36Sopenharmony_ci 562306a36Sopenharmony_ci 662306a36Sopenharmony_ci1. Introduction 762306a36Sopenharmony_ci=============== 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciPersistent memory based storage is able to perform IO at byte (or more 1062306a36Sopenharmony_ciaccurately, cache line) granularity. However, we often want to expose such 1162306a36Sopenharmony_cistorage as traditional block devices. The block drivers for persistent memory 1262306a36Sopenharmony_ciwill do exactly this. However, they do not provide any atomicity guarantees. 1362306a36Sopenharmony_ciTraditional SSDs typically provide protection against torn sectors in hardware, 1462306a36Sopenharmony_ciusing stored energy in capacitors to complete in-flight block writes, or perhaps 1562306a36Sopenharmony_ciin firmware. We don't have this luxury with persistent memory - if a write is in 1662306a36Sopenharmony_ciprogress, and we experience a power failure, the block will contain a mix of old 1762306a36Sopenharmony_ciand new data. Applications may not be prepared to handle such a scenario. 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciThe Block Translation Table (BTT) provides atomic sector update semantics for 2062306a36Sopenharmony_cipersistent memory devices, so that applications that rely on sector writes not 2162306a36Sopenharmony_cibeing torn can continue to do so. The BTT manifests itself as a stacked block 2262306a36Sopenharmony_cidevice, and reserves a portion of the underlying storage for its metadata. At 2362306a36Sopenharmony_cithe heart of it, is an indirection table that re-maps all the blocks on the 2462306a36Sopenharmony_civolume. It can be thought of as an extremely simple file system that only 2562306a36Sopenharmony_ciprovides atomic sector updates. 2662306a36Sopenharmony_ci 2762306a36Sopenharmony_ci 2862306a36Sopenharmony_ci2. Static Layout 2962306a36Sopenharmony_ci================ 3062306a36Sopenharmony_ci 3162306a36Sopenharmony_ciThe underlying storage on which a BTT can be laid out is not limited in any way. 3262306a36Sopenharmony_ciThe BTT, however, splits the available space into chunks of up to 512 GiB, 3362306a36Sopenharmony_cicalled "Arenas". 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciEach arena follows the same layout for its metadata, and all references in an 3662306a36Sopenharmony_ciarena are internal to it (with the exception of one field that points to the 3762306a36Sopenharmony_cinext arena). The following depicts the "On-disk" metadata layout:: 3862306a36Sopenharmony_ci 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ci Backing Store +-------> Arena 4162306a36Sopenharmony_ci +---------------+ | +------------------+ 4262306a36Sopenharmony_ci | | | | Arena info block | 4362306a36Sopenharmony_ci | Arena 0 +---+ | 4K | 4462306a36Sopenharmony_ci | 512G | +------------------+ 4562306a36Sopenharmony_ci | | | | 4662306a36Sopenharmony_ci +---------------+ | | 4762306a36Sopenharmony_ci | | | | 4862306a36Sopenharmony_ci | Arena 1 | | Data Blocks | 4962306a36Sopenharmony_ci | 512G | | | 5062306a36Sopenharmony_ci | | | | 5162306a36Sopenharmony_ci +---------------+ | | 5262306a36Sopenharmony_ci | . | | | 5362306a36Sopenharmony_ci | . | | | 5462306a36Sopenharmony_ci | . | | | 5562306a36Sopenharmony_ci | | | | 5662306a36Sopenharmony_ci | | | | 5762306a36Sopenharmony_ci +---------------+ +------------------+ 5862306a36Sopenharmony_ci | | 5962306a36Sopenharmony_ci | BTT Map | 6062306a36Sopenharmony_ci | | 6162306a36Sopenharmony_ci | | 6262306a36Sopenharmony_ci +------------------+ 6362306a36Sopenharmony_ci | | 6462306a36Sopenharmony_ci | BTT Flog | 6562306a36Sopenharmony_ci | | 6662306a36Sopenharmony_ci +------------------+ 6762306a36Sopenharmony_ci | Info block copy | 6862306a36Sopenharmony_ci | 4K | 6962306a36Sopenharmony_ci +------------------+ 7062306a36Sopenharmony_ci 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ci3. Theory of Operation 7362306a36Sopenharmony_ci====================== 7462306a36Sopenharmony_ci 7562306a36Sopenharmony_ci 7662306a36Sopenharmony_cia. The BTT Map 7762306a36Sopenharmony_ci-------------- 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ciThe map is a simple lookup/indirection table that maps an LBA to an internal 8062306a36Sopenharmony_ciblock. Each map entry is 32 bits. The two most significant bits are special 8162306a36Sopenharmony_ciflags, and the remaining form the internal block number. 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_ci======== ============================================================= 8462306a36Sopenharmony_ciBit Description 8562306a36Sopenharmony_ci======== ============================================================= 8662306a36Sopenharmony_ci31 - 30 Error and Zero flags - Used in the following way:: 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci == == ==================================================== 8962306a36Sopenharmony_ci 31 30 Description 9062306a36Sopenharmony_ci == == ==================================================== 9162306a36Sopenharmony_ci 0 0 Initial state. Reads return zeroes; Premap = Postmap 9262306a36Sopenharmony_ci 0 1 Zero state: Reads return zeroes 9362306a36Sopenharmony_ci 1 0 Error state: Reads fail; Writes clear 'E' bit 9462306a36Sopenharmony_ci 1 1 Normal Block – has valid postmap 9562306a36Sopenharmony_ci == == ==================================================== 9662306a36Sopenharmony_ci 9762306a36Sopenharmony_ci29 - 0 Mappings to internal 'postmap' blocks 9862306a36Sopenharmony_ci======== ============================================================= 9962306a36Sopenharmony_ci 10062306a36Sopenharmony_ci 10162306a36Sopenharmony_ciSome of the terminology that will be subsequently used: 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ci============ ================================================================ 10462306a36Sopenharmony_ciExternal LBA LBA as made visible to upper layers. 10562306a36Sopenharmony_ciABA Arena Block Address - Block offset/number within an arena 10662306a36Sopenharmony_ciPremap ABA The block offset into an arena, which was decided upon by range 10762306a36Sopenharmony_ci checking the External LBA 10862306a36Sopenharmony_ciPostmap ABA The block number in the "Data Blocks" area obtained after 10962306a36Sopenharmony_ci indirection from the map 11062306a36Sopenharmony_cinfree The number of free blocks that are maintained at any given time. 11162306a36Sopenharmony_ci This is the number of concurrent writes that can happen to the 11262306a36Sopenharmony_ci arena. 11362306a36Sopenharmony_ci============ ================================================================ 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ciFor example, after adding a BTT, we surface a disk of 1024G. We get a read for 11762306a36Sopenharmony_cithe external LBA at 768G. This falls into the second arena, and of the 512G 11862306a36Sopenharmony_ciworth of blocks that this arena contributes, this block is at 256G. Thus, the 11962306a36Sopenharmony_cipremap ABA is 256G. We now refer to the map, and find out the mapping for block 12062306a36Sopenharmony_ci'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci 12362306a36Sopenharmony_cib. The BTT Flog 12462306a36Sopenharmony_ci--------------- 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ciThe BTT provides sector atomicity by making every write an "allocating write", 12762306a36Sopenharmony_cii.e. Every write goes to a "free" block. A running list of free blocks is 12862306a36Sopenharmony_cimaintained in the form of the BTT flog. 'Flog' is a combination of the words 12962306a36Sopenharmony_ci"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci======== ===================================================================== 13262306a36Sopenharmony_cilba The premap ABA that is being written to 13362306a36Sopenharmony_ciold_map The old postmap ABA - after 'this' write completes, this will be a 13462306a36Sopenharmony_ci free block. 13562306a36Sopenharmony_cinew_map The new postmap ABA. The map will up updated to reflect this 13662306a36Sopenharmony_ci lba->postmap_aba mapping, but we log it here in case we have to 13762306a36Sopenharmony_ci recover. 13862306a36Sopenharmony_ciseq Sequence number to mark which of the 2 sections of this flog entry is 13962306a36Sopenharmony_ci valid/newest. It cycles between 01->10->11->01 (binary) under normal 14062306a36Sopenharmony_ci operation, with 00 indicating an uninitialized state. 14162306a36Sopenharmony_cilba' alternate lba entry 14262306a36Sopenharmony_ciold_map' alternate old postmap entry 14362306a36Sopenharmony_cinew_map' alternate new postmap entry 14462306a36Sopenharmony_ciseq' alternate sequence number. 14562306a36Sopenharmony_ci======== ===================================================================== 14662306a36Sopenharmony_ci 14762306a36Sopenharmony_ciEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also 14862306a36Sopenharmony_cipadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 14962306a36Sopenharmony_cidone such that for any entry being written, it: 15062306a36Sopenharmony_cia. overwrites the 'old' section in the entry based on sequence numbers 15162306a36Sopenharmony_cib. writes the 'new' section such that the sequence number is written last. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ci 15462306a36Sopenharmony_cic. The concept of lanes 15562306a36Sopenharmony_ci----------------------- 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciWhile 'nfree' describes the number of concurrent IOs an arena can process 15862306a36Sopenharmony_ciconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can 15962306a36Sopenharmony_ciprocess:: 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ci nlanes = min(nfree, num_cpus) 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ciA lane number is obtained at the start of any IO, and is used for indexing into 16462306a36Sopenharmony_ciall the on-disk and in-memory data structures for the duration of the IO. If 16562306a36Sopenharmony_cithere are more CPUs than the max number of available lanes, than lanes are 16662306a36Sopenharmony_ciprotected by spinlocks. 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci 16962306a36Sopenharmony_cid. In-memory data structure: Read Tracking Table (RTT) 17062306a36Sopenharmony_ci------------------------------------------------------ 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ciConsider a case where we have two threads, one doing reads and the other, 17362306a36Sopenharmony_ciwrites. We can hit a condition where the writer thread grabs a free block to do 17462306a36Sopenharmony_cia new IO, but the (slow) reader thread is still reading from it. In other words, 17562306a36Sopenharmony_cithe reader consulted a map entry, and started reading the corresponding block. A 17662306a36Sopenharmony_ciwriter started writing to the same external LBA, and finished the write updating 17762306a36Sopenharmony_cithe map for that external LBA to point to its new postmap ABA. At this point the 17862306a36Sopenharmony_ciinternal, postmap block that the reader is (still) reading has been inserted 17962306a36Sopenharmony_ciinto the list of free blocks. If another write comes in for the same LBA, it can 18062306a36Sopenharmony_cigrab this free block, and start writing to it, causing the reader to read 18162306a36Sopenharmony_ciincorrect data. To prevent this, we introduce the RTT. 18262306a36Sopenharmony_ci 18362306a36Sopenharmony_ciThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 18462306a36Sopenharmony_ciinto rtt[lane_number], the postmap ABA it is reading, and clears it after the 18562306a36Sopenharmony_ciread is complete. Every writer thread, after grabbing a free block, checks the 18662306a36Sopenharmony_ciRTT for its presence. If the postmap free block is in the RTT, it waits till the 18762306a36Sopenharmony_cireader clears the RTT entry, and only then starts writing to it. 18862306a36Sopenharmony_ci 18962306a36Sopenharmony_ci 19062306a36Sopenharmony_cie. In-memory data structure: map locks 19162306a36Sopenharmony_ci-------------------------------------- 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ciConsider a case where two writer threads are writing to the same LBA. There can 19462306a36Sopenharmony_cibe a race in the following sequence of steps:: 19562306a36Sopenharmony_ci 19662306a36Sopenharmony_ci free[lane] = map[premap_aba] 19762306a36Sopenharmony_ci map[premap_aba] = postmap_aba 19862306a36Sopenharmony_ci 19962306a36Sopenharmony_ciBoth threads can update their respective free[lane] with the same old, freed 20062306a36Sopenharmony_cipostmap_aba. This has made the layout inconsistent by losing a free entry, and 20162306a36Sopenharmony_ciat the same time, duplicating another free entry for two lanes. 20262306a36Sopenharmony_ci 20362306a36Sopenharmony_ciTo solve this, we could have a single map lock (per arena) that has to be taken 20462306a36Sopenharmony_cibefore performing the above sequence, but we feel that could be too contentious. 20562306a36Sopenharmony_ciInstead we use an array of (nfree) map_locks that is indexed by 20662306a36Sopenharmony_ci(premap_aba modulo nfree). 20762306a36Sopenharmony_ci 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_cif. Reconstruction from the Flog 21062306a36Sopenharmony_ci------------------------------- 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ciOn startup, we analyze the BTT flog to create our list of free blocks. We walk 21362306a36Sopenharmony_cithrough all the entries, and for each lane, of the set of two possible 21462306a36Sopenharmony_ci'sections', we always look at the most recent one only (based on the sequence 21562306a36Sopenharmony_cinumber). The reconstruction rules/steps are simple: 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_ci- Read map[log_entry.lba]. 21862306a36Sopenharmony_ci- If log_entry.new matches the map entry, then log_entry.old is free. 21962306a36Sopenharmony_ci- If log_entry.new does not match the map entry, then log_entry.new is free. 22062306a36Sopenharmony_ci (This case can only be caused by power-fails/unsafe shutdowns) 22162306a36Sopenharmony_ci 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_cig. Summarizing - Read and Write flows 22462306a36Sopenharmony_ci------------------------------------- 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ciRead: 22762306a36Sopenharmony_ci 22862306a36Sopenharmony_ci1. Convert external LBA to arena number + pre-map ABA 22962306a36Sopenharmony_ci2. Get a lane (and take lane_lock) 23062306a36Sopenharmony_ci3. Read map to get the entry for this pre-map ABA 23162306a36Sopenharmony_ci4. Enter post-map ABA into RTT[lane] 23262306a36Sopenharmony_ci5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 23362306a36Sopenharmony_ci6. If ERROR flag set in map, end IO with EIO (go to step 8) 23462306a36Sopenharmony_ci7. Read data from this block 23562306a36Sopenharmony_ci8. Remove post-map ABA entry from RTT[lane] 23662306a36Sopenharmony_ci9. Release lane (and lane_lock) 23762306a36Sopenharmony_ci 23862306a36Sopenharmony_ciWrite: 23962306a36Sopenharmony_ci 24062306a36Sopenharmony_ci1. Convert external LBA to Arena number + pre-map ABA 24162306a36Sopenharmony_ci2. Get a lane (and take lane_lock) 24262306a36Sopenharmony_ci3. Use lane to index into in-memory free list and obtain a new block, next flog 24362306a36Sopenharmony_ci index, next sequence number 24462306a36Sopenharmony_ci4. Scan the RTT to check if free block is present, and spin/wait if it is. 24562306a36Sopenharmony_ci5. Write data to this free block 24662306a36Sopenharmony_ci6. Read map to get the existing post-map ABA entry for this pre-map ABA 24762306a36Sopenharmony_ci7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 24862306a36Sopenharmony_ci8. Write new post-map ABA into map. 24962306a36Sopenharmony_ci9. Write old post-map entry into the free list 25062306a36Sopenharmony_ci10. Calculate next sequence number and write into the free list entry 25162306a36Sopenharmony_ci11. Release lane (and lane_lock) 25262306a36Sopenharmony_ci 25362306a36Sopenharmony_ci 25462306a36Sopenharmony_ci4. Error Handling 25562306a36Sopenharmony_ci================= 25662306a36Sopenharmony_ci 25762306a36Sopenharmony_ciAn arena would be in an error state if any of the metadata is corrupted 25862306a36Sopenharmony_ciirrecoverably, either due to a bug or a media error. The following conditions 25962306a36Sopenharmony_ciindicate an error: 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci- Info block checksum does not match (and recovering from the copy also fails) 26262306a36Sopenharmony_ci- All internal available blocks are not uniquely and entirely addressed by the 26362306a36Sopenharmony_ci sum of mapped blocks and free blocks (from the BTT flog). 26462306a36Sopenharmony_ci- Rebuilding free list from the flog reveals missing/duplicate/impossible 26562306a36Sopenharmony_ci entries 26662306a36Sopenharmony_ci- A map entry is out of bounds 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ciIf any of these error conditions are encountered, the arena is put into a read 26962306a36Sopenharmony_cionly state using a flag in the info block. 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_ci 27262306a36Sopenharmony_ci5. Usage 27362306a36Sopenharmony_ci======== 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem 27662306a36Sopenharmony_ci(pmem, or blk mode). The easiest way to set up such a namespace is using the 27762306a36Sopenharmony_ci'ndctl' utility [1]: 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ciFor example, the ndctl command line to setup a btt with a 4k sector size is:: 28062306a36Sopenharmony_ci 28162306a36Sopenharmony_ci ndctl create-namespace -f -e namespace0.0 -m sector -l 4k 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_ciSee ndctl create-namespace --help for more options. 28462306a36Sopenharmony_ci 28562306a36Sopenharmony_ci[1]: https://github.com/pmem/ndctl 286