162306a36Sopenharmony_ci=============================
262306a36Sopenharmony_ciBTT - Block Translation Table
362306a36Sopenharmony_ci=============================
462306a36Sopenharmony_ci
562306a36Sopenharmony_ci
662306a36Sopenharmony_ci1. Introduction
762306a36Sopenharmony_ci===============
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciPersistent memory based storage is able to perform IO at byte (or more
1062306a36Sopenharmony_ciaccurately, cache line) granularity. However, we often want to expose such
1162306a36Sopenharmony_cistorage as traditional block devices. The block drivers for persistent memory
1262306a36Sopenharmony_ciwill do exactly this. However, they do not provide any atomicity guarantees.
1362306a36Sopenharmony_ciTraditional SSDs typically provide protection against torn sectors in hardware,
1462306a36Sopenharmony_ciusing stored energy in capacitors to complete in-flight block writes, or perhaps
1562306a36Sopenharmony_ciin firmware. We don't have this luxury with persistent memory - if a write is in
1662306a36Sopenharmony_ciprogress, and we experience a power failure, the block will contain a mix of old
1762306a36Sopenharmony_ciand new data. Applications may not be prepared to handle such a scenario.
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciThe Block Translation Table (BTT) provides atomic sector update semantics for
2062306a36Sopenharmony_cipersistent memory devices, so that applications that rely on sector writes not
2162306a36Sopenharmony_cibeing torn can continue to do so. The BTT manifests itself as a stacked block
2262306a36Sopenharmony_cidevice, and reserves a portion of the underlying storage for its metadata. At
2362306a36Sopenharmony_cithe heart of it, is an indirection table that re-maps all the blocks on the
2462306a36Sopenharmony_civolume. It can be thought of as an extremely simple file system that only
2562306a36Sopenharmony_ciprovides atomic sector updates.
2662306a36Sopenharmony_ci
2762306a36Sopenharmony_ci
2862306a36Sopenharmony_ci2. Static Layout
2962306a36Sopenharmony_ci================
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciThe underlying storage on which a BTT can be laid out is not limited in any way.
3262306a36Sopenharmony_ciThe BTT, however, splits the available space into chunks of up to 512 GiB,
3362306a36Sopenharmony_cicalled "Arenas".
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciEach arena follows the same layout for its metadata, and all references in an
3662306a36Sopenharmony_ciarena are internal to it (with the exception of one field that points to the
3762306a36Sopenharmony_cinext arena). The following depicts the "On-disk" metadata layout::
3862306a36Sopenharmony_ci
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ci    Backing Store     +------->  Arena
4162306a36Sopenharmony_ci  +---------------+   |   +------------------+
4262306a36Sopenharmony_ci  |               |   |   | Arena info block |
4362306a36Sopenharmony_ci  |    Arena 0    +---+   |       4K         |
4462306a36Sopenharmony_ci  |     512G      |       +------------------+
4562306a36Sopenharmony_ci  |               |       |                  |
4662306a36Sopenharmony_ci  +---------------+       |                  |
4762306a36Sopenharmony_ci  |               |       |                  |
4862306a36Sopenharmony_ci  |    Arena 1    |       |   Data Blocks    |
4962306a36Sopenharmony_ci  |     512G      |       |                  |
5062306a36Sopenharmony_ci  |               |       |                  |
5162306a36Sopenharmony_ci  +---------------+       |                  |
5262306a36Sopenharmony_ci  |       .       |       |                  |
5362306a36Sopenharmony_ci  |       .       |       |                  |
5462306a36Sopenharmony_ci  |       .       |       |                  |
5562306a36Sopenharmony_ci  |               |       |                  |
5662306a36Sopenharmony_ci  |               |       |                  |
5762306a36Sopenharmony_ci  +---------------+       +------------------+
5862306a36Sopenharmony_ci                          |                  |
5962306a36Sopenharmony_ci                          |     BTT Map      |
6062306a36Sopenharmony_ci                          |                  |
6162306a36Sopenharmony_ci                          |                  |
6262306a36Sopenharmony_ci                          +------------------+
6362306a36Sopenharmony_ci                          |                  |
6462306a36Sopenharmony_ci                          |     BTT Flog     |
6562306a36Sopenharmony_ci                          |                  |
6662306a36Sopenharmony_ci                          +------------------+
6762306a36Sopenharmony_ci                          | Info block copy  |
6862306a36Sopenharmony_ci                          |       4K         |
6962306a36Sopenharmony_ci                          +------------------+
7062306a36Sopenharmony_ci
7162306a36Sopenharmony_ci
7262306a36Sopenharmony_ci3. Theory of Operation
7362306a36Sopenharmony_ci======================
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ci
7662306a36Sopenharmony_cia. The BTT Map
7762306a36Sopenharmony_ci--------------
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciThe map is a simple lookup/indirection table that maps an LBA to an internal
8062306a36Sopenharmony_ciblock. Each map entry is 32 bits. The two most significant bits are special
8162306a36Sopenharmony_ciflags, and the remaining form the internal block number.
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_ci======== =============================================================
8462306a36Sopenharmony_ciBit      Description
8562306a36Sopenharmony_ci======== =============================================================
8662306a36Sopenharmony_ci31 - 30	 Error and Zero flags - Used in the following way::
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci	   == ==  ====================================================
8962306a36Sopenharmony_ci	   31 30  Description
9062306a36Sopenharmony_ci	   == ==  ====================================================
9162306a36Sopenharmony_ci	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
9262306a36Sopenharmony_ci	   0  1	  Zero state: Reads return zeroes
9362306a36Sopenharmony_ci	   1  0	  Error state: Reads fail; Writes clear 'E' bit
9462306a36Sopenharmony_ci	   1  1	  Normal Block – has valid postmap
9562306a36Sopenharmony_ci	   == ==  ====================================================
9662306a36Sopenharmony_ci
9762306a36Sopenharmony_ci29 - 0	 Mappings to internal 'postmap' blocks
9862306a36Sopenharmony_ci======== =============================================================
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ci
10162306a36Sopenharmony_ciSome of the terminology that will be subsequently used:
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ci============	================================================================
10462306a36Sopenharmony_ciExternal LBA	LBA as made visible to upper layers.
10562306a36Sopenharmony_ciABA		Arena Block Address - Block offset/number within an arena
10662306a36Sopenharmony_ciPremap ABA	The block offset into an arena, which was decided upon by range
10762306a36Sopenharmony_ci		checking the External LBA
10862306a36Sopenharmony_ciPostmap ABA	The block number in the "Data Blocks" area obtained after
10962306a36Sopenharmony_ci		indirection from the map
11062306a36Sopenharmony_cinfree		The number of free blocks that are maintained at any given time.
11162306a36Sopenharmony_ci		This is the number of concurrent writes that can happen to the
11262306a36Sopenharmony_ci		arena.
11362306a36Sopenharmony_ci============	================================================================
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ciFor example, after adding a BTT, we surface a disk of 1024G. We get a read for
11762306a36Sopenharmony_cithe external LBA at 768G. This falls into the second arena, and of the 512G
11862306a36Sopenharmony_ciworth of blocks that this arena contributes, this block is at 256G. Thus, the
11962306a36Sopenharmony_cipremap ABA is 256G. We now refer to the map, and find out the mapping for block
12062306a36Sopenharmony_ci'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_cib. The BTT Flog
12462306a36Sopenharmony_ci---------------
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciThe BTT provides sector atomicity by making every write an "allocating write",
12762306a36Sopenharmony_cii.e. Every write goes to a "free" block. A running list of free blocks is
12862306a36Sopenharmony_cimaintained in the form of the BTT flog. 'Flog' is a combination of the words
12962306a36Sopenharmony_ci"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci========  =====================================================================
13262306a36Sopenharmony_cilba       The premap ABA that is being written to
13362306a36Sopenharmony_ciold_map   The old postmap ABA - after 'this' write completes, this will be a
13462306a36Sopenharmony_ci	  free block.
13562306a36Sopenharmony_cinew_map   The new postmap ABA. The map will up updated to reflect this
13662306a36Sopenharmony_ci	  lba->postmap_aba mapping, but we log it here in case we have to
13762306a36Sopenharmony_ci	  recover.
13862306a36Sopenharmony_ciseq	  Sequence number to mark which of the 2 sections of this flog entry is
13962306a36Sopenharmony_ci	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
14062306a36Sopenharmony_ci	  operation, with 00 indicating an uninitialized state.
14162306a36Sopenharmony_cilba'	  alternate lba entry
14262306a36Sopenharmony_ciold_map'  alternate old postmap entry
14362306a36Sopenharmony_cinew_map'  alternate new postmap entry
14462306a36Sopenharmony_ciseq'	  alternate sequence number.
14562306a36Sopenharmony_ci========  =====================================================================
14662306a36Sopenharmony_ci
14762306a36Sopenharmony_ciEach of the above fields is 32-bit, making one entry 32 bytes. Entries are also
14862306a36Sopenharmony_cipadded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
14962306a36Sopenharmony_cidone such that for any entry being written, it:
15062306a36Sopenharmony_cia. overwrites the 'old' section in the entry based on sequence numbers
15162306a36Sopenharmony_cib. writes the 'new' section such that the sequence number is written last.
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ci
15462306a36Sopenharmony_cic. The concept of lanes
15562306a36Sopenharmony_ci-----------------------
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciWhile 'nfree' describes the number of concurrent IOs an arena can process
15862306a36Sopenharmony_ciconcurrently, 'nlanes' is the number of IOs the BTT device as a whole can
15962306a36Sopenharmony_ciprocess::
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci	nlanes = min(nfree, num_cpus)
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ciA lane number is obtained at the start of any IO, and is used for indexing into
16462306a36Sopenharmony_ciall the on-disk and in-memory data structures for the duration of the IO. If
16562306a36Sopenharmony_cithere are more CPUs than the max number of available lanes, than lanes are
16662306a36Sopenharmony_ciprotected by spinlocks.
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_cid. In-memory data structure: Read Tracking Table (RTT)
17062306a36Sopenharmony_ci------------------------------------------------------
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ciConsider a case where we have two threads, one doing reads and the other,
17362306a36Sopenharmony_ciwrites. We can hit a condition where the writer thread grabs a free block to do
17462306a36Sopenharmony_cia new IO, but the (slow) reader thread is still reading from it. In other words,
17562306a36Sopenharmony_cithe reader consulted a map entry, and started reading the corresponding block. A
17662306a36Sopenharmony_ciwriter started writing to the same external LBA, and finished the write updating
17762306a36Sopenharmony_cithe map for that external LBA to point to its new postmap ABA. At this point the
17862306a36Sopenharmony_ciinternal, postmap block that the reader is (still) reading has been inserted
17962306a36Sopenharmony_ciinto the list of free blocks. If another write comes in for the same LBA, it can
18062306a36Sopenharmony_cigrab this free block, and start writing to it, causing the reader to read
18162306a36Sopenharmony_ciincorrect data. To prevent this, we introduce the RTT.
18262306a36Sopenharmony_ci
18362306a36Sopenharmony_ciThe RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
18462306a36Sopenharmony_ciinto rtt[lane_number], the postmap ABA it is reading, and clears it after the
18562306a36Sopenharmony_ciread is complete. Every writer thread, after grabbing a free block, checks the
18662306a36Sopenharmony_ciRTT for its presence. If the postmap free block is in the RTT, it waits till the
18762306a36Sopenharmony_cireader clears the RTT entry, and only then starts writing to it.
18862306a36Sopenharmony_ci
18962306a36Sopenharmony_ci
19062306a36Sopenharmony_cie. In-memory data structure: map locks
19162306a36Sopenharmony_ci--------------------------------------
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ciConsider a case where two writer threads are writing to the same LBA. There can
19462306a36Sopenharmony_cibe a race in the following sequence of steps::
19562306a36Sopenharmony_ci
19662306a36Sopenharmony_ci	free[lane] = map[premap_aba]
19762306a36Sopenharmony_ci	map[premap_aba] = postmap_aba
19862306a36Sopenharmony_ci
19962306a36Sopenharmony_ciBoth threads can update their respective free[lane] with the same old, freed
20062306a36Sopenharmony_cipostmap_aba. This has made the layout inconsistent by losing a free entry, and
20162306a36Sopenharmony_ciat the same time, duplicating another free entry for two lanes.
20262306a36Sopenharmony_ci
20362306a36Sopenharmony_ciTo solve this, we could have a single map lock (per arena) that has to be taken
20462306a36Sopenharmony_cibefore performing the above sequence, but we feel that could be too contentious.
20562306a36Sopenharmony_ciInstead we use an array of (nfree) map_locks that is indexed by
20662306a36Sopenharmony_ci(premap_aba modulo nfree).
20762306a36Sopenharmony_ci
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_cif. Reconstruction from the Flog
21062306a36Sopenharmony_ci-------------------------------
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ciOn startup, we analyze the BTT flog to create our list of free blocks. We walk
21362306a36Sopenharmony_cithrough all the entries, and for each lane, of the set of two possible
21462306a36Sopenharmony_ci'sections', we always look at the most recent one only (based on the sequence
21562306a36Sopenharmony_cinumber). The reconstruction rules/steps are simple:
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ci- Read map[log_entry.lba].
21862306a36Sopenharmony_ci- If log_entry.new matches the map entry, then log_entry.old is free.
21962306a36Sopenharmony_ci- If log_entry.new does not match the map entry, then log_entry.new is free.
22062306a36Sopenharmony_ci  (This case can only be caused by power-fails/unsafe shutdowns)
22162306a36Sopenharmony_ci
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_cig. Summarizing - Read and Write flows
22462306a36Sopenharmony_ci-------------------------------------
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ciRead:
22762306a36Sopenharmony_ci
22862306a36Sopenharmony_ci1.  Convert external LBA to arena number + pre-map ABA
22962306a36Sopenharmony_ci2.  Get a lane (and take lane_lock)
23062306a36Sopenharmony_ci3.  Read map to get the entry for this pre-map ABA
23162306a36Sopenharmony_ci4.  Enter post-map ABA into RTT[lane]
23262306a36Sopenharmony_ci5.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
23362306a36Sopenharmony_ci6.  If ERROR flag set in map, end IO with EIO (go to step 8)
23462306a36Sopenharmony_ci7.  Read data from this block
23562306a36Sopenharmony_ci8.  Remove post-map ABA entry from RTT[lane]
23662306a36Sopenharmony_ci9.  Release lane (and lane_lock)
23762306a36Sopenharmony_ci
23862306a36Sopenharmony_ciWrite:
23962306a36Sopenharmony_ci
24062306a36Sopenharmony_ci1.  Convert external LBA to Arena number + pre-map ABA
24162306a36Sopenharmony_ci2.  Get a lane (and take lane_lock)
24262306a36Sopenharmony_ci3.  Use lane to index into in-memory free list and obtain a new block, next flog
24362306a36Sopenharmony_ci    index, next sequence number
24462306a36Sopenharmony_ci4.  Scan the RTT to check if free block is present, and spin/wait if it is.
24562306a36Sopenharmony_ci5.  Write data to this free block
24662306a36Sopenharmony_ci6.  Read map to get the existing post-map ABA entry for this pre-map ABA
24762306a36Sopenharmony_ci7.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
24862306a36Sopenharmony_ci8.  Write new post-map ABA into map.
24962306a36Sopenharmony_ci9.  Write old post-map entry into the free list
25062306a36Sopenharmony_ci10. Calculate next sequence number and write into the free list entry
25162306a36Sopenharmony_ci11. Release lane (and lane_lock)
25262306a36Sopenharmony_ci
25362306a36Sopenharmony_ci
25462306a36Sopenharmony_ci4. Error Handling
25562306a36Sopenharmony_ci=================
25662306a36Sopenharmony_ci
25762306a36Sopenharmony_ciAn arena would be in an error state if any of the metadata is corrupted
25862306a36Sopenharmony_ciirrecoverably, either due to a bug or a media error. The following conditions
25962306a36Sopenharmony_ciindicate an error:
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ci- Info block checksum does not match (and recovering from the copy also fails)
26262306a36Sopenharmony_ci- All internal available blocks are not uniquely and entirely addressed by the
26362306a36Sopenharmony_ci  sum of mapped blocks and free blocks (from the BTT flog).
26462306a36Sopenharmony_ci- Rebuilding free list from the flog reveals missing/duplicate/impossible
26562306a36Sopenharmony_ci  entries
26662306a36Sopenharmony_ci- A map entry is out of bounds
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ciIf any of these error conditions are encountered, the arena is put into a read
26962306a36Sopenharmony_cionly state using a flag in the info block.
27062306a36Sopenharmony_ci
27162306a36Sopenharmony_ci
27262306a36Sopenharmony_ci5. Usage
27362306a36Sopenharmony_ci========
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciThe BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
27662306a36Sopenharmony_ci(pmem, or blk mode). The easiest way to set up such a namespace is using the
27762306a36Sopenharmony_ci'ndctl' utility [1]:
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ciFor example, the ndctl command line to setup a btt with a 4k sector size is::
28062306a36Sopenharmony_ci
28162306a36Sopenharmony_ci    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_ciSee ndctl create-namespace --help for more options.
28462306a36Sopenharmony_ci
28562306a36Sopenharmony_ci[1]: https://github.com/pmem/ndctl
286