18c2ecf20Sopenharmony_ci==============
28c2ecf20Sopenharmony_ciData Integrity
38c2ecf20Sopenharmony_ci==============
48c2ecf20Sopenharmony_ci
58c2ecf20Sopenharmony_ci1. Introduction
68c2ecf20Sopenharmony_ci===============
78c2ecf20Sopenharmony_ci
88c2ecf20Sopenharmony_ciModern filesystems feature checksumming of data and metadata to
98c2ecf20Sopenharmony_ciprotect against data corruption.  However, the detection of the
108c2ecf20Sopenharmony_cicorruption is done at read time which could potentially be months
118c2ecf20Sopenharmony_ciafter the data was written.  At that point the original data that the
128c2ecf20Sopenharmony_ciapplication tried to write is most likely lost.
138c2ecf20Sopenharmony_ci
148c2ecf20Sopenharmony_ciThe solution is to ensure that the disk is actually storing what the
158c2ecf20Sopenharmony_ciapplication meant it to.  Recent additions to both the SCSI family
168c2ecf20Sopenharmony_ciprotocols (SBC Data Integrity Field, SCC protection proposal) as well
178c2ecf20Sopenharmony_cias SATA/T13 (External Path Protection) try to remedy this by adding
188c2ecf20Sopenharmony_cisupport for appending integrity metadata to an I/O.  The integrity
198c2ecf20Sopenharmony_cimetadata (or protection information in SCSI terminology) includes a
208c2ecf20Sopenharmony_cichecksum for each sector as well as an incrementing counter that
218c2ecf20Sopenharmony_ciensures the individual sectors are written in the right order.  And
228c2ecf20Sopenharmony_cifor some protection schemes also that the I/O is written to the right
238c2ecf20Sopenharmony_ciplace on disk.
248c2ecf20Sopenharmony_ci
258c2ecf20Sopenharmony_ciCurrent storage controllers and devices implement various protective
268c2ecf20Sopenharmony_cimeasures, for instance checksumming and scrubbing.  But these
278c2ecf20Sopenharmony_citechnologies are working in their own isolated domains or at best
288c2ecf20Sopenharmony_cibetween adjacent nodes in the I/O path.  The interesting thing about
298c2ecf20Sopenharmony_ciDIF and the other integrity extensions is that the protection format
308c2ecf20Sopenharmony_ciis well defined and every node in the I/O path can verify the
318c2ecf20Sopenharmony_ciintegrity of the I/O and reject it if corruption is detected.  This
328c2ecf20Sopenharmony_ciallows not only corruption prevention but also isolation of the point
338c2ecf20Sopenharmony_ciof failure.
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ci2. The Data Integrity Extensions
368c2ecf20Sopenharmony_ci================================
378c2ecf20Sopenharmony_ci
388c2ecf20Sopenharmony_ciAs written, the protocol extensions only protect the path between
398c2ecf20Sopenharmony_cicontroller and storage device.  However, many controllers actually
408c2ecf20Sopenharmony_ciallow the operating system to interact with the integrity metadata
418c2ecf20Sopenharmony_ci(IMD).  We have been working with several FC/SAS HBA vendors to enable
428c2ecf20Sopenharmony_cithe protection information to be transferred to and from their
438c2ecf20Sopenharmony_cicontrollers.
448c2ecf20Sopenharmony_ci
458c2ecf20Sopenharmony_ciThe SCSI Data Integrity Field works by appending 8 bytes of protection
468c2ecf20Sopenharmony_ciinformation to each sector.  The data + integrity metadata is stored
478c2ecf20Sopenharmony_ciin 520 byte sectors on disk.  Data + IMD are interleaved when
488c2ecf20Sopenharmony_citransferred between the controller and target.  The T13 proposal is
498c2ecf20Sopenharmony_cisimilar.
508c2ecf20Sopenharmony_ci
518c2ecf20Sopenharmony_ciBecause it is highly inconvenient for operating systems to deal with
528c2ecf20Sopenharmony_ci520 (and 4104) byte sectors, we approached several HBA vendors and
538c2ecf20Sopenharmony_ciencouraged them to allow separation of the data and integrity metadata
548c2ecf20Sopenharmony_ciscatter-gather lists.
558c2ecf20Sopenharmony_ci
568c2ecf20Sopenharmony_ciThe controller will interleave the buffers on write and split them on
578c2ecf20Sopenharmony_ciread.  This means that Linux can DMA the data buffers to and from
588c2ecf20Sopenharmony_cihost memory without changes to the page cache.
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ciAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
618c2ecf20Sopenharmony_ciis somewhat heavy to compute in software.  Benchmarks found that
628c2ecf20Sopenharmony_cicalculating this checksum had a significant impact on system
638c2ecf20Sopenharmony_ciperformance for a number of workloads.  Some controllers allow a
648c2ecf20Sopenharmony_cilighter-weight checksum to be used when interfacing with the operating
658c2ecf20Sopenharmony_cisystem.  Emulex, for instance, supports the TCP/IP checksum instead.
668c2ecf20Sopenharmony_ciThe IP checksum received from the OS is converted to the 16-bit CRC
678c2ecf20Sopenharmony_ciwhen writing and vice versa.  This allows the integrity metadata to be
688c2ecf20Sopenharmony_cigenerated by Linux or the application at very low cost (comparable to
698c2ecf20Sopenharmony_cisoftware RAID5).
708c2ecf20Sopenharmony_ci
718c2ecf20Sopenharmony_ciThe IP checksum is weaker than the CRC in terms of detecting bit
728c2ecf20Sopenharmony_cierrors.  However, the strength is really in the separation of the data
738c2ecf20Sopenharmony_cibuffers and the integrity metadata.  These two distinct buffers must
748c2ecf20Sopenharmony_cimatch up for an I/O to complete.
758c2ecf20Sopenharmony_ci
768c2ecf20Sopenharmony_ciThe separation of the data and integrity metadata buffers as well as
778c2ecf20Sopenharmony_cithe choice in checksums is referred to as the Data Integrity
788c2ecf20Sopenharmony_ciExtensions.  As these extensions are outside the scope of the protocol
798c2ecf20Sopenharmony_cibodies (T10, T13), Oracle and its partners are trying to standardize
808c2ecf20Sopenharmony_cithem within the Storage Networking Industry Association.
818c2ecf20Sopenharmony_ci
828c2ecf20Sopenharmony_ci3. Kernel Changes
838c2ecf20Sopenharmony_ci=================
848c2ecf20Sopenharmony_ci
858c2ecf20Sopenharmony_ciThe data integrity framework in Linux enables protection information
868c2ecf20Sopenharmony_cito be pinned to I/Os and sent to/received from controllers that
878c2ecf20Sopenharmony_cisupport it.
888c2ecf20Sopenharmony_ci
898c2ecf20Sopenharmony_ciThe advantage to the integrity extensions in SCSI and SATA is that
908c2ecf20Sopenharmony_cithey enable us to protect the entire path from application to storage
918c2ecf20Sopenharmony_cidevice.  However, at the same time this is also the biggest
928c2ecf20Sopenharmony_cidisadvantage. It means that the protection information must be in a
938c2ecf20Sopenharmony_ciformat that can be understood by the disk.
948c2ecf20Sopenharmony_ci
958c2ecf20Sopenharmony_ciGenerally Linux/POSIX applications are agnostic to the intricacies of
968c2ecf20Sopenharmony_cithe storage devices they are accessing.  The virtual filesystem switch
978c2ecf20Sopenharmony_ciand the block layer make things like hardware sector size and
988c2ecf20Sopenharmony_citransport protocols completely transparent to the application.
998c2ecf20Sopenharmony_ci
1008c2ecf20Sopenharmony_ciHowever, this level of detail is required when preparing the
1018c2ecf20Sopenharmony_ciprotection information to send to a disk.  Consequently, the very
1028c2ecf20Sopenharmony_ciconcept of an end-to-end protection scheme is a layering violation.
1038c2ecf20Sopenharmony_ciIt is completely unreasonable for an application to be aware whether
1048c2ecf20Sopenharmony_ciit is accessing a SCSI or SATA disk.
1058c2ecf20Sopenharmony_ci
1068c2ecf20Sopenharmony_ciThe data integrity support implemented in Linux attempts to hide this
1078c2ecf20Sopenharmony_cifrom the application.  As far as the application (and to some extent
1088c2ecf20Sopenharmony_cithe kernel) is concerned, the integrity metadata is opaque information
1098c2ecf20Sopenharmony_cithat's attached to the I/O.
1108c2ecf20Sopenharmony_ci
1118c2ecf20Sopenharmony_ciThe current implementation allows the block layer to automatically
1128c2ecf20Sopenharmony_cigenerate the protection information for any I/O.  Eventually the
1138c2ecf20Sopenharmony_ciintent is to move the integrity metadata calculation to userspace for
1148c2ecf20Sopenharmony_ciuser data.  Metadata and other I/O that originates within the kernel
1158c2ecf20Sopenharmony_ciwill still use the automatic generation interface.
1168c2ecf20Sopenharmony_ci
1178c2ecf20Sopenharmony_ciSome storage devices allow each hardware sector to be tagged with a
1188c2ecf20Sopenharmony_ci16-bit value.  The owner of this tag space is the owner of the block
1198c2ecf20Sopenharmony_cidevice.  I.e. the filesystem in most cases.  The filesystem can use
1208c2ecf20Sopenharmony_cithis extra space to tag sectors as they see fit.  Because the tag
1218c2ecf20Sopenharmony_cispace is limited, the block interface allows tagging bigger chunks by
1228c2ecf20Sopenharmony_ciway of interleaving.  This way, 8*16 bits of information can be
1238c2ecf20Sopenharmony_ciattached to a typical 4KB filesystem block.
1248c2ecf20Sopenharmony_ci
1258c2ecf20Sopenharmony_ciThis also means that applications such as fsck and mkfs will need
1268c2ecf20Sopenharmony_ciaccess to manipulate the tags from user space.  A passthrough
1278c2ecf20Sopenharmony_ciinterface for this is being worked on.
1288c2ecf20Sopenharmony_ci
1298c2ecf20Sopenharmony_ci
1308c2ecf20Sopenharmony_ci4. Block Layer Implementation Details
1318c2ecf20Sopenharmony_ci=====================================
1328c2ecf20Sopenharmony_ci
1338c2ecf20Sopenharmony_ci4.1 Bio
1348c2ecf20Sopenharmony_ci-------
1358c2ecf20Sopenharmony_ci
1368c2ecf20Sopenharmony_ciThe data integrity patches add a new field to struct bio when
1378c2ecf20Sopenharmony_ciCONFIG_BLK_DEV_INTEGRITY is enabled.  bio_integrity(bio) returns a
1388c2ecf20Sopenharmony_cipointer to a struct bip which contains the bio integrity payload.
1398c2ecf20Sopenharmony_ciEssentially a bip is a trimmed down struct bio which holds a bio_vec
1408c2ecf20Sopenharmony_cicontaining the integrity metadata and the required housekeeping
1418c2ecf20Sopenharmony_ciinformation (bvec pool, vector count, etc.)
1428c2ecf20Sopenharmony_ci
1438c2ecf20Sopenharmony_ciA kernel subsystem can enable data integrity protection on a bio by
1448c2ecf20Sopenharmony_cicalling bio_integrity_alloc(bio).  This will allocate and attach the
1458c2ecf20Sopenharmony_cibip to the bio.
1468c2ecf20Sopenharmony_ci
1478c2ecf20Sopenharmony_ciIndividual pages containing integrity metadata can subsequently be
1488c2ecf20Sopenharmony_ciattached using bio_integrity_add_page().
1498c2ecf20Sopenharmony_ci
1508c2ecf20Sopenharmony_cibio_free() will automatically free the bip.
1518c2ecf20Sopenharmony_ci
1528c2ecf20Sopenharmony_ci
1538c2ecf20Sopenharmony_ci4.2 Block Device
1548c2ecf20Sopenharmony_ci----------------
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_ciBecause the format of the protection data is tied to the physical
1578c2ecf20Sopenharmony_cidisk, each block device has been extended with a block integrity
1588c2ecf20Sopenharmony_ciprofile (struct blk_integrity).  This optional profile is registered
1598c2ecf20Sopenharmony_ciwith the block layer using blk_integrity_register().
1608c2ecf20Sopenharmony_ci
1618c2ecf20Sopenharmony_ciThe profile contains callback functions for generating and verifying
1628c2ecf20Sopenharmony_cithe protection data, as well as getting and setting application tags.
1638c2ecf20Sopenharmony_ciThe profile also contains a few constants to aid in completing,
1648c2ecf20Sopenharmony_cimerging and splitting the integrity metadata.
1658c2ecf20Sopenharmony_ci
1668c2ecf20Sopenharmony_ciLayered block devices will need to pick a profile that's appropriate
1678c2ecf20Sopenharmony_cifor all subdevices.  blk_integrity_compare() can help with that.  DM
1688c2ecf20Sopenharmony_ciand MD linear, RAID0 and RAID1 are currently supported.  RAID4/5/6
1698c2ecf20Sopenharmony_ciwill require extra work due to the application tag.
1708c2ecf20Sopenharmony_ci
1718c2ecf20Sopenharmony_ci
1728c2ecf20Sopenharmony_ci5.0 Block Layer Integrity API
1738c2ecf20Sopenharmony_ci=============================
1748c2ecf20Sopenharmony_ci
1758c2ecf20Sopenharmony_ci5.1 Normal Filesystem
1768c2ecf20Sopenharmony_ci---------------------
1778c2ecf20Sopenharmony_ci
1788c2ecf20Sopenharmony_ci    The normal filesystem is unaware that the underlying block device
1798c2ecf20Sopenharmony_ci    is capable of sending/receiving integrity metadata.  The IMD will
1808c2ecf20Sopenharmony_ci    be automatically generated by the block layer at submit_bio() time
1818c2ecf20Sopenharmony_ci    in case of a WRITE.  A READ request will cause the I/O integrity
1828c2ecf20Sopenharmony_ci    to be verified upon completion.
1838c2ecf20Sopenharmony_ci
1848c2ecf20Sopenharmony_ci    IMD generation and verification can be toggled using the::
1858c2ecf20Sopenharmony_ci
1868c2ecf20Sopenharmony_ci      /sys/block/<bdev>/integrity/write_generate
1878c2ecf20Sopenharmony_ci
1888c2ecf20Sopenharmony_ci    and::
1898c2ecf20Sopenharmony_ci
1908c2ecf20Sopenharmony_ci      /sys/block/<bdev>/integrity/read_verify
1918c2ecf20Sopenharmony_ci
1928c2ecf20Sopenharmony_ci    flags.
1938c2ecf20Sopenharmony_ci
1948c2ecf20Sopenharmony_ci
1958c2ecf20Sopenharmony_ci5.2 Integrity-Aware Filesystem
1968c2ecf20Sopenharmony_ci------------------------------
1978c2ecf20Sopenharmony_ci
1988c2ecf20Sopenharmony_ci    A filesystem that is integrity-aware can prepare I/Os with IMD
1998c2ecf20Sopenharmony_ci    attached.  It can also use the application tag space if this is
2008c2ecf20Sopenharmony_ci    supported by the block device.
2018c2ecf20Sopenharmony_ci
2028c2ecf20Sopenharmony_ci
2038c2ecf20Sopenharmony_ci    `bool bio_integrity_prep(bio);`
2048c2ecf20Sopenharmony_ci
2058c2ecf20Sopenharmony_ci      To generate IMD for WRITE and to set up buffers for READ, the
2068c2ecf20Sopenharmony_ci      filesystem must call bio_integrity_prep(bio).
2078c2ecf20Sopenharmony_ci
2088c2ecf20Sopenharmony_ci      Prior to calling this function, the bio data direction and start
2098c2ecf20Sopenharmony_ci      sector must be set, and the bio should have all data pages
2108c2ecf20Sopenharmony_ci      added.  It is up to the caller to ensure that the bio does not
2118c2ecf20Sopenharmony_ci      change while I/O is in progress.
2128c2ecf20Sopenharmony_ci      Complete bio with error if prepare failed for some reson.
2138c2ecf20Sopenharmony_ci
2148c2ecf20Sopenharmony_ci
2158c2ecf20Sopenharmony_ci5.3 Passing Existing Integrity Metadata
2168c2ecf20Sopenharmony_ci---------------------------------------
2178c2ecf20Sopenharmony_ci
2188c2ecf20Sopenharmony_ci    Filesystems that either generate their own integrity metadata or
2198c2ecf20Sopenharmony_ci    are capable of transferring IMD from user space can use the
2208c2ecf20Sopenharmony_ci    following calls:
2218c2ecf20Sopenharmony_ci
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ci    `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);`
2248c2ecf20Sopenharmony_ci
2258c2ecf20Sopenharmony_ci      Allocates the bio integrity payload and hangs it off of the bio.
2268c2ecf20Sopenharmony_ci      nr_pages indicate how many pages of protection data need to be
2278c2ecf20Sopenharmony_ci      stored in the integrity bio_vec list (similar to bio_alloc()).
2288c2ecf20Sopenharmony_ci
2298c2ecf20Sopenharmony_ci      The integrity payload will be freed at bio_free() time.
2308c2ecf20Sopenharmony_ci
2318c2ecf20Sopenharmony_ci
2328c2ecf20Sopenharmony_ci    `int bio_integrity_add_page(bio, page, len, offset);`
2338c2ecf20Sopenharmony_ci
2348c2ecf20Sopenharmony_ci      Attaches a page containing integrity metadata to an existing
2358c2ecf20Sopenharmony_ci      bio.  The bio must have an existing bip,
2368c2ecf20Sopenharmony_ci      i.e. bio_integrity_alloc() must have been called.  For a WRITE,
2378c2ecf20Sopenharmony_ci      the integrity metadata in the pages must be in a format
2388c2ecf20Sopenharmony_ci      understood by the target device with the notable exception that
2398c2ecf20Sopenharmony_ci      the sector numbers will be remapped as the request traverses the
2408c2ecf20Sopenharmony_ci      I/O stack.  This implies that the pages added using this call
2418c2ecf20Sopenharmony_ci      will be modified during I/O!  The first reference tag in the
2428c2ecf20Sopenharmony_ci      integrity metadata must have a value of bip->bip_sector.
2438c2ecf20Sopenharmony_ci
2448c2ecf20Sopenharmony_ci      Pages can be added using bio_integrity_add_page() as long as
2458c2ecf20Sopenharmony_ci      there is room in the bip bio_vec array (nr_pages).
2468c2ecf20Sopenharmony_ci
2478c2ecf20Sopenharmony_ci      Upon completion of a READ operation, the attached pages will
2488c2ecf20Sopenharmony_ci      contain the integrity metadata received from the storage device.
2498c2ecf20Sopenharmony_ci      It is up to the receiver to process them and verify data
2508c2ecf20Sopenharmony_ci      integrity upon completion.
2518c2ecf20Sopenharmony_ci
2528c2ecf20Sopenharmony_ci
2538c2ecf20Sopenharmony_ci5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata
2548c2ecf20Sopenharmony_ci--------------------------------------------------------------------------
2558c2ecf20Sopenharmony_ci
2568c2ecf20Sopenharmony_ci    To enable integrity exchange on a block device the gendisk must be
2578c2ecf20Sopenharmony_ci    registered as capable:
2588c2ecf20Sopenharmony_ci
2598c2ecf20Sopenharmony_ci    `int blk_integrity_register(gendisk, blk_integrity);`
2608c2ecf20Sopenharmony_ci
2618c2ecf20Sopenharmony_ci      The blk_integrity struct is a template and should contain the
2628c2ecf20Sopenharmony_ci      following::
2638c2ecf20Sopenharmony_ci
2648c2ecf20Sopenharmony_ci        static struct blk_integrity my_profile = {
2658c2ecf20Sopenharmony_ci            .name                   = "STANDARDSBODY-TYPE-VARIANT-CSUM",
2668c2ecf20Sopenharmony_ci            .generate_fn            = my_generate_fn,
2678c2ecf20Sopenharmony_ci	    .verify_fn              = my_verify_fn,
2688c2ecf20Sopenharmony_ci	    .tuple_size             = sizeof(struct my_tuple_size),
2698c2ecf20Sopenharmony_ci	    .tag_size               = <tag bytes per hw sector>,
2708c2ecf20Sopenharmony_ci        };
2718c2ecf20Sopenharmony_ci
2728c2ecf20Sopenharmony_ci      'name' is a text string which will be visible in sysfs.  This is
2738c2ecf20Sopenharmony_ci      part of the userland API so chose it carefully and never change
2748c2ecf20Sopenharmony_ci      it.  The format is standards body-type-variant.
2758c2ecf20Sopenharmony_ci      E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
2768c2ecf20Sopenharmony_ci
2778c2ecf20Sopenharmony_ci      'generate_fn' generates appropriate integrity metadata (for WRITE).
2788c2ecf20Sopenharmony_ci
2798c2ecf20Sopenharmony_ci      'verify_fn' verifies that the data buffer matches the integrity
2808c2ecf20Sopenharmony_ci      metadata.
2818c2ecf20Sopenharmony_ci
2828c2ecf20Sopenharmony_ci      'tuple_size' must be set to match the size of the integrity
2838c2ecf20Sopenharmony_ci      metadata per sector.  I.e. 8 for DIF and EPP.
2848c2ecf20Sopenharmony_ci
2858c2ecf20Sopenharmony_ci      'tag_size' must be set to identify how many bytes of tag space
2868c2ecf20Sopenharmony_ci      are available per hardware sector.  For DIF this is either 2 or
2878c2ecf20Sopenharmony_ci      0 depending on the value of the Control Mode Page ATO bit.
2888c2ecf20Sopenharmony_ci
2898c2ecf20Sopenharmony_ci----------------------------------------------------------------------
2908c2ecf20Sopenharmony_ci
2918c2ecf20Sopenharmony_ci2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>
292