18c2ecf20Sopenharmony_ci============== 28c2ecf20Sopenharmony_ciData Integrity 38c2ecf20Sopenharmony_ci============== 48c2ecf20Sopenharmony_ci 58c2ecf20Sopenharmony_ci1. Introduction 68c2ecf20Sopenharmony_ci=============== 78c2ecf20Sopenharmony_ci 88c2ecf20Sopenharmony_ciModern filesystems feature checksumming of data and metadata to 98c2ecf20Sopenharmony_ciprotect against data corruption. However, the detection of the 108c2ecf20Sopenharmony_cicorruption is done at read time which could potentially be months 118c2ecf20Sopenharmony_ciafter the data was written. At that point the original data that the 128c2ecf20Sopenharmony_ciapplication tried to write is most likely lost. 138c2ecf20Sopenharmony_ci 148c2ecf20Sopenharmony_ciThe solution is to ensure that the disk is actually storing what the 158c2ecf20Sopenharmony_ciapplication meant it to. Recent additions to both the SCSI family 168c2ecf20Sopenharmony_ciprotocols (SBC Data Integrity Field, SCC protection proposal) as well 178c2ecf20Sopenharmony_cias SATA/T13 (External Path Protection) try to remedy this by adding 188c2ecf20Sopenharmony_cisupport for appending integrity metadata to an I/O. The integrity 198c2ecf20Sopenharmony_cimetadata (or protection information in SCSI terminology) includes a 208c2ecf20Sopenharmony_cichecksum for each sector as well as an incrementing counter that 218c2ecf20Sopenharmony_ciensures the individual sectors are written in the right order. And 228c2ecf20Sopenharmony_cifor some protection schemes also that the I/O is written to the right 238c2ecf20Sopenharmony_ciplace on disk. 248c2ecf20Sopenharmony_ci 258c2ecf20Sopenharmony_ciCurrent storage controllers and devices implement various protective 268c2ecf20Sopenharmony_cimeasures, for instance checksumming and scrubbing. But these 278c2ecf20Sopenharmony_citechnologies are working in their own isolated domains or at best 288c2ecf20Sopenharmony_cibetween adjacent nodes in the I/O path. The interesting thing about 298c2ecf20Sopenharmony_ciDIF and the other integrity extensions is that the protection format 308c2ecf20Sopenharmony_ciis well defined and every node in the I/O path can verify the 318c2ecf20Sopenharmony_ciintegrity of the I/O and reject it if corruption is detected. This 328c2ecf20Sopenharmony_ciallows not only corruption prevention but also isolation of the point 338c2ecf20Sopenharmony_ciof failure. 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ci2. The Data Integrity Extensions 368c2ecf20Sopenharmony_ci================================ 378c2ecf20Sopenharmony_ci 388c2ecf20Sopenharmony_ciAs written, the protocol extensions only protect the path between 398c2ecf20Sopenharmony_cicontroller and storage device. However, many controllers actually 408c2ecf20Sopenharmony_ciallow the operating system to interact with the integrity metadata 418c2ecf20Sopenharmony_ci(IMD). We have been working with several FC/SAS HBA vendors to enable 428c2ecf20Sopenharmony_cithe protection information to be transferred to and from their 438c2ecf20Sopenharmony_cicontrollers. 448c2ecf20Sopenharmony_ci 458c2ecf20Sopenharmony_ciThe SCSI Data Integrity Field works by appending 8 bytes of protection 468c2ecf20Sopenharmony_ciinformation to each sector. The data + integrity metadata is stored 478c2ecf20Sopenharmony_ciin 520 byte sectors on disk. Data + IMD are interleaved when 488c2ecf20Sopenharmony_citransferred between the controller and target. The T13 proposal is 498c2ecf20Sopenharmony_cisimilar. 508c2ecf20Sopenharmony_ci 518c2ecf20Sopenharmony_ciBecause it is highly inconvenient for operating systems to deal with 528c2ecf20Sopenharmony_ci520 (and 4104) byte sectors, we approached several HBA vendors and 538c2ecf20Sopenharmony_ciencouraged them to allow separation of the data and integrity metadata 548c2ecf20Sopenharmony_ciscatter-gather lists. 558c2ecf20Sopenharmony_ci 568c2ecf20Sopenharmony_ciThe controller will interleave the buffers on write and split them on 578c2ecf20Sopenharmony_ciread. This means that Linux can DMA the data buffers to and from 588c2ecf20Sopenharmony_cihost memory without changes to the page cache. 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ciAlso, the 16-bit CRC checksum mandated by both the SCSI and SATA specs 618c2ecf20Sopenharmony_ciis somewhat heavy to compute in software. Benchmarks found that 628c2ecf20Sopenharmony_cicalculating this checksum had a significant impact on system 638c2ecf20Sopenharmony_ciperformance for a number of workloads. Some controllers allow a 648c2ecf20Sopenharmony_cilighter-weight checksum to be used when interfacing with the operating 658c2ecf20Sopenharmony_cisystem. Emulex, for instance, supports the TCP/IP checksum instead. 668c2ecf20Sopenharmony_ciThe IP checksum received from the OS is converted to the 16-bit CRC 678c2ecf20Sopenharmony_ciwhen writing and vice versa. This allows the integrity metadata to be 688c2ecf20Sopenharmony_cigenerated by Linux or the application at very low cost (comparable to 698c2ecf20Sopenharmony_cisoftware RAID5). 708c2ecf20Sopenharmony_ci 718c2ecf20Sopenharmony_ciThe IP checksum is weaker than the CRC in terms of detecting bit 728c2ecf20Sopenharmony_cierrors. However, the strength is really in the separation of the data 738c2ecf20Sopenharmony_cibuffers and the integrity metadata. These two distinct buffers must 748c2ecf20Sopenharmony_cimatch up for an I/O to complete. 758c2ecf20Sopenharmony_ci 768c2ecf20Sopenharmony_ciThe separation of the data and integrity metadata buffers as well as 778c2ecf20Sopenharmony_cithe choice in checksums is referred to as the Data Integrity 788c2ecf20Sopenharmony_ciExtensions. As these extensions are outside the scope of the protocol 798c2ecf20Sopenharmony_cibodies (T10, T13), Oracle and its partners are trying to standardize 808c2ecf20Sopenharmony_cithem within the Storage Networking Industry Association. 818c2ecf20Sopenharmony_ci 828c2ecf20Sopenharmony_ci3. Kernel Changes 838c2ecf20Sopenharmony_ci================= 848c2ecf20Sopenharmony_ci 858c2ecf20Sopenharmony_ciThe data integrity framework in Linux enables protection information 868c2ecf20Sopenharmony_cito be pinned to I/Os and sent to/received from controllers that 878c2ecf20Sopenharmony_cisupport it. 888c2ecf20Sopenharmony_ci 898c2ecf20Sopenharmony_ciThe advantage to the integrity extensions in SCSI and SATA is that 908c2ecf20Sopenharmony_cithey enable us to protect the entire path from application to storage 918c2ecf20Sopenharmony_cidevice. However, at the same time this is also the biggest 928c2ecf20Sopenharmony_cidisadvantage. It means that the protection information must be in a 938c2ecf20Sopenharmony_ciformat that can be understood by the disk. 948c2ecf20Sopenharmony_ci 958c2ecf20Sopenharmony_ciGenerally Linux/POSIX applications are agnostic to the intricacies of 968c2ecf20Sopenharmony_cithe storage devices they are accessing. The virtual filesystem switch 978c2ecf20Sopenharmony_ciand the block layer make things like hardware sector size and 988c2ecf20Sopenharmony_citransport protocols completely transparent to the application. 998c2ecf20Sopenharmony_ci 1008c2ecf20Sopenharmony_ciHowever, this level of detail is required when preparing the 1018c2ecf20Sopenharmony_ciprotection information to send to a disk. Consequently, the very 1028c2ecf20Sopenharmony_ciconcept of an end-to-end protection scheme is a layering violation. 1038c2ecf20Sopenharmony_ciIt is completely unreasonable for an application to be aware whether 1048c2ecf20Sopenharmony_ciit is accessing a SCSI or SATA disk. 1058c2ecf20Sopenharmony_ci 1068c2ecf20Sopenharmony_ciThe data integrity support implemented in Linux attempts to hide this 1078c2ecf20Sopenharmony_cifrom the application. As far as the application (and to some extent 1088c2ecf20Sopenharmony_cithe kernel) is concerned, the integrity metadata is opaque information 1098c2ecf20Sopenharmony_cithat's attached to the I/O. 1108c2ecf20Sopenharmony_ci 1118c2ecf20Sopenharmony_ciThe current implementation allows the block layer to automatically 1128c2ecf20Sopenharmony_cigenerate the protection information for any I/O. Eventually the 1138c2ecf20Sopenharmony_ciintent is to move the integrity metadata calculation to userspace for 1148c2ecf20Sopenharmony_ciuser data. Metadata and other I/O that originates within the kernel 1158c2ecf20Sopenharmony_ciwill still use the automatic generation interface. 1168c2ecf20Sopenharmony_ci 1178c2ecf20Sopenharmony_ciSome storage devices allow each hardware sector to be tagged with a 1188c2ecf20Sopenharmony_ci16-bit value. The owner of this tag space is the owner of the block 1198c2ecf20Sopenharmony_cidevice. I.e. the filesystem in most cases. The filesystem can use 1208c2ecf20Sopenharmony_cithis extra space to tag sectors as they see fit. Because the tag 1218c2ecf20Sopenharmony_cispace is limited, the block interface allows tagging bigger chunks by 1228c2ecf20Sopenharmony_ciway of interleaving. This way, 8*16 bits of information can be 1238c2ecf20Sopenharmony_ciattached to a typical 4KB filesystem block. 1248c2ecf20Sopenharmony_ci 1258c2ecf20Sopenharmony_ciThis also means that applications such as fsck and mkfs will need 1268c2ecf20Sopenharmony_ciaccess to manipulate the tags from user space. A passthrough 1278c2ecf20Sopenharmony_ciinterface for this is being worked on. 1288c2ecf20Sopenharmony_ci 1298c2ecf20Sopenharmony_ci 1308c2ecf20Sopenharmony_ci4. Block Layer Implementation Details 1318c2ecf20Sopenharmony_ci===================================== 1328c2ecf20Sopenharmony_ci 1338c2ecf20Sopenharmony_ci4.1 Bio 1348c2ecf20Sopenharmony_ci------- 1358c2ecf20Sopenharmony_ci 1368c2ecf20Sopenharmony_ciThe data integrity patches add a new field to struct bio when 1378c2ecf20Sopenharmony_ciCONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a 1388c2ecf20Sopenharmony_cipointer to a struct bip which contains the bio integrity payload. 1398c2ecf20Sopenharmony_ciEssentially a bip is a trimmed down struct bio which holds a bio_vec 1408c2ecf20Sopenharmony_cicontaining the integrity metadata and the required housekeeping 1418c2ecf20Sopenharmony_ciinformation (bvec pool, vector count, etc.) 1428c2ecf20Sopenharmony_ci 1438c2ecf20Sopenharmony_ciA kernel subsystem can enable data integrity protection on a bio by 1448c2ecf20Sopenharmony_cicalling bio_integrity_alloc(bio). This will allocate and attach the 1458c2ecf20Sopenharmony_cibip to the bio. 1468c2ecf20Sopenharmony_ci 1478c2ecf20Sopenharmony_ciIndividual pages containing integrity metadata can subsequently be 1488c2ecf20Sopenharmony_ciattached using bio_integrity_add_page(). 1498c2ecf20Sopenharmony_ci 1508c2ecf20Sopenharmony_cibio_free() will automatically free the bip. 1518c2ecf20Sopenharmony_ci 1528c2ecf20Sopenharmony_ci 1538c2ecf20Sopenharmony_ci4.2 Block Device 1548c2ecf20Sopenharmony_ci---------------- 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_ciBecause the format of the protection data is tied to the physical 1578c2ecf20Sopenharmony_cidisk, each block device has been extended with a block integrity 1588c2ecf20Sopenharmony_ciprofile (struct blk_integrity). This optional profile is registered 1598c2ecf20Sopenharmony_ciwith the block layer using blk_integrity_register(). 1608c2ecf20Sopenharmony_ci 1618c2ecf20Sopenharmony_ciThe profile contains callback functions for generating and verifying 1628c2ecf20Sopenharmony_cithe protection data, as well as getting and setting application tags. 1638c2ecf20Sopenharmony_ciThe profile also contains a few constants to aid in completing, 1648c2ecf20Sopenharmony_cimerging and splitting the integrity metadata. 1658c2ecf20Sopenharmony_ci 1668c2ecf20Sopenharmony_ciLayered block devices will need to pick a profile that's appropriate 1678c2ecf20Sopenharmony_cifor all subdevices. blk_integrity_compare() can help with that. DM 1688c2ecf20Sopenharmony_ciand MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 1698c2ecf20Sopenharmony_ciwill require extra work due to the application tag. 1708c2ecf20Sopenharmony_ci 1718c2ecf20Sopenharmony_ci 1728c2ecf20Sopenharmony_ci5.0 Block Layer Integrity API 1738c2ecf20Sopenharmony_ci============================= 1748c2ecf20Sopenharmony_ci 1758c2ecf20Sopenharmony_ci5.1 Normal Filesystem 1768c2ecf20Sopenharmony_ci--------------------- 1778c2ecf20Sopenharmony_ci 1788c2ecf20Sopenharmony_ci The normal filesystem is unaware that the underlying block device 1798c2ecf20Sopenharmony_ci is capable of sending/receiving integrity metadata. The IMD will 1808c2ecf20Sopenharmony_ci be automatically generated by the block layer at submit_bio() time 1818c2ecf20Sopenharmony_ci in case of a WRITE. A READ request will cause the I/O integrity 1828c2ecf20Sopenharmony_ci to be verified upon completion. 1838c2ecf20Sopenharmony_ci 1848c2ecf20Sopenharmony_ci IMD generation and verification can be toggled using the:: 1858c2ecf20Sopenharmony_ci 1868c2ecf20Sopenharmony_ci /sys/block/<bdev>/integrity/write_generate 1878c2ecf20Sopenharmony_ci 1888c2ecf20Sopenharmony_ci and:: 1898c2ecf20Sopenharmony_ci 1908c2ecf20Sopenharmony_ci /sys/block/<bdev>/integrity/read_verify 1918c2ecf20Sopenharmony_ci 1928c2ecf20Sopenharmony_ci flags. 1938c2ecf20Sopenharmony_ci 1948c2ecf20Sopenharmony_ci 1958c2ecf20Sopenharmony_ci5.2 Integrity-Aware Filesystem 1968c2ecf20Sopenharmony_ci------------------------------ 1978c2ecf20Sopenharmony_ci 1988c2ecf20Sopenharmony_ci A filesystem that is integrity-aware can prepare I/Os with IMD 1998c2ecf20Sopenharmony_ci attached. It can also use the application tag space if this is 2008c2ecf20Sopenharmony_ci supported by the block device. 2018c2ecf20Sopenharmony_ci 2028c2ecf20Sopenharmony_ci 2038c2ecf20Sopenharmony_ci `bool bio_integrity_prep(bio);` 2048c2ecf20Sopenharmony_ci 2058c2ecf20Sopenharmony_ci To generate IMD for WRITE and to set up buffers for READ, the 2068c2ecf20Sopenharmony_ci filesystem must call bio_integrity_prep(bio). 2078c2ecf20Sopenharmony_ci 2088c2ecf20Sopenharmony_ci Prior to calling this function, the bio data direction and start 2098c2ecf20Sopenharmony_ci sector must be set, and the bio should have all data pages 2108c2ecf20Sopenharmony_ci added. It is up to the caller to ensure that the bio does not 2118c2ecf20Sopenharmony_ci change while I/O is in progress. 2128c2ecf20Sopenharmony_ci Complete bio with error if prepare failed for some reson. 2138c2ecf20Sopenharmony_ci 2148c2ecf20Sopenharmony_ci 2158c2ecf20Sopenharmony_ci5.3 Passing Existing Integrity Metadata 2168c2ecf20Sopenharmony_ci--------------------------------------- 2178c2ecf20Sopenharmony_ci 2188c2ecf20Sopenharmony_ci Filesystems that either generate their own integrity metadata or 2198c2ecf20Sopenharmony_ci are capable of transferring IMD from user space can use the 2208c2ecf20Sopenharmony_ci following calls: 2218c2ecf20Sopenharmony_ci 2228c2ecf20Sopenharmony_ci 2238c2ecf20Sopenharmony_ci `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` 2248c2ecf20Sopenharmony_ci 2258c2ecf20Sopenharmony_ci Allocates the bio integrity payload and hangs it off of the bio. 2268c2ecf20Sopenharmony_ci nr_pages indicate how many pages of protection data need to be 2278c2ecf20Sopenharmony_ci stored in the integrity bio_vec list (similar to bio_alloc()). 2288c2ecf20Sopenharmony_ci 2298c2ecf20Sopenharmony_ci The integrity payload will be freed at bio_free() time. 2308c2ecf20Sopenharmony_ci 2318c2ecf20Sopenharmony_ci 2328c2ecf20Sopenharmony_ci `int bio_integrity_add_page(bio, page, len, offset);` 2338c2ecf20Sopenharmony_ci 2348c2ecf20Sopenharmony_ci Attaches a page containing integrity metadata to an existing 2358c2ecf20Sopenharmony_ci bio. The bio must have an existing bip, 2368c2ecf20Sopenharmony_ci i.e. bio_integrity_alloc() must have been called. For a WRITE, 2378c2ecf20Sopenharmony_ci the integrity metadata in the pages must be in a format 2388c2ecf20Sopenharmony_ci understood by the target device with the notable exception that 2398c2ecf20Sopenharmony_ci the sector numbers will be remapped as the request traverses the 2408c2ecf20Sopenharmony_ci I/O stack. This implies that the pages added using this call 2418c2ecf20Sopenharmony_ci will be modified during I/O! The first reference tag in the 2428c2ecf20Sopenharmony_ci integrity metadata must have a value of bip->bip_sector. 2438c2ecf20Sopenharmony_ci 2448c2ecf20Sopenharmony_ci Pages can be added using bio_integrity_add_page() as long as 2458c2ecf20Sopenharmony_ci there is room in the bip bio_vec array (nr_pages). 2468c2ecf20Sopenharmony_ci 2478c2ecf20Sopenharmony_ci Upon completion of a READ operation, the attached pages will 2488c2ecf20Sopenharmony_ci contain the integrity metadata received from the storage device. 2498c2ecf20Sopenharmony_ci It is up to the receiver to process them and verify data 2508c2ecf20Sopenharmony_ci integrity upon completion. 2518c2ecf20Sopenharmony_ci 2528c2ecf20Sopenharmony_ci 2538c2ecf20Sopenharmony_ci5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata 2548c2ecf20Sopenharmony_ci-------------------------------------------------------------------------- 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_ci To enable integrity exchange on a block device the gendisk must be 2578c2ecf20Sopenharmony_ci registered as capable: 2588c2ecf20Sopenharmony_ci 2598c2ecf20Sopenharmony_ci `int blk_integrity_register(gendisk, blk_integrity);` 2608c2ecf20Sopenharmony_ci 2618c2ecf20Sopenharmony_ci The blk_integrity struct is a template and should contain the 2628c2ecf20Sopenharmony_ci following:: 2638c2ecf20Sopenharmony_ci 2648c2ecf20Sopenharmony_ci static struct blk_integrity my_profile = { 2658c2ecf20Sopenharmony_ci .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", 2668c2ecf20Sopenharmony_ci .generate_fn = my_generate_fn, 2678c2ecf20Sopenharmony_ci .verify_fn = my_verify_fn, 2688c2ecf20Sopenharmony_ci .tuple_size = sizeof(struct my_tuple_size), 2698c2ecf20Sopenharmony_ci .tag_size = <tag bytes per hw sector>, 2708c2ecf20Sopenharmony_ci }; 2718c2ecf20Sopenharmony_ci 2728c2ecf20Sopenharmony_ci 'name' is a text string which will be visible in sysfs. This is 2738c2ecf20Sopenharmony_ci part of the userland API so chose it carefully and never change 2748c2ecf20Sopenharmony_ci it. The format is standards body-type-variant. 2758c2ecf20Sopenharmony_ci E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. 2768c2ecf20Sopenharmony_ci 2778c2ecf20Sopenharmony_ci 'generate_fn' generates appropriate integrity metadata (for WRITE). 2788c2ecf20Sopenharmony_ci 2798c2ecf20Sopenharmony_ci 'verify_fn' verifies that the data buffer matches the integrity 2808c2ecf20Sopenharmony_ci metadata. 2818c2ecf20Sopenharmony_ci 2828c2ecf20Sopenharmony_ci 'tuple_size' must be set to match the size of the integrity 2838c2ecf20Sopenharmony_ci metadata per sector. I.e. 8 for DIF and EPP. 2848c2ecf20Sopenharmony_ci 2858c2ecf20Sopenharmony_ci 'tag_size' must be set to identify how many bytes of tag space 2868c2ecf20Sopenharmony_ci are available per hardware sector. For DIF this is either 2 or 2878c2ecf20Sopenharmony_ci 0 depending on the value of the Control Mode Page ATO bit. 2888c2ecf20Sopenharmony_ci 2898c2ecf20Sopenharmony_ci---------------------------------------------------------------------- 2908c2ecf20Sopenharmony_ci 2918c2ecf20Sopenharmony_ci2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> 292