Fio end to end data protection part 1 background - vincentkfu/fio-blog GitHub Wiki
This blog post is co-authored by Ankit Kumar and Vincent Fu. We are grateful to Klaus Jensen, Adam Manzanares, Jinhwan Park, Kanchan Joshi, Krishna Kanth Reddy, and Dan Helmick for their support and feedback.
Recently we
added
NVMe end to end data protection (E2EDP) support to fio via enhancements to the
io_uring_cmd
ioengine. This blog post is Part 1 of a series and provides an
introduction to NVMe E2EDP. We borrow liberally from the text and figures in
version 1.0c of the NVM command set
specification,
seeking to present the information in a structured manner with helpful context.
Part 2 of this series will provide examples that exercise this feature in fio
and describe related tools.
E2EDP provides a means to assess data integrity from application to the NVM media and back to the application. This optional mechanism adds protection information to the logical block data that may be evaluated by the controller and application to assess the integrity of the logical block data.
Five parameters characterize the on-device format for E2EDP. These are (1) whether the metadata is contiguous with the logical block data or stored in a separate buffer; (2) the size of the metadata buffer for each LBA; (3) the Guard Protection Information format; (4) whether the protection information is stored in the first or last bytes of the metadata buffer; and (5) Protection Information Type 1, 2, or 3. Below we describe each of these parameters in more detail.
First is the distinction between Data Integrity Field (DIF) and Data Integrity Extension (DIX). Using the same terminology as SCSI, DIF configures the metadata to be contiguous with the logical block data (Figure 1a) whereas DIX configures metadata and logical block data to be in separate buffers (Figure 1b). The choice between DIF and DIX is made at the time the NVMe namespace is formatted.
Figure 1: DIF and DIX Configuration of Protection Information
Second, the on-device format also determines the size of the metadata buffer for each LBA. Supported metadata buffer sizes can vary between devices. Below is an example of the LBA formats (LBAFs) available by default for a QEMU v.8.1 emulated NVMe device as displayed by nvme-cli:
root@localhost:~# nvme id-ns -H /dev/ng0n1
NVME Identify Namespace 1:
...
nguid : 00000000000000000000000000000000
eui64 : 0000000000000000
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 1 : Metadata Size: 8 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 2 : Metadata Size: 16 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 3 : Metadata Size: 64 bytes - Data Size: 512 bytes - Relative Performance: 0 Best
LBA Format 4 : Metadata Size: 0 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 5 : Metadata Size: 8 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 6 : Metadata Size: 16 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format 7 : Metadata Size: 64 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best (in use)
For this device, the metadata buffer varies in size from 8 bytes to 64 bytes for LBAFs supporting E2EDP. The metadata buffer includes protection information fields as specified by the NVMe standard. The metadata buffer also optionally includes additional host metadata if the metadata buffer size is larger than the protection information size.
The third parameter is the protection information format and is denoted by the number of bits used for the CRC value calculated for each LBA. This is determined by the LBA format chosen for the NVMe namespace. The format also determines how the protection information is split into components. Each format is specified by the number of bits in the Guard field (which contains the computed CRC value) as described in Table 1 below.
Protection Information Format | Total Protection Information Size | Guard Protection Information Size | Application Tag Field Size | Storage and Reference Tag Field Size | Notes |
---|---|---|---|---|---|
16b Guard Protection Information | 8 bytes | 16bits/2 bytes | 2 bytes | 4 bytes | |
32b Guard Protection Information | 16 bytes | 32bits/4 bytes | 2 bytes | 10 bytes | only available for LBA formats where the LBA data size is 4KiB or greater |
64b Guard Protection Information | 16 bytes | 64bits/8 bytes | 2 bytes | 6 bytes | only available for LBA formats where the LBA data size is 4KiB or greater |
The protection information formats have 16, 32, or 64 bits for the Guard field. 32b and 64b Guard protection information formats are only available for LBA formats where the LBA data size is 4KiB or greater. The total protection information size is 8 bytes for the format with a 16b Guard field but 16 bytes for formats with 32b and 64b Guard fields. All of the formats reserve 2 bytes for the Application Tag. The Storage and Reference Tag Field sizes vary from 4 to 10 bytes among the three formats. How these bytes are split between Storage and Reference Tags depends on the LBA format. For the sake of simplicity, the remainder of this discussion will consider only the Reference Tag. The Storage Tag is an optional feature that will be omitted from the following discussion.
The fourth parameter describing the on-device format specifies the location of the protection information within the metadata buffer. The protection information resides either in the first or last bytes of the metadata buffer. Note that the location of the protection information has consequences for the calculation of the Guard PI value.
- If the PI resides in the last bytes of metadata as in Figure 2a, the CRC covers logical block data as well as the portion of the metadata buffer excluding the protection information.
- If the PI resides in the first bytes of metadata as in Figure 2b, the CRC covers only the logical block data.
Figure 2: Protection Information Inside Metadata Buffer
Note that compliance with version 1.0 or later of the NVM Command Set Specification requires protection information to be in the last bytes of the metadata buffer as depicted in Figure 2a.
Finally, the fifth parameter describing the on-device format is the E2EDP Type which governs how the Reference Tag portion of the protection information is used. These are numbered as Type 1, Type 2, and Type 3, as described in Table 2 below.
End to End Data Protection Type | Reference Tag Requirements |
---|---|
Type 1 | Reference Tag value is least significant bytes of the LBA (the number of bytes depends on the Reference Tag size) |
Type 2 | Reference Tag value specified by user |
Type 3 | Reference Tag not checked |
If the namespace is formatted with Type 1 E2EDP, then the Reference Tag for each LBA is the least significant bytes of the LBA. For Type 2 E2EDP, the Reference Tag value is specified by the user. For each Type 1 and Type 2 write operation, the application provides an Initial Logical Block Reference Tag (ILBRT) which is used for the starting LBA of the write (for Type 1 this must be the least significant bytes of the SLBA; for Type 2 this value is selected by the application). For each subsequent LBA of the write operation, the Reference Tag value is incremented by one. For read operations with Type 2 E2EDP, the application provides an Expected Initial Logical Block Reference Tag (EILBRT) which is compared to the stored Reference Tag for the first LBA of the read operation. For each subsequent LBA of the read request, the expected Reference Tag is incremented by one. Finally, for Type 3 E2EDP, the Reference Tag is not checked.
These Protection Types are the same as those defined in the SCSI protection information specified model specified in SBC-3.
The previous sections describe how the LBA format determines aspects of protection information for the NVMe device. This section describes how protection information handling is controlled by the PRACT and PRCHK fields of NVMe read and write commands.
The PRACT bit determines whether the controller or the host is responsible for computing the protection information fields in the metadata buffer. The effect of this bit also depends on whether the size of the metadata buffer is the same as or exceeds the size of the protection information for each LBA. Table 3 below summarizes how the PRACT bit affects protection information processing.
Metadata Size Equals Protection Information Size | Metadata Size Exceeds Protection Information Size | ||||
Transferred Between Host Buffer and Controller | Generates PI for Write Commands | Transferred Between Host Buffer and Controller | Generates PI for Write Commands | ||
PRACT=0 | Logical block and metadata | Host | Logical block and metadata | Host | |
PRACT=1 | Logical block data only | Controller | Logical block and metadata | Controller |
When the PRACT bit is set to 0, the host is responsible for generating protection information for write commands. When the PRACT bit is set to 1, this is the responsibility of the controller, but there are a few wrinkles. If PRACT=1 and the metadata buffer size equals the protection information size (e.g., namespace is formatted with 8 bytes of metadata per LBA and uses 16b Guard PI with a PI size of 8 bytes), then E2EDP happens completely behind the scenes and no metadata is transferred between the host and the controller at all. In all other cases, both logical block data and metadata are transferred between the host buffer and controller. With PRACT=1 and the metadata buffer size exceeding protection information size, the controller overwrites the original contents of the protection information in the metadata buffer when committing to NVM.
For read commands both the logical block data and metadata are transferred from the device to the host buffer except when PRACT=1 and protection information size equals metadata size. When the metadata is transferred to the host, the application also may use it to check data integrity.
The PRCHK field consists of three bits that determine whether the controller checks the Guard, Application Tag, and Reference Tag fields of the protection information format. Table 4 below outlines the effects of the PRCHK bits.
PRCHK bit | |
---|---|
Guard Check | Setting this bit enables protection information checking of the Guard field. The controller computes the CRC for the logical block data, and optionally metadata depending on the PI location and compares the computed value with the value in the Guard field |
Application Tag Check | Setting this bit enables protection information checking of the Application Tag field. The controller compares the unmasked bits in the protection information Application Tag field to the Logical Block Application Tag (LBAT) field in the command. Whether a bit is masked on unmasked is determined by the Logical Block Application Tag Mask (LBATM) field in the command. |
Reference Tag Check | Setting this bit enables reference tag checking for namespaces formatted with Type 1 and Type 2 protection. Setting this bit has no effect for Type 3 protection. |
Finally, we should note that sentinel values exist for protection information fields. These values depend on the E2EDP Type. For Type 1 and Type 2 data protection, if the protection information Application Tag has a value of FFFFh then all protection information checks are disabled regardless of the value of the PRCHK field. And for Type 3 protection, if the Application Tag and Reference Tag have all bits set to 1, then all protection information checks are disabled regardless of the value of the PRCHK field
To make the above more concrete, let us describe the behavior of a write command with E2EDP enabled. The command's dword 12 includes a single bit for protection information action (PRACT) and 3 bits for protection information check (PRCHK). The bits of PRCHK enable guard, application tag and reference tag checking.
Let us begin with the two scenarios where the PRACT bit is set.
- Figure 3a below depicts the case where the namespace is formatted with metadata size equal to protection information size. A write command in this scenario results in the transfer of only the logical block data from the host buffer to the controller. As the logical block data passes through the controller, the controller generates and appends protection information to the end of the logical block data, and the logical block data and protection information are written to NVM. In other words, the metadata is not resident within the host buffer.
- Figure 3b below depicts the case where the namespace is formatted with metadata size greater than protection information size. The host must send logical block data as well as the metadata buffer. As the metadata passes through the controller, the controller overwrites the protection information portion of the metadata regardless of PRCHK settings.
The final scenario has the PRACT bit not set and is depicted in Figure 3c. Data and metadata are transferred from the host to the controller. Protection information is checked as data and metadata pass through the controller.
Figure 3: Write Command Example
Source: NVM Express NVM Command Set Specification, Revision 1.0c
Let us now describe in more detail the behavior of a read command. Command dword 12 of the submission queue entry contains a single bit for PRACT and 3 bits for PRCHK. Read and write commands are the same in this respect.
Let us begin with the cases where the PRACT bit is set:
- Figure 4a below depicts the case where the namespace is formatted with metadata size equal to protection information size. The logical block data and metadata are read from NVM and protection information is checked as data passes through the controller. The controller only returns the logical block data to the host.
- Figure 4b below depicts the case where the namespace is formatted with metadata size greater than protection information size. The logical block data and metadata are read from NVM and protection information is checked as data passes through the controller. The controller returns the unchanged logical block data and metadata to the host.
The final case has the PRACT bit not set and is depicted in Figure 4c. Data and metadata are transferred from NVM to the host buffer and protection information is checked as data and metadata pass through the controller. The controller returns the logical block data and metadata to the host.
Figure 4: Read Command Example
Source: NVM Express NVM Command Set Specification, Revision 1.0c
This blog post has provided an introduction to E2EDP for NVMe devices. Part 2 of this series will describe the recent contribution we made to fio that enables testing E2EDP.
- The preceding is certainly not exhaustive in its coverage of E2EDP. For example, Storage Tag support was not included in fio's support for E2EDP and the discussion above does not include Storage Tags. We also omit discussion of protection information for compare, copy, and zone append commands. These commands were not included when support for E2EDP was added to fio.
- The NVMe specification documents are available in PDF format. If they were also available at https://nvmexpress.org in HTML format it would have been very helpful to link to relevant portions of the standard as they were discussed.