Skip to content

RFC: Variable length descriptors

Alex Forencich edited this page Nov 11, 2020 · 4 revisions

Request for comments: variable-length descriptors for Corundum

I am investigating support for variable-length descriptors in Corundum, and I would like to solicit comments on the scheme before I start writing too much code.

This document has several goals. The first goal is to lay out the variable-length descriptor framing format. The descriptor framing format covers the size, layout, and alignment of descriptors in host memory as well as information within each descriptor to facilitate transferring it from the host to the NIC. The second goal is to define a basic descriptor format to facilitate low-overhead transfers of packet data. Additional, more complex descriptor formats can then be added later on as necessary to support additional protocols and offloads.

Variable-length descriptors have the potential to be expressive, extensible and efficient, enabling support for a significant amount of per-packet metadata, descriptor-inline headers, command and control information for hardware protocol offloading and packet processing components, and other features while minimizing transfer overheads. Implementing variable-length descriptors requires more complex parsing logic on the FPGA. But, this more complex logic can also enable features that can improve performance and PCIe link utilization efficiency and latency, including descriptor block reads, prefetching, and caching.

Current Implementation: Fixed-size descriptor blocks

Currently, the descriptor format used by corundum is very simple:

struct mqnic_desc {
    __u8 rsvd0[2];
    __u16 tx_csum_cmd;
    __u32 len;
    __u64 addr;
};

In other words, one 64-bit pointer into host memory, one 32-bit length, one 16 bit field for the transmit checksum offload, and 16 bits currently not used; 16 bytes in total, with all fields aligned in host memory.

Current scatter/gather support in Corundum is a relatively simple "hack" on top of this; the current implementation adds a per-queue block size setting, which results in handling descriptors in blocks of 1, 2, 4, or 8 of these 16 byte blocks. This simple scheme with fixed-size descriptors means that the queue management logic can keep track of all of the descriptors in all of the queues without having to actually look at any of the descriptors. The disadvantage is that is that it is inflexible and wasteful in a couple of different ways. First, all of the descriptors are read individually, which results in inefficient use of the PCIe link due to TLP overheads associated with each descriptor read, especially for small block sizes. Second, complete blocks are always transferred, even if not all of the pointers are required. This makes scatter/gather less efficient, especially when handling small packets in a high-MTU configuration. Inside the FPGA, each 16-byte segment is transferred in a separate clock cycle, so transferring and parsing unused pointers limits the packet rate for larger block sizes, resulting in a trade-off in transfer efficiency between large packets (where more pointers/larger descriptors are required to prevent the host from spending CPU cycles to linearize packets) and small packets (where unused pointers in large descriptors add extra overhead).

Variable-length descriptors

Variable-length descriptors permit the size of each descriptor to vary with the requirements for transferring each packet. The idea would be to only transfer as many pointers as are needed to describe each packet - packets in a single linear segment can use a single pointer, packets broken up across several segments can use several pointers. It is also possible to inline packet headers or entire packets into the descriptor queue, which can eliminate the need for a separate DMA operation to fetch the packet data. Additionally, variable-length descriptors can be constructed with blocks of various types, including not only pointers to packet data, but also metadata and commands to control per-packet processing and other hardware components.

There are a number of possible methods for implementing variable-length descriptors. One is to have a descriptor length field near the beginning of the descriptor that indicates the size of the descriptor. Another is to use a “last block” flag on each of the blocks that make up the descriptor. Another consideration is the descriptor size and alignment – descriptors supporting any arbitrary length in bytes is by far the most flexible, but introduces alignment issues both in terms of accessing descriptor fields from software as well as interpreting the descriptor in hardware. Building descriptors from fixed-size blocks means the descriptors can be consistently aligned in system memory for ease of access from software, and less re-alignment may be required while reading and parsing the descriptors in hardware.

Using a “last block” flag means that descriptors can potentially be unlimited in size, but requires the descriptor read logic to parse the entire descriptor. This also places limitations on the overall descriptor format and/or requires more complex parsing logic to correctly parse the flags. Additionally, it is not possible to determine the size of a given descriptor before it is fully read and parsed.

Using a length field limits the size of the descriptor based on the range of the length field, but it enables a simple separation of responsibility between the descriptor fetch logic and the rest of the descriptor parsing logic – if the length field is in a consistent location, the descriptor fetch logic does not need to parse any other portion of the descriptor, and the descriptor fetch logic can check if it has the complete descriptor in its cache as soon as it has read the length field. Aside from the length field, the descriptor format can be completely decoupled from the descriptor parsing logic, permitting the descriptor format to be modified to support new functionality without changing the descriptor fetch logic. The mlx4, mlx5, and ixgbe kernel modules all use length fields in the descriptor headers.

Therefore, using a descriptor length field at a fixed location at the beginning of the descriptor provides the most reasonable balance between simplicity of implementation and descriptor flexibility.

Proposed descriptor framing format

Using a 16 byte block size and an 8-bit length field permits descriptors of up to 256 blocks or 4096 bytes in size. In this case, 16 byte data records containing an aligned 64-bit pointer and 16 or 32 bit length field permits up to 256 pointers per descriptor. Adding a 1-byte type field after the length field permits the implementation of multiple types of descriptors, allowing for future expansion.

In this case, the overall descriptor format would look something like this, with a descriptor header followed by one or more 16-byte blocks:

struct desc_hdr {
    __u8 len;
    __u8 type;
}

struct desc_block {
    __u8 rsvd[16];
};

struct desc {
    struct desc_hdr desc_hdr;
    __u8 rsvd[14];
    struct desc_block[];
};

Note that the overall descriptor does not necessarily have to be a multiple of 16 bytes in length. All entries in the descriptor ring in host memory must be 16-byte aligned and all descriptor reads will be in multiples of 16 bytes, and as such all descriptors will be effectively padded to a multiple of 16 bytes, which can be discarded when parsing the descriptor. However, maintaining a 16-byte alignment within the descriptor can simplify parsing the descriptor in hardware as it will result in fields appearing in consistent positions on the descriptor data bus.

Proposed descriptor formats

The current 16-byte descriptor format is very efficient in terms of transferring length/pointer pairs with little additional overhead, especially in the receive direction. Additionally, the two reserved bytes can be replaced with the length and type fields without changing the rest of the fields, like so:

struct desc_data_seg {
    struct desc_hdr desc_hdr;
    __u16 tx_csum_cmd;
    __u32 len;
    __u64 addr;
};

In this case, each descriptor of this type would simply be an array of these desc_data_seg structures, with only the length and type fields in the first element being used and the rest either ignored or available for other uses.

A possible modification would be to switch to a 16 bit length field, which would open up 16 bits for other uses, such as a flags field:

struct desc_data_seg {
    struct desc_hdr desc_hdr;
    __u16 flags;
    __u16 tx_csum_cmd;
    __u16 len;
    __u64 addr;
};

Additional descriptor formats that use a variable number of data segments can be defined like so:

struct new_desc {
    struct desc_hdr desc_hdr;
    // more header fields...
    struct desc_data_seg data_segs[] __attribute__ ((aligned (16)));
};

In this case, arbitrary header fields can be added and the descriptor fetch logic will be able to read the descriptor and hand it off to the transmit or receive logic for parsing without modification.

Descriptor-inline data can also be supported with this format. In this case, an offset and length in the descriptor header would point to the inline data. A descriptor with this format might look something like this:

struct desc_with_inline_data {
    __u8 desc_len;
    __u8 desc_type;
    __u8 data_seg_count;
    __u8 rsvd0;
    __u16 inl_data_len;
    // etc.
    union {
        struct desc_data_seg data;
        char inl_data[];
    } segs[] __attribute__ ((aligned (16)));
};

In this case, data_seg_count would indicate the number of data segments following the header, then the inline data would start immediately after the last data segment.

Proposed overall descriptor format:

struct desc_hdr {
    __u8 len;
    __u8 type;
}

struct desc_data_seg {
    union {
        struct desc_hdr desc_hdr;
        __u8 rsvd0[2];
    };
    __u16 flags;
    __u16 tx_csum_cmd;
    __u16 len;
    __u64 addr;
};

struct desc_with_inline_data {
    struct desc_hdr desc_hdr;
    __u16 flags;
    __u32 opcode;
    __u8 data_seg_count;
    __u8 rsvd0;
    __u16 inl_data_len;
    __u8 rsvd1[4];
    union {
        struct desc_data_seg data_seg;
        char inl_data[16];
    } segs[];
};

With this format, it is also possible embed additional free-form metadata by using a bit in the flags field to treat the first data segment as inline metadata to be passed as sideband data alongside the packet data, for use by packet processing components on the FPGA.

Remaining questions

Is the descriptor framing format sufficiently efficient and flexible (16 byte blocks, 1 byte length, 1 byte type, max desc size of 256 blocks/4096 bytes)? Is there any compelling use case for descriptors larger than 4096 bytes?

Is there any reason to keep a 32 bit length field in the data segments? I'm leaning towards switching to 16 bits; if the transfer is large enough to overflow a 16 bit length field, then the overhead of a larger descriptor should not be an issue.

For data-only or very simple offloads (i.e. receive path, transmit IP checksum) using data segments directly can be sufficient, but for more complex offloads some sort of standardized descriptor header may be useful. What fields would make sense to include in this header? It is certainly possible to add additional segments attached to the standard header or to develop new descriptor headers for supporting different protocols or offload features.

Existing driver descriptor formats

Here are the descriptor formats for the mlx4, mlx5, and ixgbe drivers, for reference purposes.

mlx4

// TX descriptor, 32 bytes
struct mlx4_en_tx_desc {
    struct mlx4_wqe_ctrl_seg ctrl;
    union {
	    struct mlx4_wqe_data_seg data; /* at least one data segment */
	    struct mlx4_wqe_lso_seg lso;
	    struct mlx4_wqe_inline_seg inl;
    };
};

// RX descriptor
struct mlx4_en_rx_desc {
    /* actual number of entries depends on rx ring stride */
    struct mlx4_wqe_data_seg data[0];
};

union mlx4_wqe_qpn_vlan {
    struct {
	    __be16	vlan_tag;
	    u8	ins_vlan;
	    u8	fence_size;
    };
    __be32		bf_qpn;
};

struct mlx4_wqe_lso_seg {
    __be32			mss_hdr_size;
    __be32			header[];
};

struct mlx4_wqe_inline_seg {
    __be32			byte_count;
};

// control segment, 16 bytes
struct mlx4_wqe_ctrl_seg {
    __be32			owner_opcode;
    union mlx4_wqe_qpn_vlan	qpn_vlan;
    /*
     * High 24 bits are SRC remote buffer; low 8 bits are flags:
     * [7]   SO (strong ordering)
     * [5]   TCP/UDP checksum
     * [4]   IP checksum
     * [3:2] C (generate completion queue entry)
     * [1]   SE (solicited event)
     * [0]   FL (force loopback)
     */
    union {
	    __be32			srcrb_flags;
	    __be16			srcrb_flags16[2];
    };
    /*
     * imm is immediate data for send/RDMA write w/ immediate;
     * also invalidation key for send with invalidate; input
     * modifier for WQEs on CCQs.
     */
    __be32			imm;
};

// data segment, 16 bytes
struct mlx4_wqe_data_seg {
    __be32			byte_count;
    __be32			lkey;
    __be64			addr;
};

mlx5

// TX descriptor
struct mlx5e_tx_wqe {
    struct mlx5_wqe_ctrl_seg ctrl;
    struct mlx5_wqe_eth_seg  eth;
    struct mlx5_wqe_data_seg data[0];
};

// RX descriptor
struct mlx5e_rx_wqe_ll {
    struct mlx5_wqe_srq_next_seg  next;
    struct mlx5_wqe_data_seg      data[];
};

// control segment, 16 bytes
struct mlx5_wqe_ctrl_seg {
    __be32			opmod_idx_opcode;
    __be32			qpn_ds;
    u8			signature;
    u8			rsvd[2];
    u8			fm_ce_se;
    union {
	    __be32		general_id;
	    __be32		imm;
	    __be32		umr_mkey;
	    __be32		tis_tir_num;
    };
};

// ethernet segment, 16 bytes
struct mlx5_wqe_eth_seg {
    u8              swp_outer_l4_offset;
    u8              swp_outer_l3_offset;
    u8              swp_inner_l4_offset;
    u8              swp_inner_l3_offset;
    u8              cs_flags;
    u8              swp_flags;
    __be16          mss;
    __be32          rsvd2;
    union {
	    struct {
		    __be16 sz;
		    u8     start[2];
	    } inline_hdr;
	    struct {
		    __be16 type;
		    __be16 vlan_tci;
	    } insert;
	    __be32 trailer;
    };
};

// RX descriptor header, 16 bytes
struct mlx5_wqe_srq_next_seg {
    u8			rsvd0[2];
    __be16			next_wqe_index;
    u8			signature;
    u8			rsvd1[11];
};

// data segment, 16 bytes
struct mlx5_wqe_data_seg {
    __be32			byte_count;
    __be32			lkey;
    __be64			addr;
};

ixgbe

/* Transmit Descriptor - Advanced */
union ixgbe_adv_tx_desc {
    struct {
	    __le64 buffer_addr;      /* Address of descriptor's data buf */
	    __le32 cmd_type_len;
	    __le32 olinfo_status;
    } read;
    struct {
	    __le64 rsvd;       /* Reserved */
	    __le32 nxtseq_seed;
	    __le32 status;
    } wb;
};

/* Receive Descriptor - Advanced */
union ixgbe_adv_rx_desc {
    struct {
	    __le64 pkt_addr; /* Packet buffer address */
	    __le64 hdr_addr; /* Header buffer address */
    } read;
    struct {
	    struct {
		    union {
			    __le32 data;
			    struct {
				    __le16 pkt_info; /* RSS, Pkt type */
				    __le16 hdr_info; /* Splithdr, hdrlen */
			    } hs_rss;
		    } lo_dword;
		    union {
			    __le32 rss; /* RSS Hash */
			    struct {
				    __le16 ip_id; /* IP id */
				    __le16 csum; /* Packet Checksum */
			    } csum_ip;
		    } hi_dword;
	    } lower;
	    struct {
		    __le32 status_error; /* ext status/error */
		    __le16 length; /* Packet length */
		    __le16 vlan; /* VLAN tag */
	    } upper;
    } wb;  /* writeback */
};