Skip to content

Future of Corundum

Alex Forencich edited this page Jan 11, 2022 · 11 revisions

Corundum was first publicly announced at the OptSys workshop at Sigcomm in August 2019 and published at FCCM in May 2020. It is now an appropriate time to consider the future direction of Corundum both as a project and as a community. In its current incarnation, Corundum is a high-performance NIC with a handful of specialized features. One potential path forward is to transform Corundum from a NIC into a powerful platform for in-network computing, which can integrate custom application logic alongside a high-performance datapath and unified address space.

Recently, there has been growing interest in in-network computing, where some amount of application-level processing takes place in networking hardware, including devices like programmable switches and smart NICs, instead of on general-purpose CPUs. Additionally, the use of RDMA for low-latency, low-overhead communication within datacenters has become increasingly common. RDMA is not only used to communicate between servers, it can also be used to communicate with GPUs (GPUDirect) and with NVMe devices (NVMeOF). The ultimate goal is to have Corundum act as a flexible, high-performance bridge between traditional compute and networking functions that can balance hardware and software. The new features of Corundum will enable users to optimally allocate network and compute resources across software and hardware for application-specific needs.

To support in-network computing in Corundum, a number of architectural changes will be required. The first is to implement RDMA as a first-class feature, in a way that is accessible from both the host as well as on-FPGA application logic. Second is to provide a unified DMA address space that includes both host memory and on-board DRAM or on-package HBM. Third is to implement a user application region with various hooks into the DMA infrastructure, datapath, and other components. The goal is to make these changes without sacrificing the core set of features and devices currently supported by Corundum.

A block diagram of a starting point for the future architecture of Corundum is shown below, and some of the key features are detailed in the following sections. Implementing all of these features, especially RDMA, is going to require a team effort. If you are interested in getting involved in some way, please see the new contributor guide.

RDMA

A full-featured implementation of RDMA on Corundum in the form of RoCEv2 will require a significant overhaul of the NIC datapath. Components to handle RoCEv2 packet headers, queue pair state information, and protection domains will need to be implemented. Changes to the descriptor format will be necessary to include additional metadata fields. New checksum computation modules will also be required to compute IP, UDP, and TCP checksums for packets generated on the FPGA, independent of the host network stack.

The primary goal is to make the RDMA stack visible to both software on the host as well as application logic on the FPGA, such that both the host and the application logic can issue and terminate RDMA operations. Additionally, the unified DMA address space will enable RDMA operations on buffers located in both host memory and on-card memory.

The RDMA stack will have to be designed in such a way that the transmit schedulers can control all outgoing packets on a per-queue-pair basis, potentially enabling hardware flow control and congestion control techniques for RDMA traffic. Additionally, the hardware RDMA functionality will be controlled via Verilog parameters so it can be disabled to save logic resources, which may be especially important for smaller devices.

The RDMA implementation in hardware will be complemented with a software stack. Ideally, the software stack will be compatible with MPI and other higher-level interface libraries to support existing applications that use RDMA directly or MPI over RDMA.

Status: TODO

Variable-length descriptors

The second major modification is the implementation of variable-length descriptor support. The current fixed-size descriptor format used by Corundum is compact, but limited. RDMA support will require additional metadata including memory keys and large length fields. Efficient scatter/gather operation on buffers of varying complexity requires support for handling descriptors of varying size. The logic required to handle variable-size descriptors will also enable other performance improvements, including descriptor block reads, prefetching, and caching; ultimately improving latency and PCIe link utilization. On the completion and event side, support for completion block writes can also improve PCIe link utilization.

Variable-length descriptor support can also be used to provide arbitrary metadata to application logic implemented on the FPGA as well as for transferring inline packet headers for segmentation offloading.

Status: TODO

Unified DMA address space

A key aspect of the new architecture is a unified address space. A unified address space provides the same address layout to the host, the RDMA stack, and any on-FPGA application logic. With a unified address space, physical addresses for host buffers and on-card buffers can be directly passed to application logic and mixed in DMA scatter lists without any additional translation or other special treatment.

This powerful feature works by first exposing the on-board DRAM/on-package HBM in a single large PCIe BAR, and then routing DMA transfer requests internally to the appropriate DMA interface module based on whether the requested address falls inside of the BAR window.

Implementing a unified address space requires construction of a multiplexer module that can connect multiple DMA interface modules to the same DMA RAM, performing arbitration on a per-segment basis for maximum throughput. Some modifications to the DMA interface modules may help to improve efficiency. Addition of a loopback RAM permits the DMA subsystem to copy data within the unified address space, including card-to-host, host-to-card, card-to-card, host-to-host, and peer-to-peer.

Status: TODO

Application logic

The proposed new architecture will have a Verilog module for application logic. The module will provide a number of interfaces to the RDMA stack, datapath, and DMA subsystem. Various types of application logic, including HDL, HLS, and P4, can make use of these interfaces to send and receive packets, modify incoming and outgoing packets, issue and terminate RDMA operations, access on-card memory, utilize the DMA subsystem to on-card memory or host memory, and interact with application software on the host system. Specific interfaces provided to this module include:

  • AXI-lite slave interface to access NIC registers
  • AXI-lite master interface for configuration of application logic (separate BAR or shared BAR)
  • AXI slave interface to access on-card memory
  • DMA subsystem port to perform DMA operations on unified address space (transparently target both host and on-card memory)
  • AXI stream ports to access packet data on-the-wire
  • Current PTP time from hardware clock
  • Additional interfaces for flow control and packet drop reporting on TX (TBD)
  • Additional interfaces to datapath and RDMA stack (interrupts, doorbells, etc.) (TBD)

Status: an initial version of the application block has been integrated into Corundum as of September 2021.

Shared interface datapath

The current architecture of Corundum associates the majority of the datapath with each physical port, even when multiple ports are associated with a single interface. The advantage of this is that it reduces interference and head-of-line blocking between ports. However, it is not compatible with all of the shared state required in the RDMA stack. This means that the current Corundum datapath must be moved into the interface and shared between the ports. This will require internal flow control to prevent head-of-line blocking between ports. However, this change has the potential to reduce resource consumption by sharing control logic, as well as potentially supporting run-time switching between a single 100G link and multiple 10G/25G links.

Status: an initial version of the shared interface datapath has been integrated into Corundum as of January 2022.