rdma - thawk/wiki GitHub Wiki

OFA培训材料

part1-OFA_Training_Sept_2016.pdf

术语

  • RDMA: Remote Direct Memory Access

    • application-to-application communication
    • Remote: communication at a distance
    • Direct: does not require a ‘higher authority’ for each access
    • Memory: virtual-to-virtual transfers...even across a network
  • Verbs

    • An API used by an application to control and conduct an RDMA operation.
  • Channel Adapter (CA)

    • An I/O device that allows an application to conduct RDMA operations directly
    • 由硬件好软件共同组成
    • 分3种类型
      • iWARP
      • InfiniBand
      • RoCE(RDMA over Converged Ethernet)
        • RoCE v1: ethertype 0x8915
        • RoCE v2: UDP port 4791
  • Requester

    • initiates and controls an RDMA operation
    • A request flows from the requester to the responder
    • A requester may request a READ or a WRITE operation
    • A WRITE operation transfers data from the requester to the responder
    • A READ operation transfers data from the responder to the requester
  • Responder

    • responds to requests
  • Channel Interface

    • 应用与Channel Adapter之间的薄层
    • 运行于user space
  • I/O

    • 通常有3类:存储、网络和IPC
    • 通常都需要操作系统介入
  • Channel I/O

    • An I/O channel is a conduit between applications
    • The OS establishes the channel, thus a channel is isolated and protected
    • But the OS is not itself part of the channel!

    Channel I/O allows applications to communicate very efficiently

  • QP: Queue Pair

    • QP = send queue + receive queue
    • An application can create many QPs
    • Each QP is associated with exactly one application
  • SQ: Send Queue

  • RQ: Receive Queue

  • CQ: Completion Queue

    • A CQ can be associated with more than one QP
  • SRQ: (Shared Receive Queue)

    • A queue which holds WQEs for incoming messages from any RC/UC/UD QP which is associated with it.
    • More than one QPs can be associated with one SRQ
  • WR: Work Request

    • a data structure that describes a piece of work to be completed:
      • a message to be sent
      • a message to be received
    • An application posts a WR to a queue.
  • RWR: Receive Work Request

  • SWR: Send Work Requests

  • SQE

  • WC: Work Completion

  • WQE (pronounced wookie): Work Queue Element

    • Once posted to a work queue, a WR becomes an element of that queue, called a WQE
  • CQE (pronounced cookie): Completion Queue Element

  • HCA (Host Channel Adapter)

  • AH (Address Handle)

    • An object which describes the path to the remote side used in UD QP
  • VA (Virtual Address)

  • L-KEY

    • The L_KEY is used when the app accesses the memory
  • R-KEY

    • The R-KEY is passed to a responder during RDMA READ or WRITE operations
  • Q-KEY – (RD, UD only). Since datagrams are unconnected, Q-Keys are used to govern access to a remote QP.

    • Normally, this access would be governed by comparing the SLID/DLID against QP context.
  • ULP (Upper Layer Protocol)

  • OFED

  • RC (Reliable Connection)

    • 1:1
    • 可靠送达
    • 按序投递
    • 类似TCP
  • UC (Unreliable Connection)

    • 1:1
    • 不保证送达,有可能丢失
  • UD (Unreliable Datagram)

    • 用一个QP收发任意其他UD QP的数据
    • 不保证送达,有可能丢失
    • 不保证顺序
    • 支持多播(1:N)
    • 类似UDP
  • GRH (Global Routing Header)

  • PD (Protection Domain)

    • Object whose components can interact with only each other. These components can be AH, QP, MR, MW and SRQ.
    • A protection domain is used to associate Queue Pairs with Memory Regions and Memory Windows, as a means for enabling and controlling network adapter access to Host System memory.
    • PDs are also used to associate Unreliable Datagram queue pairs with Address Handles, as a means of controlling access to UD destinations.
    • struct ibv_pd is used to implement protection domains.
  • MW (Memory Window)

  • MR (Memory Registration)

    • The registration process pins the memory pages (to prevent the pages from being swapped out and to keep physical <-> virtual mapping).
    • During the registration, the OS checks the permissions of the registered block.
    • The registration process writes the virtual to physical address table to the network adapter.
    • When registering memory, permissions are set for the region. Permissions are local write, remote read, remote write, atomic, and bind.
  • CM (Communication Manager)

    • An entity responsible to establish, maintain, and release communication for RC and UC QP service types
    • The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service.
    • There is a CM in every IB port of the end nodes

要点

  • 顺序

    • Ordering is guaranteed for all WRs submitted to a given send queue

    • Ordering is guaranteed to all WRs submitted to a given receive queue

    • There are no ordering guarantees between send and receive queues

    • There are no ordering guarantees between QPs

    • 表现有序: WRs submitted to a single SQ must be initiated, sent and completed in the order they are submitted

    • 实现并行: However, processing of the data transfers from multiple WRs submitted to the same SQ can be done in parallel – in particular, data may be placed into target memory in any order

    • For an RDMA_WRITE, the contents of the target buffer are indeterminate until a subsequent Send message is completed by consuming a WC at the target (e.g., an ACK is sent back and completes at the target)

  • RDMA操作

    • channel semantic
      • Send/Receive
      • 必须先建立连接
      • 双方都不需要知道对端内存地址
      • Send/Receive一一匹配
      • 必须先Receive,再Send
      • RWR肯定有WC,SWR的WC是可选的
    • memory semantic
      • RDMA Read/RDMA Write
      • 由发起方执行所有操作
      • 被动方不用执行任何操作
        • 在开始传输前,需要告诉发起方其虚拟内存位置及rkey
      • 被动方没有任何反馈
        • 不占用CPU
        • 没有事件、没有完成通知
      • 发起方需要知道被动方的虚拟内存位置
      • 需要额外通过Send/Receive传输内存位置、rkey,并在传输完成后通知被动方
  • RDMA目标 – High Bandwidth Utilization – Low Latency – Low CPU utilization

  • RDMA主要功能

    • Zero-copy data transfers
      • Data moves directly from user memory on one side to user memory on the other side
      • No CPU intervention
      • No intermediate buffering
    • Kernel by-pass
      • User has direct access to the Channel Adapter
      • No kernel-level protocol handling
      • All protocol handling done on the Channel Adapter
  • Client-Server Model

    • Server
      • Listener
        • Waits for connections from clients
      • Agent
        • Transfers data with one client
    • Client
      • Connects to a server's listener
      • Transfers data with a server's agent

典型应用架构

  1. Get the device list;

    • Every device in this list contains both a name and a GUID.
    • For example the device names can be: mthca0, mlx4_1.
  2. Open the requested device;

    • Iterate over the device list, choose a device according to its GUID or name and open it.
  3. Query the device capabilities;

  4. Allocate a Protection Domain to contain your resources;

  5. Register a memory region;

    • VPI only works with registered memory.
    • Any memory buffer which is valid in the process’s virtual space can be registered.
  6. Create a Completion Queue (CQ);

    • A CQ contains completed work requests (WR).
    • Each WR will generate a completion queue entry (CQE) that is placed on the CQ.
    • The CQE will specify if the WR was completed successfully or not.
  7. Create a Queue Pair (QP);

    • Creating a QP will also create an associated send queue and receive queue.
  8. Bring up a QP;

    • A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).
  9. Post work requests and poll for completion;

    • Use the created QP for communication operations.
  10. Cleanup;

    • Destroy objects in the reverse order you created them:
      • Delete QP
      • Delete CQ
      • Deregister MR
      • Deallocate PD
      • Close device

编程

头文件

#include <infiniband/verbs.h>   // IB_VERBS 基础头文件
#include <rdma/rdma_cma.h>      // RDMA_CM CMA 头文件 用于CM建链
#include <rdma/rdma_verbs.h>    // RDMA_CM VERBS 头文件 用于使用基于CM的Verbs接口

软件包

  • rdma-core

    指开源RDMA用户态软件协议栈,包含用户态框架、各厂商用户态驱动、API帮助手册以及开发自测试工具等。

    rdma-core在github上维护,我们的用户态Verbs API实际上就是它实现的。

  • kernel RDMA subsystem

    指开源的Linux内核中的RDMA子系统,包含RDMA内核框架及各厂商的驱动。

    RDMA子系统跟随Linux维护,是内核的的一部分。一方面提供内核态的Verbs API,一方面负责对接用户态的接口。

  • OFED

    全称为OpenFabrics Enterprise Distribution,是一个开源软件包集合,其中包含内核框架和驱动、用户框架和驱动、以及各种中间件、测试工具和API文档。

    开源OFED由OFA组织负责开发、发布和维护,它会定期从rdma-core和内核的RDMA子系统取软件版本,并对各商用OS发行版进行适配。除了协议栈和驱动外,还包含了perftest等测试工具。

    OFED包含以下主要内容:

    • rdma-core
      • 用户态框架/Verbs API
        • libibverbs(IB_VERBS)
          • ibv_create_qp()
          • ibv_pose_send()
          • ...
        • librdmacm(RDMA_CM)
          • CMA
            • rdma_listen()
            • rdma_connect()
            • ...
          • CM VERBS
            • rdma_post_read()
            • rdma_post_ud_send()
            • ...
        • libumad
        • ...
    • kernal RDMA subsystem
      • 内核框架/Verbs API
        • IB_VERBS
          • ib_create_qp()
          • ib_pose_send()
          • ...
        • RDMA_CM
          • rdma_listen()
          • rdma_connect()
        • IB_MAD
        • ...
      • 各厂商驱动
        • mlx5_ib.ko
        • hns-roce-hw-v2.ko
    • 测试工具
      • perftest
      • qperf
      • ...

API

广义的Verbs API主要由两大部分组成:

IB_VERBS

接口以ibv_xx(用户态)或者ib_xx(内核态)作为前缀,是最基础的编程接口,使用IB_VERBS就足够编写RDMA应用了。

比如:

  • ibv_create_qp() 用于创建QP
  • ibv_post_send() 用于下发Send WR
  • ibv_poll_cq() 用于从CQ中轮询CQE

RDMA_CM

以rdma_为前缀,主要分为两个功能:

CMA(Connection Management Abstraction)

在Socket和Verbs API基础上实现的,用于CM建链并交换信息的一组接口。CM建链是在Socket基础上封装为QP实现,从用户的角度来看,是在通过QP交换之后数据交换所需要的QPN,Key等信息。

比如:

  • rdma_listen()用于监听链路上的CM建链请求。
  • rdma_connect()用于确认CM连接。

CM VERBS

RDMA_CM也可以用于数据交换,相当于在verbs API上又封装了一套数据交换接口。

比如:

  • rdma_post_read()可以直接下发RDMA READ操作的WR,而不像ibv_post_send(),需要在参数中指定操作类型为READ。
  • rdma_post_ud_send()可以直接传入远端QPN,指向远端的AH,本地缓冲区指针等信息触发一次UD SEND操作。

上述接口虽然方便,但是需要配合CMA管理的链路使用,不能配合Verbs API使用。

Verbs API除了IB_VERBS和RDMA_CM之外,还有MAD(Management Datagram)接口等。

配置

  • 需要有一个连接管理器。直连的话可以用opensm
    sudo opensm
    
  • sbping
    • server
      sudo ibping -S
      
    • client
      sudo ibping -G 对方的GID
      

参考资料

远程资料

网站

代码及manpage

调优

命令

  • ibv_devinfo
  • ibdump
  • ibvdev2netdev