rdma - thawk/wiki GitHub Wiki
OFA培训材料
part1-OFA_Training_Sept_2016.pdf
术语
-
RDMA: Remote Direct Memory Access
- application-to-application communication
- Remote: communication at a distance
- Direct: does not require a ‘higher authority’ for each access
- Memory: virtual-to-virtual transfers...even across a network
-
Verbs
- An API used by an application to control and conduct an RDMA operation.
-
Channel Adapter (CA)
- An I/O device that allows an application to conduct RDMA operations directly
- 由硬件好软件共同组成
- 分3种类型
- iWARP
- InfiniBand
- RoCE(RDMA over Converged Ethernet)
- RoCE v1: ethertype
0x8915 - RoCE v2: UDP port
4791
- RoCE v1: ethertype
-
Requester
- initiates and controls an RDMA operation
- A request flows from the requester to the responder
- A requester may request a READ or a WRITE operation
- A WRITE operation transfers data from the requester to the responder
- A READ operation transfers data from the responder to the requester
-
Responder
- responds to requests
-
Channel Interface
- 应用与Channel Adapter之间的薄层
- 运行于user space
-
I/O
- 通常有3类:存储、网络和IPC
- 通常都需要操作系统介入
-
Channel I/O
- An I/O channel is a conduit between applications
- The OS establishes the channel, thus a channel is isolated and protected
- But the OS is not itself part of the channel!
Channel I/O allows applications to communicate very efficiently
-
QP: Queue Pair
- QP = send queue + receive queue
- An application can create many QPs
- Each QP is associated with exactly one application
-
SQ: Send Queue
-
RQ: Receive Queue
-
CQ: Completion Queue
- A CQ can be associated with more than one QP
-
SRQ: (Shared Receive Queue)
- A queue which holds WQEs for incoming messages from any RC/UC/UD QP which is associated with it.
- More than one QPs can be associated with one SRQ
-
WR: Work Request
- a data structure that describes a piece of work to be completed:
- a message to be sent
- a message to be received
- An application posts a WR to a queue.
- a data structure that describes a piece of work to be completed:
-
RWR: Receive Work Request
-
SWR: Send Work Requests
-
SQE
-
WC: Work Completion
-
WQE (pronounced wookie): Work Queue Element
- Once posted to a work queue, a WR becomes an element of that queue, called a WQE
-
CQE (pronounced cookie): Completion Queue Element
-
HCA (Host Channel Adapter)
-
AH (Address Handle)
- An object which describes the path to the remote side used in UD QP
-
VA (Virtual Address)
-
L-KEY
- The L_KEY is used when the app accesses the memory
-
R-KEY
- The R-KEY is passed to a responder during RDMA READ or WRITE operations
-
Q-KEY – (RD, UD only). Since datagrams are unconnected, Q-Keys are used to govern access to a remote QP.
- Normally, this access would be governed by comparing the SLID/DLID against QP context.
-
ULP (Upper Layer Protocol)
-
OFED
-
RC (Reliable Connection)
- 1:1
- 可靠送达
- 按序投递
- 类似
TCP
-
UC (Unreliable Connection)
- 1:1
- 不保证送达,有可能丢失
-
UD (Unreliable Datagram)
- 用一个QP收发任意其他UD QP的数据
- 不保证送达,有可能丢失
- 不保证顺序
- 支持多播(1:N)
- 类似
UDP
-
GRH (Global Routing Header)
-
PD (Protection Domain)
- Object whose components can interact with only each other. These components can be AH, QP, MR, MW and SRQ.
- A protection domain is used to associate Queue Pairs with Memory Regions and Memory Windows, as a means for enabling and controlling network adapter access to Host System memory.
- PDs are also used to associate Unreliable Datagram queue pairs with Address Handles, as a means of controlling access to UD destinations.
struct ibv_pdis used to implement protection domains.
-
MW (Memory Window)
-
MR (Memory Registration)
- The registration process pins the memory pages (to prevent the pages from being swapped out and to keep physical <-> virtual mapping).
- During the registration, the OS checks the permissions of the registered block.
- The registration process writes the virtual to physical address table to the network adapter.
- When registering memory, permissions are set for the region. Permissions are local write, remote read, remote write, atomic, and bind.
-
CM (Communication Manager)
- An entity responsible to establish, maintain, and release communication for RC and UC QP service types
- The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service.
- There is a CM in every IB port of the end nodes
要点
-
顺序
-
Ordering is guaranteed for all WRs submitted to a given send queue
-
Ordering is guaranteed to all WRs submitted to a given receive queue
-
There are no ordering guarantees between send and receive queues
-
There are no ordering guarantees between QPs
-
表现有序: WRs submitted to a single SQ must be initiated, sent and completed in the order they are submitted
-
实现并行: However, processing of the data transfers from multiple WRs submitted to the same SQ can be done in parallel – in particular, data may be placed into target memory in any order
-
For an RDMA_WRITE, the contents of the target buffer are indeterminate until a subsequent Send message is completed by consuming a WC at the target (e.g., an ACK is sent back and completes at the target)
-
-
RDMA操作
- channel semantic
- Send/Receive
- 必须先建立连接
- 双方都不需要知道对端内存地址
- Send/Receive一一匹配
- 必须先Receive,再Send
- RWR肯定有WC,SWR的WC是可选的
- memory semantic
- RDMA Read/RDMA Write
- 由发起方执行所有操作
- 被动方不用执行任何操作
- 在开始传输前,需要告诉发起方其虚拟内存位置及rkey
- 被动方没有任何反馈
- 不占用CPU
- 没有事件、没有完成通知
- 发起方需要知道被动方的虚拟内存位置
- 需要额外通过Send/Receive传输内存位置、rkey,并在传输完成后通知被动方
- channel semantic
-
RDMA目标 – High Bandwidth Utilization – Low Latency – Low CPU utilization
-
RDMA主要功能
- Zero-copy data transfers
- Data moves directly from user memory on one side to user memory on the other side
- No CPU intervention
- No intermediate buffering
- Kernel by-pass
- User has direct access to the Channel Adapter
- No kernel-level protocol handling
- All protocol handling done on the Channel Adapter
- Zero-copy data transfers
-
Client-Server Model
- Server
- Listener
- Waits for connections from clients
- Agent
- Transfers data with one client
- Listener
- Client
- Connects to a server's listener
- Transfers data with a server's agent
- Server
典型应用架构
-
Get the device list;
- Every device in this list contains both a name and a GUID.
- For example the device names can be: mthca0, mlx4_1.
-
Open the requested device;
- Iterate over the device list, choose a device according to its GUID or name and open it.
-
Query the device capabilities;
-
Allocate a Protection Domain to contain your resources;
-
Register a memory region;
- VPI only works with registered memory.
- Any memory buffer which is valid in the process’s virtual space can be registered.
-
Create a Completion Queue (CQ);
- A CQ contains completed work requests (WR).
- Each WR will generate a completion queue entry (CQE) that is placed on the CQ.
- The CQE will specify if the WR was completed successfully or not.
-
Create a Queue Pair (QP);
- Creating a QP will also create an associated send queue and receive queue.
-
Bring up a QP;
- A created QP still cannot be used until it is transitioned through several states, eventually getting to Ready To Send (RTS).
-
Post work requests and poll for completion;
- Use the created QP for communication operations.
-
Cleanup;
- Destroy objects in the reverse order you created them:
- Delete QP
- Delete CQ
- Deregister MR
- Deallocate PD
- Close device
- Destroy objects in the reverse order you created them:
编程
头文件
#include <infiniband/verbs.h> // IB_VERBS 基础头文件
#include <rdma/rdma_cma.h> // RDMA_CM CMA 头文件 用于CM建链
#include <rdma/rdma_verbs.h> // RDMA_CM VERBS 头文件 用于使用基于CM的Verbs接口
软件包
-
rdma-core
指开源RDMA用户态软件协议栈,包含用户态框架、各厂商用户态驱动、API帮助手册以及开发自测试工具等。
rdma-core在github上维护,我们的用户态Verbs API实际上就是它实现的。
-
kernel RDMA subsystem
指开源的Linux内核中的RDMA子系统,包含RDMA内核框架及各厂商的驱动。
RDMA子系统跟随Linux维护,是内核的的一部分。一方面提供内核态的Verbs API,一方面负责对接用户态的接口。
-
OFED
全称为OpenFabrics Enterprise Distribution,是一个开源软件包集合,其中包含内核框架和驱动、用户框架和驱动、以及各种中间件、测试工具和API文档。
开源OFED由OFA组织负责开发、发布和维护,它会定期从rdma-core和内核的RDMA子系统取软件版本,并对各商用OS发行版进行适配。除了协议栈和驱动外,还包含了perftest等测试工具。
OFED包含以下主要内容:
- rdma-core
- 用户态框架/Verbs API
- libibverbs(IB_VERBS)
- ibv_create_qp()
- ibv_pose_send()
- ...
- librdmacm(RDMA_CM)
- CMA
- rdma_listen()
- rdma_connect()
- ...
- CM VERBS
- rdma_post_read()
- rdma_post_ud_send()
- ...
- CMA
- libumad
- ...
- libibverbs(IB_VERBS)
- 用户态框架/Verbs API
- kernal RDMA subsystem
- 内核框架/Verbs API
- IB_VERBS
- ib_create_qp()
- ib_pose_send()
- ...
- RDMA_CM
- rdma_listen()
- rdma_connect()
- IB_MAD
- ...
- IB_VERBS
- 各厂商驱动
- mlx5_ib.ko
- hns-roce-hw-v2.ko
- 内核框架/Verbs API
- 测试工具
- perftest
- qperf
- ...
- rdma-core
API
广义的Verbs API主要由两大部分组成:
IB_VERBS
接口以ibv_xx(用户态)或者ib_xx(内核态)作为前缀,是最基础的编程接口,使用IB_VERBS就足够编写RDMA应用了。
比如:
- ibv_create_qp() 用于创建QP
- ibv_post_send() 用于下发Send WR
- ibv_poll_cq() 用于从CQ中轮询CQE
RDMA_CM
以rdma_为前缀,主要分为两个功能:
CMA(Connection Management Abstraction)
在Socket和Verbs API基础上实现的,用于CM建链并交换信息的一组接口。CM建链是在Socket基础上封装为QP实现,从用户的角度来看,是在通过QP交换之后数据交换所需要的QPN,Key等信息。
比如:
- rdma_listen()用于监听链路上的CM建链请求。
- rdma_connect()用于确认CM连接。
CM VERBS
RDMA_CM也可以用于数据交换,相当于在verbs API上又封装了一套数据交换接口。
比如:
- rdma_post_read()可以直接下发RDMA READ操作的WR,而不像ibv_post_send(),需要在参数中指定操作类型为READ。
- rdma_post_ud_send()可以直接传入远端QPN,指向远端的AH,本地缓冲区指针等信息触发一次UD SEND操作。
上述接口虽然方便,但是需要配合CMA管理的链路使用,不能配合Verbs API使用。
Verbs API除了IB_VERBS和RDMA_CM之外,还有MAD(Management Datagram)接口等。
配置
- 需要有一个连接管理器。直连的话可以用
opensmsudo opensm sbping- server
sudo ibping -S - client
sudo ibping -G 对方的GID
- server
参考资料
远程资料
- 服务器总线结构图
- 网卡型号说明
- RDMA的简介
- RHEL7的配置
- h3c关于RDMA的介绍
- 无损网络和PFC
- RDMA 在数据中心的可靠传输
- Minimizing the Hidden Cost of RDMA
- Part III. InfiniBand and RDMA Networking Red Hat Enterprise Linux 7 | Red Hat Customer Portal
- 0. 《RDMA杂谈》专栏索引 - 知乎
- 有一系列介绍RDMA的文章
网站
代码及manpage
- linux-rdma/rdma-core: RDMA core userspace libraries and daemons
- https://linux.die.net/man/7/rdma_cm
- https://linux.die.net/man/3/ibv_create_qp
调优
命令
- ibv_devinfo
- ibdump
- ibvdev2netdev