RoCE - shawfdong/hyades GitHub Wiki
RoCE (RDMA over Converged Ethernet) is a network protocol that allows RDMA (remote direct memory access) over an Ethernet network. The advantages of RDMA over other network APIs such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.
QLogic was the first in the industry to demonstrate end-to-end interoperability for 25Gb and 100Gb Ethernet solutions[1]. Both QLogic 25Gb and 100Gb Ethernet leverage a common 25Gbps per-lane signaling scheme, with 25GbE utilizing one lane and 100GbE utilizing four lanes. The QLogic converged Ethernet adapters support RoCE v1 and v2, as well as multiple protocol offload and concurrent LAN (TCP/IP) and SAN (FCoE, iSCSI) processing, over a single Ethernet connection[2]. We'll explore RoCE on QLogic 25GbE adapters.
The tests are performed on two PowerEdge R730 servers. Each PowerEdge R730 server is equipped with:
- two (2x) 18-core Intel Haswell Xeon E5-2699 v3 processors at 2.30 GHz
- 128 GB memory
- a 8-lane PCIe QLogic 25GbE adapter
- a 16-lane PCIe QLogic 100GbE adapter
# grep -E '^model name|^cpu MHz' /proc/cpuinfo model name : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz cpu MHz : 2799.980 # cpupower frequency-info analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: 0.97 ms. hardware limits: 1.20 GHz - 2.80 GHz available cpufreq governors: performance, powersave current policy: frequency should be within 1.20 GHz and 2.80 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency is 2.80 GHz (asserted by call to hardware). boost state support: Supported: yes Active: yes
The 2 test servers run CentOS 7. QLogic provides the following kernel modules/drivers for the converged NICs (network interface cards):
- qed - QLogic Everest Driver core module
- qede - QLogic Everest Driver for Ethernet protocol
- qedi - QLogic Everest Driver for iSCSI protocol
- qedf - QLogic Everest Driver for FCoE protocol
- qedr - QLogic Everest Driver for RoCE protocol
# modinfo qedr filename: /lib/modules/3.10.0-123.el7.x86_64/updates/qedr.ko version: 8.3.6.0 license: GPL author: QLogic Corporation description: QLogic 40G/100G ROCE Driver srcversion: FC346F5749A39A75FEB4610 depends: compat,ib_core,qede,qed vermagic: 3.10.0-123.el7.x86_64 SMP mod_unload modversions parm: debug: Default debug msglevel (uint) parm: roce_100g: Run RoCE over 100G link - will cause performance degradation for regular L2 traffic over 100g (uint)
Additionally, libqedr is the userspace driver for Qlogic RoCE adapters. The package installs liblibqedr.a & libqedr.so in /usr/lib64; and adds the file /etc/libibverbs.d/qedr.driver:
driver qedr
RDMA is started with init script /etc/init.d/openibd. When it is started, we can query RDMA devices:
# ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.3.9.0 node_guid: 020e:1eff:fec4:05f1 sys_image_guid: 020e:1eff:fec4:05f1 vendor_id: 0x1077 vendor_part_id: 5718 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: EthernetNote that RoCE[3] doesn't require an SM (InfiniBand subnet manager); it operates using standard Ethernet network management practices for L2 address assignments, L2 topology discovery, and switch filtering data base (FDB) configuration[4].
However, ibnodes fails:
# ibnodes src/query_smp.c:235; Failed to register SMI agent on ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failedThe manpage of ibnodes says ibnodes "walks the IB subnet topology or uses an already saved topology file and extracts the IB nodes", so it requires an SM (subnet network) to work properly; but there is no SM in RoCE! Other utilities that are hard-coded to use SM will likewise fail as well.
The QLogic 25GbE adapters are in PCIe slot 4 in each server. Since the biosdevname is used by default on Dell systems[5], they are given the consistent interface name p4p1. The 25GbE interfaces on the 2 servers can be directly connected using a SFP+ direct attach copper cables; or they can both connect to a 100GbE switch (e.g., Dell Z9100-ON) using a QSFP to 4x SFP+ breakout cable. I've performed tests in both cases, and haven't seen any noticeable difference between the two in either bandwidth or latency.
The IPv4 addresses of the 25GbE interfaces (p4p1) on the 2 servers are 192.168.25.91 & 192.168.25.96, respectively.
First make sure the firewall is disabled on the QLogic 25GbE (p4p1) and 100GbE (p6p1) interfaces, by adding the following 2 lines to /etc/sysconfig/iptables:
-A INPUT -i p4p1 -j ACCEPT -A INPUT -i p6p1 -j ACCEPTthen restart the iptables service:
# systemctl restart iptables
We can test the connection of the RDMA communication manager (RDMA_CM) by using the RDMA ping-pong test. On the server:
# rping -s -S 35 -C 5 -a 192.168.25.96 -p 100
Then on the client:
# rping -c -v -S 35 -C 5 -a 192.168.25.96 -p 100 ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTU ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUV ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVW ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWX ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXY client DISCONNECT EVENT...
On the server:
# ib_send_bw -q 8 --port=12500 -x 0 -a -F 2> /dev/null
On the client (note here we use the IPv4 address of p4p1 on the server):
# ib_send_bw 192.168.25.96 -q 8 --port=12500 -x 0 -a -F 2> /dev/null --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : qedr0 Number of qps : 8 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet Gid index : 0 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 2 1000 8.36 8.33 4.367733 4 1000 16.71 16.71 4.379616 8 1000 33.49 33.47 4.387179 16 1000 66.98 66.91 4.385292 32 1000 133.95 133.28 4.367329 64 1000 267.90 267.75 4.386754 128 1000 536.83 535.91 4.390151 256 1000 1071.60 1069.66 4.381322 512 1000 2139.16 2135.22 4.372941 1024 1000 2696.39 2695.85 2.760550 2048 1000 2834.19 2833.39 1.450694 4096 1000 2904.73 2904.41 0.743529 8192 1000 2909.46 2909.04 0.372357 16384 1000 2911.10 2910.90 0.186297 32768 1000 2911.92 2911.83 0.093178 65536 1000 2912.28 2912.27 0.046596 131072 1000 2912.48 2912.46 0.023300 262144 1000 2912.59 2912.59 0.011650 524288 1000 2912.69 2912.68 0.005825 1048576 1000 2912.64 2912.63 0.002913 2097152 1000 2912.69 2912.68 0.001456 4194304 1000 2912.67 2912.67 0.000728 8388608 1000 2912.66 2912.66 0.000364 ---------------------------------------------------------------------------------------Note because of the CPU discrepancy issue mentioned above, we have to use the -F option to avoid the test failing. We can see that the RDMA bandwidth is close to line speed (2912.66 MB/sec = 23.3 Gbps).
On the server:
# ib_send_lat --port=12500 -x 0 -a -F 2> /dev/null
On the client:
# ib_send_lat 192.168.25.96 --port=12500 -x 0 -a -F 2> /dev/null --------------------------------------------------------------------------------------- Send Latency Test Dual-port : OFF Device : qedr0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 1 Mtu : 4096[B] Link type : Ethernet Gid index : 0 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0xff0000 PSN 0xf84231 GID: 254:128:00:00:00:00:00:00:02:14:30:255:254:196:05:241 remote address: LID 0000 QPN 0xff0000 PSN 0xdea55a GID: 254:128:00:00:00:00:00:00:02:14:30:255:254:196:05:190 --------------------------------------------------------------------------------------- #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 6.67 11.03 6.82 4 1000 6.70 8.45 6.82 8 1000 6.71 8.41 6.83 16 1000 6.71 8.62 6.86 32 1000 6.72 13.27 6.87 64 1000 6.77 8.45 6.89 128 1000 6.84 8.70 6.96 256 1000 6.86 8.69 7.01 512 1000 6.97 8.83 7.10 1024 1000 7.24 9.06 7.37 2048 1000 8.15 9.90 8.30 4096 1000 10.11 12.03 10.26 8192 1000 11.48 25.28 11.66 16384 1000 14.14 625.12 14.30 32768 1000 19.52 32.89 19.67 65536 1000 30.25 43.76 30.40 131072 1000 51.73 64.93 51.86 262144 1000 94.61 107.97 94.78 524288 1000 180.42 194.77 180.61 1048576 1000 352.09 365.93 352.27 2097152 1000 695.43 708.81 695.62 4194304 1000 1383.92 1396.75 1384.16 8388608 1000 2757.24 2770.96 2757.44 ---------------------------------------------------------------------------------------However, the RDMA latency is no where near the claimed 1 usec!
Open MPI can run on top of RoCE [6], in which case Open MPI uses the OFED Verbs-based openib BTL for traffic and its internal rdmacm CPC (Connection Pseudo-Component) for establishing connections for MPI traffic. Let's find out if we can get Open MPI to run on top of QLogic RoCE implementation.
Install the openmpi package:
# yum install openmpi.x86_64 openmpi-devel.x86_64which installs openmpi-1.6.4-5.el7.x86_64, and environment-modules-3.2.10-8.el7.x86_64 as a dependency as well.
Set up passwordless SSH between the 2 servers for user qlogic:
[qlogic@apps4191 ~]$ cd .ssh/ [qlogic@apps4191 .ssh]$ ssh-keygen [qlogic@apps4191 .ssh]$ cat id_rsa.pub >> authorized_keys [qlogic@apps4191 .ssh]$ scp authorized_keys id_rsa id_rsa.pub 192.168.25.96:.ssh/
To automatically load the module for Open MPI, append the following line to ~qlogic/.bashrc:
module load mpi/openmpi-x86_64
Test mpirun:
[qlogic@apps4191 ~]$ module load mpi/openmpi-x86_64 [qlogic@apps4191 ~]$ mpirun --host 192.168.25.91,192.168.25.96 hostname apps4191 apps4196
Download and compile a simple MPI program on the 2 servers:
$ mpicc -showme gcc -I/usr/include/openmpi-x86_64 -pthread -m64 -L/usr/lib64/openmpi/lib -lmpi $ wget http://pleiades.ucsc.edu/codes/mpi_hostname.c $ mpicc mpi_hostname.c -o mpi_hostname.x
Test running the simple MPI program on top of TCP:
[qlogic@apps4191 ~]$ mpirun --mca btl tcp,self -n 8 --cpus-per-proc 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x Hello, I am 0 of 8 running on apps4191 Hello, I am 2 of 8 running on apps4191 Hello, I am 4 of 8 running on apps4191 Hello, I am 6 of 8 running on apps4191 Hello, I am 1 of 8 running on apps4196 Hello, I am 3 of 8 running on apps4196 Hello, I am 5 of 8 running on apps4196 Hello, I am 7 of 8 running on apps4196 [qlogic@apps4191 ~]$ mpirun --mca btl tcp,self -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x Hello, I am 0 of 8 running on apps4191 Hello, I am 1 of 8 running on apps4191 Hello, I am 2 of 8 running on apps4191 Hello, I am 3 of 8 running on apps4191 Hello, I am 4 of 8 running on apps4196 Hello, I am 5 of 8 running on apps4196 Hello, I am 6 of 8 running on apps4196 Hello, I am 7 of 8 running on apps4196
However, running the MPI program on top of RoCE will result in a warning and an error:
[qlogic@apps4191 ~]$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x -------------------------------------------------------------------------- WARNING: No preset parameters were found for the device that Open MPI detected: Local host: apps4191 Device name: qedr0 Device vendor ID: 0x1077 Device vendor part ID: 5718 Default device parameters will be used, which may result in lower performance. You can edit any of the files specified by the btl_openib_device_param_files MCA parameter to set values for your device. NOTE: You can turn off this warning by setting the MCA parameter btl_openib_warn_no_device_params_found to 0. -------------------------------------------------------------------------- -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory. This typically can indicate that the memlock limits are set too low. For most HPC installations, the memlock limits should be set to "unlimited". The failure occured here: Local host: apps4191 OMPI source: btl_openib_component.c:1200 Function: ompi_free_list_init_ex_new() Device: qedr0 Memlock limit: 65536 You may need to consult with your system administrator to get this problem fixed. This FAQ entry on the Open MPI web site may also be helpful: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages --------------------------------------------------------------------------
We'll ignore the warning for now. The error is because the maximum lock memory is too low in CentOS 7 by default:
[qlogic@apps4196 ~]$ ulimit -l 64 kbytes
Set the maximum lock memory to unlimted, by creating /etc/security/limits.d/openfabrics.conf with the following content:
* soft memlock unlimited * hard memlock unlimited
This may not be necessary; but I've also added the following line to /etc/pam.d/sshd
session required pam_limits.sothen restarted the sshd service on the 2 servers:
# systemctl restart sshd
Confirm the maximum lock memory is now unlimted:
[qlogic@apps4191 ~]$ ulimit -l unlimited [qlogic@apps4191 ~]$ ssh 192.168.25.96 ulimit -l unlimited [qlogic@apps4191 ~]$ mpirun -np 2 --host 192.168.25.91,192.168.25.96 bash -c ulimit -l unlimited unlimited
However, it still spills out a warning about maximum lock memory when running the MPI program on top of RoCE by following the instructions in Open MPI FAQ:
[qlogic@apps4191 ~]$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x -------------------------------------------------------------------------- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: apps4196 Registerable memory: 32768 MiB Total memory: 130975 MiB Your MPI job will continue, but may be behave poorly and/or hang. --------------------------------------------------------------------------
which states that the reegisterable memory is only 32GB (out of a total memory size of 128GB). Without delving deeper into the issue, I assume this is a bug/feature in the QLogic drivers and will ignore the warning for now.
The above mpirun command also produces a misleading error (see below):
-------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: apps4196 OMPI source: btl_openib.c:308 Function: ibv_create_srq() Error: Success (errno=0) Device: qedr0 You may need to consult with your system administrator to get this problem fixed. --------------------------------------------------------------------------
More seriously, this mpirun command results in a fatal error!
-------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[56905,1],4]) is on host: apps4196 Process 2 ([[56905,1],0]) is on host: apps4191 BTLs attempted: openib self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- [apps4196:5083] *** An error occurred in MPI_Init [apps4196:5083] *** on a NULL communicator [apps4196:5083] *** Unknown error [apps4196:5083] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: apps4196 PID: 5083 -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 4 with PID 5083 on node 192.168.25.96 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [apps4191:05001] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found [apps4191:05001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [apps4191:05001] 1 more process has sent help message help-mpi-btl-openib.txt / reg mem limit low [apps4191:05001] 1 more process has sent help message help-mpi-btl-openib.txt / init-fail-create-q [apps4191:05001] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc [apps4191:05001] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail [apps4191:05001] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle [apps4191:05001] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
After some trials, however, I find that running mpirun without the --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm options will work, albeit with some misleading errors/warnings:
[qlogic@apps4191 ~]$ mpirun -np 8 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: apps4191 OMPI source: btl_openib.c:308 Function: ibv_create_srq() Error: Success (errno=0) Device: qedr0 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- Hello, I am 0 of 8 running on apps4191 Hello, I am 2 of 8 running on apps4191 Hello, I am 4 of 8 running on apps4191 Hello, I am 6 of 8 running on apps4191 Hello, I am 5 of 8 running on apps4196 Hello, I am 7 of 8 running on apps4196 Hello, I am 1 of 8 running on apps4196 Hello, I am 3 of 8 running on apps4196 [apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found [apps4191:17465] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / reg mem limit low [apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / init-fail-create-q
The presumable error about openib BTL failure is clearly misleading, because the call of the function ibv_create_srq() was a Success (errno=0); but was erroneously labelled as a failure!
The following works too:
[qlogic@apps4191 ~]$ mpirun -np 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x -------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: apps4191 OMPI source: btl_openib.c:308 Function: ibv_create_srq() Error: Success (errno=0) Device: qedr0 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------------- Hello, I am 0 of 8 running on apps4191 Hello, I am 1 of 8 running on apps4191 Hello, I am 2 of 8 running on apps4191 Hello, I am 3 of 8 running on apps4191 Hello, I am 6 of 8 running on apps4196 Hello, I am 7 of 8 running on apps4196 Hello, I am 4 of 8 running on apps4196 Hello, I am 5 of 8 running on apps4196 [apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found [apps4191:17554] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / reg mem limit low [apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / init-fail-create-q
In summary, there are some bugs in QLogic RoCE driver and Open MPI that prevent Open MPI on top of QLogic RoCE from fully working; but those bugs appear to be minor and QLogic Everest Driver for RoCE protocol is still under active development. I expect Open MPI on top of QLogic RoCE will work flawlessly when QLogic 25GbE & 100GbE adapters are officially released in later 2015.
- ^ QLogic Demonstrates End-to-End 100Gb Ethernet Networks With Dell
- ^ QLogic to Demonstrate End-to-End 25Gb, 50Gb and 100Gb Networking With HP at Interop Las Vegas
- ^ RDMA over Converged Ethernet (RoCE)
- ^ ConnectX-2 EN with RoCE
- ^ Consistent Network Device Naming in RHEL 7
- ^ How do I run Open MPI over RoCE?