RoCE - shawfdong/hyades GitHub Wiki

RoCE (RDMA over Converged Ethernet) is a network protocol that allows RDMA (remote direct memory access) over an Ethernet network. The advantages of RDMA over other network APIs such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.

Table of Contents QLogic 25Gb and 100Gb Ethernet Testing Setup RDMA Performance Tests RDMA Ping-Pong Test Read Bandwidth Test Read Latency Test Open MPI References

QLogic 25Gb and 100Gb Ethernet

QLogic was the first in the industry to demonstrate end-to-end interoperability for 25Gb and 100Gb Ethernet solutions^[1]. Both QLogic 25Gb and 100Gb Ethernet leverage a common 25Gbps per-lane signaling scheme, with 25GbE utilizing one lane and 100GbE utilizing four lanes. The QLogic converged Ethernet adapters support RoCE v1 and v2, as well as multiple protocol offload and concurrent LAN (TCP/IP) and SAN (FCoE, iSCSI) processing, over a single Ethernet connection^[2]. We'll explore RoCE on QLogic 25GbE adapters.

Testing Setup

The tests are performed on two PowerEdge R730 servers. Each PowerEdge R730 server is equipped with:

two (2x) 18-core Intel Haswell Xeon E5-2699 v3 processors at 2.30 GHz
128 GB memory
a 8-lane PCIe QLogic 25GbE adapter
a 16-lane PCIe QLogic 100GbE adapter

Although the official frequency of Xeon E5-2699 v3 is 2.30 GHz, the CPUs somehow run at 2.80 GHz in the 2 test servers (the discrepancy will cause a lot of warnings in IB (InfiniBand) performance testing utilities):

# grep -E '^model name|^cpu MHz' /proc/cpuinfo
model name	  : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
cpu MHz		    : 2799.980

# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 1.20 GHz - 2.80 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 1.20 GHz and 2.80 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 2.80 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes

The 2 test servers run CentOS 7. QLogic provides the following kernel modules/drivers for the converged NICs (network interface cards):

qed - QLogic Everest Driver core module
qede - QLogic Everest Driver for Ethernet protocol
qedi - QLogic Everest Driver for iSCSI protocol
qedf - QLogic Everest Driver for FCoE protocol
qedr - QLogic Everest Driver for RoCE protocol

Everest in the driver names seem to suggest that the QLogic 25GbE and 100GbE adapters are based on the Broadcom Everest Ethernet controllers. Note for the time being QLogic has only enabled support for RoCE on the 25GbE adapters. To test RoCE, first build and install OFED 3.18; then build and install QLogic kernel modules/drivers (qed, qede, & qedr). Here is the information about kernel module qedr:

# modinfo qedr
filename:       /lib/modules/3.10.0-123.el7.x86_64/updates/qedr.ko
version:        8.3.6.0
license:        GPL
author:         QLogic Corporation
description:    QLogic 40G/100G ROCE Driver
srcversion:     FC346F5749A39A75FEB4610
depends:        compat,ib_core,qede,qed
vermagic:       3.10.0-123.el7.x86_64 SMP mod_unload modversions 
parm:           debug: Default debug msglevel (uint)
parm:           roce_100g: Run RoCE over 100G link - will cause performance degradation for regular L2 traffic over 100g (uint)

Additionally, libqedr is the userspace driver for Qlogic RoCE adapters. The package installs liblibqedr.a & libqedr.so in /usr/lib64; and adds the file /etc/libibverbs.d/qedr.driver:

driver qedr

RDMA is started with init script /etc/init.d/openibd. When it is started, we can query RDMA devices:

# ibv_devinfo 
hca_id:	      qedr0
              transport:        InfiniBand (0)
              fw_ver:           8.3.9.0
              node_guid:        020e:1eff:fec4:05f1
              sys_image_guid:   020e:1eff:fec4:05f1
              vendor_id:        0x1077
              vendor_part_id:   5718
              hw_ver:           0x0
              phys_port_cnt:    1
	              port: 1
                            state:   PORT_ACTIVE (4)
                            max_mtu:  4096 (5)
                            active_mtu:  4096 (5)
                            sm_lid:   0
                            port_lid:  0
                            port_lmc:  0x00
                            link_layer:  Ethernet

Note that RoCE^[3] doesn't require an SM (InfiniBand subnet manager); it operates using standard Ethernet network management practices for L2 address assignments, L2 topology discovery, and switch filtering data base (FDB) configuration^[4].

However, ibnodes fails:

# ibnodes 
src/query_smp.c:235; Failed to register SMI agent on ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

The manpage of ibnodes says ibnodes "walks the IB subnet topology or uses an already saved topology file and extracts the IB nodes", so it requires an SM (subnet network) to work properly; but there is no SM in RoCE! Other utilities that are hard-coded to use SM will likewise fail as well.

The QLogic 25GbE adapters are in PCIe slot 4 in each server. Since the biosdevname is used by default on Dell systems^[5], they are given the consistent interface name p4p1. The 25GbE interfaces on the 2 servers can be directly connected using a SFP+ direct attach copper cables; or they can both connect to a 100GbE switch (e.g., Dell Z9100-ON) using a QSFP to 4x SFP+ breakout cable. I've performed tests in both cases, and haven't seen any noticeable difference between the two in either bandwidth or latency.

The IPv4 addresses of the 25GbE interfaces (p4p1) on the 2 servers are 192.168.25.91 & 192.168.25.96, respectively.

RDMA Performance Tests

First make sure the firewall is disabled on the QLogic 25GbE (p4p1) and 100GbE (p6p1) interfaces, by adding the following 2 lines to /etc/sysconfig/iptables:

-A INPUT -i p4p1 -j ACCEPT
-A INPUT -i p6p1 -j ACCEPT

then restart the iptables service:

# systemctl restart iptables

RDMA Ping-Pong Test

We can test the connection of the RDMA communication manager (RDMA_CM) by using the RDMA ping-pong test. On the server:

# rping -s -S 35 -C 5 -a 192.168.25.96 -p 100

Then on the client:

# rping -c -v -S 35 -C 5 -a 192.168.25.96 -p 100
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTU
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUV
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVW
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWX
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXY
client DISCONNECT EVENT...

Read Bandwidth Test

On the server:

# ib_send_bw -q 8 --port=12500 -x 0 -a -F 2> /dev/null

On the client (note here we use the IPv4 address of p4p1 on the server):

# ib_send_bw 192.168.25.96 -q 8 --port=12500 -x 0 -a -F 2> /dev/null
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF                Device         : qedr0
 Number of qps   : 8                  Transport type : IB
 Connection type : RC                 Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : Ethernet
 Gid index       : 0
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             8.36               8.33		      4.367733
 4          1000             16.71              16.71		      4.379616
 8          1000             33.49              33.47		      4.387179
 16         1000             66.98              66.91		      4.385292
 32         1000             133.95             133.28		      4.367329
 64         1000             267.90             267.75		      4.386754
 128        1000             536.83             535.91		      4.390151
 256        1000             1071.60            1069.66		      4.381322
 512        1000             2139.16            2135.22		      4.372941
 1024       1000             2696.39            2695.85		      2.760550
 2048       1000             2834.19            2833.39		      1.450694
 4096       1000             2904.73            2904.41		      0.743529
 8192       1000             2909.46            2909.04		      0.372357
 16384      1000             2911.10            2910.90		      0.186297
 32768      1000             2911.92            2911.83		      0.093178
 65536      1000             2912.28            2912.27		      0.046596
 131072     1000             2912.48            2912.46		      0.023300
 262144     1000             2912.59            2912.59		      0.011650
 524288     1000             2912.69            2912.68		      0.005825
 1048576    1000             2912.64            2912.63		      0.002913
 2097152    1000             2912.69            2912.68		      0.001456
 4194304    1000             2912.67            2912.67		      0.000728
 8388608    1000             2912.66            2912.66		      0.000364
---------------------------------------------------------------------------------------

Note because of the CPU discrepancy issue mentioned above, we have to use the -F option to avoid the test failing. We can see that the RDMA bandwidth is close to line speed (2912.66 MB/sec = 23.3 Gbps).

Read Latency Test

On the server:

# ib_send_lat --port=12500 -x 0 -a -F 2> /dev/null

On the client:

# ib_send_lat 192.168.25.96 --port=12500 -x 0 -a -F 2> /dev/null
---------------------------------------------------------------------------------------
                    Send Latency Test
 Dual-port       : OFF                Device         : qedr0
 Number of qps   : 1                  Transport type : IB
 Connection type : RC                 Using SRQ      : OFF
 TX depth        : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 Gid index       : 0
 Max inline data : 0[B]
 rdma_cm QPs	  : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0xff0000 PSN 0xf84231
 GID: 254:128:00:00:00:00:00:00:02:14:30:255:254:196:05:241
 remote address: LID 0000 QPN 0xff0000 PSN 0xdea55a
 GID: 254:128:00:00:00:00:00:00:02:14:30:255:254:196:05:190
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000          6.67           11.03        6.82   
 4       1000          6.70           8.45         6.82   
 8       1000          6.71           8.41         6.83   
 16      1000          6.71           8.62         6.86   
 32      1000          6.72           13.27        6.87   
 64      1000          6.77           8.45         6.89   
 128     1000          6.84           8.70         6.96   
 256     1000          6.86           8.69         7.01   
 512     1000          6.97           8.83         7.10   
 1024    1000          7.24           9.06         7.37   
 2048    1000          8.15           9.90         8.30   
 4096    1000          10.11          12.03        10.26  
 8192    1000          11.48          25.28        11.66  
 16384   1000          14.14          625.12       14.30  
 32768   1000          19.52          32.89        19.67  
 65536   1000          30.25          43.76        30.40  
 131072  1000          51.73          64.93        51.86  
 262144  1000          94.61          107.97       94.78  
 524288  1000          180.42         194.77       180.61 
 1048576 1000          352.09         365.93       352.27 
 2097152 1000          695.43         708.81       695.62 
 4194304 1000          1383.92        1396.75      1384.16
 8388608 1000          2757.24        2770.96      2757.44
---------------------------------------------------------------------------------------

However, the RDMA latency is no where near the claimed 1 usec!

Open MPI

Open MPI can run on top of RoCE ^[6], in which case Open MPI uses the OFED Verbs-based openib BTL for traffic and its internal rdmacm CPC (Connection Pseudo-Component) for establishing connections for MPI traffic. Let's find out if we can get Open MPI to run on top of QLogic RoCE implementation.

Install the openmpi package:

# yum install openmpi.x86_64 openmpi-devel.x86_64

which installs openmpi-1.6.4-5.el7.x86_64, and environment-modules-3.2.10-8.el7.x86_64 as a dependency as well.

Set up passwordless SSH between the 2 servers for user qlogic:

[qlogic@apps4191 ~]$ cd .ssh/
[qlogic@apps4191 .ssh]$ ssh-keygen
[qlogic@apps4191 .ssh]$ cat id_rsa.pub >> authorized_keys
[qlogic@apps4191 .ssh]$ scp authorized_keys id_rsa id_rsa.pub 192.168.25.96:.ssh/

To automatically load the module for Open MPI, append the following line to ~qlogic/.bashrc:

module load mpi/openmpi-x86_64

Test mpirun:

[qlogic@apps4191 ~]$ module load mpi/openmpi-x86_64
[qlogic@apps4191 ~]$ mpirun --host 192.168.25.91,192.168.25.96 hostname
apps4191
apps4196

Download and compile a simple MPI program on the 2 servers:

$ mpicc -showme
gcc -I/usr/include/openmpi-x86_64 -pthread -m64 -L/usr/lib64/openmpi/lib -lmpi
$ wget http://pleiades.ucsc.edu/codes/mpi_hostname.c
$ mpicc mpi_hostname.c -o mpi_hostname.x

Test running the simple MPI program on top of TCP:

[qlogic@apps4191 ~]$ mpirun --mca btl tcp,self -n 8 --cpus-per-proc 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
Hello, I am 0 of 8 running on apps4191
Hello, I am 2 of 8 running on apps4191
Hello, I am 4 of 8 running on apps4191
Hello, I am 6 of 8 running on apps4191
Hello, I am 1 of 8 running on apps4196
Hello, I am 3 of 8 running on apps4196
Hello, I am 5 of 8 running on apps4196
Hello, I am 7 of 8 running on apps4196

[qlogic@apps4191 ~]$ mpirun --mca btl tcp,self -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
Hello, I am 0 of 8 running on apps4191
Hello, I am 1 of 8 running on apps4191
Hello, I am 2 of 8 running on apps4191
Hello, I am 3 of 8 running on apps4191
Hello, I am 4 of 8 running on apps4196
Hello, I am 5 of 8 running on apps4196
Hello, I am 6 of 8 running on apps4196
Hello, I am 7 of 8 running on apps4196

However, running the MPI program on top of RoCE will result in a warning and an error:

[qlogic@apps4191 ~]$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            apps4191
  Device name:           qedr0
  Device vendor ID:      0x1077
  Device vendor part ID: 5718

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    apps4191
  OMPI source:   btl_openib_component.c:1200
  Function:      ompi_free_list_init_ex_new()
  Device:        qedr0
  Memlock limit: 65536

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------

We'll ignore the warning for now. The error is because the maximum lock memory is too low in CentOS 7 by default:

[qlogic@apps4196 ~]$ ulimit -l
64 kbytes

Set the maximum lock memory to unlimted, by creating /etc/security/limits.d/openfabrics.conf with the following content:

* soft memlock unlimited
* hard memlock unlimited

This may not be necessary; but I've also added the following line to /etc/pam.d/sshd

session    required     pam_limits.so

then restarted the sshd service on the 2 servers:

# systemctl restart sshd

Confirm the maximum lock memory is now unlimted:

[qlogic@apps4191 ~]$ ulimit -l
unlimited
[qlogic@apps4191 ~]$ ssh 192.168.25.96 ulimit -l
unlimited
[qlogic@apps4191 ~]$ mpirun -np 2 --host 192.168.25.91,192.168.25.96 bash -c ulimit -l
unlimited
unlimited

However, it still spills out a warning about maximum lock memory when running the MPI program on top of RoCE by following the instructions in Open MPI FAQ:

[qlogic@apps4191 ~]$ mpirun --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm -n 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              apps4196
  Registerable memory:     32768 MiB
  Total memory:            130975 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------

which states that the reegisterable memory is only 32GB (out of a total memory size of 128GB). Without delving deeper into the issue, I assume this is a bug/feature in the QLogic drivers and will ignore the warning for now.

The above mpirun command also produces a misleading error (see below):

--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue.  This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?).  The failure occured here:

  Local host:  apps4196
  OMPI source: btl_openib.c:308
  Function:    ibv_create_srq()
  Error:       Success (errno=0)
  Device:      qedr0

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------

More seriously, this mpirun command results in a fatal error!

--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

Process 1 ([[56905,1],4]) is on host: apps4196
Process 2 ([[56905,1],0]) is on host: apps4191
BTLs attempted: openib self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[apps4196:5083] *** An error occurred in MPI_Init
[apps4196:5083] *** on a NULL communicator
[apps4196:5083] *** Unknown error
[apps4196:5083] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

Reason: Before MPI_INIT completed
Local host: apps4196
PID: 5083
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 5083 on
node 192.168.25.96 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[apps4191:05001] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[apps4191:05001] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[apps4191:05001] 1 more process has sent help message help-mpi-btl-openib.txt / reg mem limit low
[apps4191:05001] 1 more process has sent help message help-mpi-btl-openib.txt / init-fail-create-q
[apps4191:05001] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[apps4191:05001] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[apps4191:05001] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[apps4191:05001] 1 more process has sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed

After some trials, however, I find that running mpirun without the --mca btl openib,self,sm --mca btl_openib_cpc_include rdmacm options will work, albeit with some misleading errors/warnings:

[qlogic@apps4191 ~]$ mpirun -np 8 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue.  This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?).  The failure occured here:

  Local host:  apps4191
  OMPI source: btl_openib.c:308
  Function:    ibv_create_srq()
  Error:       Success (errno=0)
  Device:      qedr0

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
Hello, I am 0 of 8 running on apps4191
Hello, I am 2 of 8 running on apps4191
Hello, I am 4 of 8 running on apps4191
Hello, I am 6 of 8 running on apps4191
Hello, I am 5 of 8 running on apps4196
Hello, I am 7 of 8 running on apps4196
Hello, I am 1 of 8 running on apps4196
Hello, I am 3 of 8 running on apps4196
[apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[apps4191:17465] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / reg mem limit low
[apps4191:17465] 7 more processes have sent help message help-mpi-btl-openib.txt / init-fail-create-q

The presumable error about openib BTL failure is clearly misleading, because the call of the function ibv_create_srq() was a Success (errno=0); but was erroneously labelled as a failure!

The following works too:

[qlogic@apps4191 ~]$ mpirun -np 8 --npernode 4 --host 192.168.25.91,192.168.25.96 ./mpi_hostname.x
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue.  This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?).  The failure occured here:

  Local host:  apps4191
  OMPI source: btl_openib.c:308
  Function:    ibv_create_srq()
  Error:       Success (errno=0)
  Device:      qedr0

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
Hello, I am 0 of 8 running on apps4191
Hello, I am 1 of 8 running on apps4191
Hello, I am 2 of 8 running on apps4191
Hello, I am 3 of 8 running on apps4191
Hello, I am 6 of 8 running on apps4196
Hello, I am 7 of 8 running on apps4196
Hello, I am 4 of 8 running on apps4196
Hello, I am 5 of 8 running on apps4196
[apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[apps4191:17554] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / reg mem limit low
[apps4191:17554] 7 more processes have sent help message help-mpi-btl-openib.txt / init-fail-create-q

In summary, there are some bugs in QLogic RoCE driver and Open MPI that prevent Open MPI on top of QLogic RoCE from fully working; but those bugs appear to be minor and QLogic Everest Driver for RoCE protocol is still under active development. I expect Open MPI on top of QLogic RoCE will work flawlessly when QLogic 25GbE & 100GbE adapters are officially released in later 2015.

RoCE - shawfdong/hyades GitHub Wiki

Table of Contents