Eureka Installation: InfiniBand - calab-ntu/gpu-cluster GitHub Wiki
kernel version > 3.10.0-1126.el7.x86_64
- [Optional] If no origin driver is installed
- Check hardware
lspci -v | grep Mellanox # 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] # Subsystem: Mellanox Technologies Device 0003
- Uninstall conflicting Intel Omni-Path packages first
yum -y remove opa-libopamgt opa-address-resolution
- Check hardware
- Install new driver and uninstall origin one.
Result:
cd /work1/xuanshan/caler_packages/MLNX_OFED_LINUX-5.7-1.0.2.0-rhel7.9-x86_64/ ./mlnxofedinstall --add-kernel-support
Device #1: ---------- Device Type: ConnectX5 Part Number: MCX555A-ECA_Ax Description: ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6 PSID: MT_0000000010 PCI Device Name: 09:00.0 Base GUID: 1c34da03005bd158 Versions: Current Available FW 16.27.1016 16.34.1002 PXE 3.5.0901 3.6.0700 UEFI 14.20.0019 14.27.0014 Status: Update required --------- Found 1 device(s) requiring firmware update... Device #1: Updating FW ... FSMST_INITIALIZE - OK Writing Boot image component - OK Done
- Update your initramfs
dracut -f
- Load new dirver
/etc/init.d/openibd restart
- Check before reboot
ibstatus
default gid: fe80:0000:0000:0000:1c34:da03:005b:d1ac (It could be different from different machine) base lid: 0x3 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) (It need to be 100Gb/sec) link_layer: InfiniBand
- Reboot
- Check
ibstatus
again after reboot
kernel version < 3.10.0-1126.el7.x86_64
-
Check hardware
lspci -v | grep Mellanox # 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5] # Subsystem: Mellanox Technologies Device 0003
-
Install driver: https://docs.mellanox.com/display/OFEDv501000/Installation
-
Uninstall conflicting Intel Omni-Path packages first
yum -y remove opa-libopamgt opa-address-resolution cd /work1/shared/eureka/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.7-x86_64 ./mlnxofedinstall -h # Print usage
-
Choose one of the following three commands
--> Currently used firemare:/work1/shared/eureka/package/ib/adaptor/firmware/fw-ConnectX5-rel-16_27_1016-MCX555A-ECA_Ax-UEFI-14.20.19-FlexBoot-3.5.901.bin.zip
--> For unattended installation, Add--force
after./mlnxofedinstall
(suitable for installing multiple nodes in parallel using pdsh)./mlnxofedinstall # update firmware automatically [use this until new firmware release] ./mlnxofedinstall --fw-image-dir /tmp/my_fw_bin_files # specify the firmware (see above for the currently used version) ./mlnxofedinstall --without-fw-update # no firmware update
Example results
... ... Device #1: ---------- Device Type: ConnectX5 Part Number: MCX555A-ECA_Ax Description: ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6 PSID: MT_0000000010 PCI Device Name: 09:00.0 Base GUID: 1c34da03005bd1b8 Versions: Current Available FW 16.25.1020 16.27.1016 PXE 3.5.0701 3.5.0901 UEFI 14.18.0019 14.20.0019 Status: Update required --------- Found 1 device(s) requiring firmware update... Device #1: Updating FW ... Initializing image partition - OK Writing Boot image component - OK Restart needed for updates to take effect. Log File: /tmp/MLNX_OFED_LINUX.18682.logs/fw_update.log To load the new driver, run: /etc/init.d/openibd restart
-
Reboot
-
If your kernel version does not match with any of the offered pre-built RPMs, you can add your kernel version
by using the "mlnx_add_kernel_support.sh" script located inside the MLNX_OFED package.
Ref: https://docs.mellanox.com/display/OFEDv501000/Installing+Mellanox+OFED
--> Search "mlnx_add_kernel_support.sh"
https://community.mellanox.com/s/article/howto-compile-mlnx-ofed-for-different-linux-kernel-distribution--160---160-x
Be aware that IB driver will install the following packages under /opt, which may overwrite the previously installed packages
mlnx-ethtool mlnx-iproute2 mlnx-fw-updater ibutils knem hcoll mxm sharp
rpm -ql package_name
can list files installed by these rpms --> Some files are installed under /etc and /usr as well
ibutils
/etc/ld.so.conf.d/ibutils.conf
/etc/profile.d/ibutils.csh
/etc/profile.d/ibutils.sh
knem
/etc/udev/rules.d/10-knem.rules
/usr/lib64/pkgconfig/knem.pc
hcoll
/etc/ld.so.conf.d/hcoll.conf
/usr/lib64/pkgconfig/hcoll.pc
mxm
/etc/ld.so.conf.d/mxm.conf
/usr/lib64/pkgconfig/mxm.pc
sharp
/etc/ld.so.conf.d/sharp.conf
/usr/lib64/pkgconfig/sharp.pc
In principle, we can use
./mlnxofedinstall --without-<package>
to exclude these packages. However, dependency among variosu packages can be tricky. Therefore, currently we just reinstall all packages.To get rid of this potential issue, we install all cluster-wide software in
/software
instead of/opt
--> The/opt
folder is no longer shared by NFS; each node has its own/opt
-
Check
- Basic
cat /etc/security/limits.conf # * soft memlock unlimited # * hard memlock unlimited systemctl status openibd # Active: active (exited) systemctl is-enabled openibd # enabled systemctl status opensmd # Active: inactive (dead) systemctl is-enabled opensmd # disabled hca_self_test.ofed # Ref: /work1/shared/eureka/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.7-x86_64/docs/readme_and_user_manual/hca_self_test.readme
---- Performing Adapter Device Self Test ---- Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-5.0-1.0.0.0 (OFED-5.0-1.0.0.0): 3.10.0-1062.el7.x86_64 Host Driver RPM Check .................. PASS Firmware on CA #0 HCA .................. v16.27.1016 Host Driver Initialization ............. PASS Number of CA Ports Active .............. 1 Port State of Port #1 on CA #0 (HCA)..... UP 4X EDR (InfiniBand) Error Counter Check on CA #0 (HCA)...... PASS Kernel Syslog Check .................... PASS Node GUID on CA #0 (HCA) ............... 1c:34:da:03:00:5b:d1:b8 ------------------ DONE ---------------------
ibdev2netdev -v | grep -i MCX 0000:09:00.0 mlx5_0 (MT4119 - MCX555A-ECAT) CX555A - ConnectX-5 QSFP28 fw 16.27.1016 port 1 (ACTIVE) ==> ib0 (Up) # check PCIe speed: both link capability (OnkCap) and status (LnkSta) should show "Speed 8GT/s, Width x16" --> bandwidth = 8GT/s*16 = 128 Gb/s lspci -s 09:00.0 -vvv |grep Width LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
- Bandwidth
ib_write_bw -aF # on eureka00 ib_write_bw -aF eureka00 # on eurekaXX --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x03 QPN 0x010c PSN 0x7e8beb RKey 0x00146c VAddr 0x007f5c84bbd000 remote address: LID 0x02 QPN 0x010c PSN 0xa3df05 RKey 0x001452 VAddr 0x007fe666513000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 2 5000 8.01 7.85 4.118260 4 5000 20.25 19.82 5.196505 8 5000 40.57 40.27 5.278838 16 5000 81.13 80.27 5.260790 32 5000 162.27 161.04 5.276990 64 5000 324.04 323.05 5.292797 128 5000 648.09 636.96 5.217944 256 5000 1300.13 1288.37 5.277154 512 5000 2596.31 2571.03 5.265473 1024 5000 5192.61 5142.59 5.266016 2048 5000 10338.05 10234.47 5.240046 4096 5000 11723.33 11649.54 2.982283 8192 5000 11763.95 11756.10 1.504781 16384 5000 11776.66 11775.13 0.753608 32768 5000 11798.25 11797.31 0.377514 65536 5000 11809.13 11808.82 0.188941 131072 5000 11813.66 11813.56 0.094508 262144 5000 11816.62 11816.50 0.047266 524288 5000 11817.90 11817.90 0.023636 1048576 5000 11817.55 11817.52 0.011818 2097152 5000 11817.93 11817.91 0.005909 4194304 5000 11818.06 11818.05 0.002955 8388608 5000 11819.12 11819.12 0.001477 ---------------------------------------------------------------------------------------
ib_read_bw -aF # on eureka00 ib_read_bw -aF eureka00 # on eurekaXX --------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x03 QPN 0x010d PSN 0xdf80e6 OUT 0x10 RKey 0x00146d VAddr 0x007f65ed58b000 remote address: LID 0x02 QPN 0x010d PSN 0x44d7d7 OUT 0x10 RKey 0x001453 VAddr 0x007fada14db000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 2 1000 7.94 7.49 3.929325 4 1000 20.22 20.20 5.294230 8 1000 40.69 40.67 5.330739 16 1000 81.38 81.24 5.323896 32 1000 162.77 162.54 5.326176 64 1000 325.53 305.15 4.999600 128 1000 650.07 609.05 4.989373 256 1000 1296.18 1176.64 4.819505 512 1000 2596.30 2592.49 5.309415 1024 1000 5153.42 5062.49 5.183991 2048 1000 8954.27 8940.89 4.577735 4096 1000 10890.85 10885.68 2.786733 8192 1000 11693.39 11690.36 1.496366 16384 1000 11738.75 11736.56 0.751140 32768 1000 11777.96 11777.50 0.376880 65536 1000 11796.46 11795.85 0.188734 131072 1000 11810.18 11810.12 0.094481 262144 1000 11813.58 11813.42 0.047254 524288 1000 11816.50 11816.46 0.023633 1048576 1000 11817.64 11817.60 0.011818 2097152 1000 11818.87 11818.85 0.005909 4194304 1000 11818.21 11818.21 0.002955 8388608 1000 11817.71 11817.71 0.001477 ---------------------------------------------------------------------------------------
- Other useful commands
- Single node
ibstat # Displays IB operational information pertaining to a specified Host Channel Adapter (HCA) # --> "Physical state" should be "LinkUp" CA 'mlx5_0' CA type: MT4119 Number of ports: 1 Firmware version: 16.27.1016 Hardware version: 0 Node GUID: 0x1c34da03005bd1b8 System image GUID: 0x1c34da03005bd1b8 Port 1: State: Active Physical state: LinkUp Rate: 100 Base lid: 4 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x1c34da03005bd1b8 Link layer: InfiniBand
ibstatus # Query basic status of IB device(s) Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:1c34:da03:0057:b0a8 base lid: 0x2 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand
- All nodes
ibnetdiscover # Show the lid <--> hostname mapping ibhosts # Show all available nodes ibnodes # Show all available nodes plus switch ibswitches # Show the switch iblinkinfo # Show the status and speed of all links ibdiagnet # Show diagnostic information for entire subnet
- Single node
- Basic
-
[Optional] Update firmware
https://docs.mellanox.com/display/OFEDv501000/Updating+Firmware+After+Installation
https://www.mellanox.com/support/firmware/firmware-downloads -
[Optional] Performance tuning
https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters # check AMD NUMA, especially
https://community.mellanox.com/s/article/how-to-tune-an-amd-server--eypc-cpu--for-maximum-performance
https://community.mellanox.com/s/article/How-to-Tune-Your-Linux-Server-for-Best-Performance-Using-the-mlnx-tune-Tool
https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf
https://community.mellanox.com/s/article/howto-install-iperf-and-test-mellanox-adapters-performance -
Additional info
-
Installation logs
/tmp/MLNX_OFED_LINUX.xxxxx.logs
-
Query installation info
/etc/infiniband/info
prefix=/usr Kernel=3.10.0-1062.el7.x86_64 Configure options: --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mlxfw-mod \ --with-mlx4-mod --with-mlx4_en-mod --with-mlx5-mod --with-ipoib-mod --with-innova-flex --with-mdev-mod \ --with-srp-mod --with-iser-mod --with-isert-mod
-
OpenIB configuration file
/etc/infiniband/openib.conf
-
-
General:
https://www.recitalsoftware.com/blogs/32-howto-getting-started-with-infiniband-on-redhat-centos-linux
https://wiki.archlinux.org/index.php/InfiniBand#For_Mellanox
/work1/shared/eureka/package/ib/adaptor/driver/MLNX_OFED\ Documentation\ Rev\ 5.0-1.0.0.0__03_06_2020.pdf
/work1/shared/eureka/package/ib/adaptor/firmware/ConnectX-5\ Adapter\ Cards\ Firmware\ Release\ Notes__03_07_2020.pdf -
Increase memory size used by HCA:
https://community.mellanox.com/s/article/howto-increase-memory-size-used-by-mellanox-adapters
https://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem -
OpenSM:
https://docs.mellanox.com/display/OFEDv501000/OpenSM -
IPoIB:
https://docs.mellanox.com/pages/viewpage.action?pageId=25138271
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configuring_ipoib -
HCA tool:
Download: https://www.mellanox.com/products/adapter-software/firmware-tools
Manual: https://docs.mellanox.com/display/MFTv4140/Introduction
Example: https://access.redhat.com/articles/3082811 -
Configure IB interface (ib0)
https://judexzhu.github.io/Quick-config-for-Infiniband/
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/ch-Configuring_IP_Networking#sec-Selecting_Network_Configuration_Methods -
Trouble shooting
https://hasanmansur.com/2012/10/15/infiniband-troubleshooting/
-
Install hardware:
Ref: https://www.mellanox.com/sites/default/files/related-docs/user_manuals/SB77X0_SB78X0_SB67X0_1U_IB_Systems_Quick_Installation_Guide.pdf
https://docs.mellanox.com/display/SB77X0EDR/Static+Rail+Kit -
Power on: Ref: https://docs.mellanox.com/display/SB77X0EDR/Initial+Power+On
--> All LEDs should be green -
Configuration: 0. General ref:
https://docs.mellanox.com/display/MLNXOSv382204/Getting+Started#GettingStarted-ConfSw41TConfiguringtheSwitchfortheFirstTime-
Connect a host PC (e.g., my Mac) to the console (RJ-45) port of the switch
using the supplied RJ-451-to-DB9 cable + DB9-to-USB cable -
Login (using my Mac): https://software.intel.com/en-us/setting-up-serial-terminal-on-system-with-mac-os-x
Get the usb device name byls /dev/cu.usbserial-* # on my Mac it shows "/dev/cu.usbserial-14140"
Log in by
screen /dev/cu.usbserial-14140 115200 -L # press "Enter" twice Username: admin Password: admin
To end the session, type "Ctrl+A" and then "Ctrl+K"
-
Configure: https://docs.mellanox.com/display/SB77X0EDR/System+Bring-Up+of+Managed+Systems
Please answer 'yes' or 'no'. Do you want to use the wizard for initial configuration? yes Step 1: Hostname? [switch-d33330] switch-sb7800 Step 2: Use DHCP on mgmt0 interface? [yes] no Step 3: Use zeroconf on mgmt0 interface? [no] Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 192.168.0.10/24 Step 5: Default gateway? 192.168.0.1 Step 6: Primary DNS server? 140.112.254.4 Step 7: Domain name? Step 8: Enable IPv6? [yes] Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no] Step 10: Enable DHCPv6 on mgmt0 interface? [yes] no Step 11: Admin password (Enter to leave unchanged)? # same as eureka Step 11: Confirm admin password?
-
Check
- check software version
show version switch-sb7800 [standalone: master] > show version Product name: MLNX-OS Product release: 3.8.1054 Build ID: #1-dev Build date: 2019-05-23 14:17:14 Target arch: x86_64 Target hw: x86_64 Built by: jenkins@336bea182aae Version summary: X86_64 3.8.1054 2019-05-23 14:17:14 x86_64 Product model: x86onie Host ID: B8599FD33330 System serial num: MT1942J01109 System UUID: 33456328-ee72-11e9-8000-b8599fa613e0 Uptime: 1h 5m 42.640s CPU load averages: 3.08 / 3.06 / 3.01 Number of CPUs: 2 System memory: 414 MB used / 3349 MB free / 3763 MB total Swap: 0 MB used / 0 MB free / 0 MB total
- check the mgmt0 interface
enable # to enter the "Enable" mode show interfaces mgmt0 ----------------------------------- Interface mgmt0 status: Comment : Admin up : yes Link up : no DHCP running : no IP address : 192.168.0.10 Netmask : 255.255.255.0 IPv6 enabled : yes Autoconf enabled: no Autoconf route : yes Autoconf privacy: no DHCPv6 running : no Speed : UNKNOWN Duplex : UNKNOWN Interface type : ethernet Interface source: bridge MTU : 1500 HW address : B8:59:9F:D3:33:30
- check software version
-
enable
reload halt
Configuration has been modified; save first? [yes] yes
Configuration changes saved.
Halting system...
switch-sb7800 [standalone: master] #
System shutdown initiated -- logging off.
Connection to 192.168.0.10 closed.
After printing the above messages, unplug both power cords.
- Serial terminal
Repeat steps 2-1 and 2-2 in "Switch Initialization" Username: admin Password: # same as eureka
- SSH
```
enable # to enter the "Enable" mode
configure terminal # to enter the "Config" mode
ib smnode switch-sb7800 enable # enable OpenSM
show ib sm # check --> should show "enabled"
no configure # to exit the "Config" mode
```
-
General:
https://docs.mellanox.com/display/SB77X0EDR
https://www.mellanox.com/related-docs/user_manuals/1U_HW_QSG_SB7700.pdf
https://community.mellanox.com/s/article/howto-get-started-with-mellanox-switches -
OS:
https://docs.mellanox.com/display/MLNXOSv382204
/work1/shared/eureka/package/ib/switch/MLNX-OS\ User\ Manual\ v3.8.2204__03_04_2020.pdf -
Install:
https://www.mellanox.com/sites/default/files/related-docs/user_manuals/SB77X0_SB78X0_SB67X0_1U_IB_Systems_Quick_Installation_Guide.pdf
https://docs.mellanox.com/display/SB77X0EDR/Installation