Eureka Installation: InfiniBand - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

ADAPTOR

Update or install Driver

kernel version > 3.10.0-1126.el7.x86_64

  1. [Optional] If no origin driver is installed
    1. Check hardware
      lspci -v | grep Mellanox
      
         # 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
         # Subsystem: Mellanox Technologies Device 0003
      
    2. Uninstall conflicting Intel Omni-Path packages first
      yum -y remove opa-libopamgt opa-address-resolution
  2. Install new driver and uninstall origin one.
    cd /work1/xuanshan/caler_packages/MLNX_OFED_LINUX-5.7-1.0.2.0-rhel7.9-x86_64/
    ./mlnxofedinstall --add-kernel-support
    
    Result:
    Device #1:
    ----------
      Device Type:      ConnectX5
      Part Number:      MCX555A-ECA_Ax
      Description:      ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6     PSID:             MT_0000000010
      PCI Device Name:  09:00.0
      Base GUID:        1c34da03005bd158
      Versions:         Current        Available
      FW             16.27.1016     16.34.1002
      PXE            3.5.0901       3.6.0700
      UEFI           14.20.0019     14.27.0014
      Status:           Update required
    ---------
    Found 1 device(s) requiring firmware update...
    Device #1: Updating FW ...
    FSMST_INITIALIZE -   OK
    Writing Boot image component -   OK
    Done
    
  3. Update your initramfs
    dracut -f
    
  4. Load new dirver
    /etc/init.d/openibd restart
    
  5. Check before reboot
    ibstatus
    
    default gid:     fe80:0000:0000:0000:1c34:da03:005b:d1ac (It could be different from different machine)
    base lid:        0x3
    sm lid:          0x1
    state:           4: ACTIVE
    phys state:      5: LinkUp
    rate:            100 Gb/sec (4X EDR)                     (It need to be 100Gb/sec)
    link_layer:      InfiniBand
    
  6. Reboot
  7. Check ibstatus again after reboot

Install Driver

kernel version < 3.10.0-1126.el7.x86_64

  1. Ref: https://docs.mellanox.com/display/OFEDv501000

  2. Check hardware

    lspci -v | grep Mellanox
    
       # 09:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
       # Subsystem: Mellanox Technologies Device 0003
    
  3. Install driver: https://docs.mellanox.com/display/OFEDv501000/Installation

    1. Uninstall conflicting Intel Omni-Path packages first

      yum -y remove opa-libopamgt opa-address-resolution
      
      cd /work1/shared/eureka/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.7-x86_64
      ./mlnxofedinstall -h # Print usage
    2. Choose one of the following three commands
      --> Currently used firemare: /work1/shared/eureka/package/ib/adaptor/firmware/fw-ConnectX5-rel-16_27_1016-MCX555A-ECA_Ax-UEFI-14.20.19-FlexBoot-3.5.901.bin.zip
      --> For unattended installation, Add --force after ./mlnxofedinstall (suitable for installing multiple nodes in parallel using pdsh)

      ./mlnxofedinstall                                     # update firmware automatically [use this until new firmware release]
      ./mlnxofedinstall --fw-image-dir /tmp/my_fw_bin_files # specify the firmware (see above for the currently used version)
      ./mlnxofedinstall --without-fw-update                 # no firmware update
      

      Example results

         ...
         ...
         Device #1:
         ----------
      
           Device Type:      ConnectX5
           Part Number:      MCX555A-ECA_Ax
           Description:      ConnectX-5 VPI adapter card; EDR IB (100Gb/s) and 100GbE; single-port QSFP28; PCIe3.0 x16; tall bracket; ROHS R6
           PSID:             MT_0000000010
           PCI Device Name:  09:00.0
           Base GUID:        1c34da03005bd1b8
           Versions:         Current        Available
              FW             16.25.1020     16.27.1016
              PXE            3.5.0701       3.5.0901
              UEFI           14.18.0019     14.20.0019
      
           Status:           Update required
      
         ---------
         Found 1 device(s) requiring firmware update...
      
         Device #1: Updating FW ...
         Initializing image partition -   OK
         Writing Boot image component -   OK
      
         Restart needed for updates to take effect.
         Log File: /tmp/MLNX_OFED_LINUX.18682.logs/fw_update.log
         To load the new driver, run:
         /etc/init.d/openibd restart
      
    3. Reboot

Optional

If your kernel version does not match with any of the offered pre-built RPMs, you can add your kernel version
by using the "mlnx_add_kernel_support.sh" script located inside the MLNX_OFED package.
Ref: https://docs.mellanox.com/display/OFEDv501000/Installing+Mellanox+OFED
--> Search "mlnx_add_kernel_support.sh"
https://community.mellanox.com/s/article/howto-compile-mlnx-ofed-for-different-linux-kernel-distribution--160---160-x

Remark

Be aware that IB driver will install the following packages under /opt, which may overwrite the previously installed packages

mlnx-ethtool mlnx-iproute2 mlnx-fw-updater ibutils knem hcoll mxm sharp

rpm -ql package_name can list files installed by these rpms --> Some files are installed under /etc and /usr as well

  • ibutils
    /etc/ld.so.conf.d/ibutils.conf
    /etc/profile.d/ibutils.csh
    /etc/profile.d/ibutils.sh

  • knem
    /etc/udev/rules.d/10-knem.rules
    /usr/lib64/pkgconfig/knem.pc

  • hcoll
    /etc/ld.so.conf.d/hcoll.conf
    /usr/lib64/pkgconfig/hcoll.pc

  • mxm
    /etc/ld.so.conf.d/mxm.conf
    /usr/lib64/pkgconfig/mxm.pc

  • sharp
    /etc/ld.so.conf.d/sharp.conf
    /usr/lib64/pkgconfig/sharp.pc

In principle, we can use ./mlnxofedinstall --without-<package> to exclude these packages. However, dependency among variosu packages can be tricky. Therefore, currently we just reinstall all packages.

To get rid of this potential issue, we install all cluster-wide software in /software instead of /opt
--> The /opt folder is no longer shared by NFS; each node has its own /opt

  1. Check

    1. Basic
      cat /etc/security/limits.conf # * soft memlock unlimited
                                    # * hard memlock unlimited
      
      systemctl status openibd     # Active: active (exited)
      systemctl is-enabled openibd # enabled
      
      systemctl status opensmd     # Active: inactive (dead)
      systemctl is-enabled opensmd # disabled
      
      hca_self_test.ofed # Ref: /work1/shared/eureka/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.0-1.0.0.0-rhel7.7-x86_64/docs/readme_and_user_manual/hca_self_test.readme
         ---- Performing Adapter Device Self Test ----
         Number of CAs Detected ................. 1
         PCI Device Check ....................... PASS
         Kernel Arch ............................ x86_64
         Host Driver Version .................... MLNX_OFED_LINUX-5.0-1.0.0.0 (OFED-5.0-1.0.0.0): 3.10.0-1062.el7.x86_64
         Host Driver RPM Check .................. PASS
         Firmware on CA #0 HCA .................. v16.27.1016
         Host Driver Initialization ............. PASS
         Number of CA Ports Active .............. 1
         Port State of Port #1 on CA #0 (HCA)..... UP 4X EDR (InfiniBand)
         Error Counter Check on CA #0 (HCA)...... PASS
         Kernel Syslog Check .................... PASS
         Node GUID on CA #0 (HCA) ............... 1c:34:da:03:00:5b:d1:b8
         ------------------ DONE ---------------------
      
      ibdev2netdev -v | grep -i MCX
      
         0000:09:00.0 mlx5_0 (MT4119 - MCX555A-ECAT) CX555A - ConnectX-5 QSFP28 fw 16.27.1016 port 1 (ACTIVE) ==> ib0 (Up)
      
      # check PCIe speed: both link capability (OnkCap) and status (LnkSta) should show "Speed 8GT/s, Width x16" --> bandwidth = 8GT/s*16 = 128 Gb/s
      lspci -s 09:00.0 -vvv |grep Width
      
         LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
         LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
      
    2. Bandwidth
         ib_write_bw -aF          # on eureka00
         ib_write_bw -aF eureka00 # on eurekaXX
      
            ---------------------------------------------------------------------------------------
                                RDMA_Write BW Test
             Dual-port       : OFF		Device         : mlx5_0
             Number of qps   : 1		Transport type : IB
             Connection type : RC		Using SRQ      : OFF
             PCIe relax order: ON
             TX depth        : 128
             CQ Moderation   : 100
             Mtu             : 4096[B]
             Link type       : IB
             Max inline data : 0[B]
             rdma_cm QPs	 : OFF
             Data ex. method : Ethernet
            ---------------------------------------------------------------------------------------
             local address: LID 0x03 QPN 0x010c PSN 0x7e8beb RKey 0x00146c VAddr 0x007f5c84bbd000
             remote address: LID 0x02 QPN 0x010c PSN 0xa3df05 RKey 0x001452 VAddr 0x007fe666513000
            ---------------------------------------------------------------------------------------
             #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
             2          5000             8.01               7.85   		   4.118260
             4          5000             20.25              19.82  		   5.196505
             8          5000             40.57              40.27  		   5.278838
             16         5000             81.13              80.27  		   5.260790
             32         5000             162.27             161.04 		   5.276990
             64         5000             324.04             323.05 		   5.292797
             128        5000             648.09             636.96 		   5.217944
             256        5000             1300.13            1288.37		   5.277154
             512        5000             2596.31            2571.03		   5.265473
             1024       5000             5192.61            5142.59		   5.266016
             2048       5000             10338.05           10234.47		   5.240046
             4096       5000             11723.33           11649.54		   2.982283
             8192       5000             11763.95           11756.10		   1.504781
             16384      5000             11776.66           11775.13		   0.753608
             32768      5000             11798.25           11797.31		   0.377514
             65536      5000             11809.13           11808.82		   0.188941
             131072     5000             11813.66           11813.56		   0.094508
             262144     5000             11816.62           11816.50		   0.047266
             524288     5000             11817.90           11817.90		   0.023636
             1048576    5000             11817.55           11817.52		   0.011818
             2097152    5000             11817.93           11817.91		   0.005909
             4194304    5000             11818.06           11818.05		   0.002955
             8388608    5000             11819.12           11819.12		   0.001477
            ---------------------------------------------------------------------------------------
      
         ib_read_bw -aF          # on eureka00
         ib_read_bw -aF eureka00 # on eurekaXX
      
         ---------------------------------------------------------------------------------------
                             RDMA_Read BW Test
          Dual-port       : OFF		Device         : mlx5_0
          Number of qps   : 1		Transport type : IB
          Connection type : RC		Using SRQ      : OFF
          PCIe relax order: ON
          TX depth        : 128
          CQ Moderation   : 100
          Mtu             : 4096[B]
          Link type       : IB
          Outstand reads  : 16
          rdma_cm QPs	 : OFF
          Data ex. method : Ethernet
         ---------------------------------------------------------------------------------------
          local address: LID 0x03 QPN 0x010d PSN 0xdf80e6 OUT 0x10 RKey 0x00146d VAddr 0x007f65ed58b000
          remote address: LID 0x02 QPN 0x010d PSN 0x44d7d7 OUT 0x10 RKey 0x001453 VAddr 0x007fada14db000
         ---------------------------------------------------------------------------------------
          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
          2          1000             7.94               7.49   		   3.929325
          4          1000             20.22              20.20  		   5.294230
          8          1000             40.69              40.67  		   5.330739
          16         1000             81.38              81.24  		   5.323896
          32         1000             162.77             162.54 		   5.326176
          64         1000             325.53             305.15 		   4.999600
          128        1000             650.07             609.05 		   4.989373
          256        1000             1296.18            1176.64		   4.819505
          512        1000             2596.30            2592.49		   5.309415
          1024       1000             5153.42            5062.49		   5.183991
          2048       1000             8954.27            8940.89		   4.577735
          4096       1000             10890.85           10885.68		   2.786733
          8192       1000             11693.39           11690.36		   1.496366
          16384      1000             11738.75           11736.56		   0.751140
          32768      1000             11777.96           11777.50		   0.376880
          65536      1000             11796.46           11795.85		   0.188734
          131072     1000             11810.18           11810.12		   0.094481
          262144     1000             11813.58           11813.42		   0.047254
          524288     1000             11816.50           11816.46		   0.023633
          1048576    1000             11817.64           11817.60		   0.011818
          2097152    1000             11818.87           11818.85		   0.005909
          4194304    1000             11818.21           11818.21		   0.002955
          8388608    1000             11817.71           11817.71		   0.001477
         ---------------------------------------------------------------------------------------
      
    3. Other useful commands
      1. Single node
        ibstat # Displays IB operational information pertaining to a specified Host Channel Adapter (HCA)
               # --> "Physical state" should be "LinkUp"
        
        CA 'mlx5_0'
           CA type: MT4119
           Number of ports: 1
           Firmware version: 16.27.1016
           Hardware version: 0
           Node GUID: 0x1c34da03005bd1b8
           System image GUID: 0x1c34da03005bd1b8
           Port 1:
                   State: Active
                   Physical state: LinkUp
                   Rate: 100
                   Base lid: 4
                   LMC: 0
                   SM lid: 1
                   Capability mask: 0x2651e848
                   Port GUID: 0x1c34da03005bd1b8
                   Link layer: InfiniBand
        
        ibstatus # Query basic status of IB device(s)
        
        Infiniband device 'mlx5_0' port 1 status:
                default gid:     fe80:0000:0000:0000:1c34:da03:0057:b0a8
                base lid:        0x2
                sm lid:          0x1
                state:           4: ACTIVE
                phys state:      5: LinkUp
                rate:            100 Gb/sec (4X EDR)
                link_layer:      InfiniBand
        
      2. All nodes
        ibnetdiscover # Show the lid <--> hostname mapping
        ibhosts       # Show all available nodes
        ibnodes       # Show all available nodes plus switch
        ibswitches    # Show the switch
        iblinkinfo    # Show the status and speed of all links
        ibdiagnet     # Show diagnostic information for entire subnet 
        
  2. [Optional] Update firmware
    https://docs.mellanox.com/display/OFEDv501000/Updating+Firmware+After+Installation
    https://www.mellanox.com/support/firmware/firmware-downloads

  3. [Optional] Performance tuning
    https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters # check AMD NUMA, especially
    https://community.mellanox.com/s/article/how-to-tune-an-amd-server--eypc-cpu--for-maximum-performance
    https://community.mellanox.com/s/article/How-to-Tune-Your-Linux-Server-for-Best-Performance-Using-the-mlnx-tune-Tool
    https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf
    https://community.mellanox.com/s/article/howto-install-iperf-and-test-mellanox-adapters-performance

  4. Additional info

    1. Installation logs

      /tmp/MLNX_OFED_LINUX.xxxxx.logs

    2. Query installation info

      /etc/infiniband/info

      prefix=/usr
      Kernel=3.10.0-1062.el7.x86_64
      
      Configure options: --with-core-mod --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod --with-mlxfw-mod \
                         --with-mlx4-mod --with-mlx4_en-mod --with-mlx5-mod --with-ipoib-mod --with-innova-flex --with-mdev-mod \
                         --with-srp-mod --with-iser-mod --with-isert-mod
      
    3. OpenIB configuration file

      /etc/infiniband/openib.conf

Reference


SWITCH

Switch Initialization

  1. Install hardware:
    Ref: https://www.mellanox.com/sites/default/files/related-docs/user_manuals/SB77X0_SB78X0_SB67X0_1U_IB_Systems_Quick_Installation_Guide.pdf
    https://docs.mellanox.com/display/SB77X0EDR/Static+Rail+Kit

  2. Power on: Ref: https://docs.mellanox.com/display/SB77X0EDR/Initial+Power+On
    --> All LEDs should be green

  3. Configuration: 0. General ref:
    https://docs.mellanox.com/display/MLNXOSv382204/Getting+Started#GettingStarted-ConfSw41TConfiguringtheSwitchfortheFirstTime

    1. Connect a host PC (e.g., my Mac) to the console (RJ-45) port of the switch
      using the supplied RJ-451-to-DB9 cable + DB9-to-USB cable

    2. Login (using my Mac): https://software.intel.com/en-us/setting-up-serial-terminal-on-system-with-mac-os-x
      Get the usb device name by

      ls /dev/cu.usbserial-* # on my Mac it shows "/dev/cu.usbserial-14140"
      

      Log in by

      screen /dev/cu.usbserial-14140 115200 -L # press "Enter" twice
      
      Username: admin
      Password: admin
      

      To end the session, type "Ctrl+A" and then "Ctrl+K"

    3. Configure: https://docs.mellanox.com/display/SB77X0EDR/System+Bring-Up+of+Managed+Systems

      Please answer 'yes' or 'no'.
      Do you want to use the wizard for initial configuration? yes
      
      Step 1: Hostname? [switch-d33330] switch-sb7800
      Step 2: Use DHCP on mgmt0 interface? [yes] no
      Step 3: Use zeroconf on mgmt0 interface? [no]
      Step 4: Primary IPv4 address and masklen? [0.0.0.0/0] 192.168.0.10/24
      Step 5: Default gateway? 192.168.0.1
      Step 6: Primary DNS server? 140.112.254.4
      Step 7: Domain name?
      Step 8: Enable IPv6? [yes]
      Step 9: Enable IPv6 autoconfig (SLAAC) on mgmt0 interface? [no]
      Step 10: Enable DHCPv6 on mgmt0 interface? [yes] no
      Step 11: Admin password (Enter to leave unchanged)? # same as eureka
      Step 11: Confirm admin password?
      
    4. Check

      1. check software version
        show version
        
        switch-sb7800 [standalone: master] > show version
           Product name:      MLNX-OS
           Product release:   3.8.1054
           Build ID:          #1-dev
           Build date:        2019-05-23 14:17:14
           Target arch:       x86_64
           Target hw:         x86_64
           Built by:          jenkins@336bea182aae
           Version summary:   X86_64 3.8.1054 2019-05-23 14:17:14 x86_64
        
           Product model:     x86onie
           Host ID:           B8599FD33330
           System serial num: MT1942J01109
           System UUID:       33456328-ee72-11e9-8000-b8599fa613e0
        
           Uptime:            1h 5m 42.640s
           CPU load averages: 3.08 / 3.06 / 3.01
           Number of CPUs:    2
           System memory:     414 MB used / 3349 MB free / 3763 MB total
           Swap:              0 MB used / 0 MB free / 0 MB total
        
      2. check the mgmt0 interface
        enable # to enter the "Enable" mode
        show interfaces mgmt0
        -----------------------------------
        Interface mgmt0 status:
          Comment         :
          Admin up        : yes
          Link up         : no
          DHCP running    : no
          IP address      : 192.168.0.10
          Netmask         : 255.255.255.0
          IPv6 enabled    : yes
          Autoconf enabled: no
          Autoconf route  : yes
          Autoconf privacy: no
          DHCPv6 running  : no
          Speed           : UNKNOWN
          Duplex          : UNKNOWN
          Interface type  : ethernet
          Interface source: bridge
          MTU             : 1500
          HW address      : B8:59:9F:D3:33:30
        

Switch Shut Down

   enable
   reload halt
  
Configuration has been modified; save first? [yes] yes
Configuration changes saved.
Halting system...
switch-sb7800 [standalone: master] # 

System shutdown initiated -- logging off.

Connection to 192.168.0.10 closed.

After printing the above messages, unplug both power cords.

Switch Login

  1. Serial terminal
    Repeat steps 2-1 and 2-2 in "Switch Initialization"
    
    Username: admin
    Password: # same as eureka
    
  2. SSH

Enable OpenSM (Subnet Manager)

  ```
  enable                           # to enter the "Enable" mode
  configure terminal               # to enter the "Config" mode
  ib smnode switch-sb7800 enable   # enable OpenSM
  show ib sm                       # check --> should show "enabled"
  no configure                     # to exit the "Config" mode
  ```

Reference


Links

⚠️ **GitHub.com Fallback** ⚠️