System Reboot - calab-ntu/gpu-cluster GitHub Wiki

Shutdown Procedure

  1. After su root, press who to confirm whether there is any user hasn't logged out yet. If yes, use
    pkill -u USER_NAME
    
    to logout that user.
  2. Shutdown computing nodes pdsh -w eureka[01-33] poweroff
  3. [Optional] Log in to the switch, shut down the switch manually, and then unplug the two power cords. Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about Switch Shut Down.
    1. Connect to IfiniBand switch eureka: ssh [email protected] spock: ssh [email protected]
    2. enable
    3. reload halt
    Configuration has been modified; save first? [yes] yes
    Configuration changes saved.
    Halting system...
    switch-sb7800 [standalone: master] # 
    
    System shutdown initiated -- logging off.
    
    Connection to 192.168.0.10 closed.
    
    After printing the above messages, unplug both power cords.
  4. Disable ssh from other user except root. Check Account Management SSH access.
  5. Shutdown login node
  6. Shutdown eater & ironman
    1. Login to DSM of eater and ironman
    2. Shut down in DSM
  7. Shutdown Tumaz
  8. [Optional] Turn off the big cooling fans.

Checklist Before Booting Up System

  1. Power lines
    1. IB Switch power cables (2)
    2. Ethernet switch power cables (1)
    3. Power cables of each nodes (34)
    4. NAS (4)
    5. tumaz (1)
  2. Cables of each nodes
    1. CPU (8+8 pins)
    2. MB (24 pins)
    3. Water PUMP (USB mini-b & 4 pins fan power)
    4. GPU (8+6 or 8+8 pins)
    5. IB card and cable
    6. Ethernet cable

Boot Procedure (the boot order is important)

  1. [Optional] Turn on the big cooling fans.
  2. [Optional] Plug the two power cords of switch into the socket. Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about Switch Login, and Enable OpenSM (Subnet Manager).
  3. Boot Tumaz
  4. Boot eater & ironman
  5. Boot login node
    • Hold the SET UP (PC-Link) button on thermometer for 3 seconds to enable the data transmition from thermometer to PC.
  6. Log in to the switch, enable the subnet manager. In eureka
    ssh [email protected]
    
    In spock
    ssh [email protected]
    
    In switch
    enable                           # to enter the "Enable" mode
    configure terminal               # to enter the "Config" mode
    ib smnode switch-sb7800 enable   # enable OpenSM
    show ib sm                       # check --> should show "enabled"
    no configure                     # to exit the "Config" mode
    
  7. Boot computing nodes
    1. Boot one node on a strip at a time to prevent surge current over the limit of the strip.
  8. Enable ssh. Check Account Management SSH access.

Checklist After Booting Up System

  1. pbsnodes -l all to confirm that all nodes are online
  2. Check the temperature of each node, and check the fans behind rack are turned on
  3. For each node, ensure there is no error about [system] Failed to activate service 'org.freedesktop.login1': timed out in /var/log/messages
  4. For each node, check the IB bandwidth returned by ibstatus |grep rate is 100 Gb/sec (4X EDR).

Links