System Reboot - calab-ntu/gpu-cluster GitHub Wiki
Shutdown Procedure
- After
su root
, presswho
to confirm whether there is any user hasn't logged out yet. If yes, use
to logout that user.pkill -u USER_NAME
- Shutdown computing nodes
pdsh -w eureka[01-33] poweroff
- [Optional] Log in to the switch, shut down the switch manually, and then unplug the two power cords.
Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about
Switch Shut Down
.- Connect to IfiniBand switch
eureka:
ssh [email protected]
spock:ssh [email protected]
enable
reload halt
After printing the above messages, unplug both power cords.Configuration has been modified; save first? [yes] yes Configuration changes saved. Halting system... switch-sb7800 [standalone: master] # System shutdown initiated -- logging off. Connection to 192.168.0.10 closed.
- Connect to IfiniBand switch
eureka:
- Disable
ssh
from other user except root. Check Account ManagementSSH access
. - Shutdown login node
- Shutdown eater & ironman
- Login to DSM of eater and ironman
- Shut down in DSM
- Shutdown Tumaz
- [Optional] Turn off the big cooling fans.
Checklist Before Booting Up System
- Power lines
- IB Switch power cables (2)
- Ethernet switch power cables (1)
- Power cables of each nodes (34)
- NAS (4)
- tumaz (1)
- Cables of each nodes
- CPU (8+8 pins)
- MB (24 pins)
- Water PUMP (USB mini-b & 4 pins fan power)
- GPU (8+6 or 8+8 pins)
- IB card and cable
- Ethernet cable
Boot Procedure (the boot order is important)
- [Optional] Turn on the big cooling fans.
- [Optional] Plug the two power cords of switch into the socket.
Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about
Switch Login
, andEnable OpenSM
(Subnet Manager). - Boot Tumaz
- Boot eater & ironman
- Boot login node
- Hold the
SET UP
(PC-Link
) button on thermometer for 3 seconds to enable the data transmition from thermometer to PC.
- Hold the
- Log in to the switch, enable the subnet manager.
In
eureka
Inssh [email protected]
spock
Inssh [email protected]
switch
enable # to enter the "Enable" mode configure terminal # to enter the "Config" mode ib smnode switch-sb7800 enable # enable OpenSM show ib sm # check --> should show "enabled" no configure # to exit the "Config" mode
- Boot computing nodes
- Boot one node on a strip at a time to prevent surge current over the limit of the strip.
- Enable
ssh
. Check Account ManagementSSH access
.
Checklist After Booting Up System
pbsnodes -l all
to confirm that all nodes are online- Check the temperature of each node, and check the fans behind rack are turned on
- For each node, ensure there is no error about
[system] Failed to activate service 'org.freedesktop.login1': timed out
in/var/log/messages
- For each node, check the IB bandwidth returned by
ibstatus |grep rate
is100 Gb/sec (4X EDR)
.