System Reboot - calab-ntu/gpu-cluster GitHub Wiki
Shutdown Procedure
- After
su root, presswhoto confirm whether there is any user hasn't logged out yet. If yes, use
to logout that user.pkill -u USER_NAME - Shutdown computing nodes
pdsh -w eureka[01-33] poweroff - [Optional] Log in to the switch, shut down the switch manually, and then unplug the two power cords.
Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about
Switch Shut Down.- Connect to IfiniBand switch
eureka:
ssh [email protected]spock:ssh [email protected] enablereload halt
After printing the above messages, unplug both power cords.Configuration has been modified; save first? [yes] yes Configuration changes saved. Halting system... switch-sb7800 [standalone: master] # System shutdown initiated -- logging off. Connection to 192.168.0.10 closed. - Connect to IfiniBand switch
eureka:
- Disable
sshfrom other user except root. Check Account ManagementSSH access. - Shutdown login node
- Shutdown eater & ironman
- Login to DSM of eater and ironman
- Shut down in DSM
- Shutdown Tumaz
- [Optional] Turn off the big cooling fans.
Checklist Before Booting Up System
- Power lines
- IB Switch power cables (2)
- Ethernet switch power cables (1)
- Power cables of each nodes (34)
- NAS (4)
- tumaz (1)
- Cables of each nodes
- CPU (8+8 pins)
- MB (24 pins)
- Water PUMP (USB mini-b & 4 pins fan power)
- GPU (8+6 or 8+8 pins)
- IB card and cable
- Ethernet cable
Boot Procedure (the boot order is important)
- [Optional] Turn on the big cooling fans.
- [Optional] Plug the two power cords of switch into the socket.
Check https://github.com/calab-ntu/eureka/wiki/System-Installation%3A-InfiniBand about
Switch Login, andEnable OpenSM(Subnet Manager). - Boot Tumaz
- Boot eater & ironman
- Boot login node
- Hold the
SET UP(PC-Link) button on thermometer for 3 seconds to enable the data transmition from thermometer to PC.
- Hold the
- Log in to the switch, enable the subnet manager.
In
eureka
Inssh [email protected]spock
Inssh [email protected]switchenable # to enter the "Enable" mode configure terminal # to enter the "Config" mode ib smnode switch-sb7800 enable # enable OpenSM show ib sm # check --> should show "enabled" no configure # to exit the "Config" mode - Boot computing nodes
- Boot one node on a strip at a time to prevent surge current over the limit of the strip.
- Enable
ssh. Check Account ManagementSSH access.
Checklist After Booting Up System
pbsnodes -l allto confirm that all nodes are online- Check the temperature of each node, and check the fans behind rack are turned on
- For each node, ensure there is no error about
[system] Failed to activate service 'org.freedesktop.login1': timed outin/var/log/messages - For each node, check the IB bandwidth returned by
ibstatus |grep rateis100 Gb/sec (4X EDR).