System Maintenance - calab-ntu/gpu-cluster GitHub Wiki
Maintenance Roles
- MOVING NODES
- Two or more people are necessary for moving nodes in and out of the rack.
- NAS
- System and NAS must be powered off before moving nodes or NAS.
- INDUSTRIAL FANS
- Keep away from working fans.
- Before turning off industrial fans of
Spock
, all Spock
nodes should be turned off.
Check power and connection cable
Maintenance routine (maintenance will be held at the first Friday of a month)
- Every two month:
- Every half year:
- RAM test.
Specific nodes will be tested in a maintenance.
- Feb:
eureka
: 01
~ 11
; spock
: 01
~ 09
- Apr:
eureka
: 12
~ 22
; spock
: 10
~ 18
- Jun:
eureka
: 23
~ 33
; spock
: 19
~ 28
- Aug:
eureka
: 01
~ 11
; spock
: 01
~ 09
- Oct:
eureka
: 12
~ 22
; spock
: 10
~ 18
- Dec:
eureka
: 23
~ 33
; spock
: 19
~ 28
- Water cooler pump check.
- Every year:
- Replace thermal paste on high temperature node.