System Maintenance - calab-ntu/gpu-cluster GitHub Wiki

Maintenance Roles

  1. MOVING NODES
    • Two or more people are necessary for moving nodes in and out of the rack.
  2. NAS
    • System and NAS must be powered off before moving nodes or NAS.
  3. INDUSTRIAL FANS
    • Keep away from working fans.
    • Before turning off industrial fans of Spock, all Spock nodes should be turned off.

Check power and connection cable

  • 1. Cable plugs on PSU
  • 2. Network cable
  • 3. IB cable
  • 4. RAMs
  • 5. Power cables on CPU, GPU, MB.

Maintenance routine (maintenance will be held at the first Friday of a month)

  • Every two month:
    • Cooler fan check.
  • Every half year:
    • RAM test. Specific nodes will be tested in a maintenance.
      • Feb: eureka: 01 ~ 11; spock: 01 ~ 09
      • Apr: eureka: 12 ~ 22; spock: 10 ~ 18
      • Jun: eureka: 23 ~ 33; spock: 19 ~ 28
      • Aug: eureka: 01 ~ 11; spock: 01 ~ 09
      • Oct: eureka: 12 ~ 22; spock: 10 ~ 18
      • Dec: eureka: 23 ~ 33; spock: 19 ~ 28
    • Water cooler pump check.
  • Every year:
    • Replace thermal paste on high temperature node.