System Monitoring - calab-ntu/gpu-cluster GitHub Wiki

Temperature

CPU

GPU

nvidia-smi -a | grep 'GPU Current Temp'

IB adaptor

mget_temp -d /dev/mst/*

This command needs root privilege

Network Usage

  • Command: iftop

    This command needs root privilege

Node Status

  • Node occupation status: node
  • Monitor all node state : cat /projectW/job_log/node_state_early

    This file will refresh every 10 minutes.

Room temperature

  • Eureka: T_room
  • Spock : T_room