System Monitoring - calab-ntu/gpu-cluster GitHub Wiki
Temperature
CPU
Eureka
:sensors
For CPU Threadripper 2000 Driver k10temp report 27 degree higher than actual temperature Ref. https://www.phoronix.com/news/Linux-4.18.6-k10temp-Correct
SPOCK
:ipmitool sensor get 'CPU Temp.'
This command needs root privilege
GPU
nvidia-smi -a | grep 'GPU Current Temp'
IB adaptor
mget_temp -d /dev/mst/*
This command needs root privilege
Network Usage
- Command:
iftop
This command needs root privilege
Node Status
- Node occupation status:
node
- Monitor all node state :
cat /projectW/job_log/node_state_early
This file will refresh every 10 minutes.
Room temperature
Eureka
:T_room
Spock
:T_room