System Testing - calab-ntu/gpu-cluster GitHub Wiki
Temperature Test
WIP
RAM Test
-
insert the boot USB and target RAM
-
boot to BIOS
- Check memory frequency. 2933MHz
- Set boot USB as the first priority boot device.
-
Start memory test
- centOS7 install USB
Trubleshooting
->Run a memory test
- Memtest86 USB
config
->CPU
->parallel
start test
- centOS7 install USB
-
After test
- centOS7
- Press
esc
to end test. - Record number of errors.
- Reboot to BIOS and check memory frequency. 2933 MHz
- Press
- Memtest86 USB
- Save report of test and record result.
- Reboot to BIOS and check memory frequency. 2933 MHz
- centOS7
GPU Burn-in Test
- Reference: https://github.com/Microway/gpu-burn
- Clone the source from github:
git clone https://github.com/Microway/gpu-burn.git
- Make
gpu_burn
- Edit
Makefile
to fit the software path and GPU spec.CUDA_PATH=/software/cuda/default NVCCFLAGS=-gencode=arch=compute_75, code=sm_75 -I${CUDA_PATH}/include --fatbin
make
- Edit
- Execute burn-in test
./gpu_burn [run_time] > log.file #run time in unit of second
- After test, at the tail of log file will show the test was pass or not. eg.
100.0% proc: 38K err: 0 tmp: 34C Summary at: Wed May 13 15:38:30 CST 2020 100.0% proc: 41K err: 0 tmp: 37C Killing processes.. done Tested 1 GPUs: GPU 0: OK
Burn-in Test
WIP
MPI Bandwidth Test
WIP
HD Test
WIP
GAMER Performance Test
WIP