System Testing - calab-ntu/gpu-cluster GitHub Wiki

Temperature Test

WIP

RAM Test

  1. insert the boot USB and target RAM

  2. boot to BIOS

    1. Check memory frequency. 2933MHz
    2. Set boot USB as the first priority boot device.
  3. Start memory test

    • centOS7 install USB Trubleshooting -> Run a memory test
    • Memtest86 USB
      1. config -> CPU -> parallel
      2. start test
  4. After test

    • centOS7
      1. Press esc to end test.
      2. Record number of errors.
      3. Reboot to BIOS and check memory frequency. 2933 MHz
    • Memtest86 USB
      1. Save report of test and record result.
      2. Reboot to BIOS and check memory frequency. 2933 MHz

GPU Burn-in Test

  1. Reference: https://github.com/Microway/gpu-burn
  2. Clone the source from github:
    git clone https://github.com/Microway/gpu-burn.git
    
  3. Make gpu_burn
    1. Edit Makefile to fit the software path and GPU spec.
      CUDA_PATH=/software/cuda/default
      
      NVCCFLAGS=-gencode=arch=compute_75, code=sm_75 -I${CUDA_PATH}/include --fatbin
      
    2. make
  4. Execute burn-in test
    ./gpu_burn [run_time] > log.file #run time in unit of second
    
  5. After test, at the tail of log file will show the test was pass or not. eg.
    100.0%  proc: 38K err: 0 tmp: 34C
            Summary at:   Wed May 13 15:38:30 CST 2020
    
    100.0%  proc: 41K err: 0 tmp: 37C
            Killing processes.. done
    
    Tested 1 GPUs:
            GPU 0: OK
    

Burn-in Test

WIP

MPI Bandwidth Test

WIP

HD Test

WIP

GAMER Performance Test

WIP

Links