Policy - chunxxc/GPU-Server-Handbook GitHub Wiki

The servers operate under an honour-based policy:

  1. There are no hard-coded limits on your use of the server, but taking up more than half GPU or CPU for yourself is strictly forbidden.
  2. You are forbidden to use files from other users' directories without permission.

To protect you own private files (such as a private key), change the mode with chmod so that they are only readable/writable for you (most files are default readable to everyone).

  1. The can use any port for web servers (like Jupyter notebook or Tensorboard) except for Port 1024, which is the ssh access port. Please remember to close when not needed. Streaming the screen is forbidden.
  2. DGX user: Put your large data files (>10G) in the ~/raid_storage (5.2T). Put your script outside the ~/raid_storage for auto backup. GPU1 user: There is no separate space for large files, but if you want fast OI, put data in fast_data (800G). (Hardware)

There are two things to check before you start a job:

  • Use export CUDA_VISIBLE_DEVICES=N to select use one GPU with index N=[0,1,2,3]
  • Check GPU status with nvidia-smi. This will print the GPU status window (one second): temperature, power usage, memory usage, utility, list of process on GPU. In principle, it is good to avoid using an occupied GPU. Otherwise,
    • Check if the max power usage for a consecutive 5 seconds is below 250w (>300W will cause system shutdown),
    • Check if the free memory is enough for your process to avoid the OOM problem.

if you see the temperature is >80C please report to the admin

  • Check CPU status with 'htop' before you issue a multi-thread process. This will show a dynamic window on full list of process. Quit the window by type 'q'.

A policy violation will cause you to lose access. If you have encountered any problem or want to report a violation, please directly contact the administrator group.