Admin Troubleshooting - chunxxc/GPU-Server-Handbook GitHub Wiki
https://docs.nvidia.com/dgx/dgx-station-user-guide/)
The full manual of DGX station is available (It contains information for most troubleshooting on hardware and kernels.
Driver/Kernel mismatch after reboot
Temporal solution is to re-install driver (e.g., sudo sh NVIDIA-...*.run
)
What should have done is: Run sudo nvidia-bug-report.sh
to get the nvidia-bug-report.log.gz. You can unzip it uzip -d nvidia-bug-report.log.gz
and use vim to read through it. It is a very long log file so I recommend you to just search for keywords like 'mismatch'. You can also send the file directly to Nvidia support for further instructions.
When the disk is full
As an admin, it is easy to blame the users for using up a lot disk space. But sadly it is not always the case. Use sudo du -h -d1 /
to check if any system package happens to cache a lot space. In one case, /var/lib/docker has occupied 728.1GB space for itself.
HELP, there is a power shortage!
Well, wait until Ellevio fixes its business. After the power outage, the server room on floor 4 will have its fuse temporally disabled. Go to the server room and look left on the wall for the electricity control box, turn the "manuell Förbikopplare B3" button on again. (Or just try every button on the wall! (NO))
How to list all users on the server?
To see the currently active user (logged in):
who
To list the existing users:
cat /etc/passwd
If root is mounted as read-only
$sudo mount -o remount,rw /
Why there is a "Couldn't get size: 0x8000000000000e"?
Because once upon a time the admin went into the Bios setting were changed to default. So it thinks of itself as a dual booting system and always wants to find the windows first. If you want to fix it you can go to BOIS again to set "Boot Device control" to UEFI and Legacy OPROM, and "Boot from Storage Devices" to UEFI only.