Update the System - chunxxc/GPU-Server-Handbook GitHub Wiki
To decide on a software update, you need to check the compatibility
The Nvidia products (Driver-CUDA-cuDNN) often go hand-in-hand
- Nvidia driver + CUDA + cuDNN (https://www.nvidia.com/Download/index.aspx https://docs.nvidia.com/deploy/cuda-compatibility/index.html#default-to-minor-version (there can only be one version of Nvidia kernel driver, but multi CUDA might be needed)
- Nvidia - Python - Tensorflow - Pytorch (https://www.tensorflow.org/install/source#linux https://pytorch.org/get-started/locally/ Tensorflow holds the real restrictions since Pytorch usually just go with what it is given)
- Ubuntu - DGX OS - Python (can set lower version python as OS default)
Bazel does not support Ubuntu 20 (2022). Please recheck for Bazel and Ubuntu compilability.
Update system package manager (around 10 min)
$sudo apt update #run system manager
$sudo apt full-upgrade -s #this will print out all possible updates. Updates should happen when they include security and cuda
$sudo apt full-upgrade
$sudo apt autoremove
Update Nvidia-driver
- Check tensorflow and pytorch supported CUDA version before decision of driver version (also check python version). These two are always behind nvidia development. Then search for available driver version and take a look into nvidia to see what your card support:https://www.nvidia.com/Download/index.aspx?lang=en-us For non-interactive install:
$apt search nvidia-driver
$sudo add-apt-repository ppa:graphics-drivers # use it if you can not see the newest driver
$sudo apt-get install nvidia-XXX
$sudo apt-get update
For interactive install (*.run file) (https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html):
sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
Optionally use the below purge command if you want a clean install or there are more than four versions of CUDA on board.
$sudo apt purge nvidia* libnvidia*
- after updating the system, a reboot is needed to allow the driver and the card greeting each other and get nvidia-smi working again
The dgx-station always need a manual press on the start button despite using 'reboot' command.
Sudo apt upgrade again
- perform another system software upgrade so that the dgx-os system is updated according to the gpu-driver as well.
$sudo apt-get update # this update the view of the system software manager
$sudo apt full-upgrade
$sudo apt autoremove # remove the leftovers
Install new CUDA Toolkit (nvidia-smi gives wrong CUDA version)
Follow this (better to go with run file) https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal
- remember to change the ENV path in /etc/bash.bashrc:
export LD_LIBRARY_PATH=/usr/local/cuda-*.*/lib64\
export PATH=/usr/local/cuda-*.*/bin${PATH:+:${PATH}}
export CUDA_HOME=/usr/local/cuda-*.*
export TF_FORCE_GPU_ALLOW_GROWTH=true #dynamic usage of memory, must have for sharing one gpu
export TF_CPP_MIN_LOG_LEVEL=3
then do:
source /etc/bash.bashrc
for cuDNN: https://developer.nvidia.com/rdp/cudnn-download (need register account and use the general linux tar file) and follow:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html (never mind to verify, it will get verified by tensorflow)
sometimes it is also good to check if the dgx-os-system has any valueable updates https://docs.nvidia.com/dgx/dgx-os-desktop-release-notes/index.html (But careful with major update since it usually are associated with Ubuntu update e.g. dgx os 5 <-> Ubuntu 20; dgx os 4 <-> ubuntu 18)
Update python
sudo add-apt-repository ppa:deadsnakes/ppa # alternative ppa that has newest libs
sudo apt-get install pythonX.Y,pythonX.Y-dev,pythonX.Y-venv # the old version still exist, they are all at /usr/bin/python*
sudo update-alternatives --install /usr/bin/python3 python3 /usr(/local)/bin/pythonX.Y N # add the python as one alternative
- However, often the new Python can not be found through apt. Just install them from source. It is very easy with some online guides.
$sudo update-alternatives --config python3 # set prior for which python to use when hit "python3"
For Ubuntu 18 we should use python3.7 as OS default
- if the above makes the apt_pkg module missing, do
$sudo cp /usr/lib/python3/dist-packages/apt_pkg.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_pkg.so
Now tensorflow
:exclamation: Build from source to enable customized C++ operation with the correct GCC. Also just forget about TF 1.X.
Check here for version support https://www.tensorflow.org/install/source#gpu
- Download the Bazel dist.zip and bootstrap it locally https://docs.bazel.build/versions/4.1.0/install-compile-source.html#bootstrap-bazel Do not use installer for Bazel
Download a clean git repo tensorflow then follow here:https://www.tensorflow.org/install/source (but be aware: all command should be done with 'sudo' except for Bazel and pip install deps, which is only needed locally to create the image) ./configure:
- OpenCL: No
- ComeputeCPP: No
- ROCM: no
- TensorRT: NO
- use clang as CUDA compiler: No
if you want to use another CUDA, set the local env of LD_LIBRARY_PATH, PATH, TF_CUDA_PAHTS to be the desired one.
when bazel build, forget [--config=option], it is there for nothing, add --verbose_failures to expand error infos
$.../output/bazel build --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package
If you have conda local installed cuda related pkg (cudatoolkit/magma-cuda), which confuses the PATH, perhapd you will need to comment out conda PATH entry at your ~/.bashrc, then better to log out and log in again and check your path with
$echo "$PATH"
This will take some hours, leave it overnight for Bazel to work.
$sudo pip3 build
the package and things be good to go!
Check Tensorflow with none admin user account
check if Tensorflow is using GPU with
- tf 2
a = tf.constant([1.0, 2.0, 3.0], [4.0, 5.0, 6.0](/chunxxc/GPU-Server-Handbook/wiki/1.0,-2.0,-3.0],-[4.0,-5.0,-6.0))
b = tf.constant([1.0, 2.0], [3.0, 4.0], [5.0, 6.0](/chunxxc/GPU-Server-Handbook/wiki/1.0,-2.0],-[3.0,-4.0],-[5.0,-6.0))
c = tf.matmul(a, b)
mind the two setting TF_FORCE_GPU_ALLOW_GROWTH=true and TF_CPP_MIN_LOG_LEVEL=3, see if they still works
Update PyTorch
It pretty straight forward, just take LOOOOG time
- Download the src pack and build from source (since we are using the newest CUDA) follow here:https://pytorch.org/
:exclamation: all steps in the github installation guide can be done without sudo except for the final setup.py install
- verify if pytorch:(do it outside the Pytorch folder)
print(torch.rand(2,3).cuda())
print(torch.version.cuda)
finally, Matlab
you will need to ask the IT group to issue an admin license so that it supports multi-users.