Update the System - chunxxc/GPU-Server-Handbook GitHub Wiki

To decide on a software update, you need to check the compatibility

The Nvidia products (Driver-CUDA-cuDNN) often go hand-in-hand

Nvidia driver + CUDA + cuDNN (https://www.nvidia.com/Download/index.aspx https://docs.nvidia.com/deploy/cuda-compatibility/index.html#default-to-minor-version (there can only be one version of Nvidia kernel driver, but multi CUDA might be needed)
Nvidia - Python - Tensorflow - Pytorch (https://www.tensorflow.org/install/source#linux https://pytorch.org/get-started/locally/ Tensorflow holds the real restrictions since Pytorch usually just go with what it is given)
Ubuntu - DGX OS - Python (can set lower version python as OS default)

Bazel does not support Ubuntu 20 (2022). Please recheck for Bazel and Ubuntu compilability.

Update system package manager (around 10 min)

$sudo apt update #run system manager
$sudo apt full-upgrade -s #this will print out all possible updates. Updates should happen when they include security and cuda
$sudo apt full-upgrade 
$sudo apt autoremove

Update Nvidia-driver

Check tensorflow and pytorch supported CUDA version before decision of driver version (also check python version). These two are always behind nvidia development. Then search for available driver version and take a look into nvidia to see what your card support:https://www.nvidia.com/Download/index.aspx?lang=en-us For non-interactive install:

$apt search nvidia-driver 
$sudo add-apt-repository ppa:graphics-drivers # use it if you can not see the newest driver
$sudo apt-get install nvidia-XXX
$sudo apt-get update

For interactive install (*.run file) (https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html):

sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

Optionally use the below purge command if you want a clean install or there are more than four versions of CUDA on board.

$sudo apt purge nvidia* libnvidia*

after updating the system, a reboot is needed to allow the driver and the card greeting each other and get nvidia-smi working again

The dgx-station always need a manual press on the start button despite using 'reboot' command.

Sudo apt upgrade again

perform another system software upgrade so that the dgx-os system is updated according to the gpu-driver as well.

$sudo apt-get update # this update the view of the system software manager
$sudo apt full-upgrade
$sudo apt autoremove # remove the leftovers

Install new CUDA Toolkit (nvidia-smi gives wrong CUDA version)

Follow this (better to go with run file) https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

remember to change the ENV path in /etc/bash.bashrc:

export LD_LIBRARY_PATH=/usr/local/cuda-*.*/lib64\ 
export PATH=/usr/local/cuda-*.*/bin${PATH:+:${PATH}}       
export CUDA_HOME=/usr/local/cuda-*.*
export TF_FORCE_GPU_ALLOW_GROWTH=true #dynamic usage of memory, must have for sharing one gpu
export TF_CPP_MIN_LOG_LEVEL=3

then do:

source /etc/bash.bashrc

for cuDNN: https://developer.nvidia.com/rdp/cudnn-download (need register account and use the general linux tar file) and follow:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html (never mind to verify, it will get verified by tensorflow)

sometimes it is also good to check if the dgx-os-system has any valueable updates https://docs.nvidia.com/dgx/dgx-os-desktop-release-notes/index.html (But careful with major update since it usually are associated with Ubuntu update e.g. dgx os 5 <-> Ubuntu 20; dgx os 4 <-> ubuntu 18)

Update python

sudo add-apt-repository ppa:deadsnakes/ppa # alternative ppa that has newest libs
sudo apt-get install pythonX.Y,pythonX.Y-dev,pythonX.Y-venv # the old version still exist, they are all at /usr/bin/python*
sudo update-alternatives --install /usr/bin/python3 python3 /usr(/local)/bin/pythonX.Y N # add the python as one alternative

However, often the new Python can not be found through apt. Just install them from source. It is very easy with some online guides.

$sudo update-alternatives --config python3 # set prior for which python to use when hit "python3"

For Ubuntu 18 we should use python3.7 as OS default

if the above makes the apt_pkg module missing, do

$sudo cp /usr/lib/python3/dist-packages/apt_pkg.cpython-36m-x86_64-linux-gnu.so /usr/lib/python3/dist-packages/apt_pkg.so

Now tensorflow

:exclamation: Build from source to enable customized C++ operation with the correct GCC. Also just forget about TF 1.X.
Check here for version support https://www.tensorflow.org/install/source#gpu

Download the Bazel dist.zip and bootstrap it locally https://docs.bazel.build/versions/4.1.0/install-compile-source.html#bootstrap-bazel Do not use installer for Bazel

Download a clean git repo tensorflow then follow here:https://www.tensorflow.org/install/source (but be aware: all command should be done with 'sudo' except for Bazel and pip install deps, which is only needed locally to create the image) ./configure:

OpenCL: No
ComeputeCPP: No
ROCM: no
TensorRT: NO
use clang as CUDA compiler: No

if you want to use another CUDA, set the local env of LD_LIBRARY_PATH, PATH, TF_CUDA_PAHTS to be the desired one.
when bazel build, forget [--config=option], it is there for nothing, add --verbose_failures to expand error infos

$.../output/bazel build --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

If you have conda local installed cuda related pkg (cudatoolkit/magma-cuda), which confuses the PATH, perhapd you will need to comment out conda PATH entry at your ~/.bashrc, then better to log out and log in again and check your path with $echo "$PATH"

This will take some hours, leave it overnight for Bazel to work.

$sudo pip3 build the package and things be good to go!

Check Tensorflow with none admin user account

check if Tensorflow is using GPU with

tf 2

a = tf.constant([1.0, 2.0, 3.0], [4.0, 5.0, 6.0](/chunxxc/GPU-Server-Handbook/wiki/1.0,-2.0,-3.0],-[4.0,-5.0,-6.0))
b = tf.constant([1.0, 2.0], [3.0, 4.0], [5.0, 6.0](/chunxxc/GPU-Server-Handbook/wiki/1.0,-2.0],-[3.0,-4.0],-[5.0,-6.0))
c = tf.matmul(a, b)

mind the two setting TF_FORCE_GPU_ALLOW_GROWTH=true and TF_CPP_MIN_LOG_LEVEL=3, see if they still works

Update PyTorch

It pretty straight forward, just take LOOOOG time

Download the src pack and build from source (since we are using the newest CUDA) follow here:https://pytorch.org/

:exclamation: all steps in the github installation guide can be done without sudo except for the final setup.py install

verify if pytorch:(do it outside the Pytorch folder)

print(torch.rand(2,3).cuda())
print(torch.version.cuda)

finally, Matlab

you will need to ask the IT group to issue an admin license so that it supports multi-users.