Horovod 설치 - jinwooklim/my-exp GitHub Wiki
참조
http://solarisailab.com/archives/2627
https://github.com/horovod/horovod
https://samishappy.tistory.com/25
http://lsi.ugr.es/jmantas/pdp/ayuda/datos/instalaciones/Install_OpenMPI_en.pdf
https://lambdalabs.com/blog/horovod-keras-for-multi-gpu-training/
https://github.com/horovod/horovod/blob/master/examples/tensorflow_mnist.py
OpenMPI설치
https://www.open-mpi.org/software/ompi/v4.0/
- Download and Extract 'openmpi-4.0.2.tar.gz'
cd ./openmpi-4.0.2.tar.gz
./configure --prefix=/home/$USER/.openmpi
It is necessary to add on the prefix the installation directory we want to use for OpenMPI.
The normal thing to do would be to select the next directory “/home/'user'/.openmpi”.- Install
5-1.NPROCS=`grep -c processor /proc/cpuinfo`;
5-2.make -j $NPROCS all
# for parallel compile
5-3.make install
- Environment setting
6-1.echo 'export PATH=$PATH:/home/$USER/.openmpi/bin' >> /home/$USER/.bashrc
6-2.echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/' >> /home/$USER/.bashrc
OpenMPI 설치완료 확인
$ mpirun
--------------------------------------------------------------------------
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------
Tensorflow - Horovod 연동전 사전작업
(optional) conda install gxx_linux-64
# conda 환경에서는 필수gcc -v
# gcc version > 4.9pip install tensorflow
# 1.13 버전으로 가정
NCCL 설치
참조 : https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
Note: If you are using the network repository, the following command will upgrade CUDA to the latest version.
sudo apt install libnccl2 libnccl-dev
OR, If you prefer to keep an older version of CUDA, specify a specific version, for example:
sudo apt install libnccl2=2.5.6-1+cuda10.0 libnccl-dev=2.5.6-1+cuda10.0
# CUDA 10.0으로 설정echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu' >> ~/.bashrc
source ~/.bashrc
Horovod 설치
sudo dpkg-query -L libnccl-dev
# 위치 확인
/usr/lib/x86_64-linux-gnu/
/usr/include/
HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
2-1. 아래로 하니까 됨.HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
Tensorflow with Horovod
horovodrun -np 4 -H localhost:4 python tensorflow_mnist.py