Horovod 설치 - jinwooklim/my-exp GitHub Wiki

참조
http://solarisailab.com/archives/2627
https://github.com/horovod/horovod
https://samishappy.tistory.com/25
http://lsi.ugr.es/jmantas/pdp/ayuda/datos/instalaciones/Install_OpenMPI_en.pdf
https://lambdalabs.com/blog/horovod-keras-for-multi-gpu-training/
https://github.com/horovod/horovod/blob/master/examples/tensorflow_mnist.py

OpenMPI설치

  1. https://www.open-mpi.org/software/ompi/v4.0/
  2. Download and Extract 'openmpi-4.0.2.tar.gz'
  3. cd ./openmpi-4.0.2.tar.gz
  4. ./configure --prefix=/home/$USER/.openmpi
    It is necessary to add on the prefix the installation directory we want to use for OpenMPI.
    The normal thing to do would be to select the next directory “/home/'user'/.openmpi”.
  5. Install
    5-1. NPROCS=`grep -c processor /proc/cpuinfo`;
    5-2. make -j $NPROCS all # for parallel compile
    5-3. make install
  6. Environment setting
    6-1. echo 'export PATH=$PATH:/home/$USER/.openmpi/bin' >> /home/$USER/.bashrc
    6-2. echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/.openmpi/lib/' >> /home/$USER/.bashrc

OpenMPI 설치완료 확인

  1. $ mpirun
--------------------------------------------------------------------------
mpirun could not find anything to do.

It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--------------------------------------------------------------------------

Tensorflow - Horovod 연동전 사전작업

  1. (optional) conda install gxx_linux-64 # conda 환경에서는 필수
  2. gcc -v # gcc version > 4.9
  3. pip install tensorflow # 1.13 버전으로 가정

NCCL 설치

참조 : https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
Note: If you are using the network repository, the following command will upgrade CUDA to the latest version.

  1. sudo apt install libnccl2 libnccl-dev
    OR, If you prefer to keep an older version of CUDA, specify a specific version, for example:
    sudo apt install libnccl2=2.5.6-1+cuda10.0 libnccl-dev=2.5.6-1+cuda10.0 # CUDA 10.0으로 설정
  2. echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu' >> ~/.bashrc
  3. source ~/.bashrc

Horovod 설치

  1. sudo dpkg-query -L libnccl-dev # 위치 확인
/usr/lib/x86_64-linux-gnu/
/usr/include/
  1. HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod
    2-1. 아래로 하니까 됨. HOROVOD_NCCL_HOME=/usr/lib/x86_64-linux-gnu HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir horovod

Tensorflow with Horovod

  1. horovodrun -np 4 -H localhost:4 python tensorflow_mnist.py