Eureka Installation: Login Node - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

  1. Follow everything in Eureka Installation: Computing Node with following changes.

    1. Install CentOS

      1. Installation summary:
        • Additional Development
        • Compatibility Libraries
        • Development Tools
        • Emacs
        • File and Storage Server
        • Hardware Monitoring Utilities
        • Infiniband Support
        • Legacy X Window System Compatibility
        • Network File System Client
        • Platform Development
        • Python
        • Technical Writing
        • Email Server
        • Identity Management Server
        • KDE
        • Security Tools
        • System Administration Tools
    2. Set up network Edit /etc/sysconfig/network-scripts/ifcfg-XXX, where XXX is usually "enpXXX" BOOTPROTO=static ONBOOT=yes IPADDR=192.168.0.100 GATEWAY=192.168.0.1 NETMASK=255.255.255.0 DNS1=140.112.254.4

    3. NFS client

      • Mount all remote folders Do not mount /software
        mount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
        
  2. Install the CUDA driver

    1. Copy initialization scripts
      cp /work1/shared/eureka/init_script/*.sh  /etc/profile.d/
      cp /work1/shared/eureka/init_script/*.csh /etc/profile.d/
      
      cp /work1/shared/eureka/etc/rc.local /etc/rc.d/
      chmod +x /etc/rc.d/rc.local
      
      Edit /etc/rc.d/rc.local as follows
      1. Comment out the line /usr/bin/nvidia-persistenced --verbose
      2. Comment out the line nvidia-cuda-mps-control -d
      3. Replace nvidia-smi -i 0 -c EXCLUSIVE_PROCESS by nvidia-smi -i 0 -c PROHIBITED
  3. TORQUE

    cd /work1/shared/eureka/package/torque/src/torque-3.0.6
    

    Login node only: edit Fish_Install.sh to enable —enable-server

    • WARNING: do NOT run Fish_Install.sh in parallel (i.e., install one node at a time)
    sh Fish_Install.sh >& log.eurekaXX
    cd ../../etc
    cp pbs /etc/init.d/
    

    Edit pbs.conf to set start_server=1 and start_mom=0

    cp pbs.conf /etc/
    cp nodes /var/spool/TORQUE/server_priv/nodes
    systemctl enable pbs
    source /etc/profile.d/torque.sh
    cd ../src/torque-3.0.6/
    ./torque.setup root
    killall pbs_server
    systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
    systemctl status pbs
    
  4. [Optional] Create the SSH key of root

    ssh-keygen -t rsa
    cd ~/.ssh
    cp id_rsa.pub authorized_keys
    cp id_rsa* authorized_keys /work1/shared/eureka/ssh_root/
    
  5. NFS server

    systemctl enable nfs
    systemctl start nfs
    cp /work1/shared/eureka/etc/exports /etc/
    exportfs -ra
    showmount -e eureka00 # /software 192.168.0.0/24
    
    

    Comment out eureka00:/software in /etc/fstab

  6. CUDA

    1. Driver

      It should have been installed when following Eureka Installation: Computing Node

    2. Libraries and samples

      mkdir /software/cuda /software/cuda/10.0 /software/cuda/10.1 /software/cuda/10.2 /software/cuda/11.3
      ln -s /software/cuda/11.3 /software/cuda/default
      

      Install 11.3

      cd /work1/shared/eureka/package/cuda/
      sh cuda_11.3.1_465.19.01_linux.run  --silent --toolkit --installpath=/software/cuda/11.3
      sh cuda_11.3.1_465.19.01_linux.run  --silent --samples --installpath=/software/cuda/11.3
      

      Install 10.2

      Somehow toolkit and samples need to be installed separately --> Likely a bug in the installer since the older version works

      cd /work1/shared/eureka/package/cuda/
      sh cuda_10.2.89_440.33.01_linux.run  --silent --toolkit --installpath=/software/cuda/10.2
      sh cuda_10.2.89_440.33.01_linux.run  --silent --samples --installpath=/software/cuda/10.2
      rm -f /usr/local/cuda
      cd /software/cuda/10.2/NVIDIA_CUDA-10.2_Samples
      

      Edit Makefile to add CUDA_PATH=/software/cuda/10.2 to the end of # +@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)

      make -j 16 -k # Add "-k" to skip some unsupported samples
                    # (https://devtalk.nvidia.com/default/topic/1067000/where-is-quot-nvscibuf-h-quot-/?offset=22)
      

      0_Simple/simpleMPI will not be compiled if MPI has not been installed Ignore errors related to the vulkan GPU and gl31.h header

      Install 10.1 and 10.0 for Po-Hsun since his SR code somehow crashes with 10.2

      cd /work1/shared/eureka/package/cuda/
      sh cuda_10.1.243_418.87.00_linux.run --silent --toolkit --installpath=/software/cuda/10.1
      sh cuda_10.1.243_418.87.00_linux.run --silent --samples --installpath=/software/cuda/10.1
      rm -f /usr/local/cuda
      cd /software/cuda/10.1/NVIDIA_CUDA-10.1_Samples
      

      Edit Makefile to add CUDA_PATH=/software/cuda/10.1 to the end of # +@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)

      make -j 16
      
      cd /work1/shared/eureka/package/cuda/
      sh cuda_10.0.130_410.48_linux --silent --toolkit --toolkitpath=/software/cuda/10.0
      sh cuda_10.0.130_410.48_linux --silent --samples --samplespath=/software/cuda/10.0 --toolkitpath=/software/cuda/10.0
      rm -f /usr/local/cuda
      cd /software/cuda/10.0/NVIDIA_CUDA-10.0_Samples
      

      Edit Makefile to add CUDA_PATH=/software/cuda/10.0 to the end of # +@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)

      make -j 16
      
    3. cuDNN

      [Optional] Download the latest version https://developer.nvidia.com/cudnn --> Download cuDNN

      1. Installation (the following example adopts CUDA 10.1)
        CUDNN_TMP=/work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn-10.1-linux-x64-v7.6.5.32
        CUDA_TMP=/software/cuda/10.1
        cp ${CUDNN_TMP}/include/cudnn.h ${CUDA_TMP}/include
        cp ${CUDNN_TMP}/lib64/libcudnn* ${CUDA_TMP}/lib64
        chmod a+r ${CUDA_TMP}/include/cudnn.h ${CUDA_TMP}/lib64/libcudnn*
        
      2. Test

        Log in to a computing node first Do NOT use GNU compiler later than 8

        cd /tmp
        export CUDA_PATH=/software/cuda/10.1
        export PATH=$CUDA_PATH/bin:$PATH
        export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH
        cp -r /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn_samples_v7 .
        cd cudnn_samples_v7/mnistCUDNN
        make clean && make -j 16
        ./mnistCUDNN # Test passed!
        cd ../../
        rm -rf cudnn_samples_v7
        
    4. Test

      cd /software/cuda/10.2/NVIDIA_CUDA-10.2_Samples
      
      ./1_Utilities/deviceQuery/deviceQuery
      

      Verify that

      1. 1st line: "Detected 1 CUDA Capable device(s)"
      2. 2nd line: "Device 0: "GeForce RTX 2080 Super"
      3. Last line: "Result = PASS"
      ./1_Utilities/bandwidthTest/bandwidthTest
      

      Verify that

      1. Host <-> Devince bandwidth ~13 GB/s.
      2. "Result = PASS"
    5. [Optional] Reset the default mode to graphical.target (a.k.a. runlevel 5)

      systemctl set-default graphical.target
      systemctl get-default # graphical.target
      
  7. Intel compiler oneAPI Basic & HPC toolkit

    1. Install Basic toolkit
    wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17431/l_BaseKit_p_2021.1.0.2659.sh
    
    sudo bash l_BaseKit_p_2021.1.0.2659.sh
    
    1. Install HPC toolkit
    wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17427/l_HPCKit_p_2021.1.0.2684.sh
    
    sudo bash l_HPCKit_p_2021.1.0.2684.sh
    

    Parallel Studio XE 2020 Cluster Edition (license expired)

    mkdir -p /software/intel/
    cd /work1/shared/eureka/package/intel/parallel_studio_xe_2020_cluster_edition_online
    ./install.sh
    

    Follow the instructions and use default unless specified below

    "I do NOT consent to the collection of my Information" "Activate with serial number": S477-MWZ2BF48 "Customize installation" -> "Change install Directory" -> "/software/intel/2020"

    After installation

    mv /opt/intel/licenses /opt/intel/.pset /software/intel
    rm -rf /opt/intel
    cd /opt
    ln -s /software/intel
    cd /software/intel
    ln -s /work1/shared/eureka/package/intel src
    ln -s 2020 default
    

    Check /etc/profile.d/intel.sh and, if necessary, replace opt by software

  8. Valgrind

    mkdir -p /software/valgrind
    cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.eureka
    

    After installation

    cd /software/valgrind
    ln -s /work1/shared/eureka/package/valgrind src
    ln -s 3.15.0 default
    

    Check /etc/profile.d/valgrind.sh

  9. OpenMPI

    mkdir -p /software/openmpi
    cd /software/openmpi/src/openmpi-4.1.1
    # [Optional] Edit Fish_Install_with_UCX.sh (remember to un-comment the configuration flags)
    sh Fish_Install_with_UCX.sh >& log.eureka-intel
    

    After installation

    1. check linking to UCX libraries
    cd /software/openmpi/4.1.1-intel-oneapi/bin
    objdump -p mpicxx | grep PATH    # see whether /software/openucx/ucx-1.12.0/lib is in RPATH
    ldd mpicxx | grep ucx            # see whether dynamic linker can find UCX libraries
    
    1. make soft link
    cd /software/openmpi
    ln -s /work1/shared/eureka/package/openmpi src
    unlink default                   # optional, if default already existed
    ln -s /software/openmpi/4.1.1-intel-oneapi default
    
    1. Check /etc/profile.d/openmpi.sh
    2. Check debugger
    ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
    
    1. Set the MCA parameters [Optional]
      1. Edit the configure file /software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):
      pml=ucx                                       # include only ucx for pml
      osc=ucx                                       # include only ucx for osc
      btl=^openib                                   # exclude openib from btl
      
      which works for OpenMPI 4.1.1 and UCX 1.12.0, without giving warning message:
      [eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
      
      1. Reference for excluding openib if UCX library is installed.
      2. Reference for setting MCA parameters by configure file (10. How do I set the value of MCA parameters?, 4. Files)
  10. Maui

    cd /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/maui/maui-3.3.1
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.eureka
    cd etc/
    cp maui.d /etc/init.d/
    cp maui.sh maui.csh /etc/profile.d/
    systemctl enable maui.d
    
    cd /usr/local/maui
    

    Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)

          RMPOLLINTERVAL 00:00:15
          #BACKFILLPOLICY         FIRSTFIT
          #RESERVATIONPOLICY      CURRENTHIGHEST
          #NODEALLOCATIONPOLICY   MINRESOURCE
    
          # <==== Add by Nelson ====>
          JOBAGGREGATIONTIME      00:00:04
    
          # Backfill
    
          BACKFILLPOLICY          FIRSTFIT
          RESERVATIONPOLICY       NEVER
    
          # Node Allocation
    
          NODEALLOCATIONPOLICY    FIRSTAVAILABLE
    
          # Set Job Flags
          JOBACTIONONNODEFAILURE  CANCEL
          JOBNODEMATCHPOLICY      EXACTNODE
    
       systemctl start maui.d
       source /etc/profile.d/maui.sh
    
  11. Other packages

    1. screen, pdsh

      yum -y install screen
      yum -y install pdsh
      
    2. FFTW

      • fftw2
        mkdir -p /software/fftw
        cd /work1/shared/eureka/package/fftw/fftw-2.1.5-revised
        # [Optional] Edit Fish_Install.sh
        sh Fish_Install.sh >& log.eureka-intel
        
        After installation
        cd /software/fftw
        ln -s /work1/shared/eureka/package/fftw src
        ln -s 2.1.5-intel default
        
      • fftw3
        cd /work1/shared/eureka/package/fftw/fftw-3.3.10
        # [Optional] Edit Fish_Install.sh
        sh Fish_Install.sh >& log.eureka-intel
        
        After installation
        cd /software/fftw
        ln -s 3.3.10-intel-2022.0.1-openmpi-4.1.1-ucx_mt default3
        
    3. HDF5

      mkdir -p /software/hdf5
      cd /work1/shared/eureka/package/hdf5/hdf5-1.10.6
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.eureka
      

      After installation

      cd /software/hdf5
      ln -s /work1/shared/eureka/package/hdf5 src
      ln -s 1.10.6 default
      
    4. GSL

      mkdir -p /software/gsl
      cd /work1/shared/eureka/package/gsl/gsl-2.6
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.eureka
      

      After installation

      cd /software/gsl
      ln -s /work1/shared/eureka/package/gsl src
      ln -s 2.6 default
      
    5. gnuplot

      mkdir -p /software/gnuplot
      cd /work1/shared/eureka/package/gnuplot/gnuplot-5.2.8
      # [Optional] Edit Fish_Install.sh
      sh Fish_Install.sh >& log.eureka
      

      After installation

      cd /software/gnuplot
      ln -s /work1/shared/eureka/package/gnuplot src
      ln -s 5.2.8 default
      
    6. Latest GNU compiler

      • [Optional] Download the latest version GCC_VERSION=9.3.0
        wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.gz
        
      1. Installation
        mkdir -p /software/gcc
        cd /work1/shared/eureka/package/gcc/gcc-9.3.0
        # [Optional] Edit Fish_Install.sh
        sh Fish_Install.sh >& log.eureka
        

        Ref: https://gcc.gnu.org/install/configure.html https://gcc.gnu.org/install/ https://gist.github.com/nchaigne/ad06bc867f911a3c0d32939f1e930a11 After installation

        cd /software/gcc
        ln -s /work1/shared/eureka/package/gcc src
        ln -s 9.3.0 default
        
    7. UCX library

      • [Optional] Download the latest version
        mkdir -p /software/openucx
        cd /software/openucx/
        git clone https://github.com/openucx/ucx.git ucx
        
      1. Installation
        cd /software/openucx/ucx
        ./autogen.sh
        mkdir build
        cd build
        ../contrib/configure-release --prefix=/install_path_for_ucx --enable-mt  #enable MPI_THREAD_MULTIPLE
        make && make install
        

        Ref: https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#%20UCX%20installation Ref: https://github.com/openucx/ucx/issues/5284 Enable MPI_THREAD_MULTIPLE allows using parapllel yt with UCX, without the need to exclude ucx from pml in the script.

  12. Miscellaneous

  13. Language Add the following lines to /etc/environment

    LANG=en_US.utf-8
    LC_ALL=en_US.utf-8 
    
  14. Disable ssh with root Add PermitRootLogin prohibit-password to /etc/ssh/sshd_config

  15. Block user for 5 minutes after fail password 3 times. Add auth required pam_tally2.so deny=3 unlock_time=300 file=/var/log/tallylog after the line auth required pam_sepermit.so in file /etc/pam.d/sshd

    ref. https://www.netadmin.com.tw/netadmin/zh-tw/technology/B0EF7FD662BA40A8995DE16FB2BB9041

  16. Fail2band

    1. yum install fail2band
    2. Create file /etc/fail2ban/jail.d/sshd.local with following content:
      [sshd]
      enabled = true
      
      bantime = 10m
      bantime.increment = true
      bantime.maxtime = 1d
      
      findtime = 10m
      
      maxretry = 5
      
      destemail = [Admin email] #school email
      action = %(action_mwl)s
      
    3. systemctl enable fail2band.service
    4. systemctl start fail2band.service
    5. Give a warning message for user at the login interface
      1. Add content: Banner /etc/issue.net in file /etc/ssh/sshd_config
      2. Backup origin file /etc/issue.net and do echo Info: Noted that your account will be blocked for 5 minutes if you fail password for 3 times. > /etc/issue.net
  17. Keep log files for 6 months

    1. Edit file /etc/logrotate.conf:
      # keep 4 weeks worth of backlogs
      rotate 26 #<- change this number from 4 to 26
      
      ...
      
      # no packages own wtmp and btmp -- we'll rotate them here
      /var/log/wtmp {
          monthly
          create 0664 root utmp
              minsize 1M
          rotate 6 #<- change this number from 1 to 6
      }
      
    2. systemctl restart rsyslog.service

Links