Eureka Installation: Login Node - calab-ntu/gpu-cluster GitHub Wiki
Installation Procedure
-
Follow everything in Eureka Installation: Computing Node with following changes.
-
Install CentOS
- Installation summary:
- Additional Development
- Compatibility Libraries
- Development Tools
- Emacs
- File and Storage Server
- Hardware Monitoring Utilities
- Infiniband Support
- Legacy X Window System Compatibility
- Network File System Client
- Platform Development
- Python
- Technical Writing
- Email Server
- Identity Management Server
- KDE
- Security Tools
- System Administration Tools
- Installation summary:
-
Set up network Edit
/etc/sysconfig/network-scripts/ifcfg-XXX, where XXX is usually "enpXXX" BOOTPROTO=static ONBOOT=yes IPADDR=192.168.0.100 GATEWAY=192.168.0.1 NETMASK=255.255.255.0 DNS1=140.112.254.4 -
NFS client
- Mount all remote folders
Do not mount
/softwaremount /software; mount /home; mount /work1; mount /project; mount /projectX; mount /projectY; mount /projectZ
- Mount all remote folders
Do not mount
-
-
Install the CUDA driver
- Copy initialization scripts
Editcp /work1/shared/eureka/init_script/*.sh /etc/profile.d/ cp /work1/shared/eureka/init_script/*.csh /etc/profile.d/ cp /work1/shared/eureka/etc/rc.local /etc/rc.d/ chmod +x /etc/rc.d/rc.local/etc/rc.d/rc.localas follows- Comment out the line
/usr/bin/nvidia-persistenced --verbose - Comment out the line
nvidia-cuda-mps-control -d - Replace
nvidia-smi -i 0 -c EXCLUSIVE_PROCESSbynvidia-smi -i 0 -c PROHIBITED
- Comment out the line
- Copy initialization scripts
-
TORQUE
cd /work1/shared/eureka/package/torque/src/torque-3.0.6Login node only: edit
Fish_Install.shto enable—enable-server- WARNING: do NOT run
Fish_Install.shin parallel (i.e., install one node at a time)
sh Fish_Install.sh >& log.eurekaXX cd ../../etc cp pbs /etc/init.d/Edit
pbs.confto setstart_server=1andstart_mom=0cp pbs.conf /etc/ cp nodes /var/spool/TORQUE/server_priv/nodes systemctl enable pbs source /etc/profile.d/torque.sh cd ../src/torque-3.0.6/ ./torque.setup root killall pbs_server systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config" systemctl status pbs - WARNING: do NOT run
-
[Optional] Create the SSH key of root
ssh-keygen -t rsa cd ~/.ssh cp id_rsa.pub authorized_keys cp id_rsa* authorized_keys /work1/shared/eureka/ssh_root/ -
NFS server
systemctl enable nfs systemctl start nfs cp /work1/shared/eureka/etc/exports /etc/ exportfs -ra showmount -e eureka00 # /software 192.168.0.0/24Comment out
eureka00:/softwarein/etc/fstab -
CUDA
-
Driver
It should have been installed when following Eureka Installation: Computing Node
-
Libraries and samples
mkdir /software/cuda /software/cuda/10.0 /software/cuda/10.1 /software/cuda/10.2 /software/cuda/11.3 ln -s /software/cuda/11.3 /software/cuda/defaultInstall 11.3
cd /work1/shared/eureka/package/cuda/ sh cuda_11.3.1_465.19.01_linux.run --silent --toolkit --installpath=/software/cuda/11.3 sh cuda_11.3.1_465.19.01_linux.run --silent --samples --installpath=/software/cuda/11.3Install 10.2
Somehow toolkit and samples need to be installed separately --> Likely a bug in the installer since the older version works
cd /work1/shared/eureka/package/cuda/ sh cuda_10.2.89_440.33.01_linux.run --silent --toolkit --installpath=/software/cuda/10.2 sh cuda_10.2.89_440.33.01_linux.run --silent --samples --installpath=/software/cuda/10.2 rm -f /usr/local/cuda cd /software/cuda/10.2/NVIDIA_CUDA-10.2_SamplesEdit
Makefileto addCUDA_PATH=/software/cuda/10.2to the end of #+@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)make -j 16 -k # Add "-k" to skip some unsupported samples # (https://devtalk.nvidia.com/default/topic/1067000/where-is-quot-nvscibuf-h-quot-/?offset=22)0_Simple/simpleMPIwill not be compiled if MPI has not been installed Ignore errors related to the vulkan GPU andgl31.hheaderInstall 10.1 and 10.0 for Po-Hsun since his SR code somehow crashes with 10.2
cd /work1/shared/eureka/package/cuda/ sh cuda_10.1.243_418.87.00_linux.run --silent --toolkit --installpath=/software/cuda/10.1 sh cuda_10.1.243_418.87.00_linux.run --silent --samples --installpath=/software/cuda/10.1 rm -f /usr/local/cuda cd /software/cuda/10.1/NVIDIA_CUDA-10.1_SamplesEdit
Makefileto addCUDA_PATH=/software/cuda/10.1to the end of #+@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)make -j 16 cd /work1/shared/eureka/package/cuda/ sh cuda_10.0.130_410.48_linux --silent --toolkit --toolkitpath=/software/cuda/10.0 sh cuda_10.0.130_410.48_linux --silent --samples --samplespath=/software/cuda/10.0 --toolkitpath=/software/cuda/10.0 rm -f /usr/local/cuda cd /software/cuda/10.0/NVIDIA_CUDA-10.0_SamplesEdit
Makefileto addCUDA_PATH=/software/cuda/10.0to the end of #+@$(MAKE) -C $(dir $*) $(MAKECMDGOALS)make -j 16 -
cuDNN
[Optional] Download the latest version https://developer.nvidia.com/cudnn --> Download cuDNN
- Installation (the following example adopts CUDA 10.1)
CUDNN_TMP=/work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn-10.1-linux-x64-v7.6.5.32 CUDA_TMP=/software/cuda/10.1 cp ${CUDNN_TMP}/include/cudnn.h ${CUDA_TMP}/include cp ${CUDNN_TMP}/lib64/libcudnn* ${CUDA_TMP}/lib64 chmod a+r ${CUDA_TMP}/include/cudnn.h ${CUDA_TMP}/lib64/libcudnn* - Test
Log in to a computing node first Do NOT use GNU compiler later than 8
cd /tmp export CUDA_PATH=/software/cuda/10.1 export PATH=$CUDA_PATH/bin:$PATH export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH cp -r /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/cudnn/cudnn_samples_v7 . cd cudnn_samples_v7/mnistCUDNN make clean && make -j 16 ./mnistCUDNN # Test passed! cd ../../ rm -rf cudnn_samples_v7
- Installation (the following example adopts CUDA 10.1)
-
Test
cd /software/cuda/10.2/NVIDIA_CUDA-10.2_Samples ./1_Utilities/deviceQuery/deviceQueryVerify that
- 1st line: "Detected 1 CUDA Capable device(s)"
- 2nd line: "Device 0: "GeForce RTX 2080 Super"
- Last line: "Result = PASS"
./1_Utilities/bandwidthTest/bandwidthTestVerify that
- Host <-> Devince bandwidth ~13 GB/s.
- "Result = PASS"
-
[Optional] Reset the default mode to
graphical.target(a.k.a. runlevel 5)systemctl set-default graphical.target systemctl get-default # graphical.target
-
-
Intel compiler oneAPI Basic & HPC toolkit
- Install Basic toolkit
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17431/l_BaseKit_p_2021.1.0.2659.sh sudo bash l_BaseKit_p_2021.1.0.2659.sh- Install HPC toolkit
wget https://registrationcenter-download.intel.com/akdlm/irc_nas/17427/l_HPCKit_p_2021.1.0.2684.sh sudo bash l_HPCKit_p_2021.1.0.2684.shParallel Studio XE 2020 Cluster Edition (license expired)
mkdir -p /software/intel/ cd /work1/shared/eureka/package/intel/parallel_studio_xe_2020_cluster_edition_online ./install.shFollow the instructions and use default unless specified below
"I do NOT consent to the collection of my Information" "Activate with serial number": S477-MWZ2BF48 "Customize installation" -> "Change install Directory" -> "/software/intel/2020"
After installation
mv /opt/intel/licenses /opt/intel/.pset /software/intel rm -rf /opt/intel cd /opt ln -s /software/intel cd /software/intel ln -s /work1/shared/eureka/package/intel src ln -s 2020 defaultCheck
/etc/profile.d/intel.shand, if necessary, replaceoptbysoftware -
Valgrind
mkdir -p /software/valgrind cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eurekaAfter installation
cd /software/valgrind ln -s /work1/shared/eureka/package/valgrind src ln -s 3.15.0 defaultCheck
/etc/profile.d/valgrind.sh -
OpenMPI
mkdir -p /software/openmpi cd /software/openmpi/src/openmpi-4.1.1 # [Optional] Edit Fish_Install_with_UCX.sh (remember to un-comment the configuration flags) sh Fish_Install_with_UCX.sh >& log.eureka-intelAfter installation
- check linking to
UCXlibraries
cd /software/openmpi/4.1.1-intel-oneapi/bin objdump -p mpicxx | grep PATH # see whether /software/openucx/ucx-1.12.0/lib is in RPATH ldd mpicxx | grep ucx # see whether dynamic linker can find UCX libraries- make soft link
cd /software/openmpi ln -s /work1/shared/eureka/package/openmpi src unlink default # optional, if default already existed ln -s /software/openmpi/4.1.1-intel-oneapi default- Check
/etc/profile.d/openmpi.sh - Check debugger
ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)- Set the
MCAparameters [Optional]- Edit the configure file
/software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.confby root (for querying the path for configure file, one can use:ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):
which works forpml=ucx # include only ucx for pml osc=ucx # include only ucx for osc btl=^openib # exclude openib from btlOpenMPI 4.1.1andUCX 1.12.0, without giving warning message:[eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue. - Edit the configure file
- check linking to
-
Maui
cd /work1/fish/Code_Backup/GPU_Cluster_Setup/eureka/package/maui/maui-3.3.1 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eureka cd etc/ cp maui.d /etc/init.d/ cp maui.sh maui.csh /etc/profile.d/ systemctl enable maui.d cd /usr/local/mauiEdit
maui.cfgas follows (an example is put atmaui-3.3.1/maui.cfg.eureka)RMPOLLINTERVAL 00:00:15 #BACKFILLPOLICY FIRSTFIT #RESERVATIONPOLICY CURRENTHIGHEST #NODEALLOCATIONPOLICY MINRESOURCE # <==== Add by Nelson ====> JOBAGGREGATIONTIME 00:00:04 # Backfill BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY NEVER # Node Allocation NODEALLOCATIONPOLICY FIRSTAVAILABLE # Set Job Flags JOBACTIONONNODEFAILURE CANCEL JOBNODEMATCHPOLICY EXACTNODE systemctl start maui.d source /etc/profile.d/maui.sh -
Other packages
-
screen,pdshyum -y install screen yum -y install pdsh -
FFTWfftw2
After installationmkdir -p /software/fftw cd /work1/shared/eureka/package/fftw/fftw-2.1.5-revised # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eureka-intelcd /software/fftw ln -s /work1/shared/eureka/package/fftw src ln -s 2.1.5-intel defaultfftw3
After installationcd /work1/shared/eureka/package/fftw/fftw-3.3.10 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eureka-intelcd /software/fftw ln -s 3.3.10-intel-2022.0.1-openmpi-4.1.1-ucx_mt default3
-
HDF5mkdir -p /software/hdf5 cd /work1/shared/eureka/package/hdf5/hdf5-1.10.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eurekaAfter installation
cd /software/hdf5 ln -s /work1/shared/eureka/package/hdf5 src ln -s 1.10.6 default -
GSLmkdir -p /software/gsl cd /work1/shared/eureka/package/gsl/gsl-2.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eurekaAfter installation
cd /software/gsl ln -s /work1/shared/eureka/package/gsl src ln -s 2.6 default -
gnuplotmkdir -p /software/gnuplot cd /work1/shared/eureka/package/gnuplot/gnuplot-5.2.8 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eurekaAfter installation
cd /software/gnuplot ln -s /work1/shared/eureka/package/gnuplot src ln -s 5.2.8 default -
Latest GNU compiler
- [Optional] Download the latest version
GCC_VERSION=9.3.0
wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.gz
- Installation
mkdir -p /software/gcc cd /work1/shared/eureka/package/gcc/gcc-9.3.0 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.eurekaRef: https://gcc.gnu.org/install/configure.html https://gcc.gnu.org/install/ https://gist.github.com/nchaigne/ad06bc867f911a3c0d32939f1e930a11 After installation
cd /software/gcc ln -s /work1/shared/eureka/package/gcc src ln -s 9.3.0 default
- [Optional] Download the latest version
GCC_VERSION=9.3.0
-
UCXlibrary- [Optional] Download the latest version
mkdir -p /software/openucx cd /software/openucx/ git clone https://github.com/openucx/ucx.git ucx
- Installation
cd /software/openucx/ucx ./autogen.sh mkdir build cd build ../contrib/configure-release --prefix=/install_path_for_ucx --enable-mt #enable MPI_THREAD_MULTIPLE make && make installRef: https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX#%20UCX%20installation Ref: https://github.com/openucx/ucx/issues/5284 Enable
MPI_THREAD_MULTIPLEallows using parapllelytwithUCX, without the need to excludeucxfrompmlin the script.
- [Optional] Download the latest version
-
-
Miscellaneous
-
Language Add the following lines to
/etc/environmentLANG=en_US.utf-8 LC_ALL=en_US.utf-8 -
Disable ssh with root Add
PermitRootLogin prohibit-passwordto/etc/ssh/sshd_config -
Block user for 5 minutes after fail password 3 times. Add
auth required pam_tally2.so deny=3 unlock_time=300 file=/var/log/tallylogafter the lineauth required pam_sepermit.soin file/etc/pam.d/sshdref. https://www.netadmin.com.tw/netadmin/zh-tw/technology/B0EF7FD662BA40A8995DE16FB2BB9041
-
Fail2bandyum install fail2band- Create file
/etc/fail2ban/jail.d/sshd.localwith following content:[sshd] enabled = true bantime = 10m bantime.increment = true bantime.maxtime = 1d findtime = 10m maxretry = 5 destemail = [Admin email] #school email action = %(action_mwl)s systemctl enable fail2band.servicesystemctl start fail2band.service- Give a warning message for user at the login interface
- Add content:
Banner /etc/issue.netin file/etc/ssh/sshd_config - Backup origin file
/etc/issue.netand doecho Info: Noted that your account will be blocked for 5 minutes if you fail password for 3 times. > /etc/issue.net
- Add content:
-
Keep log files for 6 months
- Edit file
/etc/logrotate.conf:# keep 4 weeks worth of backlogs rotate 26 #<- change this number from 4 to 26 ... # no packages own wtmp and btmp -- we'll rotate them here /var/log/wtmp { monthly create 0664 root utmp minsize 1M rotate 6 #<- change this number from 1 to 6 } systemctl restart rsyslog.service
- Edit file