Spock Installation: Login Node - calab-ntu/gpu-cluster GitHub Wiki
Install nodes (Ubuntu 22.04 server)
1. Check switche settings on MB
-
VGA switch -> off

-
IPMI switch -> left (default)

-
PSU(PHANDEKS ) hybrid -> press down
-
Change cooling fan header from
CPU_OPTtoCHA_FAN1
-
Plug the micro USB plug off the CPU pump.
2. Set up BIOS
-
Bootup machine with BIOS flash disk plugged in.
If the machine is boot for the first time, it would ask if you want to initial the CPU config. Press
Yto confirm initial. -
Check BIOS version an update
- Get into BIOS with press
deleteorF2during booting.
- Check BIOS version : Main -> BIOS Information -> Version
if Version = 1106 x64, then skip the steps in
Update BIOS - Plug in the USB disk with "BIOS" label to USB socket labeled with "BIOS".
- Keep pressing
deleteorF2during booting to get in to BIOS. Tool->ASUS EZ Flash 3 Utility- Find the folder
PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1106 - Find the file
PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1106.CAP Yes- Reboot with
save changes and exitor pressF10. - Check again.
- Get into BIOS with press
-
DRAM overclock setting
Ai Tweaker->Ai overclock Tuner-> ChooseD.O.C.PD.O.C.P-> ChooseD.O.C.P DDR4-3200 16-18-18-38-1.35VF10reboot.- Check:
Main->Total Memory:262144 MB-> Speed :3200 MHz
-
Enable NUMA
Advanced->AMD CBS->DF Common Option->Memory AddressingNUMA nodes persocket-> ChooseNPS2
-
F10Reboot
3. Install ubuntu server 22.04
-
Download ubuntu 22.04 from https://www.ubuntu-tw.org/modules/tinyd0/ Make bootable USB disk with rufus.
-
Set up boot disk in BIOS
- Boot with bootable USB disk plugged in.
- Get into BIOS with press
deleteorF2during booting. Boot> Choose USB to boot.F10to reboot.
-
Install Ubuntu 22.04 0. Choose
Try or install ubuntu server- Select language :
English->Done - Keyboard configuration.
- Layout :
English (US) - Variant :
English (US)->Done
- Layout :
- Choose type of install
- Ubuntu Server
- Search for third party drivers
->
Done
- Network connecions ->
Continue without network - Configure proxy ->
DoneLeave the field empty.
- Configure Ubuntu archive mirror ->
DoneDon't change the url
- Guided storage configuration
Custom storage layout->Done- Select disks and reformat all of them.
- Install in RAID1 (Redundancy)
- Select the
Use As Boot Deviceon both disks /boot- Choose free space and select
Add GPT partition- Size : 1G
- Format :
Leave unformat->Create
Create software RAID:- RAID namd :
md0 - Raid type :
Raid 1 - Format :
ext4 - Mount :
/boot->Create
- RAID namd :
- Choose free space and select
/: Same steps as/bootwith changes- Size : 914G
swap: Choose the rest of free space and format them asswap
- Select the
- Porfile setup
- Your name: spock**
** is the number of node name
- Your server's name: spock**
- Pick a username: tmp_account
- Choose a password: ************
- Confirm your password: ***********
- Your name: spock**
- Upgrade to Ubuntu Pro ->
Skip Ubuntu Pro setup for now->Done - SSH Setup ->
DoneDon't check the option.
- Third-party drivers
- Do not install third-party drivers now
->
Done
- Do not install third-party drivers now
->
Reboot Now-> Unplug the install medium and pressenterto reboot.
- Select language :
-
Check.
- Kernel :
uname -r->5.15.0-60-generic - CPU :
lscpu | grep Model->AMD Ryzen Threadripper PRO 5975WX 32-Cores - RAM :
sudo dmidecode memory | grep SpeedConfigured Speed: 3200 MT/s Speed: 2667 MT/s - NUMA :
lscpu | grep NUMANUMA node(s) : 2 NUMA node0 CPU(s) : 0-15, 32-47 NUMA node1 CPU(s) : 16-31, 48-63
- Kernel :
4. Set up settings
-
Network settings.
- Edit netplan :
sudo vim /etc/netplan/00-installer-config.yaml# This is the network config written by 'subiquity' network: ethernets: enp*****0: dhcp4: true enp*****1: dhcp4: false addresses: [192.168.0.2**/22] # ** would be replaced by the number of node. nameservers: addresses: [140.112.254.4] routes: - to: default via: 192.168.0.1 version:2 - Apply netplan:
sudo netplan apply - Poweroff the machine and move it to machine room.
- Plug the ethernet cable to the upper ethernet port.
- Check
- ip settings :
ip addr show dev enp*****1inet 192.168.0.2**/22 - DNS server :
resolvectl statusLink 3 (enp*****1) Current Scopes: DNS Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: 140.112.254.4 DNS Servers: 140.112.254.4 ping 192.168.0.150
- ip settings :
- Get system network informations.
sudo -i scp [your_account]@192.168.0.150:/work1/shared/spock/etc/hosts /etc/hosts
- Edit netplan :
-
Update system 0. Operate in sudo privilage
sudo -iapt updateapt-get install -y linux-image-5.15.0-78-genericPressentertwice as kernel update UI appears.rebootsudo -i- check :
uname -r5.15.0-78-generic # or above - Change group name of ID 1000 :
groupmod --new-name calab tmp_account - Set root password :
passwd - Delete
/home/tmp_account:rm -r /home/tmp_account - Change
shlink fromdashtobash:sudo dpkg-reconfigure dash # Then configure UI will ask if want to set /usr/bin/sh to dash # Press "No" to set the /usr/bin/sh to bash
-
Time stamp of command history
su- Add
export HISTTIMEFORMAT='%d/%m/%y %T 'to the end of file/etc/profile source /etc/profile- Check by
history
-
Set timezone
sutimedatectl set-timezone Asia/Taipei- Check
timedatectl show
-
NFS settings
- Client
sudo -i- Install NFS client.
apt -y install nfs-common - Get auto mount settings from
work1.ssh [your_account]@eureka00 cat /work1/shared/spock/etc/fstab >> /etc/fstab[Login node only] Comment out the line start from spock00:/software
- Create directories.
mkdir /software /work1 /projectV /projectW /projectX /projectY /projectZ - Check the accessibility of the target NFS servers
showmount -e spock00 # /software 192.168.0.0/24 **[Skip on login node]** showmount -e tumaz # /home 192.168.0.0/24 showmount -e ironman # /volume1/gpucluster1 192.168.0.0/24 # /volume3/gpucluster3 192.168.0.0/24 showmount -e eater # /volume1/gpucluster3 192.168.0.0/24 # /volume2/gpucluster4 192.168.0.0/24 # /volume3/gpucluster6 192.168.0.0/24 showmount -e pacific # /volume1/gpucluster1 192.168.0.0/24 - Mount all remote directories.
mount /software; # Skip in process on login node mount /home; mount /work1; mount /projectW; mount /projectX; mount /projectY; mount /projectZ; mount /projectV - Check :
df -htumaz:/home 208G 22G 176G 12% /home ironman:/volume1/gpucluster1 70T 47T 24T 67% /work1 ironman:/volume3/gpucluster3 70T 70T 643G 100% /projectX eater:/volume1/gpucluster3 70T 67T 3.6T 95% /projectY eater:/volume2/gpucluster4 88T 77T 12T 88% /projectZ eater:/volume3/gpucluster6 88T 75T 13T 86% /projectW pacific:/volume1/gpucluster1 140T 20T 120T 15% /projectV
- Server [Login node only]
- Install NIS server :
sudo apt -y install nfs-kernel-server - Create and check the directory to be mount :
ll /software>(/software not exist) >mkdir /software` - Copy NIS target settings to
/etc/exports:cp /work1/shared/spock/etc/exports /etc/exports - Start and enable NIS server :
systemctl restart nfs-kernel-server.service systemctl enable nfs-kernel-server.service - Check NIS server status and result.
systemctl status nfs-kernel-server.service # Active: active (exited) showmount -e spock00 # /software 192.168.0.0/24
- Install NIS server :
- Client
-
NIS settings
- Install NIS client.
sudo apt -y install nis - Configure as a NIS Client.
vim /etc/yp.conf, add follow text at the end.domain tumaz.gpucluster.calab server tumazvim /etc/nsswitch.confpasswd: files systemd nis group: files systemd nis shadow: files nis hosts: files dns nis- Set NIS domain name,
vim /etc/defaultdomaintumaz.gpucluster.calab - Start and enable nis.
systemctl restart ypbind systemctl enable ypbind
- check :
ll /homeyptest:1 test failypwhich:tumaz
- Logout and login with your own account
suDeletetmp_account:userdel --remove tmp_accountIt's okay to receive error message:userdel: tmp_account mail spool (/var/mail/tmp_account) not found userdel: tmp_account home directory (/home/tmp_account) not found
- Install NIS client.
-
Install GPU driver
- Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)
systemctl set-default multi-user.target - Reboot.
su- Install dkms :
apt -y install dkms - Disable
nouveau: Create file/etc/modprobe.d/blacklist-nouveau.confwith content:blacklist nouveau options nouveau modeset=0 - Apply system changes
update-initramfs -u - Reboot.
su- Check
nouveauis disabled :lsmod | grep nouveauThis should print nothing.
- Install nvidia dirver
- Install :
su sh /work1/shared/spock/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver - Validate with
cat /proc/driver/nvidia/version:NVRM version: NVIDIA UNIX x86_64 Kernel Module 530.30.02 GCC version: gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04) - Copy the default profile files.
cp /work1/shared/spock/init_script/*.sh /etc/profile.d/ cp /work1/shared/spock/init_script/*.csh /etc/profile.d/ cp /work1/shared/spock/etc/rc.local /etc/ chmod +x /etc/rc.local - Change GPU settings [login node only]
Edit
/etc/rc.localas follows- Comment out the line
/usr/bin/nvidia-persistenced --verbose - Comment out the line
nvidia-cuda-mps-control -d - Replace
nvidia-smi -i 0 -c EXCLUSIVE_PROCESSbynvidia-smi -i 0 -c PROHIBITED
- Comment out the line
- Reboot
- Install :
- Check
nvidia-smiNVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
- Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)
-
NTP client 0.
suapt -y install ntp ntpdate- Edit
/etc/ntp.conf- Add
pool time.google.com iburst - Comment out other pool servers.
- Add
systemctl restart ntpsystemctl status ntpsystemctl enable ntp
-
TORQUE
- Install the required packages
apt -y install libnuma-dev apt -y install tcl-dev tk-dev apt -y install libntirpc-dev sh /work1/shared/spock/package/torque/src/torque-3.0.6/spock_library_set.sh - Compile and install from source code.
cd /work1/shared/spock/package/torque/src/torque-3.0.6 # WARNING: do NOT run "spock_Install.sh" in parallel (i.e., install one node at a time) # [Login node ] uncomment "--enable-server" # [Computing nodes] comment "--enable-server" sh spock_Install.sh >& log.spockXX cd ../../etc cp pbs_spock /etc/init.d/pbs ln -s /etc/init.d/pbs /etc/systemd/system/ cp pbs.conf /etc/ # [Login node only]: edit "pbs.conf" to set "start_server=1" and "start_mom=0" cp nodes_spock /var/spool/TORQUE/server_priv/nodes systemctl enable pbs source /etc/profile.d/torque.sh cd ../src/torque-3.0.6/ ./torque.setup root killall pbs_server systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config" systemctl status pbs - Check
cat /var/spool/TORQUE/pbs_environment:LANG=en_US.utf-8 - Setup
overcommit-ratioand Disableovercommit-memoryincrontabcp /work1/shared/spock/helper_script/disable_memory_overcommit.sh /root/- Edit crontab with
crontab -eand add a new line:@reboot /usr/bin/sh /root/disable_memory_overcommit.sh 1> /tmp/disable_memory_overcommit.log 2>&1
- Install the required packages
-
[Optional] [Login node only] Create the SSH key of root [Testing]
ssh-keygen -t rsa cd ~/.ssh cp id_rsa.pub authorized_keys cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/ -
InfiniBand
ref. https://docs.nvidia.com/networking/display/MLNXOSv3105002/Getting+Started#heading-RerunningtheWizard
- Check hardware
lspci | grep Mellanox01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6] - Install necessary package
apt -y install libsasl2-dev libldap2-dev libssl-dev - Install driver
sucd /work1/shared/spock/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu22.04-x86_64./mlnxofedinstallDevice #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 01:00.0 Base GUID: 0c42a10300ef2a1a Versions: Current Available FW 20.34.1002 20.36.1010 PXE 3.6.0700 3.6.0901 UEFI 14.27.0014 14.29.0014 Status: Up to date ---------/etc/init.d/openibd restartreboot
- Check
0.
suibstatusInfiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:0c42:a103:00ef:2a1a base lid: 0xffff sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBandcat /etc/security/limits.conf* soft memlock unlimited * hard memlock unlimitedsystemctl status openibdActive: active (exited)systemctl is-enabled openibdenabledsystemctl status opensmdActive: inactive (dead)systemctl is-enabled opensmddisabledhca_self_test.ofed---- Performing Adapter Device Self Test ---- Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-5.9-0.5.6.0 (OFED-5.9-0.5.6): 5.15.0-69-generic Host Driver RPM Check .................. PASS Firmware on CA #0 HCA .................. v20.36.1010 Host Driver Initialization ............. PASS Number of CA Ports Active .............. 1 Port State of Port #1 on CA #0 (HCA)..... UP 4X HDR (InfiniBand) Error Counter Check on CA #0 (HCA)...... PASS Kernel Syslog Check .................... PASS Node GUID on CA #0 (HCA) ............... 0c:42:a1:03:00:ef:2a:1a ------------------ DONE ---------------------ibdev2netdev -v | grep -i MCX0000:01:00.0 mlx5_0 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56 fw 20.36.1010 port 1 (ACTIVE) ==> ibp1s0 (Down)- IB connection and band width test.
- Computing nodes -> Login node
On
spock00
Onib_write_bw -aFspockXXib_write_bw -aF spock00************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0027 PSN 0xcb8c4 RKey 0x1fffbe VAddr 0x007f9c96aaa000 remote address: LID 0x01 QPN 0x0027 PSN 0x560b74 RKey 0x1fffbe VAddr 0x007f0894517000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 8388608 5000 23452.55 23452.55 0.002932 --------------------------------------------------------------------------------------- - Computing nodes <- Login node
On
spock00
Onib_read_bw -aFspockXXib_read_bw -aF spock00************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 4096[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0028 PSN 0x593c01 OUT 0x10 RKey 0x1fffbf VAddr 0x007efc3f67f000 remote address: LID 0x01 QPN 0x0028 PSN 0xbaa0aa OUT 0x10 RKey 0x1fffbf VAddr 0x007f6fd2a85000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 8388608 1000 23517.75 23517.73 0.002940 ---------------------------------------------------------------------------------------
- Computing nodes -> Login node
On
- Start mst to make us enable monitor IB adaptor
systemctl enable mst systemctl start mst mst status
- Check hardware
-
ssh without password for the root
cd /work1/shared/spock/ssh_root/ cp authorized_keys id_rsa* /root/.ssh/ # Verification ssh spock00 # "yes" to "continue connecting" ssh spockXX # "yes" to "continue connecting" exit exit -
Raid1 monitor Edit
/etc/mdadm/mdadm.confas follow:
#MAILADDR root #Common out this line
MAILADDR [NTU account]@ntu.edu.tw #Add this line
5. install compilers [Login node only]
-
Intel compiler
- [Login node] Install
sumkdir /software/intelln -s /software/intel /optcd /work1/shared/spock/package/intelsh l_BaseKit_p_2023.1.0.46401.sh -a --cliFollow and accept the installation process.sh l_HPCKit_p_2023.1.0.46346.sh -a --cliFollow and accept inte installation process.
- [Computing nodes] Link
sucd /optln -s /software/intel
- [Login node] Install
-
gcc compiler [skip]
- [Login node] Install latest version
sumkdir /software/gcccd /work1/shared/spock/package/gcc/gcc-12.2.0sh ./spock_Install.sh >& log.spockcd /software/gccln -s /work1/shared/spock/package/gcc ./srcln -s 12.2.0 default
- [Login node] Install latest version
6. install packages
-
[Login node only] CUDA
cd /work1/shared/spock/package/cudamkdir /software/cudash cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=/software/cuda/12.1- Create default link:
ln -s /software/cuda/12.1 /software/cuda/default
-
[Login node only] Valgrind
mkdir /software/valgrind cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spockAfter installation
cd /software/valgrind ln -s /work1/shared/spock/package/valgrind src ln -s 3.15.0 default -
[Login node only] UCX Library
- Download latest version [optional]
mkdir /software/openucx mkdir /software/src cd /software/openucx/src git clone https://github.com/openucx/ucx.git ucx
- Installatoin
cd /software/openucx/src/ucx ./autogen.sh mkdir build cd build ../contrib/configure-release --prefix=/software/openucx/ucx-1.15.0_with_mt --enable-mt #enable MPI_THREAD_MULTIPLE make && make install
- Download latest version [optional]
-
[Login node only] OpenMPI
source /etc/profile.d/intel.sh mkdir /software/openmpi ln -s /work1/shared/spock/package/openmpi /software/openmpi/src cd /software/openmpi/src/openmpi-4.1.5 # [Optional] Edit spock_Install_with_UCX.sh (remember to un-comment the configuration flags) sh spock_Install_with_ucx.sh >& log.spock-allAfter installation
-
Check
ucxcd /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0/bin objdump -p mpicxx | grep PATH # see whether /software/openucx/ucx-1.15.0_with_mt/lib is in RPATH ldd mpicxx | grep ucx # see whether dynamic linker can find UCX libraries -
Check debugger
source /etc/profile.d/openmpi.sh ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5) -
Set the
MCAparameters- Edit the configure file
/software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.confby root (for querying the path for configure file, one can use:ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):pml=ucx osc=ucx btl=^openibinclude only ucx for pm include only ucx for osc exclude openib from btl which works for
OpenMPI 4.1.1andUCX 1.12.0, without giving warning message:[eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
- Edit the configure file
-
-
[Login node only] Maui [testing] [Problematic on
sedandgccversion] Install sed 4.2.2cd /work1/shared/spock/package/sed/sed-4.2.2 sh spock_Install.shInstall maui
cd /work1/shared/spock/package/maui/maui-3.3.1/ # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spock cd etc/ cp spock_maui.d /etc/init.d/maui.d cp maui.sh maui.csh /etc/profile.d/ systemctl enable maui.d cd /usr/local/mauiEdit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)
RMPOLLINTERVAL 00:00:15 #BACKFILLPOLICY FIRSTFIT #RESERVATIONPOLICY CURRENTHIGHEST #NODEALLOCATIONPOLICY MINRESOURCE # <==== Add by Nelson ====> JOBAGGREGATIONTIME 00:00:04 # Backfill BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY NEVER # Node Allocation NODEALLOCATIONPOLICY FIRSTAVAILABLE # Set Job Flags JOBACTIONONNODEFAILURE CANCEL JOBNODEMATCHPOLICY EXACTNODE systemctl start maui.d source /etc/profile.d/maui.sh -
[Login node only] FFTW
- FFTW-2
After installationmkdir /software/fftw cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised # [Optional] Edit Fish_Install.sh to install in intel or gcc sh Fish_Install.sh >& log.spock-intelcd /software/fftw ln -s /work1/shared/spock/package/fftw src - FFTW-3
After installationcd /work1/shared/eureka/package/fftw/fftw-3.3.10 # [Optional] Edit spock_Install.sh sh spock_Install.sh >& log.spock-intelcd /software/fftw
- FFTW-2
-
[Ligin node only] HDF5
mkdir -p /software/hdf5 cd /work1/shared/spock/package/hdf5/hdf5-1.10.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spockAfter installation
cd /software/hdf5 ln -s /work1/shared/spock/package/hdf5 src -
[Login node only] GSL
mkdir -p /software/gsl cd /work1/shared/spock/package/gsl/gsl-2.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spockAfter installation
cd /software/gsl ln -s /work1/shared/spock/package/gsl src -
python2
source /etc/profile.d/openmpi.sh; source /etc/profile.d/intel.sh; source /etc/profile.d/hdf5.sh apt -y install python2 python2-dev apt -y install python-tk cd /work1/shared/spock/package/python2 python2 get-pip.py sh install-python-packages.sh -
python3
apt -y install python3 python3-dev apt -y install python3-tk apt -y install python3-pip cd /work1/shared/spock/package/python3 sh install-python-packages.shAdd
/usr/local/bintoPATHby adding a line at the end of/etc/profileexport PATH=/usr/local/bin:$PATH -
Module
cd /work1/shared/spock/package/module/modules-5.1.1 make clean ./configure make make installAfter installation
cp init/profile.sh /etc/profile.d/10-modules.sh cp init/profile.csh /etc/profile.d/modules.csh source init/bashAdd
/software/intel/oneapi/modulefilesto default module directories by adding the line to the file/usr/local/Modules/etc/initrcmodule use /software/intel/oneapi/modulefilesSet up preload module
ln -s /software/modulefiles/default_modules.sh /etc/profile.d/default_modules.sh
7. Miscellaneous setup
-
IPMI tool
- Install IPMI driver and tool :
apt -y install openipmi ipmitool - Check :
ipmitool sensor get "CPU Temp."
- Install IPMI driver and tool :
-
ffmpeg
apt -y install ffmpeg -
gnuplot
apt -y install gnuplot-x11 -
screen
apt -y install screen -
pdsh
apt -y install pdsh -
locate
apt -y install plocate -
ClamAV
apt -y install clamav clamav-daemon systemctl stop clamav-freshclam freshclam systemctl start clamav-freshclam systemctl enable clamav-freshclam -
X11 server
apt -y install xorg openbox -
CPU usage monitor
apt -y install sysstat -
Image display
fehapt -y install feh -
Disable auto update.
- Edit the
aptconfig file at/etc/apt/apt.conf.d/20auto-upgradesas follow.APT::Periodic::Update-Package-Lists "0"; APT::Periodic::Unattended-Upgrade "0"; - Apply config
apt-config dump APT::Periodic::Update-Package-Lists apt-config dump APT::Periodic::Unattended-Upgrade
- Edit the
-
Setup login limitations
- Edit
/etc/security/faillock.conf:dir = /var/run/faillock audit silent deny = 3 fail_interval = 900 unlock_time = 300 - Edit
/etc/pam.d/common-authas follow:auth required pam_fiallock.so preauth auth [success=1 default=ignore] pam_unix.so nullok auth [defualt=die] pam_faillock.so authfail auth sufficient pam_faillock.so authsucc auth requisite pam_deny.so auth required pam_permit.so auth potional pam_cap.so - Edit
/etc/pam.d/common-accountas follow:account [success=1 new_authtok_reqd=done default=ignore] pam_unix.so account requisite pam_deny.so account required pam_permit.so account required pam_faillock.so
- Edit
-
Fail2bandapt install fail2band- Create file
/etc/fail2ban/jail.d/sshd.localwith following content:[sshd] enabled = true bantime = 10m bantime.increment = true bantime.maxtime = 1d findtime = 10m maxretry = 5 destemail = [Admin username] action = %(action_mwl)s systemctl enable fail2band.servicesystemctl start fail2band.service- Give a warning message for user at the login interface
- Add content:
Banner /etc/issue.netin file/etc/ssh/sshd_config - Backup origin file
/etc/issue.netand doecho Info: Noted that your account will be blocked for 5 minutes if you fail password for 3 times. > /etc/issue.net
- Add content:
-
Keep log files for 6 months
- Edit file
/etc/logrotate.conf:# keep 4 weeks worth of backlogs rotate 26 #<- change this number from 4 to 26 - Edit file
/etc/logrotate.d/wtmp:# no packages own wtmp -- we'll rotate it here /var/log/wtmp { missingok monthly create 0664 root utmp minsize 1M rotate 6 #<- change the number from 1 to 6 }
- Edit file
-
Login node policy setting Add following lines at the end of file
/etc/security/limits.conf@calab soft cpu 1 @calab hard cpu 1.2 @calab soft memlock 41943040 @calab hard memlock 41943040 @calab soft data 41943040 @calab hard data 41943040 @partner soft cpu 1 @partner hard cpu 1.2 @partner soft memlock 41943040 @partner hard memlock 41943040 @partner soft data 41943040 @partner hard data 41943040
8. Check
-
CPU burn-in test
- Install CPU test program
apt -y install stress-ng - Run CPU test
stress-ng --cpu 0 --timeout 30m & - Detect CPU temperature every minute during test
AMD Threadripper allows temperature up to 95 degree. And thefor i in {1..40}; do ipmitool sensor | grep "CPU Temp."; sleep 1m; donenon-critical upper limitis 85 degree. forspock02the highest temperature is 82 degree.
- Install CPU test program
-
GPU burn-in test
cd /work1/shared/spock/tests/gpu_burn-in/gpu-burn ./gpu_burn 1800 # run for 30 minutesduring the test, watch the gpu temperature shown on screen. For RTX3080Ti, hightest temperature is 93 degree celsius. And the
non-critical upper limitis 90 degree. Forspock02, the highest temperature is 81 degree. -
MPI suit test [Run as regular user]
- Download @spock00
git clone https://github.com/open-mpi/mpi-test-suite.git - Compile @spock00
cd mpi-test-suite./autogen.sh./configure CC=mpiccmake - Run tests
cp /work1/shared/tests/mpi_test_suite/run_test.sh ./qsub -I -lnodes=spockXX:ppn=32cd {directory of mpi_test_suite}sh run_test.sh >& spockXX.log - Check test result
tail spockXX.log# Number of failed tests: 0
- Download @spock00