Spock Installation: Login Node - calab-ntu/gpu-cluster GitHub Wiki
Install nodes (Ubuntu 22.04 server)
1. Check switche settings on MB
-
VGA switch -> off
-
IPMI switch -> left (default)
-
PSU(PHANDEKS ) hybrid -> press down
-
Change cooling fan header from
CPU_OPT
toCHA_FAN1
-
Plug the micro USB plug off the CPU pump.
2. Set up BIOS
-
Bootup machine with BIOS flash disk plugged in.
If the machine is boot for the first time, it would ask if you want to initial the CPU config. Press
Y
to confirm initial. -
Check BIOS version an update
- Get into BIOS with press
delete
orF2
during booting.
- Check BIOS version : Main -> BIOS Information -> Version
if Version = 1106 x64, then skip the steps in
Update BIOS
- Plug in the USB disk with "BIOS" label to USB socket labeled with "BIOS".
- Keep pressing
delete
orF2
during booting to get in to BIOS. Tool
->ASUS EZ Flash 3 Utility
- Find the folder
PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1106
- Find the file
PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1106.CAP
Yes
- Reboot with
save changes and exit
or pressF10
. - Check again.
- Get into BIOS with press
-
DRAM overclock setting
Ai Tweaker
->Ai overclock Tuner
-> ChooseD.O.C.P
D.O.C.P
-> ChooseD.O.C.P DDR4-3200 16-18-18-38-1.35V
F10
reboot.- Check:
Main
->Total Memory
:262144 MB
-> Speed :3200 MHz
-
Enable NUMA
Advanced
->AMD CBS
->DF Common Option
->Memory Addressing
NUMA nodes persocket
-> ChooseNPS2
-
F10
Reboot
3. Install ubuntu server 22.04
-
Download ubuntu 22.04 from https://www.ubuntu-tw.org/modules/tinyd0/ Make bootable USB disk with rufus.
-
Set up boot disk in BIOS
- Boot with bootable USB disk plugged in.
- Get into BIOS with press
delete
orF2
during booting. Boot
> Choose USB to boot.F10
to reboot.
-
Install Ubuntu 22.04 0. Choose
Try or install ubuntu server
- Select language :
English
->Done
- Keyboard configuration.
- Layout :
English (US)
- Variant :
English (US)
->Done
- Layout :
- Choose type of install
- Ubuntu Server
- Search for third party drivers
->
Done
- Network connecions ->
Continue without network
- Configure proxy ->
Done
Leave the field empty.
- Configure Ubuntu archive mirror ->
Done
Don't change the url
- Guided storage configuration
Custom storage layout
->Done
- Select disks and reformat all of them.
- Install in RAID1 (Redundancy)
- Select the
Use As Boot Device
on both disks /boot
- Choose free space and select
Add GPT partition
- Size : 1G
- Format :
Leave unformat
->Create
Create software RAID
:- RAID namd :
md0
- Raid type :
Raid 1
- Format :
ext4
- Mount :
/boot
->Create
- RAID namd :
- Choose free space and select
/
: Same steps as/boot
with changes- Size : 914G
swap
: Choose the rest of free space and format them asswap
- Select the
- Porfile setup
- Your name: spock**
** is the number of node name
- Your server's name: spock**
- Pick a username: tmp_account
- Choose a password: ************
- Confirm your password: ***********
- Your name: spock**
- Upgrade to Ubuntu Pro ->
Skip Ubuntu Pro setup for now
->Done
- SSH Setup ->
Done
Don't check the option.
- Third-party drivers
- Do not install third-party drivers now
->
Done
- Do not install third-party drivers now
->
Reboot Now
-> Unplug the install medium and pressenter
to reboot.
- Select language :
-
Check.
- Kernel :
uname -r
->5.15.0-60-generic
- CPU :
lscpu | grep Model
->AMD Ryzen Threadripper PRO 5975WX 32-Cores
- RAM :
sudo dmidecode memory | grep Speed
Configured Speed: 3200 MT/s Speed: 2667 MT/s
- NUMA :
lscpu | grep NUMA
NUMA node(s) : 2 NUMA node0 CPU(s) : 0-15, 32-47 NUMA node1 CPU(s) : 16-31, 48-63
- Kernel :
4. Set up settings
-
Network settings.
- Edit netplan :
sudo vim /etc/netplan/00-installer-config.yaml
# This is the network config written by 'subiquity' network: ethernets: enp*****0: dhcp4: true enp*****1: dhcp4: false addresses: [192.168.0.2**/22] # ** would be replaced by the number of node. nameservers: addresses: [140.112.254.4] routes: - to: default via: 192.168.0.1 version:2
- Apply netplan:
sudo netplan apply
- Poweroff the machine and move it to machine room.
- Plug the ethernet cable to the upper ethernet port.
- Check
- ip settings :
ip addr show dev enp*****1
inet 192.168.0.2**/22
- DNS server :
resolvectl status
Link 3 (enp*****1) Current Scopes: DNS Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: 140.112.254.4 DNS Servers: 140.112.254.4
ping 192.168.0.150
- ip settings :
- Get system network informations.
sudo -i scp [your_account]@192.168.0.150:/work1/shared/spock/etc/hosts /etc/hosts
- Edit netplan :
-
Update system 0. Operate in sudo privilage
sudo -i
apt update
apt-get install -y linux-image-5.15.0-78-generic
Pressenter
twice as kernel update UI appears.reboot
sudo -i
- check :
uname -r
5.15.0-78-generic # or above
- Change group name of ID 1000 :
groupmod --new-name calab tmp_account
- Set root password :
passwd
- Delete
/home/tmp_account
:rm -r /home/tmp_account
- Change
sh
link fromdash
tobash
:sudo dpkg-reconfigure dash # Then configure UI will ask if want to set /usr/bin/sh to dash # Press "No" to set the /usr/bin/sh to bash
-
Time stamp of command history
su
- Add
export HISTTIMEFORMAT='%d/%m/%y %T '
to the end of file/etc/profile
source /etc/profile
- Check by
history
-
Set timezone
su
timedatectl set-timezone Asia/Taipei
- Check
timedatectl show
-
NFS settings
- Client
sudo -i
- Install NFS client.
apt -y install nfs-common
- Get auto mount settings from
work1
.ssh [your_account]@eureka00 cat /work1/shared/spock/etc/fstab >> /etc/fstab
[Login node only] Comment out the line start from spock00:/software
- Create directories.
mkdir /software /work1 /projectV /projectW /projectX /projectY /projectZ
- Check the accessibility of the target NFS servers
showmount -e spock00 # /software 192.168.0.0/24 **[Skip on login node]** showmount -e tumaz # /home 192.168.0.0/24 showmount -e ironman # /volume1/gpucluster1 192.168.0.0/24 # /volume3/gpucluster3 192.168.0.0/24 showmount -e eater # /volume1/gpucluster3 192.168.0.0/24 # /volume2/gpucluster4 192.168.0.0/24 # /volume3/gpucluster6 192.168.0.0/24 showmount -e pacific # /volume1/gpucluster1 192.168.0.0/24
- Mount all remote directories.
mount /software; # Skip in process on login node mount /home; mount /work1; mount /projectW; mount /projectX; mount /projectY; mount /projectZ; mount /projectV
- Check :
df -h
tumaz:/home 208G 22G 176G 12% /home ironman:/volume1/gpucluster1 70T 47T 24T 67% /work1 ironman:/volume3/gpucluster3 70T 70T 643G 100% /projectX eater:/volume1/gpucluster3 70T 67T 3.6T 95% /projectY eater:/volume2/gpucluster4 88T 77T 12T 88% /projectZ eater:/volume3/gpucluster6 88T 75T 13T 86% /projectW pacific:/volume1/gpucluster1 140T 20T 120T 15% /projectV
- Server [Login node only]
- Install NIS server :
sudo apt -y install nfs-kernel-server
- Create and check the directory to be mount :
ll /software
>(/software not exist) >
mkdir /software` - Copy NIS target settings to
/etc/exports
:cp /work1/shared/spock/etc/exports /etc/exports
- Start and enable NIS server :
systemctl restart nfs-kernel-server.service systemctl enable nfs-kernel-server.service
- Check NIS server status and result.
systemctl status nfs-kernel-server.service # Active: active (exited) showmount -e spock00 # /software 192.168.0.0/24
- Install NIS server :
- Client
-
NIS settings
- Install NIS client.
sudo apt -y install nis
- Configure as a NIS Client.
vim /etc/yp.conf
, add follow text at the end.domain tumaz.gpucluster.calab server tumaz
vim /etc/nsswitch.conf
passwd: files systemd nis group: files systemd nis shadow: files nis hosts: files dns nis
- Set NIS domain name,
vim /etc/defaultdomain
tumaz.gpucluster.calab
- Start and enable nis.
systemctl restart ypbind systemctl enable ypbind
- check :
ll /home
yptest
:1 test fail
ypwhich
:tumaz
- Logout and login with your own account
su
Deletetmp_account
:userdel --remove tmp_account
It's okay to receive error message:userdel: tmp_account mail spool (/var/mail/tmp_account) not found userdel: tmp_account home directory (/home/tmp_account) not found
- Install NIS client.
-
Install GPU driver
- Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)
systemctl set-default multi-user.target
- Reboot.
su
- Install dkms :
apt -y install dkms
- Disable
nouveau
: Create file/etc/modprobe.d/blacklist-nouveau.conf
with content:blacklist nouveau options nouveau modeset=0
- Apply system changes
update-initramfs -u
- Reboot.
su
- Check
nouveau
is disabled :lsmod | grep nouveau
This should print nothing.
- Install nvidia dirver
- Install :
su sh /work1/shared/spock/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
- Validate with
cat /proc/driver/nvidia/version
:NVRM version: NVIDIA UNIX x86_64 Kernel Module 530.30.02 GCC version: gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
- Copy the default profile files.
cp /work1/shared/spock/init_script/*.sh /etc/profile.d/ cp /work1/shared/spock/init_script/*.csh /etc/profile.d/ cp /work1/shared/spock/etc/rc.local /etc/ chmod +x /etc/rc.local
- Change GPU settings [login node only]
Edit
/etc/rc.local
as follows- Comment out the line
/usr/bin/nvidia-persistenced --verbose
- Comment out the line
nvidia-cuda-mps-control -d
- Replace
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
bynvidia-smi -i 0 -c PROHIBITED
- Comment out the line
- Reboot
- Install :
- Check
nvidia-smi
NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1
- Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running)
-
NTP client 0.
su
apt -y install ntp ntpdate
- Edit
/etc/ntp.conf
- Add
pool time.google.com iburst
- Comment out other pool servers.
- Add
systemctl restart ntp
systemctl status ntp
systemctl enable ntp
-
TORQUE
- Install the required packages
apt -y install libnuma-dev apt -y install tcl-dev tk-dev apt -y install libntirpc-dev sh /work1/shared/spock/package/torque/src/torque-3.0.6/spock_library_set.sh
- Compile and install from source code.
cd /work1/shared/spock/package/torque/src/torque-3.0.6 # WARNING: do NOT run "spock_Install.sh" in parallel (i.e., install one node at a time) # [Login node ] uncomment "--enable-server" # [Computing nodes] comment "--enable-server" sh spock_Install.sh >& log.spockXX cd ../../etc cp pbs_spock /etc/init.d/pbs ln -s /etc/init.d/pbs /etc/systemd/system/ cp pbs.conf /etc/ # [Login node only]: edit "pbs.conf" to set "start_server=1" and "start_mom=0" cp nodes_spock /var/spool/TORQUE/server_priv/nodes systemctl enable pbs source /etc/profile.d/torque.sh cd ../src/torque-3.0.6/ ./torque.setup root killall pbs_server systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config" systemctl status pbs
- Check
cat /var/spool/TORQUE/pbs_environment
:LANG=en_US.utf-8
- Setup
overcommit-ratio
and Disableovercommit-memory
incrontab
cp /work1/shared/spock/helper_script/disable_memory_overcommit.sh /root/
- Edit crontab with
crontab -e
and add a new line:@reboot /usr/bin/sh /root/disable_memory_overcommit.sh 1> /tmp/disable_memory_overcommit.log 2>&1
- Install the required packages
-
[Optional] [Login node only] Create the SSH key of root [Testing]
ssh-keygen -t rsa cd ~/.ssh cp id_rsa.pub authorized_keys cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
-
InfiniBand
ref. https://docs.nvidia.com/networking/display/MLNXOSv3105002/Getting+Started#heading-RerunningtheWizard
- Check hardware
lspci | grep Mellanox
01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
- Install necessary package
apt -y install libsasl2-dev libldap2-dev libssl-dev
- Install driver
su
cd /work1/shared/spock/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu22.04-x86_64
./mlnxofedinstall
Device #1: ---------- Device Type: ConnectX6 Part Number: MCX653105A-HDA_Ax Description: ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6 PSID: MT_0000000223 PCI Device Name: 01:00.0 Base GUID: 0c42a10300ef2a1a Versions: Current Available FW 20.34.1002 20.36.1010 PXE 3.6.0700 3.6.0901 UEFI 14.27.0014 14.29.0014 Status: Up to date ---------
/etc/init.d/openibd restart
reboot
- Check
0.
su
ibstatus
Infiniband device 'mlx5_0' port 1 status: default gid: fe80:0000:0000:0000:0c42:a103:00ef:2a1a base lid: 0xffff sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: InfiniBand
cat /etc/security/limits.conf
* soft memlock unlimited * hard memlock unlimited
systemctl status openibd
Active: active (exited)
systemctl is-enabled openibd
enabled
systemctl status opensmd
Active: inactive (dead)
systemctl is-enabled opensmd
disabled
hca_self_test.ofed
---- Performing Adapter Device Self Test ---- Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-5.9-0.5.6.0 (OFED-5.9-0.5.6): 5.15.0-69-generic Host Driver RPM Check .................. PASS Firmware on CA #0 HCA .................. v20.36.1010 Host Driver Initialization ............. PASS Number of CA Ports Active .............. 1 Port State of Port #1 on CA #0 (HCA)..... UP 4X HDR (InfiniBand) Error Counter Check on CA #0 (HCA)...... PASS Kernel Syslog Check .................... PASS Node GUID on CA #0 (HCA) ............... 0c:42:a1:03:00:ef:2a:1a ------------------ DONE ---------------------
ibdev2netdev -v | grep -i MCX
0000:01:00.0 mlx5_0 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56 fw 20.36.1010 port 1 (ACTIVE) ==> ibp1s0 (Down)
- IB connection and band width test.
- Computing nodes -> Login node
On
spock00
Onib_write_bw -aF
spockXX
ib_write_bw -aF spock00
************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0027 PSN 0xcb8c4 RKey 0x1fffbe VAddr 0x007f9c96aaa000 remote address: LID 0x01 QPN 0x0027 PSN 0x560b74 RKey 0x1fffbe VAddr 0x007f0894517000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 8388608 5000 23452.55 23452.55 0.002932 ---------------------------------------------------------------------------------------
- Computing nodes <- Login node
On
spock00
Onib_read_bw -aF
spockXX
ib_read_bw -aF spock00
************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- RDMA_Read BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON CQ Moderation : 100 Mtu : 4096[B] Link type : IB Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x02 QPN 0x0028 PSN 0x593c01 OUT 0x10 RKey 0x1fffbf VAddr 0x007efc3f67f000 remote address: LID 0x01 QPN 0x0028 PSN 0xbaa0aa OUT 0x10 RKey 0x1fffbf VAddr 0x007f6fd2a85000 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 8388608 1000 23517.75 23517.73 0.002940 ---------------------------------------------------------------------------------------
- Computing nodes -> Login node
On
- Start mst to make us enable monitor IB adaptor
systemctl enable mst systemctl start mst mst status
- Check hardware
-
ssh without password for the root
cd /work1/shared/spock/ssh_root/ cp authorized_keys id_rsa* /root/.ssh/ # Verification ssh spock00 # "yes" to "continue connecting" ssh spockXX # "yes" to "continue connecting" exit exit
5. install compilers [Login node only]
-
Intel compiler
- [Login node] Install
su
mkdir /software/intel
ln -s /software/intel /opt
cd /work1/shared/spock/package/intel
sh l_BaseKit_p_2023.1.0.46401.sh -a --cli
Follow and accept the installation process.sh l_HPCKit_p_2023.1.0.46346.sh -a --cli
Follow and accept inte installation process.
- [Computing nodes] Link
su
cd /opt
ln -s /software/intel
- [Login node] Install
-
gcc compiler [skip]
- [Login node] Install latest version
su
mkdir /software/gcc
cd /work1/shared/spock/package/gcc/gcc-12.2.0
sh ./spock_Install.sh >& log.spock
cd /software/gcc
ln -s /work1/shared/spock/package/gcc ./src
ln -s 12.2.0 default
- [Login node] Install latest version
6. install packages
-
[Login node only] CUDA
cd /work1/shared/spock/package/cuda
mkdir /software/cuda
sh cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=/software/cuda/12.1
- Create default link:
ln -s /software/cuda/12.1 /software/cuda/default
-
[Login node only] Valgrind
mkdir /software/valgrind cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spock
After installation
cd /software/valgrind ln -s /work1/shared/spock/package/valgrind src ln -s 3.15.0 default
-
[Login node only] UCX Library
- Download latest version [optional]
mkdir /software/openucx mkdir /software/src cd /software/openucx/src git clone https://github.com/openucx/ucx.git ucx
- Installatoin
cd /software/openucx/src/ucx ./autogen.sh mkdir build cd build ../contrib/configure-release --prefix=/software/openucx/ucx-1.15.0_with_mt --enable-mt #enable MPI_THREAD_MULTIPLE make && make install
- Download latest version [optional]
-
[Login node only] OpenMPI
source /etc/profile.d/intel.sh mkdir /software/openmpi ln -s /work1/shared/spock/package/openmpi /software/openmpi/src cd /software/openmpi/src/openmpi-4.1.5 # [Optional] Edit spock_Install_with_UCX.sh (remember to un-comment the configuration flags) sh spock_Install_with_ucx.sh >& log.spock-all
After installation
-
Check
ucx
cd /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0/bin objdump -p mpicxx | grep PATH # see whether /software/openucx/ucx-1.15.0_with_mt/lib is in RPATH ldd mpicxx | grep ucx # see whether dynamic linker can find UCX libraries
-
Check debugger
source /etc/profile.d/openmpi.sh ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
-
Set the
MCA
parameters- Edit the configure file
/software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf
by root (for querying the path for configure file, one can use:ompi_info --params mca all --level 9 | grep mca_param_files
). Add the below lines (2021/07/24):pml=ucx osc=ucx btl=^openib
include only ucx for pm include only ucx for osc exclude openib from btl which works for
OpenMPI 4.1.1
andUCX 1.12.0
, without giving warning message:[eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
- Edit the configure file
-
-
[Login node only] Maui [testing] [Problematic on
sed
andgcc
version] Install sed 4.2.2cd /work1/shared/spock/package/sed/sed-4.2.2 sh spock_Install.sh
Install maui
cd /work1/shared/spock/package/maui/maui-3.3.1/ # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spock cd etc/ cp spock_maui.d /etc/init.d/maui.d cp maui.sh maui.csh /etc/profile.d/ systemctl enable maui.d cd /usr/local/maui
Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)
RMPOLLINTERVAL 00:00:15 #BACKFILLPOLICY FIRSTFIT #RESERVATIONPOLICY CURRENTHIGHEST #NODEALLOCATIONPOLICY MINRESOURCE # <==== Add by Nelson ====> JOBAGGREGATIONTIME 00:00:04 # Backfill BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY NEVER # Node Allocation NODEALLOCATIONPOLICY FIRSTAVAILABLE # Set Job Flags JOBACTIONONNODEFAILURE CANCEL JOBNODEMATCHPOLICY EXACTNODE systemctl start maui.d source /etc/profile.d/maui.sh
-
[Login node only] FFTW
- FFTW-2
After installationmkdir /software/fftw cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised # [Optional] Edit Fish_Install.sh to install in intel or gcc sh Fish_Install.sh >& log.spock-intel
cd /software/fftw ln -s /work1/shared/spock/package/fftw src
- FFTW-3
After installationcd /work1/shared/eureka/package/fftw/fftw-3.3.10 # [Optional] Edit spock_Install.sh sh spock_Install.sh >& log.spock-intel
cd /software/fftw
- FFTW-2
-
[Ligin node only] HDF5
mkdir -p /software/hdf5 cd /work1/shared/spock/package/hdf5/hdf5-1.10.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spock
After installation
cd /software/hdf5 ln -s /work1/shared/spock/package/hdf5 src
-
[Login node only] GSL
mkdir -p /software/gsl cd /work1/shared/spock/package/gsl/gsl-2.6 # [Optional] Edit Fish_Install.sh sh Fish_Install.sh >& log.spock
After installation
cd /software/gsl ln -s /work1/shared/spock/package/gsl src
-
python2
source /etc/profile.d/openmpi.sh; source /etc/profile.d/intel.sh; source /etc/profile.d/hdf5.sh apt -y install python2 python2-dev apt -y install python-tk cd /work1/shared/spock/package/python2 python2 get-pip.py sh install-python-packages.sh
-
python3
apt -y install python3 python3-dev apt -y install python3-tk apt -y install python3-pip cd /work1/shared/spock/package/python3 sh install-python-packages.sh
Add
/usr/local/bin
toPATH
by adding a line at the end of/etc/profile
export PATH=/usr/local/bin:$PATH
-
Module
cd /work1/shared/spock/package/module/modules-5.1.1 make clean ./configure make make install
After installation
cp init/profile.sh /etc/profile.d/10-modules.sh cp init/profile.csh /etc/profile.d/modules.csh source init/bash
Add
/software/intel/oneapi/modulefiles
to default module directories by adding the line to the file/usr/local/Modules/etc/initrc
module use /software/intel/oneapi/modulefiles
Set up preload module
ln -s /software/modulefiles/default_modules.sh /etc/profile.d/default_modules.sh
7. Miscellaneous setup
-
IPMI tool
- Install IPMI driver and tool :
apt -y install openipmi ipmitool
- Check :
ipmitool sensor get "CPU Temp."
- Install IPMI driver and tool :
-
ffmpeg
apt -y install ffmpeg
-
gnuplot
apt -y install gnuplot-x11
-
screen
apt -y install screen
-
pdsh
apt -y install pdsh
-
locate
apt -y install plocate
-
ClamAV
apt -y install clamav clamav-daemon systemctl stop clamav-freshclam freshclam systemctl start clamav-freshclam systemctl enable clamav-freshclam
-
X11 server
apt -y install xorg openbox
-
CPU usage monitor
apt -y install sysstat
-
Image display
feh
apt -y install feh
-
Disable auto update.
- Edit the
apt
config file at/etc/apt/apt.conf.d/20auto-upgrades
as follow.APT::Periodic::Update-Package-Lists "0"; APT::Periodic::Unattended-Upgrade "0";
- Apply config
apt-config dump APT::Periodic::Update-Package-Lists apt-config dump APT::Periodic::Unattended-Upgrade
- Edit the
-
Setup login limitations
- Edit
/etc/security/faillock.conf
:dir = /var/run/faillock audit silent deny = 3 fail_interval = 900 unlock_time = 300
- Edit
/etc/pam.d/common-auth
as follow:auth required pam_fiallock.so preauth auth [success=1 default=ignore] pam_unix.so nullok auth [defualt=die] pam_faillock.so authfail auth sufficient pam_faillock.so authsucc auth requisite pam_deny.so auth required pam_permit.so auth potional pam_cap.so
- Edit
/etc/pam.d/common-account
as follow:account [success=1 new_authtok_reqd=done default=ignore] pam_unix.so account requisite pam_deny.so account required pam_permit.so account required pam_faillock.so
- Edit
-
Fail2band
apt install fail2band
- Create file
/etc/fail2ban/jail.d/sshd.local
with following content:[sshd] enabled = true bantime = 10m bantime.increment = true bantime.maxtime = 1d findtime = 10m maxretry = 5 destemail = [Admin username] action = %(action_mwl)s
systemctl enable fail2band.service
systemctl start fail2band.service
- Give a warning message for user at the login interface
- Add content:
Banner /etc/issue.net
in file/etc/ssh/sshd_config
- Backup origin file
/etc/issue.net
and doecho Info: Noted that your account will be blocked for 5 minutes if you fail password for 3 times. > /etc/issue.net
- Add content:
-
Keep log files for 6 months
- Edit file
/etc/logrotate.conf
:# keep 4 weeks worth of backlogs rotate 26 #<- change this number from 4 to 26
- Edit file
/etc/logrotate.d/wtmp
:# no packages own wtmp -- we'll rotate it here /var/log/wtmp { missingok monthly create 0664 root utmp minsize 1M rotate 6 #<- change the number from 1 to 6 }
- Edit file
8. Check
-
CPU burn-in test
- Install CPU test program
apt -y install stress-ng
- Run CPU test
stress-ng --cpu 0 --timeout 30m &
- Detect CPU temperature every minute during test
AMD Threadripper allows temperature up to 95 degree. And thefor i in {1..40}; do ipmitool sensor | grep "CPU Temp."; sleep 1m; done
non-critical upper limit
is 85 degree. forspock02
the highest temperature is 82 degree.
- Install CPU test program
-
GPU burn-in test
cd /work1/shared/spock/tests/gpu_burn-in/gpu-burn ./gpu_burn 1800 # run for 30 minutes
during the test, watch the gpu temperature shown on screen. For RTX3080Ti, hightest temperature is 93 degree celsius. And the
non-critical upper limit
is 90 degree. Forspock02
, the highest temperature is 81 degree. -
MPI suit test [Run as regular user]
- Download @spock00
git clone https://github.com/open-mpi/mpi-test-suite.git
- Compile @spock00
cd mpi-test-suite
./autogen.sh
./configure CC=mpicc
make
- Run tests
cp /work1/shared/tests/mpi_test_suite/run_test.sh ./
qsub -I -lnodes=spockXX:ppn=32
cd {directory of mpi_test_suite}
sh run_test.sh >& spockXX.log
- Check test result
tail spockXX.log
# Number of failed tests: 0
- Download @spock00