Spock Installation: Login Node - calab-ntu/gpu-cluster GitHub Wiki

Install nodes (Ubuntu 22.04 server)

1. Check switche settings on MB

  1. VGA switch -> off image

  2. IPMI switch -> left (default) image

  3. PSU(PHANDEKS ) hybrid -> press down

  4. Change cooling fan header from CPU_OPT to CHA_FAN1 image

  5. Plug the micro USB plug off the CPU pump.

2. Set up BIOS

  1. Bootup machine with BIOS flash disk plugged in.

    If the machine is boot for the first time, it would ask if you want to initial the CPU config. Press Y to confirm initial.

  2. Check BIOS version an update

    • Get into BIOS with press delete or F2 during booting.
    1. Check BIOS version : Main -> BIOS Information -> Version

      if Version = 1106 x64, then skip the steps in Update BIOS

    2. Plug in the USB disk with "BIOS" label to USB socket labeled with "BIOS".
    3. Keep pressing delete or F2 during booting to get in to BIOS.
    4. Tool -> ASUS EZ Flash 3 Utility
    5. Find the folder PRO_WS_WRX80E-SAGE_SE_WIFI-ASUS-1106
    6. Find the file PRO-WS-WRX80E-SAGE-SE-WIFI-ASUS-1106.CAP
    7. Yes
    8. Reboot with save changes and exit or press F10.
    9. Check again.
  3. DRAM overclock setting

    1. Ai Tweaker -> Ai overclock Tuner -> Choose D.O.C.P
    2. D.O.C.P -> Choose D.O.C.P DDR4-3200 16-18-18-38-1.35V
    3. F10 reboot.
    4. Check: Main -> Total Memory : 262144 MB -> Speed : 3200 MHz
  4. Enable NUMA

    1. Advanced -> AMD CBS -> DF Common Option -> Memory Addressing
    2. NUMA nodes persocket -> Choose NPS2
  5. F10 Reboot

3. Install ubuntu server 22.04

  1. Download ubuntu 22.04 from https://www.ubuntu-tw.org/modules/tinyd0/ Make bootable USB disk with rufus.

  2. Set up boot disk in BIOS

    1. Boot with bootable USB disk plugged in.
    2. Get into BIOS with press delete or F2 during booting.
    3. Boot > Choose USB to boot.
    4. F10 to reboot.
  3. Install Ubuntu 22.04 0. Choose Try or install ubuntu server

    1. Select language : English -> Done
    2. Keyboard configuration.
      • Layout : English (US)
      • Variant : English (US) -> Done
    3. Choose type of install
      • Ubuntu Server
      • Search for third party drivers -> Done
    4. Network connecions -> Continue without network
    5. Configure proxy -> Done

      Leave the field empty.

    6. Configure Ubuntu archive mirror -> Done

      Don't change the url

    7. Guided storage configuration
      1. Custom storage layout -> Done
      2. Select disks and reformat all of them.
      • Install in RAID1 (Redundancy)
        1. Select the Use As Boot Device on both disks
        2. /boot
          1. Choose free space and select Add GPT partition
            • Size : 1G
            • Format : Leave unformat -> Create
          2. Create software RAID :
            • RAID namd : md0
            • Raid type : Raid 1
            • Format : ext4
            • Mount : /boot -> Create
        3. / : Same steps as /boot with changes
          • Size : 914G
        4. swap : Choose the rest of free space and format them as swap
    8. Porfile setup
      • Your name: spock**

        ** is the number of node name

      • Your server's name: spock**
      • Pick a username: tmp_account
      • Choose a password: ************
      • Confirm your password: ***********
    9. Upgrade to Ubuntu Pro -> Skip Ubuntu Pro setup for now -> Done
    10. SSH Setup -> Done

      Don't check the option.

    11. Third-party drivers
      • Do not install third-party drivers now -> Done
    12. Reboot Now -> Unplug the install medium and press enter to reboot.
  4. Check.

    1. Kernel : uname -r -> 5.15.0-60-generic
    2. CPU : lscpu | grep Model -> AMD Ryzen Threadripper PRO 5975WX 32-Cores
    3. RAM : sudo dmidecode memory | grep Speed
      Configured Speed: 3200 MT/s
      Speed: 2667 MT/s
      
    4. NUMA : lscpu | grep NUMA
      NUMA node(s)           : 2
      NUMA node0 CPU(s) : 0-15, 32-47
      NUMA node1 CPU(s) : 16-31, 48-63
      

4. Set up settings

  1. Network settings.

    1. Edit netplan : sudo vim /etc/netplan/00-installer-config.yaml
      # This is the network config written by 'subiquity'
      network:
        ethernets:
          enp*****0:
            dhcp4: true
          enp*****1:
            dhcp4: false
            addresses: [192.168.0.2**/22] # ** would be replaced by the number of node.
            nameservers:
              addresses: [140.112.254.4]
            routes:
              - to: default
                via: 192.168.0.1
      version:2
      
    2. Apply netplan: sudo netplan apply
    3. Poweroff the machine and move it to machine room.
      • Plug the ethernet cable to the upper ethernet port.
    4. Check
      1. ip settings : ip addr show dev enp*****1
        inet 192.168.0.2**/22
        
      2. DNS server : resolvectl status
        Link 3 (enp*****1)
            Current Scopes: DNS
                Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
        Current DNS Server: 140.112.254.4
               DNS Servers: 140.112.254.4
        
      3. ping 192.168.0.150
    5. Get system network informations.
      sudo -i
      scp [your_account]@192.168.0.150:/work1/shared/spock/etc/hosts /etc/hosts
      
  2. Update system 0. Operate in sudo privilage sudo -i

    1. apt update
    2. apt-get install -y linux-image-5.15.0-78-generic Press enter twice as kernel update UI appears.
    3. reboot
    4. sudo -i
    5. check : uname -r 5.15.0-78-generic # or above
    6. Change group name of ID 1000 : groupmod --new-name calab tmp_account
    7. Set root password : passwd
    8. Delete /home/tmp_account : rm -r /home/tmp_account
    9. Change sh link from dash to bash:
      sudo dpkg-reconfigure dash
      # Then configure UI will ask if want to set /usr/bin/sh to dash
      # Press "No" to set the /usr/bin/sh to bash
      
  3. Time stamp of command history

    1. su
    2. Add export HISTTIMEFORMAT='%d/%m/%y %T ' to the end of file /etc/profile
    3. source /etc/profile
    4. Check by history
  4. Set timezone

    1. su
    2. timedatectl set-timezone Asia/Taipei
    3. Check timedatectl show
  5. NFS settings

    1. Client
      1. sudo -i
      2. Install NFS client. apt -y install nfs-common
      3. Get auto mount settings from work1.
        ssh [your_account]@eureka00 cat /work1/shared/spock/etc/fstab >> /etc/fstab
        

        [Login node only] Comment out the line start from spock00:/software

      4. Create directories. mkdir /software /work1 /projectV /projectW /projectX /projectY /projectZ
      5. Check the accessibility of the target NFS servers
        showmount -e spock00     # /software 192.168.0.0/24 **[Skip on login node]**
        showmount -e tumaz        # /home 192.168.0.0/24
        showmount -e ironman    # /volume1/gpucluster1 192.168.0.0/24
                                                  # /volume3/gpucluster3 192.168.0.0/24
        showmount -e eater         # /volume1/gpucluster3 192.168.0.0/24
                                                  # /volume2/gpucluster4 192.168.0.0/24
                                                  # /volume3/gpucluster6 192.168.0.0/24
        showmount -e pacific       # /volume1/gpucluster1 192.168.0.0/24
        
      6. Mount all remote directories.
        mount /software; # Skip in process on login node 
        mount /home; mount /work1; mount /projectW; mount /projectX; mount /projectY; mount /projectZ; mount /projectV
        
      7. Check : df -h
        tumaz:/home                   208G   22G  176G  12% /home
        
        ironman:/volume1/gpucluster1   70T   47T   24T  67% /work1
        ironman:/volume3/gpucluster3   70T   70T  643G 100% /projectX
        
        eater:/volume1/gpucluster3     70T   67T  3.6T  95% /projectY
        eater:/volume2/gpucluster4     88T   77T   12T  88% /projectZ
        eater:/volume3/gpucluster6     88T   75T   13T  86% /projectW
        
        pacific:/volume1/gpucluster1  140T   20T  120T  15% /projectV
        
    2. Server [Login node only]
      1. Install NIS server : sudo apt -y install nfs-kernel-server
      2. Create and check the directory to be mount : ll /software >(/software not exist) > mkdir /software`
      3. Copy NIS target settings to /etc/exports: cp /work1/shared/spock/etc/exports /etc/exports
      4. Start and enable NIS server :
        systemctl restart nfs-kernel-server.service
        systemctl enable nfs-kernel-server.service
        
      5. Check NIS server status and result.
        systemctl status nfs-kernel-server.service
        # Active: active (exited)
        showmount -e spock00
        # /software 192.168.0.0/24
        
  6. NIS settings

    1. Install NIS client. sudo apt -y install nis
    2. Configure as a NIS Client.
      1. vim /etc/yp.conf , add follow text at the end.
        domain tumaz.gpucluster.calab server tumaz
        
      2. vim /etc/nsswitch.conf
        passwd:         files systemd nis
        group:          files systemd nis
        shadow:         files nis
        hosts:          files dns nis 
        
      3. Set NIS domain name, vim /etc/defaultdomain
        tumaz.gpucluster.calab
        
      4. Start and enable nis.
        systemctl restart ypbind
        systemctl enable ypbind
        
    3. check :
      1. ll /home
      2. yptest : 1 test fail
      3. ypwhich : tumaz
    4. Logout and login with your own account su Delete tmp_account : userdel --remove tmp_account It's okay to receive error message:
      userdel: tmp_account mail spool (/var/mail/tmp_account) not found 
      userdel: tmp_account home directory (/home/tmp_account) not found
      
  7. Install GPU driver

    1. Set the text mode as default (since the NVIDIA driver cannot be installed while X window is running) systemctl set-default multi-user.target
    2. Reboot.
    3. su
    4. Install dkms : apt -y install dkms
    5. Disable nouveau : Create file /etc/modprobe.d/blacklist-nouveau.conf with content:
      blacklist nouveau
      options nouveau modeset=0
      
    6. Apply system changes update-initramfs -u
    7. Reboot.
    8. su
    9. Check nouveau is disabled : lsmod | grep nouveau

      This should print nothing.

    10. Install nvidia dirver
      1. Install :
        su
        sh /work1/shared/spock/package/cuda/cuda_12.1.0_530.30.02_linux.run --silent --driver
        
      2. Validate with cat /proc/driver/nvidia/version:
        NVRM version: NVIDIA UNIX x86_64 Kernel Module  530.30.02
        GCC version:  gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
        
      3. Copy the default profile files.
        cp /work1/shared/spock/init_script/*.sh  /etc/profile.d/
        cp /work1/shared/spock/init_script/*.csh /etc/profile.d/
        
        cp /work1/shared/spock/etc/rc.local /etc/
        chmod +x /etc/rc.local
        
      4. Change GPU settings [login node only] Edit /etc/rc.local as follows
        1. Comment out the line /usr/bin/nvidia-persistenced --verbose
        2. Comment out the line nvidia-cuda-mps-control -d
        3. Replace nvidia-smi -i 0 -c EXCLUSIVE_PROCESS by nvidia-smi -i 0 -c PROHIBITED
      5. Reboot
    11. Check nvidia-smi
      NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1
      
  8. NTP client 0. su

    1. apt -y install ntp ntpdate
    2. Edit /etc/ntp.conf
      • Add pool time.google.com iburst
      • Comment out other pool servers.
    3. systemctl restart ntp
    4. systemctl status ntp
    5. systemctl enable ntp
  9. TORQUE

    1. Install the required packages
      apt -y install libnuma-dev
      apt -y install tcl-dev tk-dev
      apt -y install libntirpc-dev
      sh /work1/shared/spock/package/torque/src/torque-3.0.6/spock_library_set.sh
      
    2. Compile and install from source code.
      cd /work1/shared/spock/package/torque/src/torque-3.0.6
      # WARNING: do NOT run "spock_Install.sh" in parallel (i.e., install one node at a time)
      # [Login node ] uncomment "--enable-server"
      # [Computing nodes] comment "--enable-server"
      sh spock_Install.sh >& log.spockXX
      cd ../../etc
      cp pbs_spock /etc/init.d/pbs
      ln -s /etc/init.d/pbs /etc/systemd/system/
      
      cp pbs.conf /etc/
      # [Login node only]: edit "pbs.conf" to set "start_server=1" and "start_mom=0"
      cp nodes_spock /var/spool/TORQUE/server_priv/nodes
      systemctl enable pbs 
      
      source /etc/profile.d/torque.sh
      cd ../src/torque-3.0.6/
      ./torque.setup root
      killall pbs_server
      systemctl start pbs # This error message is fine: "LOG_ERROR::No such file or directory (2) in read_config, fstat: config"
      systemctl status pbs
      
    3. Check cat /var/spool/TORQUE/pbs_environment : LANG=en_US.utf-8
    4. Setup overcommit-ratio and Disable overcommit-memory in crontab
      1. cp /work1/shared/spock/helper_script/disable_memory_overcommit.sh /root/
      2. Edit crontab with crontab -e and add a new line:
        @reboot /usr/bin/sh /root/disable_memory_overcommit.sh 1> /tmp/disable_memory_overcommit.log 2>&1
        
  10. [Optional] [Login node only] Create the SSH key of root [Testing]

    ssh-keygen -t rsa
    cd ~/.ssh
    cp id_rsa.pub authorized_keys
    cp id_rsa* authorized_keys /work1/shared/spock/ssh_root/
    
  11. InfiniBand

    ref. https://docs.nvidia.com/networking/display/MLNXOSv3105002/Getting+Started#heading-RerunningtheWizard

    1. Check hardware lspci | grep Mellanox
      01:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
      
    2. Install necessary package
      apt -y install libsasl2-dev  libldap2-dev libssl-dev 
      
    3. Install driver
      1. su
      2. cd /work1/shared/spock/package/ib/adaptor/driver/MLNX_OFED_LINUX-5.9-0.5.6.0-ubuntu22.04-x86_64
      3. ./mlnxofedinstall
        Device #1:
        ----------
        Device Type:      ConnectX6
        Part Number:      MCX653105A-HDA_Ax
        Description:      ConnectX-6 VPI adapter card; HDR IB (200Gb/s) and 200GbE; single-port QSFP56; PCIe4.0 x16; tall bracket; ROHS R6
        PSID:             MT_0000000223
        PCI Device Name:  01:00.0
        Base GUID:        0c42a10300ef2a1a
        Versions:         Current        Available
        FW             20.34.1002     20.36.1010
        PXE            3.6.0700       3.6.0901
        UEFI           14.27.0014     14.29.0014
        Status:           Up to date
        ---------
        
      4. /etc/init.d/openibd restart
      5. reboot
    4. Check 0. su
      1. ibstatus
        Infiniband device 'mlx5_0' port 1 status:
                default gid:     fe80:0000:0000:0000:0c42:a103:00ef:2a1a
                base lid:        0xffff
                sm lid:          0x0
                state:           4: ACTIVE
                phys state:      5: LinkUp
                rate:            200 Gb/sec (4X HDR)
                link_layer:      InfiniBand
        
      2. cat /etc/security/limits.conf
        * soft memlock unlimited
        * hard memlock unlimited
        
      3. systemctl status openibd
        Active: active (exited)
        
      4. systemctl is-enabled openibd
        enabled
        
      5. systemctl status opensmd
        Active: inactive (dead)
        
      6. systemctl is-enabled opensmd
        disabled
        
      7. hca_self_test.ofed
        ---- Performing Adapter Device Self Test ----
        Number of CAs Detected ................. 1
        PCI Device Check ....................... PASS
        Kernel Arch ............................ x86_64
        Host Driver Version .................... MLNX_OFED_LINUX-5.9-0.5.6.0 (OFED-5.9-0.5.6): 5.15.0-69-generic
        Host Driver RPM Check .................. PASS
        Firmware on CA #0 HCA .................. v20.36.1010
        Host Driver Initialization ............. PASS
        Number of CA Ports Active .............. 1
        Port State of Port #1 on CA #0 (HCA).....  UP 4X HDR (InfiniBand)
        Error Counter Check on CA #0 (HCA)...... PASS
        Kernel Syslog Check .................... PASS
        Node GUID on CA #0 (HCA) ............... 0c:42:a1:03:00:ef:2a:1a
        ------------------ DONE ---------------------
        
      8. ibdev2netdev -v | grep -i MCX
        0000:01:00.0 mlx5_0 (MT4123 - MCX653105A-HDAT) ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56
                                                                                                                   fw 20.36.1010 port 1 (ACTIVE) ==> ibp1s0 (Down)
        
      9. IB connection and band width test.
        1. Computing nodes -> Login node On spock00
          ib_write_bw -aF
          
          On spockXX
          ib_write_bw -aF spock00
          
          ************************************
          * Waiting for client to connect... *
          ************************************
          ---------------------------------------------------------------------------------------
          RDMA_Write BW Test
          Dual-port       : OFF          Device         : mlx5_0
          Number of qps   : 1            Transport type : IB
          Connection type : RC           Using SRQ      : OFF
          PCIe relax order: ON
          ibv_wr* API     : ON
          CQ Moderation   : 100
          Mtu             : 4096[B]
          Link type       : IB
          Max inline data : 0[B]
          rdma_cm QPs     : OFF
          Data ex. method : Ethernet
          ---------------------------------------------------------------------------------------
          local address: LID 0x02 QPN 0x0027 PSN 0xcb8c4 RKey 0x1fffbe VAddr 0x007f9c96aaa000
          remote address: LID 0x01 QPN 0x0027 PSN 0x560b74 RKey 0x1fffbe VAddr 0x007f0894517000
          ---------------------------------------------------------------------------------------
          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
          8388608    5000             23452.55            23452.55                  0.002932
          ---------------------------------------------------------------------------------------
          
        2. Computing nodes <- Login node On spock00
          ib_read_bw -aF
          
          On spockXX
          ib_read_bw -aF spock00
          
          ************************************
          * Waiting for client to connect... *
          ************************************
          ---------------------------------------------------------------------------------------
          RDMA_Read BW Test
          Dual-port       : OFF          Device         : mlx5_0
          Number of qps   : 1            Transport type : IB
          Connection type : RC           Using SRQ      : OFF
          PCIe relax order: ON
          ibv_wr* API     : ON
          CQ Moderation   : 100
          Mtu             : 4096[B]
          Link type       : IB
          Outstand reads  : 16
          rdma_cm QPs     : OFF
          Data ex. method : Ethernet
          ---------------------------------------------------------------------------------------
          local address: LID 0x02 QPN 0x0028 PSN 0x593c01 OUT 0x10 RKey 0x1fffbf VAddr 0x007efc3f67f000
          remote address: LID 0x01 QPN 0x0028 PSN 0xbaa0aa OUT 0x10 RKey 0x1fffbf VAddr 0x007f6fd2a85000
          ---------------------------------------------------------------------------------------
          #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
          8388608    1000             23517.75            23517.73                  0.002940
          ---------------------------------------------------------------------------------------
          
    5. Start mst to make us enable monitor IB adaptor
      systemctl enable mst
      systemctl start mst
      mst status
      
  12. ssh without password for the root

    cd /work1/shared/spock/ssh_root/
    cp authorized_keys id_rsa* /root/.ssh/
    
    # Verification
    ssh spock00   # "yes" to "continue connecting"
    ssh spockXX   # "yes" to "continue connecting"
    exit
    exit
    

5. install compilers [Login node only]

  1. Intel compiler

    • [Login node] Install
      1. su
      2. mkdir /software/intel
      3. ln -s /software/intel /opt
      4. cd /work1/shared/spock/package/intel
      5. sh l_BaseKit_p_2023.1.0.46401.sh -a --cli Follow and accept the installation process.
      6. sh l_HPCKit_p_2023.1.0.46346.sh -a --cli Follow and accept inte installation process.
    • [Computing nodes] Link
      1. su
      2. cd /opt
      3. ln -s /software/intel
  2. gcc compiler [skip]

    • [Login node] Install latest version
      1. su
      2. mkdir /software/gcc
      3. cd /work1/shared/spock/package/gcc/gcc-12.2.0
      4. sh ./spock_Install.sh >& log.spock
      5. cd /software/gcc
      6. ln -s /work1/shared/spock/package/gcc ./src
      7. ln -s 12.2.0 default

6. install packages

  1. [Login node only] CUDA

    1. cd /work1/shared/spock/package/cuda
    2. mkdir /software/cuda
    3. sh cuda_12.1.0_530.30.02_linux.run --silent --toolkit --installpath=/software/cuda/12.1
    4. Create default link: ln -s /software/cuda/12.1 /software/cuda/default
  2. [Login node only] Valgrind

    mkdir /software/valgrind
    cd /work1/shared/eureka/package/valgrind/valgrind-3.15.0
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    

    After installation

    cd /software/valgrind
    ln -s /work1/shared/spock/package/valgrind src
    ln -s 3.15.0 default
    
  3. [Login node only] UCX Library

    • Download latest version [optional]
      mkdir /software/openucx
      mkdir /software/src
      cd /software/openucx/src
      git clone https://github.com/openucx/ucx.git ucx
      
    1. Installatoin
      cd /software/openucx/src/ucx
      ./autogen.sh
      mkdir build
      cd build
      ../contrib/configure-release --prefix=/software/openucx/ucx-1.15.0_with_mt --enable-mt  #enable  MPI_THREAD_MULTIPLE
      make && make install
      
  4. [Login node only] OpenMPI

    source /etc/profile.d/intel.sh
    mkdir /software/openmpi
    ln -s /work1/shared/spock/package/openmpi /software/openmpi/src
    cd /software/openmpi/src/openmpi-4.1.5
    # [Optional] Edit spock_Install_with_UCX.sh (remember to un-comment the configuration flags)
    sh spock_Install_with_ucx.sh >& log.spock-all
    

    After installation

    1. Check ucx

      cd /software/openmpi/4.1.5-ucx_mt-intel-2023.1.0/bin
      objdump -p mpicxx | grep PATH    # see whether /software/openucx/ucx-1.15.0_with_mt/lib is in RPATH
      ldd mpicxx | grep ucx            # see whether dynamic linker can find UCX libraries
      
    2. Check debugger

      source /etc/profile.d/openmpi.sh
      ompi_info | grep "MCA memchecker" # MCA memchecker: valgrind (MCA v2.1.0, API v2.0.0, Component v3.1.5)
      
    3. Set the MCA parameters

      1. Edit the configure file /software/openmpi/4.1.1-intel-oneapi/etc/openmpi-mca-params.conf by root (for querying the path for configure file, one can use: ompi_info --params mca all --level 9 | grep mca_param_files). Add the below lines (2021/07/24):
        pml=ucx
        osc=ucx
        btl=^openib
        

        include only ucx for pm include only ucx for osc exclude openib from btl which works for OpenMPI 4.1.1 and UCX 1.12.0, without giving warning message:

        [eureka01:45901] common_ucx.c:364 Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
        
  5. [Login node only] Maui [testing] [Problematic on sed and gcc version] Install sed 4.2.2

    cd /work1/shared/spock/package/sed/sed-4.2.2
    sh spock_Install.sh
    

    Install maui

    cd /work1/shared/spock/package/maui/maui-3.3.1/
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    cd etc/
    cp spock_maui.d /etc/init.d/maui.d
    cp maui.sh maui.csh /etc/profile.d/
    systemctl enable maui.d
    
    cd /usr/local/maui
    

    Edit maui.cfg as follows (an example is put at maui-3.3.1/maui.cfg.eureka)

         RMPOLLINTERVAL 00:00:15
         #BACKFILLPOLICY         FIRSTFIT
         #RESERVATIONPOLICY      CURRENTHIGHEST
         #NODEALLOCATIONPOLICY   MINRESOURCE
    
         # <==== Add by Nelson ====>
         JOBAGGREGATIONTIME      00:00:04
    
         # Backfill
    
         BACKFILLPOLICY          FIRSTFIT
         RESERVATIONPOLICY       NEVER
    
         # Node Allocation
    
         NODEALLOCATIONPOLICY    FIRSTAVAILABLE
    
         # Set Job Flags
         JOBACTIONONNODEFAILURE  CANCEL
         JOBNODEMATCHPOLICY      EXACTNODE
    systemctl start maui.d
    source /etc/profile.d/maui.sh
    
  6. [Login node only] FFTW

    • FFTW-2
      mkdir /software/fftw
      cd /work1/shared/spock/package/fftw/fftw-2.1.5-revised
      # [Optional] Edit Fish_Install.sh to install in intel or gcc
      sh Fish_Install.sh >& log.spock-intel
      
      After installation
      cd /software/fftw
      ln -s /work1/shared/spock/package/fftw src
      
    • FFTW-3
      cd /work1/shared/eureka/package/fftw/fftw-3.3.10
      # [Optional] Edit spock_Install.sh
      sh spock_Install.sh >& log.spock-intel
      
      After installation
      cd /software/fftw
      
  7. [Ligin node only] HDF5

    mkdir -p /software/hdf5
    cd /work1/shared/spock/package/hdf5/hdf5-1.10.6
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    

    After installation

    cd /software/hdf5
    ln -s /work1/shared/spock/package/hdf5 src
    
  8. [Login node only] GSL

    mkdir -p /software/gsl
    cd /work1/shared/spock/package/gsl/gsl-2.6
    # [Optional] Edit Fish_Install.sh
    sh Fish_Install.sh >& log.spock
    

    After installation

    cd /software/gsl
    ln -s /work1/shared/spock/package/gsl src
    
  9. python2

    source /etc/profile.d/openmpi.sh; source /etc/profile.d/intel.sh; source /etc/profile.d/hdf5.sh
    apt -y install python2 python2-dev
    apt -y install python-tk
    cd /work1/shared/spock/package/python2
    python2 get-pip.py
    sh install-python-packages.sh
    
  10. python3

    apt -y install python3 python3-dev
    apt -y install python3-tk
    apt -y install python3-pip
    cd /work1/shared/spock/package/python3
    sh install-python-packages.sh
    

    Add /usr/local/bin to PATH by adding a line at the end of /etc/profile

    export PATH=/usr/local/bin:$PATH
    
  11. Module

    cd /work1/shared/spock/package/module/modules-5.1.1
    make clean
    ./configure
    make
    make install
    

    After installation

    cp init/profile.sh /etc/profile.d/10-modules.sh
    cp init/profile.csh /etc/profile.d/modules.csh
    source init/bash
    

    Add /software/intel/oneapi/modulefiles to default module directories by adding the line to the file /usr/local/Modules/etc/initrc

    module use /software/intel/oneapi/modulefiles
    

    Set up preload module

    ln -s /software/modulefiles/default_modules.sh /etc/profile.d/default_modules.sh
    

7. Miscellaneous setup

  1. IPMI tool

    1. Install IPMI driver and tool : apt -y install openipmi ipmitool
    2. Check : ipmitool sensor get "CPU Temp."
  2. ffmpeg apt -y install ffmpeg

  3. gnuplot apt -y install gnuplot-x11

  4. screen apt -y install screen

  5. pdsh apt -y install pdsh

  6. locate apt -y install plocate

  7. ClamAV

    apt -y install clamav clamav-daemon
    systemctl stop clamav-freshclam
    freshclam
    systemctl start clamav-freshclam
    systemctl enable clamav-freshclam
    
  8. X11 server

    apt -y install xorg openbox
    
  9. CPU usage monitor

    apt -y install sysstat
    
  10. Image display feh

    apt -y install feh
    
  11. Disable auto update.

    1. Edit the apt config file at /etc/apt/apt.conf.d/20auto-upgrades as follow.
      APT::Periodic::Update-Package-Lists "0";
      APT::Periodic::Unattended-Upgrade "0";
      
    2. Apply config
      apt-config dump APT::Periodic::Update-Package-Lists
      apt-config dump APT::Periodic::Unattended-Upgrade
      
  12. Setup login limitations

    ref. https://officeguide.cc/linux-pam-tally2-lock-user-accounts-after-failed-login-attempts-tutorial-examples/

    1. Edit /etc/security/faillock.conf:
      dir = /var/run/faillock
      audit
      silent
      deny = 3
      fail_interval = 900
      unlock_time = 300
      
    2. Edit /etc/pam.d/common-auth as follow:
      auth required                   pam_fiallock.so preauth
      auth [success=1 default=ignore] pam_unix.so nullok
      auth [defualt=die]              pam_faillock.so authfail
      auth sufficient                 pam_faillock.so authsucc
      
      auth requisite                  pam_deny.so
      auth required                   pam_permit.so
      auth potional                   pam_cap.so
      
    3. Edit /etc/pam.d/common-account as follow:
      account [success=1 new_authtok_reqd=done default=ignore] pam_unix.so
      account requisite                pam_deny.so
      account required                 pam_permit.so
      account required                 pam_faillock.so
      
  13. Fail2band

    1. apt install fail2band
    2. Create file /etc/fail2ban/jail.d/sshd.local with following content:
      [sshd]
      enabled = true
      
      bantime = 10m
      bantime.increment = true
      bantime.maxtime = 1d
      
      findtime = 10m
      
      maxretry = 5
      
      destemail = [Admin username]
      action = %(action_mwl)s
      
    3. systemctl enable fail2band.service
    4. systemctl start fail2band.service
    5. Give a warning message for user at the login interface
      1. Add content: Banner /etc/issue.net in file /etc/ssh/sshd_config
      2. Backup origin file /etc/issue.net and do echo Info: Noted that your account will be blocked for 5 minutes if you fail password for 3 times. > /etc/issue.net
  14. Keep log files for 6 months

    1. Edit file /etc/logrotate.conf:
      # keep 4 weeks worth of backlogs
      rotate 26 #<- change this number from 4 to 26
      
    2. Edit file /etc/logrotate.d/wtmp:
      # no packages own wtmp -- we'll rotate it here
      /var/log/wtmp {
          missingok
          monthly
          create 0664 root utmp
          minsize 1M
          rotate 6 #<- change the number from 1 to 6
      }
      

8. Check

  1. CPU burn-in test

    1. Install CPU test program
      apt -y install stress-ng
      
    2. Run CPU test
      stress-ng --cpu 0 --timeout 30m &
      
    3. Detect CPU temperature every minute during test
      for i in {1..40}; do ipmitool sensor | grep "CPU Temp."; sleep 1m; done
      
      AMD Threadripper allows temperature up to 95 degree. And the non-critical upper limit is 85 degree. for spock02 the highest temperature is 82 degree.
  2. GPU burn-in test

    cd /work1/shared/spock/tests/gpu_burn-in/gpu-burn
    ./gpu_burn 1800 # run for 30 minutes
    

    during the test, watch the gpu temperature shown on screen. For RTX3080Ti, hightest temperature is 93 degree celsius. And the non-critical upper limit is 90 degree. For spock02, the highest temperature is 81 degree.

  3. MPI suit test [Run as regular user]

    1. Download @spock00 git clone https://github.com/open-mpi/mpi-test-suite.git
    2. Compile @spock00 cd mpi-test-suite ./autogen.sh ./configure CC=mpicc make
    3. Run tests cp /work1/shared/tests/mpi_test_suite/run_test.sh ./ qsub -I -lnodes=spockXX:ppn=32 cd {directory of mpi_test_suite} sh run_test.sh >& spockXX.log
    4. Check test result tail spockXX.log # Number of failed tests: 0