Disaster Recovery Plan - calab-ntu/gpu-cluster GitHub Wiki

Login Node Failed

Reboot unexpectedly

  1. Replace RAMs on login node.

System disk failure

  1. Change the disk with backup disk

Computing Node Failed

Node no response.

  • Reboot the failed node.

Storage Server Failed

Disk failure

Ref. https://kb.synology.com/zh-tw/DSM/help/DSM/StorageManager/storage_pool_repair?version=7

  1. Login to DSM

    See /work1/xuanshan/eureka/NFS_INFO

  2. Replace broken disk

    • ironman & pacific (hot-swapping supported)
      1. Storage Manager > HDD/SSD > Choose the broken disk > Action > Deactivate
      2. Directly draw out the broken disk and replace it with new one.
    • eater (hot-swapping not supported)
      • Disk not on the host and has dropped out of RAID
        1. Directly draw out the broken disk and replace it with new one
      • Disk on the host or hasn't dropped out of the RAID
        1. Turn off the NAS
        2. Draw out the broken disk and replace it with new one
  3. Login to DSM and repair the RAID.

    1. Storage Manager > Storage Management > Storage Pool
    2. Choose the storage pool which degraded and select Repair.
    3. follow the wizard instructions

Directory not accessable

  • Reboot the NAS.

NIS Server Failed

Boot up failure

Make a temp NIS server in other machine. Change a computing node as a NIS server

SERVER

ref: https://blog.csdn.net/weixin_54099969/article/details/124800282 https://linux.vbird.org/linux_server/centos6/0430nis.php https://blog.csdn.net/dacming/article/details/121064665

  1. Install yp serve

    yum install ypserv.x86_64
    
  2. Attach tumaz backup files to /etc/passwd /etc/group /etc/shadow

    If NAS server is not working use files in backup USB

    cat /work1/xuanshan/tumaz_backup/tumaz_passwd.backup >> /etc/passwd
    cat /work1/xuanshan/tumaz_backup/tumaz_group.backup >> /etc/group
    cat /work1/xuanshan/tumaz_backup/tumaz_shadow.backup >> /etc/shadow
    
  3. Change NIS settings on target node

    nisdomainname eurekaXX.gpucluster.calab
    
  4. Set up NIS server

    1. Edit etc/sysconfig/network
      #NISDOMAIN=tumaz.gpucluster.calab
      NISDOMAIN=eurekaXX.gpucluster.calab
      YPSERV_ARGS="-p 1011"
      
    2. Edit /etc/sysconfig/yppasswdd
      YPPASSWDD_ARGS="--port  1012" 
      
    3. Start NIS server
      systemctl start ypserv.service
      /usr/lib64/yp/ypinit -m #ctrl-D -> y/Y
      
    4. Check
      systemctl status ypserv.service
      # Active: active (running)
      
  5. Set up LDAP server

    ref https://linux.vbird.org/linux_server/rocky9/0240ldap.php ref https://blog.tomy168.com/2019/07/centos76-openldap.html

    1. Install ipa package: yum install ipa-server.x86_64 migrationtools.noarch
    2. Setup LDAP server:
      1. Start server: systemctl start

      2. Copy backup files as template. cp -r /work1/xuanshan/tumaz_backup/slapd /work1/xuanshan/tumaz_backup/slapd_eurekaXX

      3. Edit /work1/xuanshan/tumaz_backup/slapd_eurekaXX/basedn.ldif as follow.

        dn: olcDatabase={2}hdb,cn=config
        replace: olcSuffix
        olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab
        
        dn: olcDatabase={2}hdb,cn=config
        changetype: modify
        replace: olcRootDN
        olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab
        
        dn: olcDatabase={2}hdb,cn=config
        changetype: modify
        replace: olcRootPW
        olcRootPW: ${password ssha}
        
      4. Edit /work1/xuan/tumaz_backup/slapd_eurekaXX/basedn1.ldif as follow.

        dn: olcDatabase={1}monitor,cn=config
        changetype: modify
        olcAccess: {0}to * by dn.base="gidNumber=0+uidNumber=0,cn=peercred,cn=extern
         al,cn=auth" read by dn.base="cn=admincalab,dc=eureka17,dc=gpucluster,dc=calab" read by * none
        
      5. Add necessary schema to LDAP server

        ldapadd -Y EXTERNAL -H ldapi:/// -f basedn.ldif
        ldapadd -Y EXTERNAL -H ldapi:/// -f basedn1.ldif
        ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif
        ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif
        ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif
        
      6. Copy example and change ownership:

        cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
        chown ldap:ldap /var/lib/ldap/*
        
      7. Edit ou.ldif, add following lines to the head.

        dn: dc=eurekaXX,dc=gpucluster,dc=calab
        dc: eurekaXX
        objectClass: top
        objectClass: domain
        
      8. Import ou, group and users to server

        sed -i 's/tumaz/eurekaXX/g' ou.ldif group.ldif user.ldif #change tumaz to eurekaXX in file.
        ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./ou.ldif
        ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif
        /usr/share/migrationtools/migrate_passwd.pl ./users > user.ldif
        sed -i 's/dc=padl,dc=com/dc=eureka17,dc=gpucluster,dc=calab/g' user.ldif
        ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif
        
    3. Change LDAP client setting

CLIENT

NIS slaves
  1. Change NIS target setup -> [Authentication configuration] -> [Use NIS]:
    Domain: eurekaXX.gpucluster.calab
    IP: eurekaXX
    
LDAP clients

System raid degraded.

  1. Check raid status.
    mdadm --detail /dev/md* #with root privilege
    
    Check line state if raid degraded
    State : clean, degraded
    
  2. Remove broken SSD from raid.
    mdadm --manage /dev/md0 --remove /dev/sd? #? is the broken SSD index.
    
  3. Shut down system and replace broken SSD.
  4. Bootup system and check new SSD status.
    lsblk
    
  5. Add new SSD to raid.
    mdadm --manage /dev/md0 --add /dev/sdb?
    
  6. Check recovery procedure.
    watch cat /proc/mdstat
    
  7. Save new raid in system
    mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf
    update-initramfs -u
    

WIP