Disaster Recovery Plan - calab-ntu/gpu-cluster GitHub Wiki

Login Node Failed

Reboot unexpectedly

Replace RAMs on login node.

System disk failure

Change the disk with backup disk

Computing Node Failed

Node no response.

Reboot the failed node.

Storage Server Failed

Disk failure

Ref. https://kb.synology.com/zh-tw/DSM/help/DSM/StorageManager/storage_pool_repair?version=7

Login to DSM

See /work1/xuanshan/eureka/NFS_INFO
Replace broken disk
- ironman & pacific (hot-swapping supported)
  1. Storage Manager > HDD/SSD > Choose the broken disk > Action > Deactivate
  2. Directly draw out the broken disk and replace it with new one.
- eater (hot-swapping not supported)
  - Disk not on the host and has dropped out of RAID
    1. Directly draw out the broken disk and replace it with new one
  - Disk on the host or hasn't dropped out of the RAID
    1. Turn off the NAS
    2. Draw out the broken disk and replace it with new one
Login to DSM and repair the RAID.
1. Storage Manager > Storage Management > Storage Pool
2. Choose the storage pool which degraded and select Repair.
3. follow the wizard instructions

Directory not accessable

Reboot the NAS.

NIS Server Failed

Boot up failure

Make a temp NIS server in other machine. Change a computing node as a NIS server

SERVER

ref: https://blog.csdn.net/weixin_54099969/article/details/124800282 https://linux.vbird.org/linux_server/centos6/0430nis.php https://blog.csdn.net/dacming/article/details/121064665

Install yp serve
```
yum install ypserv.x86_64
```

Attach tumaz backup files to /etc/passwd /etc/group /etc/shadow

If NAS server is not working use files in backup USB

cat /work1/xuanshan/tumaz_backup/tumaz_passwd.backup >> /etc/passwd
cat /work1/xuanshan/tumaz_backup/tumaz_group.backup >> /etc/group
cat /work1/xuanshan/tumaz_backup/tumaz_shadow.backup >> /etc/shadow

Change NIS settings on target node

nisdomainname eurekaXX.gpucluster.calab

Set up NIS server

Edit etc/sysconfig/network

#NISDOMAIN=tumaz.gpucluster.calab
NISDOMAIN=eurekaXX.gpucluster.calab
YPSERV_ARGS="-p 1011"

Edit /etc/sysconfig/yppasswdd
```
YPPASSWDD_ARGS="--port  1012" 
```

Start NIS server

systemctl start ypserv.service
/usr/lib64/yp/ypinit -m #ctrl-D -> y/Y

Check

systemctl status ypserv.service
# Active: active (running)

Set up LDAP server

ref https://linux.vbird.org/linux_server/rocky9/0240ldap.php ref https://blog.tomy168.com/2019/07/centos76-openldap.html

Install ipa package: yum install ipa-server.x86_64 migrationtools.noarch

Setup LDAP server:

Start server: systemctl start
Copy backup files as template. cp -r /work1/xuanshan/tumaz_backup/slapd /work1/xuanshan/tumaz_backup/slapd_eurekaXX

Edit /work1/xuanshan/tumaz_backup/slapd_eurekaXX/basedn.ldif as follow.

dn: olcDatabase={2}hdb,cn=config
replace: olcSuffix
olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: ${password ssha}

Edit /work1/xuan/tumaz_backup/slapd_eurekaXX/basedn1.ldif as follow.

dn: olcDatabase={1}monitor,cn=config
changetype: modify
olcAccess: {0}to * by dn.base="gidNumber=0+uidNumber=0,cn=peercred,cn=extern
 al,cn=auth" read by dn.base="cn=admincalab,dc=eureka17,dc=gpucluster,dc=calab" read by * none

Add necessary schema to LDAP server

ldapadd -Y EXTERNAL -H ldapi:/// -f basedn.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f basedn1.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif
ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif

Copy example and change ownership:

cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG
chown ldap:ldap /var/lib/ldap/*

Edit ou.ldif, add following lines to the head.

dn: dc=eurekaXX,dc=gpucluster,dc=calab
dc: eurekaXX
objectClass: top
objectClass: domain

Import ou, group and users to server

sed -i 's/tumaz/eurekaXX/g' ou.ldif group.ldif user.ldif #change tumaz to eurekaXX in file.
ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./ou.ldif
ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif
/usr/share/migrationtools/migrate_passwd.pl ./users > user.ldif
sed -i 's/dc=padl,dc=com/dc=eureka17,dc=gpucluster,dc=calab/g' user.ldif
ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif

Change LDAP client setting

CLIENT

NIS slaves

Change NIS target setup -> [Authentication configuration] -> [Use NIS]:
```
Domain: eurekaXX.gpucluster.calab
IP: eurekaXX
```

LDAP clients

System raid degraded.

Check raid status.

mdadm --detail /dev/md* #with root privilege

Check line state if raid degraded

State : clean, degraded

Remove broken SSD from raid.

mdadm --manage /dev/md0 --remove /dev/sd? #? is the broken SSD index.

Shut down system and replace broken SSD.
Bootup system and check new SSD status.
```
lsblk
```

Add new SSD to raid.

mdadm --manage /dev/md0 --add /dev/sdb?

Check recovery procedure.
```
watch cat /proc/mdstat
```

Save new raid in system

mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf
update-initramfs -u

WIP