Disaster Recovery Plan - calab-ntu/gpu-cluster GitHub Wiki
Login Node Failed
Reboot unexpectedly
- Replace RAMs on login node.
System disk failure
- Change the disk with backup disk
Computing Node Failed
Node no response.
- Reboot the failed node.
Storage Server Failed
Disk failure
Ref. https://kb.synology.com/zh-tw/DSM/help/DSM/StorageManager/storage_pool_repair?version=7
-
Login to DSM
See
/work1/xuanshan/eureka/NFS_INFO
-
Replace broken disk
ironman
&pacific
(hot-swapping supported)Storage Manager
>HDD/SSD
> Choose the broken disk >Action
>Deactivate
- Directly draw out the broken disk and replace it with new one.
eater
(hot-swapping not supported)- Disk not on the host and has dropped out of RAID
- Directly draw out the broken disk and replace it with new one
- Disk on the host or hasn't dropped out of the RAID
- Turn off the NAS
- Draw out the broken disk and replace it with new one
- Disk not on the host and has dropped out of RAID
-
Login to DSM and repair the RAID.
Storage Manager
>Storage Management
>Storage Pool
- Choose the storage pool which degraded and select
Repair
. - follow the wizard instructions
Directory not accessable
- Reboot the NAS.
NIS Server Failed
Boot up failure
Make a temp NIS server in other machine. Change a computing node as a NIS server
SERVER
ref: https://blog.csdn.net/weixin_54099969/article/details/124800282 https://linux.vbird.org/linux_server/centos6/0430nis.php https://blog.csdn.net/dacming/article/details/121064665
-
Install yp serve
yum install ypserv.x86_64
-
Attach
tumaz
backup files to/etc/passwd
/etc/group
/etc/shadow
If NAS server is not working use files in backup USB
cat /work1/xuanshan/tumaz_backup/tumaz_passwd.backup >> /etc/passwd cat /work1/xuanshan/tumaz_backup/tumaz_group.backup >> /etc/group cat /work1/xuanshan/tumaz_backup/tumaz_shadow.backup >> /etc/shadow
-
Change NIS settings on target node
nisdomainname eurekaXX.gpucluster.calab
-
Set up NIS server
- Edit
etc/sysconfig/network
#NISDOMAIN=tumaz.gpucluster.calab NISDOMAIN=eurekaXX.gpucluster.calab YPSERV_ARGS="-p 1011"
- Edit
/etc/sysconfig/yppasswdd
YPPASSWDD_ARGS="--port 1012"
- Start NIS server
systemctl start ypserv.service /usr/lib64/yp/ypinit -m #ctrl-D -> y/Y
- Check
systemctl status ypserv.service # Active: active (running)
- Edit
-
Set up LDAP server
ref https://linux.vbird.org/linux_server/rocky9/0240ldap.php ref https://blog.tomy168.com/2019/07/centos76-openldap.html
- Install ipa package:
yum install ipa-server.x86_64 migrationtools.noarch
- Setup LDAP server:
-
Start server:
systemctl start
-
Copy backup files as template.
cp -r /work1/xuanshan/tumaz_backup/slapd /work1/xuanshan/tumaz_backup/slapd_eurekaXX
-
Edit
/work1/xuanshan/tumaz_backup/slapd_eurekaXX/basedn.ldif
as follow.dn: olcDatabase={2}hdb,cn=config replace: olcSuffix olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab dn: olcDatabase={2}hdb,cn=config changetype: modify replace: olcRootDN olcRootDN: cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab dn: olcDatabase={2}hdb,cn=config changetype: modify replace: olcRootPW olcRootPW: ${password ssha}
-
Edit
/work1/xuan/tumaz_backup/slapd_eurekaXX/basedn1.ldif
as follow.dn: olcDatabase={1}monitor,cn=config changetype: modify olcAccess: {0}to * by dn.base="gidNumber=0+uidNumber=0,cn=peercred,cn=extern al,cn=auth" read by dn.base="cn=admincalab,dc=eureka17,dc=gpucluster,dc=calab" read by * none
-
Add necessary schema to LDAP server
ldapadd -Y EXTERNAL -H ldapi:/// -f basedn.ldif ldapadd -Y EXTERNAL -H ldapi:/// -f basedn1.ldif ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif
-
Copy example and change ownership:
cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG chown ldap:ldap /var/lib/ldap/*
-
Edit
ou.ldif
, add following lines to the head.dn: dc=eurekaXX,dc=gpucluster,dc=calab dc: eurekaXX objectClass: top objectClass: domain
-
Import
ou
,group
andusers
to serversed -i 's/tumaz/eurekaXX/g' ou.ldif group.ldif user.ldif #change tumaz to eurekaXX in file. ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./ou.ldif ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif /usr/share/migrationtools/migrate_passwd.pl ./users > user.ldif sed -i 's/dc=padl,dc=com/dc=eureka17,dc=gpucluster,dc=calab/g' user.ldif ldapadd -x -W -D "cn=admincalab,dc=eurekaXX,dc=gpucluster,dc=calab" -f ./group.ldif
-
- Change LDAP client setting
- Install ipa package:
CLIENT
NIS slaves
- Change NIS target
setup
->[Authentication configuration]
->[Use NIS]
:Domain: eurekaXX.gpucluster.calab IP: eurekaXX
LDAP clients
System raid degraded.
- Check raid status.
Check linemdadm --detail /dev/md* #with root privilege
state
if raid degradedState : clean, degraded
- Remove broken SSD from raid.
mdadm --manage /dev/md0 --remove /dev/sd? #? is the broken SSD index.
- Shut down system and replace broken SSD.
- Bootup system and check new SSD status.
lsblk
- Add new SSD to raid.
mdadm --manage /dev/md0 --add /dev/sdb?
- Check recovery procedure.
watch cat /proc/mdstat
- Save new raid in system
mdadm --detail --scan | tee -a /etc/mdadm/mdadm.conf update-initramfs -u
WIP