Eureka Installation: Storage Server - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

Server

For tumaz

apt-get install nfs-kernel-server nfs-common

```
vi /etc/exports
```
```
rpcbind start

#GUI: enable NFS
```

For Synology NAS

Login to DSM
Control Panel > File Services > SMB / AFP / NFS
Activate Enable NFS

Reference: https://kb.synology.com/zh-tw/DSM/help/DSM/AdminCenter/file_winmacnfs_nfs?version=7

Client

Restart
```
/etc/init.d/nfs-kernel-server restart
```
Check
```
showmount -e <NFS IP>
```

Mount

mount -t nfs 192.168.0.253:/volume1/gpucluster3 /projectY/

or edit /etc/fstab

"tumaz:/home   /home    nfs    defaults     0 0"

Mount all devices
```
mount -a
```

Miscellaneous

Create a space on eater for existing user on eureka
1. Register target user info in /etc/passwd and /etc/shadow on tumaz
2. ssh gamer04
3. ssh admincalab@eater
4. sudo -i
5. cd /volume1/gpucluster3
6. mkdir <user_name>
7. chown uid:gid user_name
where uid and gid are recoded in /etc/passwd on spartan
Expand storage volume
1. open DSM
2. storage manager
3. storage pool
4. Action
5. add drive
6. Drag HDDs on left side to right.
7. Click next
8. Click apply
Add new storage volume
1. Open DSM
2. storage manager
3. volume
4. create
5. Select RAID 6
6. Maximize Modify allocated size by click Max
7. Choose Btrfs instead of ext4
8. Apply
  
  This will take a few days to set up new storage volume, and will create /volume? automatically.
9. Control Panel > Shared Folder > Create > Create
  
  General : name: gpucluster?
  Advanced: choose
  Enable data checksum for advanced data integrity
  Enable file compression
10. @server: chmod 755 gpucluster?/
11. @server: vi /etc/exports
  append below line to the last line of /etc/exports
```
/volume?/gpucluster?    192.168.0.0/24(rw,async,no_wdelay,no_root_squash,insecure_locks,sec=sys)
```
12. @server: exportfs -arv
13. Check the new added volume is released to clients: showmount -e
14. @eureka00: mkdir /projectW
15. @eureka00: chmod 755 /project?
16. @eureka00: vi /etc/fstab
  append below line to the last line of "/etc/fstab"
```
eater:/volume?/gpucluster?                  /project?   nfs      auto,bg,hard,intr            0 0
```
  repeat the above step on all computing nodes--> Do NOT directly copy the file eureka00:/etc/fstab to all computing nodes
17. Check NFS client can see the newest volume:
```
@eureka00: showmount -e  eater
@eureka02: showmount -e  eater
```
18. Mount the newest volume on all computing nodes:
  pdsh -w eureka[00-33] mount -a
change_from_ext4_volume_to_btrfs_volume
https://www.synology.com/en-us/knowledgebase/DSM/tutorial/Storage/How_to_change_from_ext4_volume_to_btrfs_volume
Synchronize with an NTP server
1. Control Panel → System → Regional Options → Time
2. Choose Synchronize with an NTP server and pick pool.ntp.org → Update Now → Apply
Set up network link aggregation
1. Control pannel > Network > Network Interface > Create > Create Bond
2. Follow the steps of create with default settings, choose all (4) network interface to create bond.
3. Check and test: Write four 2G files to NAS from 4 individual nodes at the same time. for i in 01 02 03 04; do ssh eureka$i dd if=/dev/zero of=[target file name] bs=2G count=1 & done

Maintenance

If both signal light power and alert constantly flash ...

Turn NFS off
Pull all the hard disks halfway open, and then turn NFS on again to see if the power signal constantly flash
-> Do not pull out when the computer is on, Otherwise, the data will be lost
-> If you can't shut down after pressing for 10 seconds, unplug the power directly
If power and alert keeps flashing, which means that the motherboard has a problem.
If drive reconnection errors happen frequently, run extended S.M.A.R.T. test:
1. Open DSM
2. Open storage manager
3. Click HDD/SSD on left column
4. Choose one HDD to be tested.
5. Click Health Info
6. Click S.M.A.R.T. tab
7. Click Extended test and Start

If the warranty is expired, send it to 虹谷資訊 for Maintenance
https://www.hongku.com.tw/
台北市大同區重慶北路一段1號5樓

If warranty is not expired yet, send it to the store we originally bought it.
--> Label the indices for every diks before drawing out, like below. (very important!!!)

     -----------------------------------
     |           [power light]         |
     -----------------------------------
     |      1        |        2        |
     -----------------------------------
     |      3        |        4        |
     -----------------------------------
     |      5        |        6        |
     -----------------------------------
     |      7        |        8        |
     -----------------------------------
     |      9        |       10        |
     -----------------------------------
     |     11        |       12        |
     -----------------------------------

After the repair is completed:

Insert the hard drive back in the original order (very important!!!)
Boot
Finish (eureka will mount automatically)

Rebuild disks array

Master machine

OS system is spread in all disks in master machine. Disks can only be replaced one at a time.

Back up data in the disk volume.
Delete target directory.
控制台 > 共用資料夾 > 刪除
Delete volume.
- ironman: 儲存空間管理員 > 儲存空間 > 刪除
- eater: 儲存空間管理員 > 儲存空間 > 刪除
[Optional] Delete or Modify disk group.
- ironman: 儲存空間管理員 > 磁碟群組 > 刪除
- eater: 儲存空間管理員 > 儲存集區 > 刪除
[Optional] Rebuild disk group.
- ironman: 儲存空間管理員 > 磁碟群組 > 新增
- eater: 儲存空間管理員 > 儲存集區 > 新增 Choose raid6
Rebuild volume.
- ironman: 儲存空間管理員 > 儲存空間 > 新增
  Choose file system as Btrfs
- eater: 儲存空間管理員 > 儲存空間 > 新增
  Choose file system as Btrfs
After build up new disk volume. Directly draw out one disk and replace it.
Repair the disk array.
- ironman: 儲存空間管理員 > 磁碟群組 > 管理 > 修復
- eater: 儲存空間管理員 > 儲存集區 > 動作 > 修復 Repeat step 7 and 8 until all disks are replaced.

extension

Back up data in the disk volume.
Delete target directory.
控制台 > 共用資料夾 > 刪除
Delete volume.
- ironman: 儲存空間管理員 > 儲存空間 > 刪除
- eater: 儲存空間管理員 > 儲存空間 > 刪除
Delete or Modify disk group.
- ironman: 儲存空間管理員 > 磁碟群組 > 刪除
- eater: 儲存空間管理員 > 儲存集區 > 刪除
Replace all disks in NAS extension.
Rebuild disk group.
- ironman: 儲存空間管理員 > 磁碟群組 > 新增
  Choose raid6
- eater: 儲存空間管理員 > 儲存集區 > 新增
  Choose raid6
Rebuild volume.
- ironman: 儲存空間管理員 > 儲存空間 > 新增
  Choose file system as Btrfs
- eater: 儲存空間管理員 > 儲存空間 > 新增
  Choose file system as Btrfs

Add extension

ironman

Connect master and insert disks.
Create new disk group. 儲存空間管理員 > 磁碟群組/儲存集區 > 新增
1. Choose disks in new extension.
2. Choose array type as raid6
Create new disk volume. 儲存空間管理員 > 儲存空間 > 新增
Choose file system as Btrfs

eater

Connect master and insert disks.
儲存空間管理員 > 儲存空間/儲存集區 > 新增
1. Choose disks in new extension.
2. Choose raid6.
Create new disk volume. 儲存空間管理員 > 儲存空間 > 新增
Choose file system as Btrfs

NAS Data scrubbing

Log in to DSM
Open storage manager
Storage pool > Data Scrubbing > action > manual run

Eureka Installation: Storage Server - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

Server

For tumaz

For Synology NAS

Client

Miscellaneous

Maintenance

Rebuild disks array

Master machine

extension

Add extension

ironman

eater

NAS Data scrubbing

Links

⚠️ GitHub.com Fallback ⚠️

Eureka Installation: Storage Server - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

Server

For tumaz

For Synology NAS

Client

Miscellaneous

Maintenance

Rebuild disks array

Master machine

extension

Add extension

ironman

eater

NAS Data scrubbing

Links

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️