Eureka Installation: Storage Server - calab-ntu/gpu-cluster GitHub Wiki

Installation Procedure

Server

For tumaz

  1. apt-get install nfs-kernel-server nfs-common
  2. vi /etc/exports
  3. rpcbind start
    
    #GUI: enable NFS

For Synology NAS

  1. Login to DSM
  2. Control Panel > File Services > SMB / AFP / NFS
  3. Activate Enable NFS

Reference: https://kb.synology.com/zh-tw/DSM/help/DSM/AdminCenter/file_winmacnfs_nfs?version=7

Client

  1. Restart
    /etc/init.d/nfs-kernel-server restart
  2. Check
    showmount -e <NFS IP>
  3. Mount
    mount -t nfs 192.168.0.253:/volume1/gpucluster3 /projectY/
    or edit /etc/fstab
    "tumaz:/home   /home    nfs    defaults     0 0"
    
  4. Mount all devices
    mount -a

Miscellaneous

  1. Create a space on eater for existing user on eureka

    1. Register target user info in /etc/passwd and /etc/shadow on tumaz
    2. ssh gamer04
    3. ssh admincalab@eater
    4. sudo -i
    5. cd /volume1/gpucluster3
    6. mkdir <user_name>
    7. chown uid:gid user_name

    where uid and gid are recoded in /etc/passwd on spartan

  2. Expand storage volume

    1. open DSM
    2. storage manager
    3. storage pool
    4. Action
    5. add drive
    6. Drag HDDs on left side to right.
    7. Click next
    8. Click apply
  3. Add new storage volume

    1. Open DSM
    2. storage manager
    3. volume
    4. create
    5. Select RAID 6
    6. Maximize Modify allocated size by click Max
    7. Choose Btrfs instead of ext4
    8. Apply

      This will take a few days to set up new storage volume, and will create /volume? automatically.

    9. Control Panel > Shared Folder > Create > Create

      General : name: gpucluster?
      Advanced: choose
      Enable data checksum for advanced data integrity
      Enable file compression

    10. @server: chmod 755 gpucluster?/
    11. @server: vi /etc/exports

      append below line to the last line of /etc/exports

      /volume?/gpucluster?    192.168.0.0/24(rw,async,no_wdelay,no_root_squash,insecure_locks,sec=sys)
      
    12. @server: exportfs -arv
    13. Check the new added volume is released to clients: showmount -e
    14. @eureka00: mkdir /projectW
    15. @eureka00: chmod 755 /project?
    16. @eureka00: vi /etc/fstab

      append below line to the last line of "/etc/fstab"

      eater:/volume?/gpucluster?                  /project?   nfs      auto,bg,hard,intr            0 0
      

      repeat the above step on all computing nodes--> Do NOT directly copy the file eureka00:/etc/fstab to all computing nodes

    17. Check NFS client can see the newest volume:
      @eureka00: showmount -e  eater
      @eureka02: showmount -e  eater
    18. Mount the newest volume on all computing nodes:
      pdsh -w eureka[00-33] mount -a
  4. change_from_ext4_volume_to_btrfs_volume
    https://www.synology.com/en-us/knowledgebase/DSM/tutorial/Storage/How_to_change_from_ext4_volume_to_btrfs_volume

  5. Synchronize with an NTP server

    1. Control PanelSystemRegional OptionsTime
    2. Choose Synchronize with an NTP server and pick pool.ntp.orgUpdate NowApply
  6. Set up network link aggregation

    1. Control pannel > Network > Network Interface > Create > Create Bond
    2. Follow the steps of create with default settings, choose all (4) network interface to create bond.
    3. Check and test: Write four 2G files to NAS from 4 individual nodes at the same time. for i in 01 02 03 04; do ssh eureka$i dd if=/dev/zero of=[target file name] bs=2G count=1 & done

Maintenance

If both signal light power and alert constantly flash ...

  1. Turn NFS off

  2. Pull all the hard disks halfway open, and then turn NFS on again to see if the power signal constantly flash
        -> Do not pull out when the computer is on, Otherwise, the data will be lost
        -> If you can't shut down after pressing for 10 seconds, unplug the power directly

  3. If power and alert keeps flashing, which means that the motherboard has a problem.

  4. If drive reconnection errors happen frequently, run extended S.M.A.R.T. test:

    1. Open DSM
    2. Open storage manager
    3. Click HDD/SSD on left column
    4. Choose one HDD to be tested.
    5. Click Health Info
    6. Click S.M.A.R.T. tab
    7. Click Extended test and Start
  5. If the warranty is expired, send it to 虹谷資訊 for Maintenance
    https://www.hongku.com.tw/
    台北市大同區重慶北路一段1號5樓

    If warranty is not expired yet, send it to the store we originally bought it.
    --> Label the indices for every diks before drawing out, like below. (very important!!!)

         -----------------------------------
         |           [power light]         |
         -----------------------------------
         |      1        |        2        |
         -----------------------------------
         |      3        |        4        |
         -----------------------------------
         |      5        |        6        |
         -----------------------------------
         |      7        |        8        |
         -----------------------------------
         |      9        |       10        |
         -----------------------------------
         |     11        |       12        |
         -----------------------------------
    

After the repair is completed:

  1. Insert the hard drive back in the original order (very important!!!)
  2. Boot
  3. Finish (eureka will mount automatically)

Rebuild disks array

Master machine

OS system is spread in all disks in master machine. Disks can only be replaced one at a time.

  1. Back up data in the disk volume.
  2. Delete target directory.
    控制台 > 共用資料夾 > 刪除
  3. Delete volume.
    • ironman: 儲存空間管理員 > 儲存空間 > 刪除
    • eater: 儲存空間管理員 > 儲存空間 > 刪除
  4. [Optional] Delete or Modify disk group.
    • ironman: 儲存空間管理員 > 磁碟群組 > 刪除
    • eater: 儲存空間管理員 > 儲存集區 > 刪除
  5. [Optional] Rebuild disk group.
    • ironman: 儲存空間管理員 > 磁碟群組 > 新增
    • eater: 儲存空間管理員 > 儲存集區 > 新增 Choose raid6
  6. Rebuild volume.
    • ironman: 儲存空間管理員 > 儲存空間 > 新增
      Choose file system as Btrfs
    • eater: 儲存空間管理員 > 儲存空間 > 新增
      Choose file system as Btrfs
  7. After build up new disk volume. Directly draw out one disk and replace it.
  8. Repair the disk array.
    • ironman: 儲存空間管理員 > 磁碟群組 > 管理 > 修復
    • eater: 儲存空間管理員 > 儲存集區 > 動作 > 修復 Repeat step 7 and 8 until all disks are replaced.

extension

  1. Back up data in the disk volume.
  2. Delete target directory.
    控制台 > 共用資料夾 > 刪除
  3. Delete volume.
    • ironman: 儲存空間管理員 > 儲存空間 > 刪除
    • eater: 儲存空間管理員 > 儲存空間 > 刪除
  4. Delete or Modify disk group.
    • ironman: 儲存空間管理員 > 磁碟群組 > 刪除
    • eater: 儲存空間管理員 > 儲存集區 > 刪除
  5. Replace all disks in NAS extension.
  6. Rebuild disk group.
    • ironman: 儲存空間管理員 > 磁碟群組 > 新增
      Choose raid6
    • eater: 儲存空間管理員 > 儲存集區 > 新增
      Choose raid6
  7. Rebuild volume.
    • ironman: 儲存空間管理員 > 儲存空間 > 新增
      Choose file system as Btrfs
    • eater: 儲存空間管理員 > 儲存空間 > 新增
      Choose file system as Btrfs

Add extension

ironman

  1. Connect master and insert disks.
  2. Create new disk group. 儲存空間管理員 > 磁碟群組/儲存集區 > 新增
    1. Choose disks in new extension.
    2. Choose array type as raid6
  3. Create new disk volume. 儲存空間管理員 > 儲存空間 > 新增
    Choose file system as Btrfs

eater

  1. Connect master and insert disks.
  2. 儲存空間管理員 > 儲存空間/儲存集區 > 新增
    1. Choose disks in new extension.
    2. Choose raid6.
  3. Create new disk volume. 儲存空間管理員 > 儲存空間 > 新增
    Choose file system as Btrfs

NAS Data scrubbing

  1. Log in to DSM
  2. Open storage manager
  3. Storage pool > Data Scrubbing > action > manual run

Links

⚠️ **GitHub.com Fallback** ⚠️