Terascala Object Storage Servers - shawfdong/hyades GitHub Wiki

In a Lustre storage, the Object Storage Servers (OSS) manage the Object Storage Targets (OST) that store the file data objects. In the Terascala Lustre Storage, there are four (4x) Object Storage Servers, making up two (2x) active/active highly available pairs. Each Object Storage Server is a Dell PowerEdge R710 server, equipped with one quad-core Intel Sandy Bridge Xeon processor E5620 at 2.40GHz and 24GB (12 x 4GB) memory.

Table of Contents

Network

On each Metadata Server, there are four (4x) Gigabit Ethernet interfaces, two (2x) 10-Gigabit Ethernet interfaces and one QDR InfiniBand HCA (host channel adapter):

01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
05:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
07:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
node eth4 (10GbE) ib0 eth0 (Management #1) eth0:1 (Management #2) eth1 (Heartbeat) IPMI
oss-01 10.7.8.144 10.8.8.144 192.168.3.144 192.168.4.144 192.168.199.3 192.168.3.13
oss-02 10.7.8.145 10.8.8.145 192.168.3.145 192.168.4.145 192.168.199.4 192.168.4.14
oss-01 10.7.8.146 10.8.8.146 192.168.3.146 192.168.4.146 192.168.199.5 192.168.3.15
oss-02 10.7.8.147 10.8.8.147 192.168.3.147 192.168.4.147 192.168.199.6 192.168.4.16

Storage

On each Metadata Server, there are two (2x) LSI SAS 2008 SAS controllers and one LSI MegaRAID SAS 2108 RAID controller (relabeled as Dell PERC H700):

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
06:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

Each pair of Object Storage Servers redundantly attaches to a pair of Dell MD3220 Storage Arrays:

and each MD3220 Storage Array attaches to a Dell MD1200 Storage Array, for storage expansion:

There are in total 48x 2TB SAS drives in the 2x MD3220 Storage Arrays and 2x MD1200 Storage Arrays attached to each pair of Object Storage Servers. A RAID-5 virtual disk (volume) is created on every 6 physical disks. Both Object Storage Servers can access all the 8 virtual disks. Here is a list of the block devices on oss-01:

# lsblk
NAME                     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                        8:0    0 465.3G  0 disk  
|-sda1                     8:1    0  62.7M  0 part  /bootcom
|-sda2                     8:2    0  62.8M  0 part  
|-sda3                     8:3    0  62.8M  0 part  /boot
`-sda4                     8:4    0   465G  0 part  
  |-rootVG-A (dm-8)      253:8    0 226.5G  0 lvm   
  |-rootVG-B (dm-9)      253:9    0 224.2G  0 lvm   /
  |-rootVG-crash (dm-10) 253:10   0    10G  0 lvm   
  `-rootVG-swap (dm-11)  253:11   0     2G  0 lvm   [SWAP]
sdc                        8:32   0   9.1T  0 disk  
`-map01 (dm-7)           253:7    0   9.1T  0 mpath 
sdd                        8:48   0   9.1T  0 disk  
`-map02 (dm-4)           253:4    0   9.1T  0 mpath 
sde                        8:64   0   9.1T  0 disk  
`-map03 (dm-6)           253:6    0   9.1T  0 mpath 
sdb                        8:16   0   9.1T  0 disk  
`-map00 (dm-3)           253:3    0   9.1T  0 mpath 
sdg                        8:96   0   9.1T  0 disk  
`-map05 (dm-0)           253:0    0   9.1T  0 mpath 
sdi                        8:128  0   9.1T  0 disk  
`-map07 (dm-5)           253:5    0   9.1T  0 mpath 
sdh                        8:112  0   9.1T  0 disk  
`-map06 (dm-2)           253:2    0   9.1T  0 mpath 
sdf                        8:80   0   9.1T  0 disk  
`-map04 (dm-1)           253:1    0   9.1T  0 mpath 
sdk                        8:160  0   9.1T  0 disk  
`-map01 (dm-7)           253:7    0   9.1T  0 mpath 
sdl                        8:176  0   9.1T  0 disk  
`-map02 (dm-4)           253:4    0   9.1T  0 mpath 
sdm                        8:192  0   9.1T  0 disk  
`-map03 (dm-6)           253:6    0   9.1T  0 mpath 
sdn                        8:208  0   9.1T  0 disk  
`-map04 (dm-1)           253:1    0   9.1T  0 mpath 
sdj                        8:144  0   9.1T  0 disk  
`-map00 (dm-3)           253:3    0   9.1T  0 mpath 
sdp                        8:240  0   9.1T  0 disk  
`-map06 (dm-2)           253:2    0   9.1T  0 mpath 
sdq                       65:0    0   9.1T  0 disk  
`-map07 (dm-5)           253:5    0   9.1T  0 mpath 
sdo                        8:224  0   9.1T  0 disk  
`-map05 (dm-0)           253:0    0   9.1T  0 mpath
Among those, sdb & sdj are the same virtual disk, along 2 different paths. So are sdc & sdk, sdd & sdl, sde & sdm, sdf & sdn, sdg & sdo, sdh & sdp, sdi & sdq, respectively. DM Multipath is utilized to cross the 2 redundant paths to each virtual disk, in order to provide continuous availability and higher bandwidth connectivity[1].
# multipath -l
map03 (3690b11c00029d7be0000074451239c1c) dm-7 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:1:3 sdq 65:0  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:1:3 sdi 8:128 active undef running
map02 (3690b11c00029d6a50000093351239325) dm-1 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 1:0:1:2 sdh 8:112 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 2:0:1:2 sdp 8:240 active undef running
map01 (3690b11c00029d7be0000074051239bf4) dm-4 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:1:1 sdo 8:224 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:1:1 sdg 8:96  active undef running
map00 (3690b11c00029d6a50000092f51239301) dm-2 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 1:0:1:0 sdf 8:80  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 2:0:1:0 sdn 8:208 active undef running
map07 (3690b11c00029dad70000071c3f0f2b9a) dm-6 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:0:3 sdm 8:192 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:0:3 sde 8:64  active undef running
map06 (3690b11c00029d82b0000074551239361) dm-3 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 1:0:0:2 sdd 8:48  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 2:0:0:2 sdl 8:176 active undef running
map05 (3690b11c00029dad7000007183f0f2b72) dm-5 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:0:1 sdk 8:160 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:0:1 sdc 8:32  active undef running
map04 (3690b11c00029d82b000007415123933d) dm-0 DELL,MD32xx
size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 1:0:0:0 sdb 8:16  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 2:0:0:0 sdj 8:144 active undef running

There are in total 8x multipath volumes in each OSS pair. During normal operation, only half (4 volumes) are mounted on each OSS. On oss-01 (/proc/mounts):

/dev/mapper/map01 /mnt/ost1 lustre ro 0 0
/dev/mapper/map04 /mnt/ost4 lustre ro 0 0
/dev/mapper/map05 /mnt/ost5 lustre ro 0 0
/dev/mapper/map00 /mnt/ost0 lustre ro 0 0
And on oss-02:
/dev/mapper/map02 /mnt/ost2 lustre ro 0 0
/dev/mapper/map03 /mnt/ost3 lustre ro 0 0
/dev/mapper/map06 /mnt/ost6 lustre ro 0 0
/dev/mapper/map07 /mnt/ost7 lustre ro 0 0

When one OSS fails, the other will mount all 8x multipath volumes (Active/Active High-Availability).

Active/Active High-Availability

Presumably, each of the two OSS pairs forms an active/passive highly available cluster[2]. However, although heartbeat 1.2.4 is installed on all the Object Storage Servers, it doesn't appear to be running!

# /etc/init.d/heartbeat status
heartbeat is stopped. No process
Perhaps Terascala uses /etc/init.d/tsdisk to mount and failover the Lustre OSTS (Object Storage Targets)?

NRPE

Each Metadata Server runs NRPE, an agent for the Nagios server running on the Terascala Management Server. NRPE (Nagios Remote Plugin Executor) is a Nagios agent that allows remote system monitoring using scripts that are hosted on the remote systems. It allows for monitoring of resources such as disk usage, system load or the number of users currently logged in. The Nagios server periodically polls the agent on remote system using the check_nrpe plugin[3][4].

The configuration file for NRPE is /usr/local/nagios/etc/nrpe.cfg.

gmond

Each Metadata Server also runs gmond, the Ganglia Monitoring Daemon for the Ganglia Meta Daemon (gmetad) runs on the Terascala Management Server. The configuration file for gmond is /usr/local/terascala/etc/ganglia/gmond.conf.

References

  1. ^ Red Hat Enterprise Linux 6 DM Multipath
  2. ^ Dell Lustre Storage System Whitepaper
  3. ^ NRPE Documentation
  4. ^ How to Add Linux Host to Nagios Monitoring Server Using NRPE Plugin
⚠️ **GitHub.com Fallback** ⚠️