Terascala Object Storage Servers - shawfdong/hyades GitHub Wiki
In a Lustre storage, the Object Storage Servers (OSS) manage the Object Storage Targets (OST) that store the file data objects. In the Terascala Lustre Storage, there are four (4x) Object Storage Servers, making up two (2x) active/active highly available pairs. Each Object Storage Server is a Dell PowerEdge R710 server, equipped with one quad-core Intel Sandy Bridge Xeon processor E5620 at 2.40GHz and 24GB (12 x 4GB) memory.
On each Metadata Server, there are four (4x) Gigabit Ethernet interfaces, two (2x) 10-Gigabit Ethernet interfaces and one QDR InfiniBand HCA (host channel adapter):
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 05:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 05:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 07:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
node | eth4 (10GbE) | ib0 | eth0 (Management #1) | eth0:1 (Management #2) | eth1 (Heartbeat) | IPMI |
---|---|---|---|---|---|---|
oss-01 | 10.7.8.144 | 10.8.8.144 | 192.168.3.144 | 192.168.4.144 | 192.168.199.3 | 192.168.3.13 |
oss-02 | 10.7.8.145 | 10.8.8.145 | 192.168.3.145 | 192.168.4.145 | 192.168.199.4 | 192.168.4.14 |
oss-01 | 10.7.8.146 | 10.8.8.146 | 192.168.3.146 | 192.168.4.146 | 192.168.199.5 | 192.168.3.15 |
oss-02 | 10.7.8.147 | 10.8.8.147 | 192.168.3.147 | 192.168.4.147 | 192.168.199.6 | 192.168.4.16 |
On each Metadata Server, there are two (2x) LSI SAS 2008 SAS controllers and one LSI MegaRAID SAS 2108 RAID controller (relabeled as Dell PERC H700):
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 04:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) 06:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
Each pair of Object Storage Servers redundantly attaches to a pair of Dell MD3220 Storage Arrays:
and each MD3220 Storage Array attaches to a Dell MD1200 Storage Array, for storage expansion:
There are in total 48x 2TB SAS drives in the 2x MD3220 Storage Arrays and 2x MD1200 Storage Arrays attached to each pair of Object Storage Servers. A RAID-5 virtual disk (volume) is created on every 6 physical disks. Both Object Storage Servers can access all the 8 virtual disks. Here is a list of the block devices on oss-01:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 465.3G 0 disk |-sda1 8:1 0 62.7M 0 part /bootcom |-sda2 8:2 0 62.8M 0 part |-sda3 8:3 0 62.8M 0 part /boot `-sda4 8:4 0 465G 0 part |-rootVG-A (dm-8) 253:8 0 226.5G 0 lvm |-rootVG-B (dm-9) 253:9 0 224.2G 0 lvm / |-rootVG-crash (dm-10) 253:10 0 10G 0 lvm `-rootVG-swap (dm-11) 253:11 0 2G 0 lvm [SWAP] sdc 8:32 0 9.1T 0 disk `-map01 (dm-7) 253:7 0 9.1T 0 mpath sdd 8:48 0 9.1T 0 disk `-map02 (dm-4) 253:4 0 9.1T 0 mpath sde 8:64 0 9.1T 0 disk `-map03 (dm-6) 253:6 0 9.1T 0 mpath sdb 8:16 0 9.1T 0 disk `-map00 (dm-3) 253:3 0 9.1T 0 mpath sdg 8:96 0 9.1T 0 disk `-map05 (dm-0) 253:0 0 9.1T 0 mpath sdi 8:128 0 9.1T 0 disk `-map07 (dm-5) 253:5 0 9.1T 0 mpath sdh 8:112 0 9.1T 0 disk `-map06 (dm-2) 253:2 0 9.1T 0 mpath sdf 8:80 0 9.1T 0 disk `-map04 (dm-1) 253:1 0 9.1T 0 mpath sdk 8:160 0 9.1T 0 disk `-map01 (dm-7) 253:7 0 9.1T 0 mpath sdl 8:176 0 9.1T 0 disk `-map02 (dm-4) 253:4 0 9.1T 0 mpath sdm 8:192 0 9.1T 0 disk `-map03 (dm-6) 253:6 0 9.1T 0 mpath sdn 8:208 0 9.1T 0 disk `-map04 (dm-1) 253:1 0 9.1T 0 mpath sdj 8:144 0 9.1T 0 disk `-map00 (dm-3) 253:3 0 9.1T 0 mpath sdp 8:240 0 9.1T 0 disk `-map06 (dm-2) 253:2 0 9.1T 0 mpath sdq 65:0 0 9.1T 0 disk `-map07 (dm-5) 253:5 0 9.1T 0 mpath sdo 8:224 0 9.1T 0 disk `-map05 (dm-0) 253:0 0 9.1T 0 mpathAmong those, sdb & sdj are the same virtual disk, along 2 different paths. So are sdc & sdk, sdd & sdl, sde & sdm, sdf & sdn, sdg & sdo, sdh & sdp, sdi & sdq, respectively. DM Multipath is utilized to cross the 2 redundant paths to each virtual disk, in order to provide continuous availability and higher bandwidth connectivity[1].
# multipath -l map03 (3690b11c00029d7be0000074451239c1c) dm-7 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 2:0:1:3 sdq 65:0 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 1:0:1:3 sdi 8:128 active undef running map02 (3690b11c00029d6a50000093351239325) dm-1 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 1:0:1:2 sdh 8:112 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 2:0:1:2 sdp 8:240 active undef running map01 (3690b11c00029d7be0000074051239bf4) dm-4 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 2:0:1:1 sdo 8:224 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 1:0:1:1 sdg 8:96 active undef running map00 (3690b11c00029d6a50000092f51239301) dm-2 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 1:0:1:0 sdf 8:80 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 2:0:1:0 sdn 8:208 active undef running map07 (3690b11c00029dad70000071c3f0f2b9a) dm-6 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 2:0:0:3 sdm 8:192 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 1:0:0:3 sde 8:64 active undef running map06 (3690b11c00029d82b0000074551239361) dm-3 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 1:0:0:2 sdd 8:48 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 2:0:0:2 sdl 8:176 active undef running map05 (3690b11c00029dad7000007183f0f2b72) dm-5 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 2:0:0:1 sdk 8:160 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 1:0:0:1 sdc 8:32 active undef running map04 (3690b11c00029d82b000007415123933d) dm-0 DELL,MD32xx size=9.1T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 1:0:0:0 sdb 8:16 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 2:0:0:0 sdj 8:144 active undef running
There are in total 8x multipath volumes in each OSS pair. During normal operation, only half (4 volumes) are mounted on each OSS. On oss-01 (/proc/mounts):
/dev/mapper/map01 /mnt/ost1 lustre ro 0 0 /dev/mapper/map04 /mnt/ost4 lustre ro 0 0 /dev/mapper/map05 /mnt/ost5 lustre ro 0 0 /dev/mapper/map00 /mnt/ost0 lustre ro 0 0And on oss-02:
/dev/mapper/map02 /mnt/ost2 lustre ro 0 0 /dev/mapper/map03 /mnt/ost3 lustre ro 0 0 /dev/mapper/map06 /mnt/ost6 lustre ro 0 0 /dev/mapper/map07 /mnt/ost7 lustre ro 0 0
When one OSS fails, the other will mount all 8x multipath volumes (Active/Active High-Availability).
Presumably, each of the two OSS pairs forms an active/passive highly available cluster[2]. However, although heartbeat 1.2.4 is installed on all the Object Storage Servers, it doesn't appear to be running!
# /etc/init.d/heartbeat status heartbeat is stopped. No processPerhaps Terascala uses /etc/init.d/tsdisk to mount and failover the Lustre OSTS (Object Storage Targets)?
Each Metadata Server runs NRPE, an agent for the Nagios server running on the Terascala Management Server. NRPE (Nagios Remote Plugin Executor) is a Nagios agent that allows remote system monitoring using scripts that are hosted on the remote systems. It allows for monitoring of resources such as disk usage, system load or the number of users currently logged in. The Nagios server periodically polls the agent on remote system using the check_nrpe plugin[3][4].
The configuration file for NRPE is /usr/local/nagios/etc/nrpe.cfg.
Each Metadata Server also runs gmond, the Ganglia Monitoring Daemon for the Ganglia Meta Daemon (gmetad) runs on the Terascala Management Server. The configuration file for gmond is /usr/local/terascala/etc/ganglia/gmond.conf.