Terascala Metadata Servers - shawfdong/hyades GitHub Wiki

Each of Terascala Metadata Servers pair is actually a combined Lustre Management Server (MGS) and Metadata Target (MDS). The Management Server (MGS) stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components; Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information. The MetaData Servers (MDS) provide metadata services for a file system and manages a Metadata Target (MDT) that stores the file metadata[1].

In the Terascala Lustre Storage, the Metadata Server pair is comprised of two Dell PowerEdge R710 servers configured as an active/passive highly available cluster. Each Metadata Server is equipped with two (2x) quad-core Intel Sandy Bridge Xeon processors E5620 at 2.40GHz and 48GB (12 x 4GB) memory.

Table of Contents

Network

On each Metadata Server, there are four (4x) Gigabit Ethernet interfaces, two (2x) 10-Gigabit Ethernet interfaces and one QDR InfiniBand HCA (host channel adapter):

01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
05:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
07:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
node eth4 (10GbE) ib0 eth0 (Management #1) eth0:1 (Management #2) eth1 (Heartbeat) IPMI
mds-01 10.7.8.142 10.8.8.142 192.168.3.142 192.168.4.142 192.168.199.1 192.168.3.11
mds-02 10.7.8.143 10.8.8.143 192.168.3.143 192.168.4.143 192.168.199.2 192.168.4.12

Storage

On each Metadata Server, there are an LSI SAS 2008 SAS controller and an LSI MegaRAID SAS 2108 RAID controller (relabeled as Dell PERC H700):

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)
06:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

Here is a list of the block devices:

# lsblk
NAME                    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                       8:0    0 465.3G  0 disk  
|-sda1                    8:1    0  62.7M  0 part  /bootcom
|-sda2                    8:2    0  62.8M  0 part  
|-sda3                    8:3    0  62.8M  0 part  /boot
`-sda4                    8:4    0   465G  0 part  
  |-rootVG-A (dm-1)     253:1    0 226.5G  0 lvm   
  |-rootVG-B (dm-2)     253:2    0 224.2G  0 lvm   /
  |-rootVG-crash (dm-3) 253:3    0    10G  0 lvm   
  `-rootVG-swap (dm-4)  253:4    0     2G  0 lvm   [SWAP]
sdc                       8:32   0   5.5T  0 disk  
`-map00 (dm-0)          253:0    0   5.5T  0 mpath 
sdb                       8:16   0   5.5T  0 disk  
`-map00 (dm-0)          253:0    0   5.5T  0 mpath 

On each Metadata Server, there are two (2x) 500GB SATA hard drives attached to the PERC H700, in a RAID-1 (mirror) configuration. The RAID-1 is sda on the Metadata Server. As similarly done on the Terascala Management Server, there are 2 root partitions (/dev/mapper/rootVG-A and /dev/mapper/rootVG-B) and 2 boot partitions (/boot:A and /boot:B). This clever partition scheme facilitates A-B Upgrade of the Lustre servers.

Both Metadata Servers connect to a shared Dell MD3220 Storage Array; but only the active Metadata Server mounts the storage array:

On each Metadata Server, there are 2 external SAS ports, both connecting to the same MD3220 Storage Array. So there are 2 paths to the same block device (a RAID-10 virtual disk on the MD3220 Storage Array), which shows up as both sdb and sdc on the Metadata Servers. DM Multipath is utilized to cross the 2 paths to provide continuous availability and higher bandwidth connectivity between the active Metadata Server and the MD3220 Storage Array[2].

# multipath -l
map00 (36d4ae520009f050a000005255123a2ea) dm-0 DELL,MD32xx
size=5.5T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 1:0:0:0 sdb 8:16 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:1:0 sdc 8:32 active undef running

A combined MGS/MDT file system is created on the multipath block device /dev/mapper/map00, possibly with:

# mkfs.lustre --fsname=pfs --mgs --mdt --index=0 /dev/mapper/map00
and mounted at /mnt/mdt1:
# mount -t lustre /dev/mapper/map00 /mnt/mdt1

List all Lustre nodes:

# cat /proc/fs/lustre/mgs/MGS/live/pfs 
fsname: pfs
flags: 0x30     gen: 84
pfs-MDT0000
pfs-OST0000
pfs-OST0001
pfs-OST0002
pfs-OST0003
pfs-OST0004
pfs-OST0005
pfs-OST0006
pfs-OST0007
pfs-OST0008
pfs-OST0009
pfs-OST000a
pfs-OST000b
pfs-OST000c
pfs-OST000d
pfs-OST000e
pfs-OST000f

Secure RPC Config Rules:

Active/Passive High-Availability

The two Metadata Servers form an active/passive highly available cluster. The two servers use a shared Dell MD3220 Storage Array as the Terascala Metadata Target, rather than DRBD. The latter approach would introduce about 10%[3]; because DRBD (network based RAID-1) mirrors block devices on each server via an assigned network.

The HA software used on the Terascala Metadata Servers is heartbeat 1.2.4, including the following RPM packages:

This is an old version, which includes the messaging layer, the Local Resource Manager, plumbing infrastructure and STONITH, the Resource Agents, and the Cluster Resource Manager, all in a single package[4][5]. Modern HA (later than Heartbeat 2.1.4) would split those components into separate packages, e.g.[6]:
  • Pacemaker - Resource manager
  • Heartbeat - Messaging layer
  • Corosync - another Messaging layer
  • Resource Agents - Scripts that know how to control various services
The configuration files for heartbeat 1.2.4 are located in /etc/ha.d. Notably /etc/ha.d/
ts-pfs-01 ts_do_lustre_mds
uses the script /etc/ha.d/resource.d/ts_do_lustre_mds to perform MDS failover.

NRPE

Each Metadata Server runs NRPE, an agent for the Nagios server running on the Terascala Management Server. NRPE (Nagios Remote Plugin Executor) is a Nagios agent that allows remote system monitoring using scripts that are hosted on the remote systems. It allows for monitoring of resources such as disk usage, system load or the number of users currently logged in. The Nagios server periodically polls the agent on remote system using the check_nrpe plugin[7][8].

The configuration file for NRPE is /usr/local/nagios/etc/nrpe.cfg.

gmond

Each Metadata Server also runs gmond, the Ganglia Monitoring Daemon for the Ganglia Meta Daemon (gmetad) runs on the Terascala Management Server. The configuration file for gmond is /usr/local/terascala/etc/ganglia/gmond.conf.

References

  1. ^ Lustre 2.x Operations Manual
  2. ^ Red Hat Enterprise Linux 6 DM Multipath
  3. ^ Dell Lustre Storage System Whitepaper
  4. ^ Heartbeat
  5. ^ Configuring A High Availability Cluster (Heartbeat) On CentOS
  6. ^ Pacemaker, Heartbeat, Corosync, WTF?
  7. ^ NRPE Documentation
  8. ^ How to Add Linux Host to Nagios Monitoring Server Using NRPE Plugin
⚠️ **GitHub.com Fallback** ⚠️