Terascala Metadata Servers - shawfdong/hyades GitHub Wiki
Each of Terascala Metadata Servers pair is actually a combined Lustre Management Server (MGS) and Metadata Target (MDS). The Management Server (MGS) stores configuration information for all the Lustre file systems in a cluster and provides this information to other Lustre components; Each Lustre target contacts the MGS to provide information, and Lustre clients contact the MGS to retrieve information. The MetaData Servers (MDS) provide metadata services for a file system and manages a Metadata Target (MDT) that stores the file metadata[1].
In the Terascala Lustre Storage, the Metadata Server pair is comprised of two Dell PowerEdge R710 servers configured as an active/passive highly available cluster. Each Metadata Server is equipped with two (2x) quad-core Intel Sandy Bridge Xeon processors E5620 at 2.40GHz and 48GB (12 x 4GB) memory.
On each Metadata Server, there are four (4x) Gigabit Ethernet interfaces, two (2x) 10-Gigabit Ethernet interfaces and one QDR InfiniBand HCA (host channel adapter):
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 05:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 05:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 07:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
| node | eth4 (10GbE) | ib0 | eth0 (Management #1) | eth0:1 (Management #2) | eth1 (Heartbeat) | IPMI | 
|---|---|---|---|---|---|---|
| mds-01 | 10.7.8.142 | 10.8.8.142 | 192.168.3.142 | 192.168.4.142 | 192.168.199.1 | 192.168.3.11 | 
| mds-02 | 10.7.8.143 | 10.8.8.143 | 192.168.3.143 | 192.168.4.143 | 192.168.199.2 | 192.168.4.12 | 
On each Metadata Server, there are an LSI SAS 2008 SAS controller and an LSI MegaRAID SAS 2108 RAID controller (relabeled as Dell PERC H700):
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 06:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
Here is a list of the block devices:
# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 465.3G 0 disk |-sda1 8:1 0 62.7M 0 part /bootcom |-sda2 8:2 0 62.8M 0 part |-sda3 8:3 0 62.8M 0 part /boot `-sda4 8:4 0 465G 0 part |-rootVG-A (dm-1) 253:1 0 226.5G 0 lvm |-rootVG-B (dm-2) 253:2 0 224.2G 0 lvm / |-rootVG-crash (dm-3) 253:3 0 10G 0 lvm `-rootVG-swap (dm-4) 253:4 0 2G 0 lvm [SWAP] sdc 8:32 0 5.5T 0 disk `-map00 (dm-0) 253:0 0 5.5T 0 mpath sdb 8:16 0 5.5T 0 disk `-map00 (dm-0) 253:0 0 5.5T 0 mpath
On each Metadata Server, there are two (2x) 500GB SATA hard drives attached to the PERC H700, in a RAID-1 (mirror) configuration. The RAID-1 is sda on the Metadata Server. As similarly done on the Terascala Management Server, there are 2 root partitions (/dev/mapper/rootVG-A and /dev/mapper/rootVG-B) and 2 boot partitions (/boot:A and /boot:B). This clever partition scheme facilitates A-B Upgrade of the Lustre servers.
Both Metadata Servers connect to a shared Dell MD3220 Storage Array; but only the active Metadata Server mounts the storage array:
 
On each Metadata Server, there are 2 external SAS ports, both connecting to the same MD3220 Storage Array. So there are 2 paths to the same block device (a RAID-10 virtual disk on the MD3220 Storage Array), which shows up as both sdb and sdc on the Metadata Servers. DM Multipath is utilized to cross the 2 paths to provide continuous availability and higher bandwidth connectivity between the active Metadata Server and the MD3220 Storage Array[2].
# multipath -l map00 (36d4ae520009f050a000005255123a2ea) dm-0 DELL,MD32xx size=5.5T features='2 pg_init_retries 50' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=0 status=active | `- 1:0:0:0 sdb 8:16 active undef running `-+- policy='round-robin 0' prio=0 status=enabled `- 1:0:1:0 sdc 8:32 active undef running
A combined MGS/MDT file system is created on the multipath block device /dev/mapper/map00, possibly with:
# mkfs.lustre --fsname=pfs --mgs --mdt --index=0 /dev/mapper/map00and mounted at /mnt/mdt1:
# mount -t lustre /dev/mapper/map00 /mnt/mdt1
List all Lustre nodes:
# cat /proc/fs/lustre/mgs/MGS/live/pfs fsname: pfs flags: 0x30 gen: 84 pfs-MDT0000 pfs-OST0000 pfs-OST0001 pfs-OST0002 pfs-OST0003 pfs-OST0004 pfs-OST0005 pfs-OST0006 pfs-OST0007 pfs-OST0008 pfs-OST0009 pfs-OST000a pfs-OST000b pfs-OST000c pfs-OST000d pfs-OST000e pfs-OST000f Secure RPC Config Rules:
The two Metadata Servers form an active/passive highly available cluster. The two servers use a shared Dell MD3220 Storage Array as the Terascala Metadata Target, rather than DRBD. The latter approach would introduce about 10%[3]; because DRBD (network based RAID-1) mirrors block devices on each server via an assigned network.
The HA software used on the Terascala Metadata Servers is heartbeat 1.2.4, including the following RPM packages:
- heartbeat-1.2.4
- heartbeat-pils-1.2.4: a general plugin and interface loading library for heartbeat
- heartbeat-stonith-1.2.4: STONITH (Shoot The Other Node In The Head) module for heartbeat
- Pacemaker - Resource manager
- Heartbeat - Messaging layer
- Corosync - another Messaging layer
- Resource Agents - Scripts that know how to control various services
ts-pfs-01 ts_do_lustre_mdsuses the script /etc/ha.d/resource.d/ts_do_lustre_mds to perform MDS failover.
Each Metadata Server runs NRPE, an agent for the Nagios server running on the Terascala Management Server. NRPE (Nagios Remote Plugin Executor) is a Nagios agent that allows remote system monitoring using scripts that are hosted on the remote systems. It allows for monitoring of resources such as disk usage, system load or the number of users currently logged in. The Nagios server periodically polls the agent on remote system using the check_nrpe plugin[7][8].
The configuration file for NRPE is /usr/local/nagios/etc/nrpe.cfg.
Each Metadata Server also runs gmond, the Ganglia Monitoring Daemon for the Ganglia Meta Daemon (gmetad) runs on the Terascala Management Server. The configuration file for gmond is /usr/local/terascala/etc/ganglia/gmond.conf.
- ^ Lustre 2.x Operations Manual
- ^ Red Hat Enterprise Linux 6 DM Multipath
- ^ Dell Lustre Storage System Whitepaper
- ^ Heartbeat
- ^ Configuring A High Availability Cluster (Heartbeat) On CentOS
- ^ Pacemaker, Heartbeat, Corosync, WTF?
- ^ NRPE Documentation
- ^ How to Add Linux Host to Nagios Monitoring Server Using NRPE Plugin