42_Cluster_Configuration - quantum/esos GitHub Wiki

High Availability (HA) / Clustering via the Pacemaker + Corosync Stack

A full featured cluster stack is included with ESOS. The stack consists of Pacemaker, Corosync, and crmsh. The resource-agents and fence-agents packages are also included along with other supporting tools/utilities (eg, Python, ipmitool, etc.).

The setup and configuration of the cluster stack is well beyond the scope of this ESOS documentation, however, there is tons of information out there on Pacemaker/Corosync. We suggest starting with the following guides for configuration (obviously, all of the cluster components are already installed in ESOS):

Clusters from Scratch
Pacemaker Explained

The Pacemaker and Corosync rc scripts (rc.pacemaker & rc.corosync) are both disabled by default in ESOS. To enable them, edit the '/etc/rc.conf' file and set 'rc.corosync_enable' and 'rc.pacemaker_enable' to "YES". You'll then need to start both services:

/etc/rc.d/rc.corosync start
/etc/rc.d/rc.pacemaker start

A SCST resource agent ('ocf:esos:scst') is included with ESOS. It can be configured either as a normal resource (start/stop) or a multi-state resource (master/slave); the MS resource mode relies on the implicit Asymmetric Logical Unit Assignment (ALUA) functionality in SCST, so this must also be configured when using 'ocf:esos:scst' as master/slave.

When using 'ocf:esos:scst' as a "normal" resource, the supporting user-land daemons and SCST modules are loaded when started, and when stopped, all daemons and modules are unloaded. This is important to know since when loading SCST, it expects whatever devices you have defined in /etc/scst.conf to be available (eg, /dev/drbd0 block device, or a virtual disk file on a file system). SCST will remove these from the configuration if they are not available when started; please keep this in mind when designing your cluster.

You will also want to make sure SCST start-up via init is disabled since the cluster stack will be managing it; set rc.scst_enable to 'NO' in /etc/rc.conf and stop the SCST service: /etc/rc.d/rc.scst stop

An example of the SCST resource configuration for two ESOS nodes (one started, one stopped) might look something like this:

crm
cib new scst
configure primitive p_scst ocf:esos:scst
configure show
cib commit scst
quit

An example of the SCST resource configuration for two ESOS nodes (both started) might look something like this:

crm
cib new scst
configure primitive p_scst ocf:esos:scst
configure clone clone_scst p_scst \
meta clone-max="2" clone-node-max="1" notify="true" interleave="true"
configure show
cib commit scst
quit

If you want SCST to be loaded and running on the cluster nodes, but not necessarily "available" (path preference, see notes below) you can use 'ocf:esos:scst' as a multi-state resource. The resource agent (RA) relies on ALUA in SCST to function. Below are example ALUA configurations in SCST for a two node cluster (SCST must be running in order to configure this, use '/etc/rc.d/rc.scst' to enable and then disable SCST). The examples are given below to illustrate the relation of the device groups, target groups, and targets between the two nodes. You can also use the TUI "ALUA" menu functions to configure ALUA on both nodes as well.

On the first host, using the shell (Interface -> Exit to Shell):

/etc/rc.d/rc.scst start
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 21:00:00:e0:8b:9d:74:49 -dev_group esos -tgt_group local
scstadmin -set_ttgt_attr 21:00:00:e0:8b:9d:74:49 -dev_group esos -tgt_group local -attributes rel_tgt_id=1
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 21:00:00:1b:32:01:6b:11 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 21:00:00:1b:32:01:6b:11 -dev_group esos -tgt_group remote -attributes rel_tgt_id=2
/etc/rc.d/rc.scst stop

On the second host, using the shell (Interface -> Exit to Shell):

/etc/rc.d/rc.scst start
scstadmin -add_dgrp esos
scstadmin -add_tgrp local -dev_group esos
scstadmin -set_tgrp_attr local -dev_group esos -attributes group_id=2
scstadmin -add_tgrp_tgt 21:00:00:1b:32:01:6b:11 -dev_group esos -tgt_group local
scstadmin -set_ttgt_attr 21:00:00:1b:32:01:6b:11 -dev_group esos -tgt_group local -attributes rel_tgt_id=2
scstadmin -add_tgrp remote -dev_group esos
scstadmin -set_tgrp_attr remote -dev_group esos -attributes group_id=1
scstadmin -add_tgrp_tgt 21:00:00:e0:8b:9d:74:49 -dev_group esos -tgt_group remote
scstadmin -set_ttgt_attr 21:00:00:e0:8b:9d:74:49 -dev_group esos -tgt_group remote -attributes rel_tgt_id=1
/etc/rc.d/rc.scst stop

Now that ALUA is configured, you can configure the SCST resource. From this point forward, when you want your SCST devices to be used in the ALUA setup, you must run the following command command on each node for each device (DEVICE_NAME is the name of the SCST device):

scstadmin -add_dgrp_dev DEVICE_NAME -dev_group esos

The SCST RA uses two ALUA states for Master/Slave; the 'active' state for Master, and the 'nonoptimized' state for Slave. The reasoning behind using 'nonoptimized' is that we wanted the initiators to be able to pick a path themselves if something failed between it and the target (eg, switch, cable, HBA, etc.); this way it isn't required of the target (ESOS) to change ALUA state information based on an external path failure. This ALUA state seemed to work best in accomplishing this for most initiators.

SCST multi-state resource example; two nodes (one master, one slave):

crm
cib new scst
configure primitive p_scst ocf:esos:scst \
params alua="true" \
device_group=”esos” \
local_tgt_grp=”local” \
remote_tgt_grp=”remote”
op monitor interval=”10” role="Master" \
op monitor interval=”20” role="Slave" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="60"
ms ms_scst p_scst \
meta master-max="1" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true"
configure show
cib commit scst
quit

SCST multi-state resource example; two nodes (both master):

crm
cib new scst
configure primitive p_scst ocf:esos:scst \
params alua="true" device_group=”esos” \
local_tgt_grp=”local” remote_tgt_grp=”remote” \
m_alua_state="active" s_alua_state="nonoptimized" \
op monitor interval=”10” role="Master" \
op monitor interval=”20” role="Slave" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="60"
ms ms_scst p_scst \
meta master-max="2" master-node-max="1" \
clone-max="2" clone-node-max="1" \
notify="true"
configure show
cib commit scst
quit

Warning: Care needs to be taken when using an active/active SCST configuration in MPIO and clustered initiator environments. SCST itself is not cluster aware and when using something like DRBD, it only replicates the blocks of data between hosts. There is more to SCSI than just reads/writes (locks, etc.) and these are NOT communicated to other hosts with SCST. If you are using some type of clustered/HA back-end storage and using ESOS/SCST as a different target type, these other SCSI items may be passed through the layers, but you should double-check. Even when using the SCST ALUA multi-state resource agent, it is implicit ALUA and the targets are only "suggesting" what path to take (assuming the initiator/application supports implicit ALUA). Be sure not to use some type of round-robin algorithm for MPIO on you initiators! Check your initiator configuration, and use a fixed pathing policy or something similar.

Since ESOS makes use of email as a communication method, a email-helper script (external agent) was developed for the crm_mon utility. This is typically used with the 'ocf:pacemaker:ClusterMon' resource agent in ESOS for cluster status change notifications. Below is an example of configuring it using crmsh; its recommended to not enable this resource until after you have completed setup and testing of your ESOS cluster as it can generate a lot of email messages.

crm
cib new notify
configure primitive p_notify ocf:pacemaker:ClusterMon \
params user="root" update="30" \
extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
op monitor on-fail="restart" interval="10"
configure clone clone_notify p_notify \
meta target-role="Started"
configure show
cib commit notify
quit

Synchronize your configuration:

conf_sync.sh

The ocf:esos:scst RA

In addition to the ocf:esos:scst resource agent being used as described above, where it manages the SCST service (modules/daemons) itself, and the ALUA target group states, it can also be used to only start/stop the SCST subsystem. This would be useful if you wanted to run two sets (multi-state) of the ocf:esos:alua RA (see below). With this type of setup, you'd have one resource managing SCST each on each node, and then two ALUA resources with the preferred node set to a specific ESOS server. You could then achieve an "active/active" setup where you place some devices in one SCST device group, and some devices in another.

SCST "clone" resource configuration (start/stop SCST only) example:

crm
cib new scst
configure primitive p_scst ocf:esos:scst \
params alua=false \
op start interval=0 timeout=120 \
op stop interval=0 timeout=60 \
op monitor interval=30 timeout=60
configure clone clone_scst p_scst \
meta interleave=true target-role=Started
configure show
cib commit scst
quit

The ocf:esos:alua RA

The ocf:esos:alua resource agent is special in that it doesn't actually start or stop the SCST services (modules/daemons) on the cluster nodes. Instead, it expects the SCST service to already be running, and this RA simply modifies or tests the running SCST ALUA configuration (checking target group states). You can run more than one of these resource sets (multi-state) in a cluster configuration, typically just two. This allows you to create one SCST device group for node "A" and one for node "B". You can then use cluster constraints to keep SCST device group "active" on one node, and the other group "active" on the opposing node. You would then manually map devices between these two device groups creating an active/active configuration. On fail-over, all of the devices would be accessed on the standing node.

Here is an example of adding the ocf:esos_alua resource:

crm
cib new alua
configure primitive p_alua_node_a_devs ocf:esos:alua \
params device_group=node_a_devs local_tgt_grp=node_a_local_grp remote_tgt_grp=node_a_remote_grp \
m_alua_state=active s_alua_state=nonoptimized use_trans_state=true \
op monitor interval=10 role=Master \
op monitor interval=20 role=Slave \
op start interval=0 timeout=120 \
op stop interval=0 timeout=60
ms ms_alua_node_a_devs p_alua_node_a_devs \
meta master-max=1 master-node-max=1 clone-max=2 \
clone-node-max=1 notify=true interleave=true
configure show
cib commit alua
quit

The ocf:esos:syncro RA

This is the resource agent for managing Broadcom/Avago/LSI "Syncro" virtual drive (VD) ownership. The RA uses an SCST device group as a parameter, and any raw SCSI disk devices (eg, using vdisk_blockio handler, and the block devices are the actual Syncro VD's) that are in this should be "owned" by the node. This RA was inspired by the 'syncrovd' agent from author Felix Zachlod.

After much testing with this RA, we do not recommend using it! The RA attempts to "own" a VD by using SCSI persistent reservations (PR's) and it works to a limited degree, however, many quirks are visible with the Syncro VD's after ownership is attempted. It is only given here for reference purposes, and possibly expanding on it in the future.

The ocf:brick:btier RA

Special thanks to Riccardo Bicelli for creating a BTIER resource agent (RA) for use with Pacemaker. This RA is included in ESOS; here is his original post for the BTIER RA: http://think-brick.blogspot.it/2014/09/btier-resource-agents-for-pacemaker.html

Example usage in ESOS:

crm
cib new btier
configure primitive p_btier ocf:esos:btier \
params tier_devices="/dev/sda:/dev/sdb" \
device_name="mybtierdev01"
op monitor interval="10s"
configure show
cib commit btier
quit

The ocf:onesty:scst_qla2xtgt RA

This resource agent comes from author Felix Zachlod; its a special RA that some ESOS users may be interested in experimenting with. It works quite differently from the standard ocf:esos:scst and ocf:esos:alua RA's that were discussed above. It's used to start drivers and enable QLogic Fibre Channel ports as well as instantiate initiator groups, and I/O grouping. It is specifically made for HA Fibre Channel (FC) ALUA targets, and so it expects that every target port must have a unique target port ID.

Use this to initialize your SCST instance before starting devices and device groups above it. So all your configuration can be held in the cluster manager, instead of manually configuring each node. SCST startup on system boot must be disabled.

For additional information, please visit the GitHub repo for this RA: https://github.com/FZachlod/scst_ocf_ra

The ocf:onesty:scst_aluadg RA

This is a another cluster RA that comes to us from Felix Zachlod. This resource agent manages SCST ALUA device groups. It's used to create device handlers, and exports/manages the ALUA states between targets. It also provide to our ESOS users if they would like experiment with an alternative RA for use with certain SCST configurations.

For additional information, please visit the GitHub repo for this RA: https://github.com/FZachlod/scst_ocf_ra