Resource Agent for PostgreSQL 9.1 streaming replication - t-matsuo/resource-agents GitHub Wiki

This patch was merged into Clusterlabs repository and released as resource-agnets-3.9.3. (recommend 3.9.5 or higher)

Bug report : mailling list (http://lists.linux-ha.org/mailman/listinfo/) or Issues (https://github.com/ClusterLabs/resource-agents/issues)

(note) This RA is developped with CentOS 5 and Pacemaker 1.0.11.

Outline

A synchronous replication was supported as a new feature in PostgreSQL 9.1. Pacemeker Resource agent (RA) for synchronous/asynchronous replication of PostgreSQL 9.1 is developed based on the pgsql resource agent of the Heartbeat community.

Features

Failover of Master
- If Master breaks down, RA detects this fault and makes Master to stop, and Slave is promoted to new Master(promote).
Switching between asynchronous and synchronous * If Slave breaks down or LAN have some trouble, the transaction including Write operation will be stopped during the setting of synchronous replication. This means the stop of service. Therefore, RA switches dynamically from the synchtonous to the asynchronization replication for prevented from stopping.
Automated discrimination of data old and new when initial starts
- When Pacemaker of two or more nodes is started at the same time in the initial starts, RA compare the data of each node using last xlog replay location to check which node has the newest data. The node which has the newest data will be Master. Of cource the node becomes Master when Pacemaker is started only one node or it starts for the first time too. RA judges it based on the state of the data when having stopped before.
Load-balancing of read
- Because Slave can process the Read only-transaction, the load-balancing of Read is possible by applying another virtual IP for the Read operation.

Term

The term was used in this document.

RA
- Resource agent
Master,Slave
- the state of Master/Slave resource of Pacemaker
PRI
- PostgreSQL works as Primary(Master). The request of Read/Write can be processed in PRI as well as usual PostgreSQL, and data for the replication can be transmitted. Master of Pacemaker is basically corresponding to this state. At a synchronous replication, The transaction is stopped when the response from Standy PostgreSQL(HS) is lost.
HS
- PostgreSQL works as Hot Standby. Only the request of Read is available. The data of the replication can be received from PRI. Because PostgreSQL cannot change the state from PRI to HS directory, the state may not be constistent to Slave of Pacemaker though the state changes to Pacemaker with Master-> Slave.
Asynchronous mode
- Make the asynchronization replication of PostgreSQL HA cluster with RA. Only when HS works normally by the asynchronization replication, the failover of Master is possible.
Synchronous mode
- Make the synchronous replication of PostgreSQL HA cluster with RA. HS works as the synchronous mode as a normal, However, HS's breaking down or removing LAN HS move to asynchronization mode (automatic switch). Only when HS works as a synchronous mode, the failover of Master is possible.
Replication mode
- The asynchronous mode and the synchronous mode are generically called a replication mode.
Replication mode switch
- At the synchronous mode, The Replication mode switches the synchronous to the asynchronous replication.
D-LAN
- LAN that throws packet of data replication. Please use bonding for the fault tolerance.
S-LAN
- LAN to provide service. In the current Act-Standby composition, it is equal to LAN that gives virtual IP. Please use bonding for the fault tolerance.
IC-LAN
- LAN (inter connection LAN) that throws packet of communication of heartbeat of Pacemaker. Two or more is recommended.
STONITH-LAN
- STONITH is not used in this document. But I recommend to use it.

Limitations

If you want to connect new Master(PRI) after fail-over occured automatically, it is necessary to share the WAL archive of PRI to HS. For instance, at the "archive_command" setting of postgresql.conf, you can use scp or rsync command for sending the WAL archive to HS. Another methods are the using the share-disk and NFS and so on.. RA don't care the method. I recommend not to share WAL archives because it's difficult to keep it consistency.
- A specification of PostgreSQL9.1 fails switching the Master. Because the shutdown of PRI can't send all WAL to HS. I hope someone will improve it.
Without shareing the WAL archive, it is impossible to reconnect to PRI from HS when PostgreSQL is stopped with demote and stop operations. In this case you must copy the WAL archive of PRI to HS by manually.
It is necessary to add the virtual IP to not only S-LAN but also D-LAN in Master.

Parameter

The following parameters are added to the parameter of the original pgsql RA. the "monitor_sql" of original pgsql RA parameter cannot be used in the replication mode.

rep_mode
- choice from none/async/sync. "none" is default, and the same operation as original pgsql RA. "async" is an asynchronous mode, and "sync" is a synchronous mode. The following parameter node_list master_ip, and restore_command is necessary at async or sync modes(*).
node_list(*)
- The list of PRI and all HS nodes. Specifies a space-separated list of all node name (result of the uname -n command).
master_ip(*)
- Virtual IP used for D-LAN is specified. This Virtual IP is added to the Master.
restore_command(*)
- restore_command specified in recovery.conf file when starting with HS.
repuser
- The user of replication which HS connects to PRI. Default is "postgres".
primary_conninfo_opt
- RA generates recovery.conf file for HS. host,port,user and application name of primary_conninfo are automatically set by RA. If you want to set an additional parameter, you can specifies it here.
  - ex)ssl setting
tmpdir
- the rep_mode_conf and xlog_note.* and PGSQL.lock files are created in this directory. Default is /var/lib/pgsql/tmp directory. If the directory dosen't exist, RA makes it automatically.
xlog_check_count
- The count of last_xlog_replay_location's check is speciefied. Default is 3 (times). It is counted at moniter interval. The last_xlog_replay location is used for which node is the latest one at the initial starting PostgreSQL. If you set small values, the wrong PRI node is set because Pacemaker of other nodes is not started.
stop_escalate_in_slave
- Number of shutdown retries (using -m fast) before resorting to -m immediate in Slave state. In Master sate, you can use "stop_escalate".

Installation

A peculiar setting to this RA is mainly described here. Please refer to other documents for an installation and a basic operation of PostgreSQL and Pacemaker.

The assumption composition in this document is made the following.

The replication mode is assumed to be a synchronous mode
The WAL archive is not shared.
- You need to copy WAL archive from new Master to Slave.
The node name is assumed to be pm01 and pm02
S-LAN IP are 192.168.0.1 and 192.168.0.2.
Virtual IP(Master) is 192.168.0.201.
Virtual IP(Slave) is 192.168.0.202.
IC-LAN are 192.168.1.1, 192.168.1.2, 192.168.2.1, and 192.168.2.2.
D-LAN are 192.168.3.1 and 192.168.3.2.
Virtual (Master of D-LAN) IP is 192.168.3.200.
/usr/local/pgsql/ is the installation destination of PostgreSQL.
/var/lib/pgsql/9.1/data is the PostgreSQL database cluster and an archival storage is the /var/lib/pgsql/9.1/data/pg_archive.
If you use this RA for business use, STONITH setting is strongly recommends. The explanation is ommited here.

figure

the installation method is as follows.

Setting of PostgreSQL ( Only an important point..)

Point of postgresql.conf setting
- You need to start PostgreSQL manually before settiing up replication.
- when there is "Synchronous_standby_names" parameter in postgresql.conf, please delete it.
- Fixed IP cannot be written in listen_address.
- Replication_timeout is detection time for the replication cuts it, and wal_receiver_status_interval is an interval when HS tries connecting to PRI. To shorten detection, you should set thease value to small.
- The WAL archive is not shared with PRI with HS. So you need to copy WAL archive from new Master to Slave to connect new Master normally.
- RA adds "include" into postgresql.conf when using synchronous mode. If you want to switch to asynchronous mode, you need to delete it.

The main set part as follows. Please refer to the manual of PostgreSQL for other parameter. Check the starting with the PostgreSQL unit, and the replication is possible.

postgresql.conf

listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
restart_after_crash = off
replication_timeout = 5000         # mseconds
wal_receiver_status_interval = 2   # seconds
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
synchronous_commit = on
restart_after_crash = off
hot_standby_feedback = on

pg_hba.conf of each node
- Be careful. this explanation is not considered about the security.

host    all             all     127.0.0.1/32        trust
host    replication     all     192.168.3.0/24      trust

recovery.conf
- It is not necessary. RA makes it automaticaly and delete it if exists.
- If you want to confirm the replication without Pacemaker, Please make the following file at /vat/lib/pgsql/9.1/data on pm02.

standby_mode = 'on'
primary_conninfo = 'host=192.168.3.1 port=5432 user=postgres application_name=pm02'
restore_command = 'cp /var/lib/pgsql/9.1/data/pg_archive/%f %p'
recovery_target_timeline='latest'

Please confirm PostgreSQL replication success.

Pacemaker

pacemaker 1.0.11
- Only /usr/lib/ocf/resource.d/heartbeat/pgsql is replaced with this file (https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql). The file can be downloaded by clicking *raw.
- Setting CRM file of the sample described at the end of this document is additionally changed to your environment after Pacemaker starts.

About the state

This RA defines the following states as a node attribute value of Pacemaker.

attribute can be seen in "crm_mon -A".

pgsql-status

A present state of PostgreSQL is shown by the attribute value to which PRI or either HS node is displayed.

STOP
- PostgreSQL has stopped.
HS:alone
- It works as HS but doesn't connect it to PRI
HS:connected
- it works as HS, and connected with PRI but, Not normal state of the replication (Data has not caught up with PRI). "HS:connected" is displayed except for the result of "select state and sync_state from pg_stat_replication" in PRI: STREAMING|SYNC, STREAMING|ASYNC,STREAMING|STREAMING STREAMING|POTENTIAL
- in the synchronous mode and STREAMING|ASYNC in the asynchronous mode.
HS:async
- It works as HS, and the state of the asynchronous replication. When RA is used in the asynchronous mode, it is possible to be promoted to Master.
HS:sync
- It works as HS, and the state of a synchronous replication. It is possible to be promoted to Master for the synchronous mode.
HS:potential
- It is displayed when moving it in the synchronousmode more than three nodes. (PRI x1, HS x N) The specification of PostgreSQL as a synchronous replication is only one node, and the other HS nodes become this POTENTIAL.
PRI
- It operates by PRI.

pgsql-data-status

The transitional state of data is displayed. When being start next time, this state is used to judge whether which node are the latest.

DISCONNECT
- It changes when it becomes impossible to HS from PRI due to the breakdown and the D-LAN failure etc.
{state} | {sync_state}
- It changes when HS connects it with PRI.
- The result of "select state and sync_state from pg_stat_replication" on PRI is displayed as it is as an attribute value.
- INIT, CATCHUP, and STREAMING are displayed to {state} and ASYNC, POTENTIAL, and SYNC are displayed to {sync_state} (All in this ? )
LATEST
- It changes to Master(PRI).

These states are the transitional state of final data, and it may be not consistent with the state of actual data. For instance, During PRI, the state is "LATEST", but the node stop or down, this state "LATEST" is maintained. It never changes to "DISCONNECT" for oneself. When other node newly promote Master, this new Master changes the state of old Master to "DISCONNECT". When any node can not become Master, this "LATEST" will be keepped.

pgsql-master-baseline

It is displayed when promote is called. This attribute is used when starting PosgreSQL as a HS to prohibit it from starting if data is inconsistent.

pgsql-xlog-replay-loc

There is no Master node, it is displayed. RA decide to promote one node to Master comparing with the value of the last xlog replay location among other node.

crm_mon

============
Last updated: Wed Jul 11 11:11:11 2011
Stack: Heartbeat
Current DC: pm02 (11111111-1111-1111-1111-111111111111) - partition with quorum
Version: 1.0.12-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
2 Nodes configured, unknown expected votes
5 Resources configured.
============

Online: [ pm01 pm02 ]

vip-master      (ocf::heartbeat:IPaddr2):       Started pm01
vip-rep (ocf::heartbeat:IPaddr2):       Started pm01
vip-slave       (ocf::heartbeat:IPaddr2):       Started pm02
 Master/Slave Set: msPostgresql
     Masters: [ pm01 ]
     Slaves: [ pm02 ]
 Clone Set: clnPingCheck
     Started: [ pm01 pm02 ]

Node Attributes:
* Node pm01:
    + default_ping_set                  : 100
    + master-pgsql:0                    : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-status                      : PRI
* Node pm02:
    + default_ping_set                  : 100
    + master-pgsql:1                    : 100
    + pgsql-data-status                 : STREAMING|SYNC
    + pgsql-status                      : HS:sync

Migration summary:
* Node pm02:
* Node pm01:

sample setting of crm (using pacemaker 1.0.x)

property \
	no-quorum-policy="ignore" \
	stonith-enabled="false" \
    crmd-transition-delay="0s"

rsc_defaults \
	resource-stickiness="INFINITY" \
	migration-threshold="1"

ms msPostgresql pgsql \
	meta \
		master-max="1" \
		master-node-max="1" \
		clone-max="2" \
		clone-node-max="1" \
		notify="true"

clone clnPingCheck pingCheck
group master-group \
      vip-master \
      vip-rep \
      meta \
          ordered="false"

primitive vip-master ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.0.201" \
		nic="eth0" \
		cidr_netmask="24" \
	op start   timeout="60s" interval="0s"  on-fail="stop" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive vip-rep ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.3.200" \
		nic="eth3" \
		cidr_netmask="24" \
    meta \
            migration-threshold="0" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive vip-slave ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.0.202" \
		nic="eth0" \
		cidr_netmask="24" \
    meta \
        resource-stickiness="1" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive pgsql ocf:heartbeat:pgsql \
	params \
		pgctl="/usr/local/pgsql/bin/pg_ctl" \
		psql="/usr/local/pgsql/bin/psql" \
		pgdata="/var/lib/pgsql/9.1/data/" \
		start_opt="-p 5432" \
		rep_mode="sync" \
		node_list="pm01 pm02" \
		restore_command="cp /var/lib/pgsql/9.1/data/pg_archive/%f %p" \
		primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
		master_ip="192.168.3.200" \
		stop_escalate="0" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="7s" on-fail="restart" \
	op monitor timeout="60s" interval="2s"  on-fail="restart" role="Master" \
	op promote timeout="60s" interval="0s"  on-fail="restart" \
	op demote  timeout="60s" interval="0s"  on-fail="stop" \
	op stop    timeout="60s" interval="0s"  on-fail="block" \
	op notify  timeout="60s" interval="0s"

primitive pingCheck ocf:pacemaker:pingd \
	params \
		name="default_ping_set" \
		host_list="192.168.0.254" \
		multiplier="100" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="ignore"

location rsc_location-1 vip-slave \
	rule  200: pgsql-status eq "HS:sync" \
	rule  100: pgsql-status eq "PRI" \
	rule  -inf: not_defined pgsql-status \
	rule  -inf: pgsql-status ne "HS:sync" and pgsql-status ne "PRI"

location rsc_location-2 msPostgresql \
	rule -inf: not_defined default_ping_set or default_ping_set lt 100

colocation rsc_colocation-1 inf: msPostgresql        clnPingCheck
colocation rsc_colocation-2 inf: master-group        msPostgresql:Master

order rsc_order-1 0: clnPingCheck          msPostgresql
order rsc_order-2 0: msPostgresql:promote  master-group:start   symmetrical=false
order rsc_order-3 0: msPostgresql:demote   master-group:stop    symmetrical=false

Q&A

How do I force start Master although pgsql-data-status is "DISCONNECT" ?
- crm_attribute -l forever -N {Node Name} -n "pgsql-data-status" -v "LATEST"
How can I rsync data manually ?
- You need delete option.
- (ex) rsync -avr --delete 192.168.3.2:/var/lib/pgsql/9.1/data/ /var/lib/pgsql/9.1/data/
What's PGSQL.lock file ?
- The file is created on promote. And it's deleted on demote only if Slave dosen't exist. If this file remains in a node, maybe the node's data is inconsistent.
How do we stop all servers ?
- First, stop Slave. After that stop Master. If you stop Master in first, PGSQL.lock file remains.