Resource Agent for PostgreSQL 9.1 streaming replication (for pg rex branch) - t-matsuo/resource-agents GitHub Wiki

(Thanks for translation, Mr. Yoshiharu Mori)

Bug report : https://github.com/t-matsuo/resource-agents/issues

This document is written for "pg-rex" branch. "pg-rex" branch is Maintenance Version. "pgsql91" branch is Development Version.

Outline

A synchronous replication was supported as a new feature in PostgreSQL 9.1. Pacemeker Resource agent (RA) for synchronous/asynchronous replication of PostgreSQL 9.1 is developed based on the pgsql resource agent of the Heartbeat community.

Term

The term was used in this document.

RA
- Resource agent
Master,Slave
- the state of Master/Slave resource of Pacemaker
PRI
- PostgreSQL works as Primary. The request of Read/Write can be processed in PRI as well as usual PostgreSQL, and data for the replication can be transmitted. Master of Pacemaker is basically corresponding to this state. At a synchronous replication, The transaction is stopped when the response from Standy PostgreSQL(HS) is lost.
HS
- PostgreSQL works as Hot Standby. Only the request of Read is available. The data of the replication can be received from PRI. Because PostgreSQL cannot change the state from PRI to HS directory, the state may not be constistent to Slave of Pacemaker though the state changes to Pacemaker with Master-> Slave.
Asynchronous mode
- Make the asynchronization replication of PostgreSQL HA cluster with RA. Only when HS works normally by the asynchronization replication, the failover of Master is possible.
Synchronous mode
- Make the synchronous replication of PostgreSQL HA cluster with RA. HS works as the synchronous mode as a normal, However, HS's breaking down or removing LAN HS move to asynchronization mode (automatic switch). Only when HS works as a synchronous mode, the failover of Master is possible.
Replication mode
- The asynchronous mode and the synchronous mode are generically called a replication mode.
Replication mode switch
- At the synchronous mode, The Replication mode switches the synchronous to the asynchronous replication.
D-LAN
- LAN that throws packet of data replication. Please use bonding for the fault tolerance.
S-LAN
- LAN to provide service. In the current Act-Standby composition, it is equal to LAN that gives virtual IP. Please use bonding for the fault tolerance.
IC-LAN
- LAN (inter connection LAN) that throws packet of communication of heartbeat of Pacemaker. Two or more is recommended.

Function

Failover of Master

RA manages the Master when PostgreSQL is PRI, the Slave when PostgreSQL is HS. If PRI breaks down, RA detects this fault and makes PRI to stop, and HS is promoted to new PRI (promote).

Automated discrimination of data old and new when initial starts

When Pacemaker of two or more nodes is started at the same time in the initial starts, RA compare the data of each node using last xlog replay location to check which node has the newest data. The node which has the newest data will be PRI. Of cource the node becomes PRI when Pacemaker is started only one node or it starts for the first time too. RA judges it based on the state of the data when having stopped before.

Replication mode switch when synchronous replication is used

If HS breaks down or D-LAN have some trouble,the transaction including Write operation will be stopped during the setting of synchronous replication. This means the stop of service. Therefore, RA switches dynamically from the synchtonous to the asynchronization replication for prevented from stopping.

Load-balancing of Read

Because HS can process the Read only-transaction, the load-balancing of Read is possible by applying another virtual IP for the Read operation.

Limitations

If you want to switch the Master(PRI) and restart PostgreSQL, it is necessary to share the WAL archive of PRI to HS. For instance, at the "archive_command" setting of postgresql.conf, you can use scp or rsync command for sending the WAL archive to HS. Another methods are the using the share-disk and NFS and so on.. RA don't care the method.
- A bug of PostgreSQL9.1.1 fails switching the Master.
Without shareing the WAL archive, it is impossible to reconnect to PRI from HS when PostgreSQL is stopped with demote and stop operations. In this case you must copy the WAL archive of PRI to HS by manually.
It is necessary to add the virtual IP to not only S-LAN but also D-LAN in Master.

Parameter

The following parameters are added to the parameter of the original pgsql RA. the "monitor_sql" of original pgsql RA parameter cannot be used in the replication mode.

rep_mode
- choice from none/async/sync. "none" is default, and the same operation as original pgsql RA. "async" is an asynchronous mode, and "sync" is a synchronous mode. The following parameter node_list master_ip, and restore_command is necessary at async or sync modes(*).
node_list(*)
- The list of PRI and all HS nodes. Specifies a space-separated list of all node name (result of the uname -n command).
master_ip(*)
- Virtual IP used for D-LAN is specified. This Virtual IP is added to the Master.
restore_command(*)
- restore_command specified in recovery.conf file when starting with HS.
repuser
- The user of replication which HS connects to PRI. Default is "postgres".
stop_on_demote
- PostgreSQL is restarting by demote operation, Because PRI cannot be changed directly to HS. This means the failover cannot be done until PostgreSQL does the restarting completion. It takes time to the recovery when restarting, and the time of failover is required. It doesn't become the problem so much for a small data base but a big one. If this option is set to "yes", not restarting with demote operation but the stop is executed and the failover time becomes shortly. It is more effective to set the "stop_escalate" paramaeter to more small value, that is originally is designed for shortening failover time(Maybe...)
primary_conninfo_opt
- RA generates recovery.conf file for HS. host,port,user and application name of primary_conninfo are automatically setted by RA. If you want to set an additional parameter, you can specifies it here.
  - ex)ssl setting
tmpdir
- the rep_mode_conf and xlog_note are creating in this directory. Default is /var/lib/pgsql directory. The root and DB administorator user should be able to read and write it.
pgctldata
- Command path to pg_controldata is specified. Default is /usr/bin/pg_controldata.
xlog_check_count
- The count of last_xlog_replay_location's check is speciefied. Default is 3 (times). It is counted at moniter interval. The last_xlog_replay location is used for which node is the latest one at the initial starting PostgreSQL. If you set small values, the wrong PRI node is setted because Pacemaker of other nodes is not started.

Installation

A peculiar setting to this RA is mainly described here. Please refer to other documents for an installation and a basic operation of PostgreSQL and Pacemaker.

The assumption composition in this document is made the following.

The replication mode is assumed to be a synchronous mode
scp command is used for sharing the WAL archive.
The node name is assumed to be pm01 and pm02
S-LAN IP are 192.168.0.1 and 192.168.0.2.
Virtual IP(Master) is 192.168.0.201.
Virtual IP(Slave) is 192.168.0.202.
IC-LAN are 192.168.1.1, 192.168.1.2, 192.168.2.1, and 192.168.2.2.
D-LAN are 192.168.3.1 and 192.168.3.2.
Virtual (Master of D-LAN) IP is 192.168.3.200.
/usr/local/pgsql/ is the installation destination of PostgreSQL.
/var/lib/pgsql/9.1/data is the PostgreSQL database cluster and an archival storage is the /var/lib/pgsql/9.1/data/pg_archive.
If you use this RA for business use, STONITH setting is strongly recommends. The explanation is ommited here.

figure

the installation method is as follows.

Setting of PostgreSQL ( Only an important point..)

Point of postgresql.conf setting
- when there is "Synchronous_standby_names" parameter in postgresql.conf, please delete it.
- Fixed IP cannot be written in listen_address.
- Replication_timeout is detection time for the replication cuts it, and wal_receiver_status_interval is an interval when HS tries connecting to PRI. To shorten detection, you should set thease value to small. Default is waited for until tcp does the time-out(?)
Include /var/lib/pgsql/rep_mode.conf to postgresql.conf. RA dynamically writes synchronous_standby_names in this file.
The WAL archive is shared with PRI with HS. Here, the example of doing the WAL archiving on PRI and transmitting it to the HS side used by scp command at the same time written in archive_command. Please DB administorater user (postgres) must be exchanging the ssh key with pm02 for pm01 and opposite are also the same. Check login each node without the password.

The main set part as follows. Please refer to the manual of PostgreSQL for other parameter. Check the starting with the PostgreSQL unit, and the replication is possible.

postgresql.conf of pm01

listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.1/data/pg_archive/%f && scp -q %p [email protected]:/var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
include = '/var/lib/pgsql/rep_mode.conf'
restart_after_crash = off
replication_timeout = 5000         # mseconds
wal_receiver_status_interval = 2   # seconds

postgresql.conf of pm02 (Only IP of scp destination is different. )

listen_addresses = '*'
wal_level = hot_standby
synchronous_commit = on
archive_mode = on
archive_command = 'cp %p /var/lib/pgsql/9.1/data/pg_archive/%f && scp -q %p [email protected]:/var/lib/pgsql/9.1/data/pg_archive/%f'
max_wal_senders=5
wal_keep_segments = 32
hot_standby = on
include = '/var/lib/pgsql/rep_mode.conf'
restart_after_crash = off
replication_timeout = 5000         # mseconds
wal_receiver_status_interval = 2   # seconds

pg_hba.conf of each node * Be careful. this explanation is not considered about the security.

host    all             all     127.0.0.1/32        trust
host    replication     all     192.168.3.0/24      trust

recovery.conf
- It is not necessary. RA makes it automaticaly and delete it if exists.
- if you want to confirm the replication without Pacemaker, Please make the following file at /vat/lib/pgsql/9.1/data on pm02.

standby_mode = 'on'
primary_conninfo = 'host=192.168.3.1 port=5432 user=postgres application_name=pm02'
restore_command = 'cp /var/lib/pgsql/9.1/data/pg_archive/%f %p'
recovery_target_timeline='latest'

Please confirm PostgreSQL replication success.

Pacemaker

pacemaker 1.0.11
- Only/usr/lib/ocf/resource.d/heartbeat/pgsql is replaced with this file. The file can be downloaded by clicking *raw.
- Setting CRM file of the sample described at the end of this document is additionally changed to your environment after Pacemaker starts.

About the state

This RA defines the following states as a node attribute value of Pacemaker.

attribute can be seen in "crm_mon -A".

pgsql-status

A present state of PostgreSQL is shown by the attribute value to which PRI or either HS node is displayed.

STOP
- PostgreSQL has stopped.
HS:alone
- It works as HS but doesn't connect it to PRI
HS:connected
- it works as HS, and connected with PRI but, Not normal state of the replication (Data has not caught up with PRI). "HS:connected" is displayed except for the result of "select state and sync_state from pg_stat_replication" in PRI: STREAMING|SYNC, STREAMING|ASYNC,STREAMING|STREAMING STREAMING|POTENTIAL
- in the synchronous mode and STREAMING|ASYNC in the asynchronous mode.
HS:async
- It works as HS, and the state of the asynchronous replication. When RA is used in the asynchronous mode, it is possible to be promoted to Master.
HS:sync
- It works as HS, and the state of a synchronous replication. It is possible to be promoted to Master for the synchronous mode.
HS:potential
- It is displayed when moving it in the synchronousmode more than three nodes. (PRI x1, HS x N) The specification of PostgreSQL as a synchronous replication is only one node, and the other HS nodes become this POTENTIAL.
PRI
- It operates by PRI.

pgsql-data-status

The transitional state of data is displayed. When being start next time, this state is used to judge whether which node are the latest.

DISCONNECT
- It changes when it becomes impossible to HS from PRI due to the breakdown and the D-LAN failure etc.
{state} | {sync_state}
- It changes when HS connects it with PRI.
- The result of "select state and sync_state from pg_stat_replication" on PRI is displayed as it is as an attribute value.
- INIT, CATCHUP, and STREAMING are displayed to {state} and ASYNC, POTENTIAL, and SYNC are displayed to {sync_state} (All in this ? )
LATEST
- It changes to Master(PRI).

These states are the transitional state of final data, and it may be not consistent with the state of actual data. For instance, During PRI, the state is "LATEST", but the node stop or down, this state "LATEST" is maintained. It never changes to "DISCONNECT" for oneself. When other node newly promote Master, this new Master changes the state of old Master to "DISCONNECT". When any node can not become Master, this "LATEST" will be keepped.

pgsql-master-baseline

It is displayed when promote is called. This attribute is used when starting PosgreSQL as a HS to prohibit it from starting if data is inconsistent.

pgsql-xlog-replay-loc

There is no Master node, it is displayed. RA decide to promote one node to Master comparing with the value of the last xlog replay location among other node.

sample setting of crm

property \
	no-quorum-policy="ignore" \
	stonith-enabled="false" \
    crmd-transition-delay="2s"

rsc_defaults \
	resource-stickiness="INFINITY" \
	migration-threshold="1"

ms msPostgresql postgresql \
	meta \
		master-max="1" \
		master-node-max="1" \
		clone-max="2" \
		clone-node-max="1" \
		notify="true"

clone clnPingCheck pingCheck
group master-group vip-master vip-rep

primitive vip-master ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.0.200" \
		nic="eth0" \
		cidr_netmask="24" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive vip-rep ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.3.200" \
		nic="eth3" \
		cidr_netmask="24" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive vip-slave ocf:heartbeat:IPaddr2 \
	params \
		ip="192.168.0.202" \
		nic="eth0" \
		cidr_netmask="24" \
    meta \
        resource-stickiness="1" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="block"

primitive postgresql ocf:heartbeat:pgsql \
	params \
		pgctl="/usr/local/pgsql/bin/pg_ctl" \
		psql="/usr/local/pgsql/bin/psql" \
		pgctldata="/usr/local/pgsql/bin/pg_controldata" \
		pgdata="/var/lib/pgsql/9.1/data/" \
		start_opt="-p 5432" \
		rep_mode="sync" \
        node_list="pm01 pm02" \
        restore_command="cp /var/lib/pgsql/9.1/data/pg_archive/%f %p" \
        master_ip="192.168.3.200" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="7s" on-fail="restart" \
	op monitor timeout="60s" interval="2s"  on-fail="restart" role="Master" \
	op promote timeout="60s" interval="0s"  on-fail="restart" \
	op demote  timeout="60s" interval="0s"  on-fail="block" \
	op stop    timeout="60s" interval="0s"  on-fail="block" \
	op notify  timeout="60s" interval="0s"

primitive pingCheck ocf:pacemaker:pingd \
	params \
		name="default_ping_set" \
		host_list="192.168.0.254" \
		multiplier="100" \
	op start   timeout="60s" interval="0s"  on-fail="restart" \
	op monitor timeout="60s" interval="10s" on-fail="restart" \
	op stop    timeout="60s" interval="0s"  on-fail="ignore"

location rsc_location-1 vip-slave \
	rule  200: pgsql-status eq "HS:sync" \
	rule  100: pgsql-status eq "PRI" \
	rule  -inf: not_defined pgsql-status \
	rule  -inf: pgsql-status ne "HS:sync" and pgsql-status ne "PRI"

location rsc_location-2 msPostgresql \
	rule role=master  200: #uname eq pm01 \
	rule role=master  100: #uname eq pm02 \
	rule role=master -inf: defined fail-count-vip-master \
	rule role=master -inf: defined fail-count-vip-rep \
	rule -inf: not_defined default_ping_set or default_ping_set lt 100

colocation rsc_colocation-1 inf: msPostgresql        clnPingCheck
colocation rsc_colocation-2 inf: master-group        msPostgresql:Master

order rsc_order-1 0: clnPingCheck          msPostgresql
order rsc_order-2 0: msPostgresql:promote  master-group:start   symmetrical=false
order rsc_order-3 0: msPostgresql:demote   master-group:stop    symmetrical=false