Testing Pacemaker qDevice Quorum with Db2 HADR - fsmegale/Db2-luw GitHub Wiki
In this article I will show the tests I did with Db2 HADR and Pacemaker using qDevice Quorum in the following scenarios:
1- Turning off one of specific machines of the HADR cluster
2- Turning off the network of one of specific machines of the HADR cluster
To configure the qDevice Quorum topology and/or the Two-Node Quorum topology you can reference to my article here:
Automating Db2 HADR with Pacemaker
https://github.com/fsmegale/Db2-luw/wiki/Automating-Db2-HADR-with-Pacemaker
The test with Two-Node Quorum is described in the link above as well.
To know more about Quorum in Pacemaker, check the following documentation:
Quorum devices support on Pacemaker
https://www.ibm.com/docs/en/db2/11.5?topic=component-quorum-devices-support-pacemaker
In the Db2 HADR automated by Pacemaker, the HADR nodes will receive and ID.
Quorum device will pick the node with the lowest node id to survive in case of network or host failure.
All hosts are running SUSE 15 SP3
HADR hosts are running: Db2 v11.5.7.0
Let's see the outputs that show the configuration:
db2cm -list
hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
Cluster Status
Domain information:
Error: Could not get Pacemaker version. - 679
Domain name = HadrPaceDomain
Pacemaker version =
Corosync version = 3.0.4
Current domain leader = hostHADR2
Number of nodes = 2
Number of resources = 6
Node information:
Name name State
---------------- --------
hostHADR1 Online
hostHADR2 Online
Resource Information:
Resource Name = db2_db2inst1_db2inst1_HADRPACE
Resource Type = HADR
DB Name = HADRPACE
Managed = true
HADR Primary Instance = db2inst1
HADR Primary Node = hostHADR2
HADR Primary State = Online
HADR Standby Instance = db2inst1
HADR Standby Node = hostHADR1
HADR Standby State = Online
Resource Name = db2_hostHADR1_db2inst1_0
State = Online
Managed = true
Resource Type = Instance
Node = hostHADR1
Instance Name = db2inst1
Resource Name = db2_hostHADR1_eth0
State = Online
Managed = true
Resource Type = Network Interface
Node = hostHADR1
Interface Name = eth0
Resource Name = db2_hostHADR2_db2inst1_0
State = Online
Managed = true
Resource Type = Instance
Node = hostHADR2
Instance Name = db2inst1
Resource Name = db2_hostHADR2_eth0
State = Online
Managed = true
Resource Type = Network Interface
Node = hostHADR2
Interface Name = eth0
Fencing Information:
Not configured
Quorum Information:
Qdevice
Qdevice information
-------------------
Model: Net
Node ID: 1
Configured node list:
0 Node ID = 1
1 Node ID = 2
Membership node list: 1, 2
Qdevice-net information
----------------------
Cluster name: HadrPaceDomain
QNetd host: qDeviceHost:5403
Algorithm: LMS
Tie-breaker: Node with lowest node ID
State: Connected
crm status
hostHADR2:~ # crm status
Cluster Summary:
* Stack: corosync
* Current DC: hostHADR2 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Fri Jan 21 10:44:39 2022
* Last change: Fri Jan 21 10:26:13 2022 by db2inst1 via crm_resource on hostHADR2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ hostHADR1 hostHADR2 ]
Full List of Resources:
* db2_hostHADR1_eth0 (ocf::heartbeat:db2ethmon): Started hostHADR1
* db2_hostHADR2_eth0 (ocf::heartbeat:db2ethmon): Started hostHADR2
* db2_hostHADR1_db2inst1_0 (ocf::heartbeat:db2inst): Started hostHADR1
* db2_hostHADR2_db2inst1_0 (ocf::heartbeat:db2inst): Started hostHADR2
* Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
* Masters: [ hostHADR2 ]
* Slaves: [ hostHADR1 ]
corosync-qdevice-tool -s
hostHADR1:~ # corosync-qdevice-tool -s
Qdevice information
-------------------
Model: Net
Node ID: 1
Configured node list:
0 Node ID = 1
1 Node ID = 2
Membership node list: 1, 2
Qdevice-net information
----------------------
Cluster name: HadrPaceDomain
QNetd host: qDeviceHost:5403
Algorithm: LMS
Tie-breaker: Node with lowest node ID
State: Connected
corosync-qnetd-tool -l //from qDeviceHost machine
qDeviceHost:~ # corosync-qnetd-tool -l
Cluster "HadrPaceDomain":
Algorithm: LMS
Tie-breaker: Node with lowest node ID
Node ID 1:
Client address: ::ffff:192.168.145.134:55276
Configured node list: 1, 2
Membership node list: 1, 2
Vote: ACK (ACK)
Node ID 2:
Client address: ::ffff:192.168.145.136:39638
Configured node list: 1, 2
Membership node list: 1, 2
Vote: ACK (ACK)
db2pd -hadr -d hadrdb
db2inst1@hostHADR2:~> db2pd -hadr -d hadrdb
Database HADRDB not activated on database member 0 or this database name cannot be found in the local database directory.
Option -hadr requires -db <database> or -alldbs option and active database.
db2inst1@hostHADR2:~> db2 list db direcotry
SQL0104N An unexpected token "direcotry" was found following "DB". Expected
tokens may include: "DIRECTORY". SQLSTATE=42601
db2inst1@hostHADR2:~> db2 list db directory
System Database Directory
Number of entries in the directory = 1
Database 1 entry:
Database alias = HADRPACE
Database name = HADRPACE
Local database directory = /home/db2inst1
Database release level = 15.00
Comment =
Directory entry type = Indirect
Catalog database partition number = 0
Alternate server hostname = 192.168.145.134
Alternate server port number = 25010
db2inst1@hostHADR2:~> db2pd -hadr -d hadrpace
Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 19:40:44 -- Date 2022-01-21-11.13.46.829912
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS = TCP_PROTOCOL
PRIMARY_MEMBER_HOST = 192.168.145.136
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = 192.168.145.134
STANDBY_INSTANCE = db2inst1
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 01/21/2022 11:09:38.515655 (1642774178)
HEARTBEAT_INTERVAL(seconds) = 1
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 7427
HADR_TIMEOUT(seconds) = 3
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.307444
LOG_HADR_WAIT_ACCUMULATED(seconds) = 8.608
LOG_HADR_WAIT_COUNT = 28
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
PRIMARY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
STANDBY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
STANDBY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
STANDBY_REPLAY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
STANDBY_RECV_BUF_SIZE(pages) = 3646
STANDBY_RECV_BUF_PERCENT = 0
STANDBY_SPOOL_LIMIT(pages) = 25600
STANDBY_SPOOL_PERCENT = 0
STANDBY_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 120
PEER_WINDOW_END = 01/21/2022 11:15:46.000000 (1642774546)
READS_ON_STANDBY_ENABLED = N
HADR_LAST_TAKEOVER_TIME = 01/10/2022 18:43:54.000000 (1641851034)
Both tests are related to make hostHADR2 unavailable somehow.
As we see it in the topology:
- hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
- hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).
Remember about the brief above?
" Quorum device will pick the node with the lowest node id to survive in case of network or host failure. "
Also, the expected behavior is:
- Turn off hostHADR2 machine
- Turn off hostHADR2 network
hostHADR2:~ # shutdown now
As soon the hostHADR2 was turned of, the hostHADR1 became the Primary:
db2inst1@hostHADR1:~> db2pd -d hadrpace -hadr
Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 20:09:06 -- Date 2022-01-21-11.42.03.887202
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = DISCONNECTED
HADR_FLAGS =
PRIMARY_MEMBER_HOST = 192.168.145.134
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = 192.168.145.136
STANDBY_INSTANCE = db2inst1
The "crm status" changed while I was monitoring it. Look:
hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
* Stack: corosync
* Current DC: hostHADR2 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Fri Jan 21 11:43:08 2022
* Last change: Fri Jan 21 11:42:02 2022 by root via crm_attribute on hostHADR1
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ hostHADR1 ]
* OFFLINE: [ hostHADR2 ]
Full List of Resources:
* db2_hostHADR1_eth0 (ocf::heartbeat:db2ethmon): Stopped
* db2_hostHADR2_eth0 (ocf::heartbeat:db2ethmon): Stopped
* db2_hostHADR1_db2inst1_0 (ocf::heartbeat:db2inst): Stopped
* db2_hostHADR2_db2inst1_0 (ocf::heartbeat:db2inst): Stopped
* Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
* Stopped: [ hostHADR1 hostHADR2 ]
then...
hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
* Stack: corosync
* Current DC: hostHADR1 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Fri Jan 21 11:44:25 2022
* Last change: Fri Jan 21 11:42:02 2022 by root via crm_attribute on hostHADR1
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ hostHADR1 ]
* OFFLINE: [ hostHADR2 ]
Full List of Resources:
* db2_hostHADR1_eth0 (ocf::heartbeat:db2ethmon): Started hostHADR1
* db2_hostHADR2_eth0 (ocf::heartbeat:db2ethmon): Stopped
* db2_hostHADR1_db2inst1_0 (ocf::heartbeat:db2inst): Started hostHADR1
* db2_hostHADR2_db2inst1_0 (ocf::heartbeat:db2inst): Stopped
* Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
* Masters: [ hostHADR1 ]
* Stopped: [ hostHADR2 ]
hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
Cluster Status
Domain information:
Error: Could not get Pacemaker version. - 679
Domain name = HadrPaceDomain
Pacemaker version =
Corosync version = 3.0.4
Current domain leader = hostHADR2
Number of nodes = 2
Number of resources = 6
Node information:
Name name State
---------------- --------
hostHADR1 Online
hostHADR2 Offline
Resource Information:
Resource Name = db2_db2inst1_db2inst1_HADRPACE
Resource Type = HADR
DB Name = HADRPACE
Managed = true
HADR Primary Instance =
HADR Primary Node =
HADR Primary State = Offline
HADR Standby Instance =
HADR Standby Node =
HADR Standby State = Offline
Resource Name = db2_hostHADR1_db2inst1_0
State = Offline
Managed = true
Resource Type = Instance
Node = hostHADR1
Instance Name = db2inst1
Resource Name = db2_hostHADR1_eth0
State = Offline
Managed = true
Resource Type = Network Interface
Node = hostHADR1
Interface Name = eth0
Resource Name = db2_hostHADR2_db2inst1_0
State = Offline
Managed = true
Resource Type = Instance
Node = hostHADR2
Instance Name = db2inst1
Resource Name = db2_hostHADR2_eth0
State = Offline
Managed = true
Resource Type = Network Interface
Node = hostHADR2
Interface Name = eth0
Fencing Information:
Not configured
Quorum Information:
Qdevice
Qdevice information
-------------------
Model: Net
Node ID: 1
Configured node list:
0 Node ID = 1
1 Node ID = 2
Membership node list: 1
Qdevice-net information
----------------------
Cluster name: HadrPaceDomain
QNetd host: qDeviceHost:5403
Algorithm: LMS
Tie-breaker: Node with lowest node ID
State: Connected
- db2diag.log messages
I got some key messages from db2diag.log to understand what happened:
2022-01-21-11.41.56.585733-180 I14052321E440 LEVEL: Error
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRecvMsg, probe:11810
MESSAGE : Zero bytes received. Remote end may have closed connection
2022-01-21-11.41.56.586375-180 I14052762E503 LEVEL: Error
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRecvMsgS, probe:30080
MESSAGE : HADR standby recv error:
DATA #1 : Hexdump, 4 bytes
0x00007FAEC27F26BC : 0100 0000 ....
2022-01-21-11.41.56.586439-180 I14053266E488 LEVEL: Error
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20215
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
"Communication with HADR partner was lost"
2022-01-21-11.41.56.589613-180 E14053755E450 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to HDR_S_DISCONN_PEER (was HDR_S_PEER), connId=17
2022-01-21-11.41.56.590047-180 I14054206E434 LEVEL: Warning
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrCloseConn, probe:30595
DATA #1 : <preformatted>
Peer window end time : 1642776236
2022-01-21-11.41.56.591586-180 I14054641E463 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32201
DATA #1 : <preformatted>
Info: HADR Socket send buffer size, SO_SNDBUF: 16384 bytes
2022-01-21-11.41.56.591654-180 I14055105E467 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32251
DATA #1 : <preformatted>
Info: HADR Socket receive buffer size, SO_RCVBUF: 131072 bytes
2022-01-21-11.41.56.592061-180 I14055573E509 LEVEL: Error
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280
MESSAGE : ZRC=0x810F0019=-2129723367=SQLO_CONN_REFUSED "Connection refused"
DATA #1 : <preformatted>
Failed to connect to primary.
2022-01-21-11.41.59.201524-180 E14056083E510 LEVEL: Event
PID : 46656 TID : 140388539033344 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
APPHDL : 0-102 APPID: *LOCAL.db2inst1.220121144159
AUTHID : DB2INST1 HOSTNAME: hostHADR1
EDUID : 39 EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, base sys utilities, sqeDBMgr::StartUsingLocalDatabase, probe:13
START : Received TAKEOVER HADR command.
2022-01-21-11.41.59.324320-180 I14056594E459 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleInitialRequest, probe:46000
MESSAGE : Standby has initiated a takeover by force peer window only.
2022-01-21-11.41.59.324422-180 I14057054E487 LEVEL: Error
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSendTakeoverMsg, probe:210
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
"Communication with HADR partner was lost"
2022-01-21-11.41.59.324465-180 I14057542E502 LEVEL: Warning
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleInitialRequest, probe:46010
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
"Communication with HADR partner was lost"
2022-01-21-11.41.59.324497-180 I14058045E431 LEVEL: Warning
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSDoTakeover, probe:47265
DATA #1 : <preformatted>
Skip posting hdrTakeoverDone
2022-01-21-11.41.59.331478-180 I14058477E557 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrBuildAndSendTkResponse, probe:32000
DATA #1 : String, 46 bytes
Enquing takeover request for stage and event:
DATA #2 : String, 23 bytes
HDR_TK_P_DONE_DRAIN_LOG
DATA #3 : String, 19 bytes
HDR_TKEV_NEXT_STAGE
2022-01-21-11.41.59.435784-180 I14059035E430 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleDoneDrain, probe:46800
MESSAGE : Standby switching roles to primary.
(...)
2022-01-21-11.42.01.610768-180 I14089625E496 LEVEL: Info
PID : 46656 TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
APPHDL : 0-8 APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID : 34 EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:4820
DATA #1 : <preformatted>
HADR Takeover: Processing database configuration change.
2022-01-21-11.42.01.613753-180 E14090122E488 LEVEL: Warning
PID : 46656 TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
APPHDL : 0-8 APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID : 34 EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:6600
MESSAGE : ADM1611W The rollforward recovery phase has been completed.
2022-01-21-11.42.01.616327-180 I14090611E490 LEVEL: Info
PID : 46656 TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
APPHDL : 0-8 APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID : 34 EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:9500
MESSAGE : Stopping replay master on standby
DATA #1 : Boolean, 1 bytes
true
2022-01-21-11.42.01.660160-180 E14091689E1245 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetDbRoleAndDbType, probe:10020
CHANGE : HADR DATABASE ROLE/TYPE -
HADR database role set to PRIMARY (was STANDBY).
HADR database type set to PHYSICAL (was PHYSICAL).
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
[0] 0x00007FAED53114C9 _Z21hdrSetDbRoleAndDbTypeP9SQLP_DBCBjj + 0x6F9
[1] 0x00007FAED5380127 _Z24hdrStbyTkHandleDoneDrainP6hdrEduP8HDR_DBCB + 0x22C7
[2] 0x00007FAED5392E83 _ZN6hdrEdu14hdrSDoTakeoverEP20HDR_MESSAGE_TAKEOVERP17HDR_RQST_TAKEOVER + 0xA33
[3] 0x00007FAED534D93F _ZN6hdrEdu7hdrEduSEPm + 0x1D7F
[4] 0x00007FAED534F352 _ZN6hdrEdu11hdrEduEntryEv + 0x552
[5] 0x00007FAED5350D07 _ZN6hdrEdu6RunEDUEv + 0x27
[6] 0x00007FAEDF7FCFC4 _ZN9sqzEDUObj9EDUDriverEv + 0x1A4
[7] 0x00007FAEDDA8A1D9 sqloEDUEntry + 0x2A9
[8] 0x00007FAEE46CC94A /lib64/libpthread.so.0 + 0x894A
[9] 0x00007FAED1CF3D0F clone + 0x3F
2022-01-21-11.42.01.680170-180 E14092935E465 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to HDR_P_REM_CATCHUP_PENDING (was HDR_S_DISCONN_PEER), connId=17
(...)
2022-01-21-11.42.01.687074-180 I14096856E411 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21152
MESSAGE : HADR Startup has completed.
2022-01-21-11.42.01.687516-180 I14097268E440 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleDoneDrain, probe:46840
MESSAGE : Standby has completed takeover (now primary).
(...)
UID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduS, probe:21900
MESSAGE : Standby Finished.
2022-01-21-11.42.01.700128-180 I14099553E402 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduP, probe:20301
MESSAGE : Primary Started.
In that case, Pacemaker run a "TAKEOVER BY FORCE PEER WINDOW ONLY"
More information about it:
"
BY FORCE
Specifies that the database is not to wait for confirmation that the original HADR primary database is shut down. Unless you are using SUPERASYNC synchronization mode, this option is needed if the HADR pair is not in peer state._
PEER WINDOW ONLY
When this option is specified, no committed transaction loss occurs if the command succeeds and the primary database is brought down before the end of the peer window period. You set the peer window period by assigning a positive integer value to the hadr_peer_window database configuration parameter. Not bringing down the primary database before the peer window expires results in a split brain. If the TAKEOVER BY FORCE PEER WINDOW ONLY command is run when the HADR pair is not in a peer state, or is in a disconnected peer state, an error is returned.
You cannot use the PEER WINDOW ONLY option when the synchronization mode is set to ASYNC or SUPERASYNC.
"
TAKEOVER HADR command
https://www.ibm.com/docs/en/db2/11.5?topic=commands-takeover-hadr
Checking the "crm status" after hostHADR2 was turned on. We can see all online and started.
hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
* Stack: corosync
* Current DC: hostHADR1 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Fri Jan 21 12:27:45 2022
* Last change: Fri Jan 21 12:23:44 2022 by root via crm_attribute on hostHADR2
* 2 nodes configured
* 6 resource instances configured
Node List:
* Online: [ hostHADR1 hostHADR2 ]
Full List of Resources:
* db2_hostHADR1_eth0 (ocf::heartbeat:db2ethmon): Started hostHADR1
* db2_hostHADR2_eth0 (ocf::heartbeat:db2ethmon): Started hostHADR2
* db2_hostHADR1_db2inst1_0 (ocf::heartbeat:db2inst): Started hostHADR1
* db2_hostHADR2_db2inst1_0 (ocf::heartbeat:db2inst): Started hostHADR2
* Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
* Masters: [ hostHADR1 ]
* Slaves: [ hostHADR2 ]
hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
Cluster Status
Domain information:
Error: Could not get Pacemaker version. - 679
Domain name = HadrPaceDomain
Pacemaker version =
Corosync version = 3.0.4
Current domain leader = hostHADR1
Number of nodes = 2
Number of resources = 6
Node information:
Name name State
---------------- --------
hostHADR1 Online
hostHADR2 Online
Resource Information:
Resource Name = db2_db2inst1_db2inst1_HADRPACE
Resource Type = HADR
DB Name = HADRPACE
Managed = true
HADR Primary Instance = db2inst1
HADR Primary Node = hostHADR1
HADR Primary State = Online
HADR Standby Instance = db2inst1
HADR Standby Node = hostHADR2
HADR Standby State = Online
Resource Name = db2_hostHADR1_db2inst1_0
State = Online
Managed = true
Resource Type = Instance
Node = hostHADR1
Instance Name = db2inst1
Resource Name = db2_hostHADR1_eth0
State = Online
Managed = true
Resource Type = Network Interface
Node = hostHADR1
Interface Name = eth0
Resource Name = db2_hostHADR2_db2inst1_0
State = Online
Managed = true
Resource Type = Instance
Node = hostHADR2
Instance Name = db2inst1
Resource Name = db2_hostHADR2_eth0
State = Online
Managed = true
Resource Type = Network Interface
Node = hostHADR2
Interface Name = eth0
Fencing Information:
Not configured
Quorum Information:
Qdevice
Qdevice information
-------------------
Model: Net
Node ID: 1
Configured node list:
0 Node ID = 1
1 Node ID = 2
Membership node list: 1, 2
Qdevice-net information
----------------------
Cluster name: HadrPaceDomain
QNetd host: qDeviceHost:5403
Algorithm: LMS
Tie-breaker: Node with lowest node ID
State: Connected
db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -d hadrpace -hadr
Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 20:56:35 -- Date 2022-01-21-12.29.32.830386
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS = TCP_PROTOCOL
PRIMARY_MEMBER_HOST = 192.168.145.134
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = 192.168.145.136
STANDBY_INSTANCE = db2inst1
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 01/21/2022 12:23:45.098044 (1642778625)
HEARTBEAT_INTERVAL(seconds) = 1
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 9480
HADR_TIMEOUT(seconds) = 3
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
LOG_HADR_WAIT_COUNT = 0
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
PRIMARY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
STANDBY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
STANDBY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
STANDBY_REPLAY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
STANDBY_RECV_BUF_SIZE(pages) = 3646
STANDBY_RECV_BUF_PERCENT = 0
STANDBY_SPOOL_LIMIT(pages) = 25600
STANDBY_SPOOL_PERCENT = 0
STANDBY_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 120
PEER_WINDOW_END = 01/21/2022 12:31:32.000000 (1642779092)
READS_ON_STANDBY_ENABLED = N
HADR_LAST_TAKEOVER_TIME = 01/21/2022 11:41:59.000000 (1642776119)
2022-01-21-12.23.44.987880-180 I9379E480 LEVEL: Warning
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleRemoteConn, probe:30160
DATA #1 : <preformatted>
TCP socket connection accepted. Remote Host: 192.168.145.136 Port: 52951
2022-01-21-12.23.45.095050-180 I9860E494 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe:43900
DATA #1 : <preformatted>
Handshake HDR_MSG_HDRHS message is received from 192.168.145.136:40002 (192.168.145.136:40002)
2022-01-21-12.23.45.097942-180 I10355E437 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAfterVerifySystem, probe:30440
DATA #1 : <preformatted>
Connection succeeded, connId=18
(...)
2022-01-21-12.23.45.201494-180 I11751E465 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrPerformPairValidation, probe:21100
DATA #1 : <preformatted>
Old primary requesting rejoining HADR pair as a standby
2022-01-21-12.23.45.304517-180 E12217E464 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to HDR_P_REM_CATCHUP (was HDR_P_REM_CATCHUP_PENDING), connId=18
2022-01-21-12.23.45.317266-180 I12682E470 LEVEL: Info
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduP, probe:20445
DATA #1 : <preformatted>
Remote catchup starts at lso 75821614.
Using page start lso 75817677.
(...)
2022-01-21-12.23.47.566859-180 E21873E450 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to HDR_P_NPEER (was HDR_P_REM_CATCHUP), connId=18
2022-01-21-12.23.47.567289-180 I22324E586 LEVEL: Info
PID : 46656 TID : 140388597753600 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 25 EDUNAME: db2loggw (HADRPACE) 0
FUNCTION: DB2 UDB, data protection services, sqpLoggwEdu::sqlpLoggwMain, probe:4399
DATA #1 : <preformatted>
Database log writing has been suspended at lso 78261116. Waiting to be resumed.
Current peer epoch: 0
Primary peer epoch: 0
Standby peer epoch: 0
Peer epoch begin: 0
Peer epoch end: 0
2022-01-21-12.23.47.571293-180 E22911E443 LEVEL: Event
PID : 46656 TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 33 EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE : HADR state set to HDR_P_PEER (was HDR_P_NPEER), connId=18
(...)
2022-01-21-12.23.48.585512-180 I27587E560 LEVEL: Info
PID : 46656 TID : 140388597753600 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : HADRPACE
HOSTNAME: hostHADR1
EDUID : 25 EDUNAME: db2loggw (HADRPACE) 0
FUNCTION: DB2 UDB, data protection services, sqpLoggwEdu::sqlpLoggwMain, probe:4427
DATA #1 : <preformatted>
Database log writing has been resumed at lso 78261116.
Current peer epoch: 0
Primary peer epoch: 0
Standby peer epoch: 0
Peer epoch begin: 0
Peer epoch end: 0
In this test I will turn the hostHADR2's network off.
The point of this test is to:
- check if the Db2 instance from hostHADR2 will really be automatically stopped by Pacemaker
- check if any Split Brain will occur after that
For more information about Split Brain:
How a TSAMP cluster gets into Db2 Split-brain
https://www.ibm.com/support/pages/how-tsamp-cluster-gets-db2-split-brain
Once again the start point of this test is the same of the previous test.
I mean:
- hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
- hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).
The Topology is described in the "Topology" section above.
The initial output from "db2pd", "db2cm -list" etc.. you can see in the "Configuration" section above.
The network adapter of hostHADR2 machina was turned off with the following.
ifdown eth0
PS.: On Linux SUSE 15 you can check the network adpater and its IPs with the following:
ip r
It was checked that the instance went down by Pacemaker in hostHADR2.
Pacemaker made a "takeover by force" (the "peer window only") and hostHADR1 became the Primary:
db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -hadr -d hadrpace
Database Member 0 -- Database HADRPACE -- Active -- Up 2 days 00:13:03 -- Date 2022-01-21-15.46.00.097250
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = DISCONNECTED
HADR_FLAGS =
PRIMARY_MEMBER_HOST = 192.168.145.134
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = 192.168.145.136
STANDBY_INSTANCE = db2inst1
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = DISCONNECTED
HADR_CONNECT_STATUS_TIME = 01/21/2022 15:24:41.902454 (1642789481)
HEARTBEAT_INTERVAL(seconds) = 1
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 18755
HADR_TIMEOUT(seconds) = 3
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000452
LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.101
LOG_HADR_WAIT_COUNT = 227
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
PRIMARY_LOG_FILE,PAGE,POS = S0000010.LOG, 71, 86866515
STANDBY_LOG_FILE,PAGE,POS = S0000000.LOG, 0, 0
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000000.LOG, 0, 0
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 01/21/2022 15:27:59.000000 (1642789679)
STANDBY_LOG_TIME = NULL
STANDBY_REPLAY_LOG_TIME = NULL
PEER_WINDOW(seconds) = 120
HADR_LAST_TAKEOVER_TIME = 01/21/2022 15:26:16.000000 (1642789576)
Just to test, I created a table in the database to make sure some transactional log would be created.
Start the network adapter from hostHADR2:
ifup eth0
Checked that HADR was still with "DISCONNECTED" STATUS in "db2pd -hadr"
- Two "PRIMARY" databases in "get db cfg"
The "get db cfg" showed that the database from both hosts (hostHADR1 and hostHADR2) was set to PRIMARY:
HADR database role = PRIMARY
HADR local host name (HADR_LOCAL_HOST) = 192.168.145.136
HADR local service name (HADR_LOCAL_SVC) = 40002
HADR remote host name (HADR_REMOTE_HOST) = 192.168.145.134
HADR remote service name (HADR_REMOTE_SVC) = 40001
HADR instance name of remote server (HADR_REMOTE_INST) = db2inst1
HADR timeout value (HADR_TIMEOUT) = 3
HADR target list (HADR_TARGET_LIST) =
HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = AUTOMATIC(25600)
HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0
HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 120
While trying to activate the database in hostHADR2, I got an error:
db2inst1@hostHADR2:~> db2 activate db hadrpace
SQL1776N The command cannot be issued on an HADR database. Reason code = "6".
(...)
6
This database is an old primary database. It cannot be started
because the standby has become the new primary through forced
takeover.
User Response:
6
Consider the following scenarios when deciding on a suitable
response:
6.1) If the HADR primary database is running on a different server,
then the database on this server should be started as an HADR
standby database. Perform the following steps:
* Issue the START HADR AS STANDBY command on the database on this
server.
* Verify that this new HADR standby database has established a
connection with the HADR primary database by examining the
HADR_CONNECT_STATUS field of the db2pd -hadr command or the
MON_GET_HADR() table function on the HADR primary database. Note,
an ADM12513E message will also be reported in the administration
notification log on the HADR primary database if a connection
cannot be established.
* If this new HADR standby database cannot establish a connection
to the HADR primary database, then this new HADR standby database
will need to be fully reinitialized based on a backup image of
the HADR primary database. Please see IBM Db2 Knowledge Center
article titled "Initializing high availability disaster recovery
(HADR)".
6.2) If the HADR primary database is not running on a different
server (or cannot be made to run), then the database on this server
should be started as an HADR primary database. Perform the following
steps:
* Issue the STOP HADR command on the database on this server.
* If you wish to establish an HADR standby database on a different
server before starting the database on this server as an HADR
primary database, then please see IBM Db2 Knowledge Center
article titled "Initializing high availability disaster recovery
(HADR)".
* Issue the START HADR AS PRIMARY command on the database on this
server (or the START HADR AS PRIMARY BY FORCE command if there is
no HADR standby database running yet).
Based on the "User Response" from error above (SQL1776N RC=6), I did:
db2inst1@hostHADR2:~> db2 start hadr on db hadrpace as standby
DB20000I The START HADR ON DATABASE command completed successfully.
And the HADR got CONNECTED and PEER:
db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -hadr -d hadrpace
Database Member 0 -- Database HADRPACE -- Active -- Up 2 days 00:31:59 -- Date 2022-01-21-16.04.56.714816
HADR_ROLE = PRIMARY
REPLAY_TYPE = PHYSICAL
HADR_SYNCMODE = NEARSYNC
STANDBY_ID = 1
LOG_STREAM_ID = 0
HADR_STATE = PEER
HADR_FLAGS = TCP_PROTOCOL
PRIMARY_MEMBER_HOST = 192.168.145.134
PRIMARY_INSTANCE = db2inst1
PRIMARY_MEMBER = 0
STANDBY_MEMBER_HOST = 192.168.145.136
STANDBY_INSTANCE = db2inst1
STANDBY_MEMBER = 0
HADR_CONNECT_STATUS = CONNECTED
HADR_CONNECT_STATUS_TIME = 01/21/2022 15:51:37.597068 (1642791097)
HEARTBEAT_INTERVAL(seconds) = 1
HEARTBEAT_MISSED = 0
HEARTBEAT_EXPECTED = 19554
HADR_TIMEOUT(seconds) = 3
TIME_SINCE_LAST_RECV(seconds) = 0
PEER_WAIT_LIMIT(seconds) = 0
LOG_HADR_WAIT_CUR(seconds) = 0.000
LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000452
LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.101
LOG_HADR_WAIT_COUNT = 227
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
PRIMARY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
STANDBY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
STANDBY_RECV_REPLAY_GAP(bytes) = 0
PRIMARY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
STANDBY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
STANDBY_REPLAY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
STANDBY_RECV_BUF_SIZE(pages) = 3646
STANDBY_RECV_BUF_PERCENT = 0
STANDBY_SPOOL_LIMIT(pages) = 25600
STANDBY_SPOOL_PERCENT = 0
STANDBY_ERROR_TIME = NULL
PEER_WINDOW(seconds) = 120
PEER_WINDOW_END = 01/21/2022 16:06:56.000000 (1642792016)
READS_ON_STANDBY_ENABLED = N
HADR_LAST_TAKEOVER_TIME = 01/21/2022 15:26:16.000000 (1642789576)
Let's recall the topology:
- hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
- hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).
And now let's see the Conclusion of the tests
-
Machine hostHADR2 was turned off.
-
Pacemaker made a "takeover by force peer window only" with success
-
after the machine hostHADR2 was turned on back: - Pacemaker started the Db2 instance on hostHADR2 machine - the HADR got CONNECTED and in PEER with no problems - the resources got Started/Online in "crm status" output - no Split Brain has occurred
-
the network adapter of hostHADR2 was turned off
-
Pacemaker did a "takeover by force" with success
-
Pacemaker stopped the Db2 instance from hostHADR2
-
at this point, to make sure some transactional log could be created, we created a table in the database
-
after the network adapter from hostHADR2 was reestablished:
- Pacemaker started the Db2 instance in hostHAHDR2
- HADR got DISCONNECTED status
- the database kept in "deactivated" status in hostHADR2
- the "HADR Role" (get db cfg) was showed as PRIMARY in both HADR hosts
- the "activate database" command failed with SQL1776N RC=6 in the old Primary (hostHADR2)
- then the "start hadr on db hadrpace as standby" was executed with Success in the old Primary (hostHADR2)
- HADR then got CONNECCTED and PEER status. Working proporly.
- No Split Brain has occurred
Pacemaker stopped the Db2 instance when the network adapter went down. So no transaction log was created in the old primary.
Because of that, the "start hadr on db hadrpace as standby" worked fine in the old primary and no Split Brain happened.
In this test the Links 1 and 3 (see the topology image) went down because the network adapter was disabled.
Another great thing is that qDevice host isn't actually part of cluster. So we can configure that only one qDevice host to be the arbitrator node for N (several) Db2 HADR clusters automated by Pacemaker.
Automating Db2 HADR with Pacemaker
https://github.com/fsmegale/Db2-luw/wiki/Automating-Db2-HADR-with-Pacemaker
Quorum devices support on Pacemaker
https://www.ibm.com/docs/en/db2/11.5?topic=component-quorum-devices-support-pacemaker
TAKEOVER HADR command
https://www.ibm.com/docs/en/db2/11.5?topic=commands-takeover-hadr
How a TSAMP cluster gets into Db2 Split-brain
https://www.ibm.com/support/pages/how-tsamp-cluster-gets-db2-split-brain