Testing Pacemaker qDevice Quorum with Db2 HADR - fsmegale/Db2-luw GitHub Wiki

Introduction

In this article I will show the tests I did with Db2 HADR and Pacemaker using qDevice Quorum in the following scenarios:

1- Turning off one of specific machines of the HADR cluster

2- Turning off the network of one of specific machines of the HADR cluster

To configure the qDevice Quorum topology and/or the Two-Node Quorum topology you can reference to my article here:

Automating Db2 HADR with Pacemaker
https://github.com/fsmegale/Db2-luw/wiki/Automating-Db2-HADR-with-Pacemaker

The test with Two-Node Quorum is described in the link above as well.

To know more about Quorum in Pacemaker, check the following documentation:

Quorum devices support on Pacemaker
https://www.ibm.com/docs/en/db2/11.5?topic=component-quorum-devices-support-pacemaker

Brief about Pacemaker with Db2 HADR

In the Db2 HADR automated by Pacemaker, the HADR nodes will receive and ID.

Quorum device will pick the node with the lowest node id to survive in case of network or host failure.

Test Environment Topology

Db2-HADR-Pacemaker-Topology

Environment

All hosts are running SUSE 15 SP3
HADR hosts are running: Db2 v11.5.7.0

Configuration

Let's see the outputs that show the configuration:

          db2cm -list
hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
      Cluster Status

Domain information:
Error: Could not get Pacemaker version. - 679
Domain name               = HadrPaceDomain
Pacemaker version         =
Corosync version          = 3.0.4
Current domain leader     = hostHADR2
Number of nodes           = 2
Number of resources       = 6

Node information:
Name name           State
----------------    --------
hostHADR1           Online
hostHADR2           Online

Resource Information:

Resource Name             = db2_db2inst1_db2inst1_HADRPACE
  Resource Type                 = HADR
    DB Name                     = HADRPACE
    Managed                     = true
    HADR Primary Instance       = db2inst1
    HADR Primary Node           = hostHADR2
    HADR Primary State          = Online
    HADR Standby Instance       = db2inst1
    HADR Standby Node           = hostHADR1
    HADR Standby State          = Online

Resource Name             = db2_hostHADR1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR1
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR1_eth0
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR1
    Interface Name              = eth0

Resource Name             = db2_hostHADR2_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR2
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR2_eth0
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR2
    Interface Name              = eth0

Fencing Information:
  Not configured
Quorum Information:
  Qdevice

Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1, 2

Qdevice-net information
----------------------
Cluster name:           HadrPaceDomain
QNetd host:             qDeviceHost:5403
Algorithm:              LMS
Tie-breaker:            Node with lowest node ID
State:                  Connected
                 crm status
hostHADR2:~ # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: hostHADR2 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Fri Jan 21 10:44:39 2022
  * Last change:  Fri Jan 21 10:26:13 2022 by db2inst1 via crm_resource on hostHADR2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ hostHADR1 hostHADR2 ]

Full List of Resources:
  * db2_hostHADR1_eth0  (ocf::heartbeat:db2ethmon):      Started hostHADR1
  * db2_hostHADR2_eth0  (ocf::heartbeat:db2ethmon):      Started hostHADR2
  * db2_hostHADR1_db2inst1_0    (ocf::heartbeat:db2inst):        Started hostHADR1
  * db2_hostHADR2_db2inst1_0    (ocf::heartbeat:db2inst):        Started hostHADR2
  * Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
    * Masters: [ hostHADR2 ]
    * Slaves: [ hostHADR1 ]
             corosync-qdevice-tool -s
hostHADR1:~ # corosync-qdevice-tool -s
Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1, 2

Qdevice-net information
----------------------
Cluster name:           HadrPaceDomain
QNetd host:             qDeviceHost:5403
Algorithm:              LMS
Tie-breaker:            Node with lowest node ID
State:                  Connected
      corosync-qnetd-tool -l  //from qDeviceHost machine
qDeviceHost:~ # corosync-qnetd-tool -l
Cluster "HadrPaceDomain":
    Algorithm:          LMS
    Tie-breaker:        Node with lowest node ID
    Node ID 1:
        Client address:         ::ffff:192.168.145.134:55276
        Configured node list:   1, 2
        Membership node list:   1, 2
        Vote:                   ACK (ACK)
    Node ID 2:
        Client address:         ::ffff:192.168.145.136:39638
        Configured node list:   1, 2
        Membership node list:   1, 2
        Vote:                   ACK (ACK)
             db2pd -hadr -d hadrdb
db2inst1@hostHADR2:~> db2pd -hadr -d hadrdb

Database HADRDB not activated on database member 0 or this database name cannot be found in the local database directory.

Option -hadr requires -db <database> or -alldbs option and active database.
db2inst1@hostHADR2:~> db2 list db direcotry
SQL0104N  An unexpected token "direcotry" was found following "DB".  Expected
tokens may include:  "DIRECTORY".  SQLSTATE=42601
db2inst1@hostHADR2:~> db2 list db directory

 System Database Directory

 Number of entries in the directory = 1

Database 1 entry:

 Database alias                       = HADRPACE
 Database name                        = HADRPACE
 Local database directory             = /home/db2inst1
 Database release level               = 15.00
 Comment                              =
 Directory entry type                 = Indirect
 Catalog database partition number    = 0
 Alternate server hostname            = 192.168.145.134
 Alternate server port number         = 25010

db2inst1@hostHADR2:~> db2pd -hadr -d hadrpace

Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 19:40:44 -- Date 2022-01-21-11.13.46.829912

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS = TCP_PROTOCOL
                  PRIMARY_MEMBER_HOST = 192.168.145.136
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = 192.168.145.134
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 01/21/2022 11:09:38.515655 (1642774178)
          HEARTBEAT_INTERVAL(seconds) = 1
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 7427
                HADR_TIMEOUT(seconds) = 3
        TIME_SINCE_LAST_RECV(seconds) = 0
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.307444
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 8.608
                  LOG_HADR_WAIT_COUNT = 28
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
            PRIMARY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
            STANDBY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000007.LOG, 386, 75628632
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
                     STANDBY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
              STANDBY_REPLAY_LOG_TIME = 01/21/2022 10:55:49.000000 (1642773349)
         STANDBY_RECV_BUF_SIZE(pages) = 3646
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 25600
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 120
                      PEER_WINDOW_END = 01/21/2022 11:15:46.000000 (1642774546)
             READS_ON_STANDBY_ENABLED = N
              HADR_LAST_TAKEOVER_TIME = 01/10/2022 18:43:54.000000 (1641851034)

Test Scopes

Both tests are related to make hostHADR2 unavailable somehow.

As we see it in the topology:

  • hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
  • hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).

Remember about the brief above?

" Quorum device will pick the node with the lowest node id to survive in case of network or host failure. "

Also, the expected behavior is:

Db2-HADR-Pacemaker-Behavior

Test Scope 1

  • Turn off hostHADR2 machine

Test Scope 2

  • Turn off hostHADR2 network

Testing Scope 1 - Turning hostHADR2 off

1) Shutting down hostHADR2

hostHADR2:~ # shutdown now

2) Checking "db2pd -hadr"

As soon the hostHADR2 was turned of, the hostHADR1 became the Primary:

db2inst1@hostHADR1:~> db2pd -d hadrpace -hadr

Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 20:09:06 -- Date 2022-01-21-11.42.03.887202

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = DISCONNECTED
                           HADR_FLAGS =
                  PRIMARY_MEMBER_HOST = 192.168.145.134
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = 192.168.145.136
                     STANDBY_INSTANCE = db2inst1

3) Checking the crm status

The "crm status" changed while I was monitoring it. Look:

hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: hostHADR2 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Fri Jan 21 11:43:08 2022
  * Last change:  Fri Jan 21 11:42:02 2022 by root via crm_attribute on hostHADR1
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ hostHADR1 ]
  * OFFLINE: [ hostHADR2 ]

Full List of Resources:
  * db2_hostHADR1_eth0  (ocf::heartbeat:db2ethmon):      Stopped
  * db2_hostHADR2_eth0  (ocf::heartbeat:db2ethmon):      Stopped
  * db2_hostHADR1_db2inst1_0    (ocf::heartbeat:db2inst):        Stopped
  * db2_hostHADR2_db2inst1_0    (ocf::heartbeat:db2inst):        Stopped
  * Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
    * Stopped: [ hostHADR1 hostHADR2 ]

then...

hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: hostHADR1 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Fri Jan 21 11:44:25 2022
  * Last change:  Fri Jan 21 11:42:02 2022 by root via crm_attribute on hostHADR1
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ hostHADR1 ]
  * OFFLINE: [ hostHADR2 ]

Full List of Resources:
  * db2_hostHADR1_eth0  (ocf::heartbeat:db2ethmon):      Started hostHADR1
  * db2_hostHADR2_eth0  (ocf::heartbeat:db2ethmon):      Stopped
  * db2_hostHADR1_db2inst1_0    (ocf::heartbeat:db2inst):        Started hostHADR1
  * db2_hostHADR2_db2inst1_0    (ocf::heartbeat:db2inst):        Stopped
  * Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
    * Masters: [ hostHADR1 ]
    * Stopped: [ hostHADR2 ]

4) Checking db2cm -list

hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
      Cluster Status

Domain information:
Error: Could not get Pacemaker version. - 679
Domain name               = HadrPaceDomain
Pacemaker version         =
Corosync version          = 3.0.4
Current domain leader     = hostHADR2
Number of nodes           = 2
Number of resources       = 6

Node information:
Name name           State
----------------    --------
hostHADR1           Online
hostHADR2           Offline

Resource Information:

Resource Name             = db2_db2inst1_db2inst1_HADRPACE
  Resource Type                 = HADR
    DB Name                     = HADRPACE
    Managed                     = true
    HADR Primary Instance       =
    HADR Primary Node           =
    HADR Primary State          = Offline
    HADR Standby Instance       =
    HADR Standby Node           =
    HADR Standby State          = Offline

Resource Name             = db2_hostHADR1_db2inst1_0
  State                         = Offline
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR1
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR1_eth0
  State                         = Offline
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR1
    Interface Name              = eth0

Resource Name             = db2_hostHADR2_db2inst1_0
  State                         = Offline
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR2
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR2_eth0
  State                         = Offline
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR2
    Interface Name              = eth0

Fencing Information:
  Not configured
Quorum Information:
  Qdevice

Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1

Qdevice-net information
----------------------
Cluster name:           HadrPaceDomain
QNetd host:             qDeviceHost:5403
Algorithm:              LMS
Tie-breaker:            Node with lowest node ID
State:                  Connected
  1. db2diag.log messages

I got some key messages from db2diag.log to understand what happened:

2022-01-21-11.41.56.585733-180 I14052321E440         LEVEL: Error
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRecvMsg, probe:11810
MESSAGE : Zero bytes received. Remote end may have closed connection

2022-01-21-11.41.56.586375-180 I14052762E503         LEVEL: Error
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRecvMsgS, probe:30080
MESSAGE : HADR standby recv error:
DATA #1 : Hexdump, 4 bytes
0x00007FAEC27F26BC : 0100 0000                                  ....

2022-01-21-11.41.56.586439-180 I14053266E488         LEVEL: Error
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20215
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
          "Communication with HADR partner was lost"

2022-01-21-11.41.56.589613-180 E14053755E450         LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_S_DISCONN_PEER (was HDR_S_PEER), connId=17

2022-01-21-11.41.56.590047-180 I14054206E434         LEVEL: Warning
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrCloseConn, probe:30595
DATA #1 : <preformatted>
Peer window end time : 1642776236

2022-01-21-11.41.56.591586-180 I14054641E463         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32201
DATA #1 : <preformatted>
Info: HADR Socket send buffer size, SO_SNDBUF: 16384 bytes

2022-01-21-11.41.56.591654-180 I14055105E467         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetTcpWindowSize, probe:32251
DATA #1 : <preformatted>
Info: HADR Socket receive buffer size, SO_RCVBUF: 131072 bytes

2022-01-21-11.41.56.592061-180 I14055573E509         LEVEL: Error
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20280
MESSAGE : ZRC=0x810F0019=-2129723367=SQLO_CONN_REFUSED "Connection refused"
DATA #1 : <preformatted>
Failed to connect to primary.

2022-01-21-11.41.59.201524-180 E14056083E510         LEVEL: Event
PID     : 46656                TID : 140388539033344 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
APPHDL  : 0-102                APPID: *LOCAL.db2inst1.220121144159
AUTHID  : DB2INST1             HOSTNAME: hostHADR1
EDUID   : 39                   EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, base sys utilities, sqeDBMgr::StartUsingLocalDatabase, probe:13
START   : Received TAKEOVER HADR command.

2022-01-21-11.41.59.324320-180 I14056594E459         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleInitialRequest, probe:46000
MESSAGE : Standby has initiated a takeover by force peer window only.

2022-01-21-11.41.59.324422-180 I14057054E487         LEVEL: Error
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSendTakeoverMsg, probe:210
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
          "Communication with HADR partner was lost"

2022-01-21-11.41.59.324465-180 I14057542E502         LEVEL: Warning
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleInitialRequest, probe:46010
MESSAGE : ZRC=0x8280001B=-2105540581=HDR_ZRC_COMM_CLOSED
          "Communication with HADR partner was lost"

2022-01-21-11.41.59.324497-180 I14058045E431         LEVEL: Warning
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSDoTakeover, probe:47265
DATA #1 : <preformatted>
Skip posting hdrTakeoverDone

2022-01-21-11.41.59.331478-180 I14058477E557         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrBuildAndSendTkResponse, probe:32000
DATA #1 : String, 46 bytes
Enquing takeover request for stage and event:
DATA #2 : String, 23 bytes
HDR_TK_P_DONE_DRAIN_LOG
DATA #3 : String, 19 bytes
HDR_TKEV_NEXT_STAGE

2022-01-21-11.41.59.435784-180 I14059035E430         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleDoneDrain, probe:46800
MESSAGE : Standby switching roles to primary.



(...)



2022-01-21-11.42.01.610768-180 I14089625E496         LEVEL: Info
PID     : 46656                TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
APPHDL  : 0-8                  APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID   : 34                   EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:4820
DATA #1 : <preformatted>
HADR Takeover: Processing database configuration change.

2022-01-21-11.42.01.613753-180 E14090122E488         LEVEL: Warning
PID     : 46656                TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
APPHDL  : 0-8                  APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID   : 34                   EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:6600
MESSAGE : ADM1611W  The rollforward recovery phase has been completed.

2022-01-21-11.42.01.616327-180 I14090611E490         LEVEL: Info
PID     : 46656                TID : 140388560004864 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
APPHDL  : 0-8                  APPID: *LOCAL.DB2.220119183259
HOSTNAME: hostHADR1
EDUID   : 34                   EDUNAME: db2agent (HADRPACE) 0
FUNCTION: DB2 UDB, recovery manager, sqlpReplayMaster, probe:9500
MESSAGE : Stopping replay master on standby
DATA #1 : Boolean, 1 bytes
true

2022-01-21-11.42.01.660160-180 E14091689E1245        LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetDbRoleAndDbType, probe:10020
CHANGE  : HADR DATABASE ROLE/TYPE -
HADR database role set to PRIMARY (was STANDBY).
HADR database type set to PHYSICAL (was PHYSICAL).
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  [0] 0x00007FAED53114C9 _Z21hdrSetDbRoleAndDbTypeP9SQLP_DBCBjj + 0x6F9
  [1] 0x00007FAED5380127 _Z24hdrStbyTkHandleDoneDrainP6hdrEduP8HDR_DBCB + 0x22C7
  [2] 0x00007FAED5392E83 _ZN6hdrEdu14hdrSDoTakeoverEP20HDR_MESSAGE_TAKEOVERP17HDR_RQST_TAKEOVER + 0xA33
  [3] 0x00007FAED534D93F _ZN6hdrEdu7hdrEduSEPm + 0x1D7F
  [4] 0x00007FAED534F352 _ZN6hdrEdu11hdrEduEntryEv + 0x552
  [5] 0x00007FAED5350D07 _ZN6hdrEdu6RunEDUEv + 0x27
  [6] 0x00007FAEDF7FCFC4 _ZN9sqzEDUObj9EDUDriverEv + 0x1A4
  [7] 0x00007FAEDDA8A1D9 sqloEDUEntry + 0x2A9
  [8] 0x00007FAEE46CC94A /lib64/libpthread.so.0 + 0x894A
  [9] 0x00007FAED1CF3D0F clone + 0x3F


2022-01-21-11.42.01.680170-180 E14092935E465         LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrs.0.0 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_P_REM_CATCHUP_PENDING (was HDR_S_DISCONN_PEER), connId=17


(...)


2022-01-21-11.42.01.687074-180 I14096856E411         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduStartup, probe:21152
MESSAGE : HADR Startup has completed.

2022-01-21-11.42.01.687516-180 I14097268E440         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrStbyTkHandleDoneDrain, probe:46840
MESSAGE : Standby has completed takeover (now primary).


(...)

UID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduS, probe:21900
MESSAGE : Standby Finished.

2022-01-21-11.42.01.700128-180 I14099553E402         LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduP, probe:20301
MESSAGE : Primary Started.

6) First conclusion

In that case, Pacemaker run a "TAKEOVER BY FORCE PEER WINDOW ONLY"

More information about it:

"
BY FORCE

Specifies that the database is not to wait for confirmation that the original HADR primary database is shut down. Unless you are using SUPERASYNC synchronization mode, this option is needed if the HADR pair is not in peer state._

PEER WINDOW ONLY

When this option is specified, no committed transaction loss occurs if the command succeeds and the primary database is brought down before the end of the peer window period. You set the peer window period by assigning a positive integer value to the hadr_peer_window database configuration parameter. Not bringing down the primary database before the peer window expires results in a split brain. If the TAKEOVER BY FORCE PEER WINDOW ONLY command is run when the HADR pair is not in a peer state, or is in a disconnected peer state, an error is returned.

You cannot use the PEER WINDOW ONLY option when the synchronization mode is set to ASYNC or SUPERASYNC.
"


TAKEOVER HADR command
https://www.ibm.com/docs/en/db2/11.5?topic=commands-takeover-hadr

7) Turned the hostHADR2 on again

8) Checking "crm status" again

Checking the "crm status" after hostHADR2 was turned on. We can see all online and started.

hostHADR1:/home/db2inst1/sqllib/bin # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: hostHADR1 (version 2.0.5+20201202.ba59be712-2.30.db2pcmk-2.0.5+20201202.ba59be712) - partition with quorum
  * Last updated: Fri Jan 21 12:27:45 2022
  * Last change:  Fri Jan 21 12:23:44 2022 by root via crm_attribute on hostHADR2
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ hostHADR1 hostHADR2 ]

Full List of Resources:
  * db2_hostHADR1_eth0  (ocf::heartbeat:db2ethmon):      Started hostHADR1
  * db2_hostHADR2_eth0  (ocf::heartbeat:db2ethmon):      Started hostHADR2
  * db2_hostHADR1_db2inst1_0    (ocf::heartbeat:db2inst):        Started hostHADR1
  * db2_hostHADR2_db2inst1_0    (ocf::heartbeat:db2inst):        Started hostHADR2
  * Clone Set: db2_db2inst1_db2inst1_HADRPACE-clone [db2_db2inst1_db2inst1_HADRPACE] (promotable):
    * Masters: [ hostHADR1 ]
    * Slaves: [ hostHADR2 ]

9) Checking the "db2cm -list" after hostHADR2 was turned on

hostHADR1:/home/db2inst1/sqllib/bin # ./db2cm -list
      Cluster Status

Domain information:
Error: Could not get Pacemaker version. - 679
Domain name               = HadrPaceDomain
Pacemaker version         =
Corosync version          = 3.0.4
Current domain leader     = hostHADR1
Number of nodes           = 2
Number of resources       = 6

Node information:
Name name           State
----------------    --------
hostHADR1           Online
hostHADR2           Online

Resource Information:

Resource Name             = db2_db2inst1_db2inst1_HADRPACE
  Resource Type                 = HADR
    DB Name                     = HADRPACE
    Managed                     = true
    HADR Primary Instance       = db2inst1
    HADR Primary Node           = hostHADR1
    HADR Primary State          = Online
    HADR Standby Instance       = db2inst1
    HADR Standby Node           = hostHADR2
    HADR Standby State          = Online

Resource Name             = db2_hostHADR1_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR1
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR1_eth0
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR1
    Interface Name              = eth0

Resource Name             = db2_hostHADR2_db2inst1_0
  State                         = Online
  Managed                       = true
  Resource Type                 = Instance
    Node                        = hostHADR2
    Instance Name               = db2inst1

Resource Name             = db2_hostHADR2_eth0
  State                         = Online
  Managed                       = true
  Resource Type                 = Network Interface
    Node                        = hostHADR2
    Interface Name              = eth0

Fencing Information:
  Not configured
Quorum Information:
  Qdevice

Qdevice information
-------------------
Model:                  Net
Node ID:                1
Configured node list:
    0   Node ID = 1
    1   Node ID = 2
Membership node list:   1, 2

Qdevice-net information
----------------------
Cluster name:           HadrPaceDomain
QNetd host:             qDeviceHost:5403
Algorithm:              LMS
Tie-breaker:            Node with lowest node ID
State:                  Connected

10) Checking the "db2pd -d hadrpace -hadr" after hostHADR2 was turned on

db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -d hadrpace -hadr

Database Member 0 -- Database HADRPACE -- Active -- Up 1 days 20:56:35 -- Date 2022-01-21-12.29.32.830386

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS = TCP_PROTOCOL
                  PRIMARY_MEMBER_HOST = 192.168.145.134
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = 192.168.145.136
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 01/21/2022 12:23:45.098044 (1642778625)
          HEARTBEAT_INTERVAL(seconds) = 1
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 9480
                HADR_TIMEOUT(seconds) = 3
        TIME_SINCE_LAST_RECV(seconds) = 0
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000000
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.000
                  LOG_HADR_WAIT_COUNT = 0
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
            PRIMARY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
            STANDBY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000008.LOG, 8, 78261115
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
                     STANDBY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
              STANDBY_REPLAY_LOG_TIME = 01/21/2022 11:59:13.000000 (1642777153)
         STANDBY_RECV_BUF_SIZE(pages) = 3646
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 25600
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 120
                      PEER_WINDOW_END = 01/21/2022 12:31:32.000000 (1642779092)
             READS_ON_STANDBY_ENABLED = N
              HADR_LAST_TAKEOVER_TIME = 01/21/2022 11:41:59.000000 (1642776119)

11) Checking the "db2diag.log" after hostHADR2 was turned on

2022-01-21-12.23.44.987880-180 I9379E480             LEVEL: Warning
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleRemoteConn, probe:30160
DATA #1 : <preformatted>
TCP socket connection accepted. Remote Host: 192.168.145.136 Port: 52951

2022-01-21-12.23.45.095050-180 I9860E494             LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe:43900
DATA #1 : <preformatted>
Handshake HDR_MSG_HDRHS message is received from 192.168.145.136:40002 (192.168.145.136:40002)

2022-01-21-12.23.45.097942-180 I10355E437            LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrAfterVerifySystem, probe:30440
DATA #1 : <preformatted>
Connection succeeded, connId=18

(...)


2022-01-21-12.23.45.201494-180 I11751E465            LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrPerformPairValidation, probe:21100
DATA #1 : <preformatted>
Old primary requesting rejoining HADR pair as a standby

2022-01-21-12.23.45.304517-180 E12217E464            LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_P_REM_CATCHUP (was HDR_P_REM_CATCHUP_PENDING), connId=18

2022-01-21-12.23.45.317266-180 I12682E470            LEVEL: Info
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEdu::hdrEduP, probe:20445
DATA #1 : <preformatted>
Remote catchup starts at lso 75821614.
Using page start lso 75817677.


(...)


2022-01-21-12.23.47.566859-180 E21873E450            LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_P_NPEER (was HDR_P_REM_CATCHUP), connId=18

2022-01-21-12.23.47.567289-180 I22324E586            LEVEL: Info
PID     : 46656                TID : 140388597753600 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 25                   EDUNAME: db2loggw (HADRPACE) 0
FUNCTION: DB2 UDB, data protection services, sqpLoggwEdu::sqlpLoggwMain, probe:4399
DATA #1 : <preformatted>
Database log writing has been suspended at lso 78261116.  Waiting to be resumed.
Current peer epoch: 0
Primary peer epoch: 0
Standby peer epoch: 0
Peer epoch begin:   0
Peer epoch end:     0

2022-01-21-12.23.47.571293-180 E22911E443            LEVEL: Event
PID     : 46656                TID : 140388564199168 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 33                   EDUNAME: db2hadrp.0.1 (HADRPACE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_P_PEER (was HDR_P_NPEER), connId=18

(...)

2022-01-21-12.23.48.585512-180 I27587E560            LEVEL: Info
PID     : 46656                TID : 140388597753600 PROC : db2sysc 0
INSTANCE: db2inst1             NODE : 000            DB   : HADRPACE
HOSTNAME: hostHADR1
EDUID   : 25                   EDUNAME: db2loggw (HADRPACE) 0
FUNCTION: DB2 UDB, data protection services, sqpLoggwEdu::sqlpLoggwMain, probe:4427
DATA #1 : <preformatted>
Database log writing has been resumed at lso 78261116.
Current peer epoch: 0
Primary peer epoch: 0
Standby peer epoch: 0
Peer epoch begin:   0
Peer epoch end:     0

Testing Scope 2 - Turning off the network of hostHADR2

In this test I will turn the hostHADR2's network off.

The point of this test is to:

  • check if the Db2 instance from hostHADR2 will really be automatically stopped by Pacemaker
  • check if any Split Brain will occur after that

For more information about Split Brain:

How a TSAMP cluster gets into Db2 Split-brain
https://www.ibm.com/support/pages/how-tsamp-cluster-gets-db2-split-brain

Once again the start point of this test is the same of the previous test.

I mean:

  • hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
  • hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).

The Topology is described in the "Topology" section above.

The initial output from "db2pd", "db2cm -list" etc.. you can see in the "Configuration" section above.

1) Turning the network off

The network adapter of hostHADR2 machina was turned off with the following.

ifdown eth0

PS.: On Linux SUSE 15 you can check the network adpater and its IPs with the following:

ip r

2) Instance went down in hopstHADR2

It was checked that the instance went down by Pacemaker in hostHADR2.

3) "Takeover by force" made by Pacemaker

Pacemaker made a "takeover by force" (the "peer window only") and hostHADR1 became the Primary:

db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -hadr -d hadrpace

Database Member 0 -- Database HADRPACE -- Active -- Up 2 days 00:13:03 -- Date 2022-01-21-15.46.00.097250

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = DISCONNECTED
                           HADR_FLAGS =
                  PRIMARY_MEMBER_HOST = 192.168.145.134
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = 192.168.145.136
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = DISCONNECTED
             HADR_CONNECT_STATUS_TIME = 01/21/2022 15:24:41.902454 (1642789481)
          HEARTBEAT_INTERVAL(seconds) = 1
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 18755
                HADR_TIMEOUT(seconds) = 3
        TIME_SINCE_LAST_RECV(seconds) = 0
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000452
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.101
                  LOG_HADR_WAIT_COUNT = 227
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 16384
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
            PRIMARY_LOG_FILE,PAGE,POS = S0000010.LOG, 71, 86866515
            STANDBY_LOG_FILE,PAGE,POS = S0000000.LOG, 0, 0
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000000.LOG, 0, 0
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 01/21/2022 15:27:59.000000 (1642789679)
                     STANDBY_LOG_TIME = NULL
              STANDBY_REPLAY_LOG_TIME = NULL
                 PEER_WINDOW(seconds) = 120
              HADR_LAST_TAKEOVER_TIME = 01/21/2022 15:26:16.000000 (1642789576)

4) Created transaction log

Just to test, I created a table in the database to make sure some transactional log would be created.

5) Made network available again

Start the network adapter from hostHADR2:

ifup eth0

6) HADR still DICONNECTED

Checked that HADR was still with "DISCONNECTED" STATUS in "db2pd -hadr"

  1. Two "PRIMARY" databases in "get db cfg"

The "get db cfg" showed that the database from both hosts (hostHADR1 and hostHADR2) was set to PRIMARY:

 HADR database role                                      = PRIMARY
 HADR local host name                  (HADR_LOCAL_HOST) = 192.168.145.136
 HADR local service name                (HADR_LOCAL_SVC) = 40002
 HADR remote host name                (HADR_REMOTE_HOST) = 192.168.145.134
 HADR remote service name              (HADR_REMOTE_SVC) = 40001
 HADR instance name of remote server  (HADR_REMOTE_INST) = db2inst1
 HADR timeout value                       (HADR_TIMEOUT) = 3
 HADR target list                     (HADR_TARGET_LIST) =
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC
 HADR spool log data limit (4KB)      (HADR_SPOOL_LIMIT) = AUTOMATIC(25600)
 HADR log replay delay (seconds)     (HADR_REPLAY_DELAY) = 0
 HADR peer window duration (seconds)  (HADR_PEER_WINDOW) = 120

8) ERROR while trying to activate the database

While trying to activate the database in hostHADR2, I got an error:

db2inst1@hostHADR2:~> db2 activate db hadrpace
SQL1776N  The command cannot be issued on an HADR database. Reason code = "6".

(...)

6

         This database is an old primary database. It cannot be started
         because the standby has become the new primary through forced
         takeover.



User Response:


6

         Consider the following scenarios when deciding on a suitable
         response:

         6.1) If the HADR primary database is running on a different server,
         then the database on this server should be started as an HADR
         standby database. Perform the following steps:

         *  Issue the START HADR AS STANDBY command on the database on this
            server.

         *  Verify that this new HADR standby database has established a
            connection with the HADR primary database by examining the
            HADR_CONNECT_STATUS field of the db2pd -hadr command or the
            MON_GET_HADR() table function on the HADR primary database. Note,
            an ADM12513E message will also be reported in the administration
            notification log on the HADR primary database if a connection
            cannot be established.

         *  If this new HADR standby database cannot establish a connection
            to the HADR primary database, then this new HADR standby database
            will need to be fully reinitialized based on a backup image of
            the HADR primary database. Please see IBM Db2 Knowledge Center
            article titled "Initializing high availability disaster recovery
            (HADR)".

         6.2) If the HADR primary database is not running on a different
         server (or cannot be made to run), then the database on this server
         should be started as an HADR primary database. Perform the following
         steps:

         *  Issue the STOP HADR command on the database on this server.

         *  If you wish to establish an HADR standby database on a different
            server before starting the database on this server as an HADR
            primary database, then please see IBM Db2 Knowledge Center
            article titled "Initializing high availability disaster recovery
            (HADR)".

         *  Issue the START HADR AS PRIMARY command on the database on this
            server (or the START HADR AS PRIMARY BY FORCE command if there is
            no HADR standby database running yet).

9) Start HADR

Based on the "User Response" from error above (SQL1776N RC=6), I did:

db2inst1@hostHADR2:~> db2 start hadr on db hadrpace as standby
DB20000I  The START HADR ON DATABASE command completed successfully.

And the HADR got CONNECTED and PEER:

db2inst1@hostHADR1:~/sqllib/db2dump/DIAG0000> db2pd -hadr -d hadrpace

Database Member 0 -- Database HADRPACE -- Active -- Up 2 days 00:31:59 -- Date 2022-01-21-16.04.56.714816

                            HADR_ROLE = PRIMARY
                          REPLAY_TYPE = PHYSICAL
                        HADR_SYNCMODE = NEARSYNC
                           STANDBY_ID = 1
                        LOG_STREAM_ID = 0
                           HADR_STATE = PEER
                           HADR_FLAGS = TCP_PROTOCOL
                  PRIMARY_MEMBER_HOST = 192.168.145.134
                     PRIMARY_INSTANCE = db2inst1
                       PRIMARY_MEMBER = 0
                  STANDBY_MEMBER_HOST = 192.168.145.136
                     STANDBY_INSTANCE = db2inst1
                       STANDBY_MEMBER = 0
                  HADR_CONNECT_STATUS = CONNECTED
             HADR_CONNECT_STATUS_TIME = 01/21/2022 15:51:37.597068 (1642791097)
          HEARTBEAT_INTERVAL(seconds) = 1
                     HEARTBEAT_MISSED = 0
                   HEARTBEAT_EXPECTED = 19554
                HADR_TIMEOUT(seconds) = 3
        TIME_SINCE_LAST_RECV(seconds) = 0
             PEER_WAIT_LIMIT(seconds) = 0
           LOG_HADR_WAIT_CUR(seconds) = 0.000
    LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.000452
   LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.101
                  LOG_HADR_WAIT_COUNT = 227
SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 87040
SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 131072
            PRIMARY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
            STANDBY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
                  HADR_LOG_GAP(bytes) = 0
     STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000010.LOG, 74, 86878440
       STANDBY_RECV_REPLAY_GAP(bytes) = 0
                     PRIMARY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
                     STANDBY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
              STANDBY_REPLAY_LOG_TIME = 01/21/2022 15:50:49.000000 (1642791049)
         STANDBY_RECV_BUF_SIZE(pages) = 3646
             STANDBY_RECV_BUF_PERCENT = 0
           STANDBY_SPOOL_LIMIT(pages) = 25600
                STANDBY_SPOOL_PERCENT = 0
                   STANDBY_ERROR_TIME = NULL
                 PEER_WINDOW(seconds) = 120
                      PEER_WINDOW_END = 01/21/2022 16:06:56.000000 (1642792016)
             READS_ON_STANDBY_ENABLED = N
              HADR_LAST_TAKEOVER_TIME = 01/21/2022 15:26:16.000000 (1642789576)

Tests Conclusion

Let's recall the topology:

Db2-HADR-Pacemaker-Topology

  • hostHADR1 has the Standby Database and the lowest Pacemaker ID (ID 1).
  • hostHADR2 has the Primary Database and the highest Pacemaker ID (ID 2).

And now let's see the Conclusion of the tests

Test Scope 1 Conclusion

  • Machine hostHADR2 was turned off.

  • Pacemaker made a "takeover by force peer window only" with success

  • after the machine hostHADR2 was turned on back: - Pacemaker started the Db2 instance on hostHADR2 machine - the HADR got CONNECTED and in PEER with no problems - the resources got Started/Online in "crm status" output - no Split Brain has occurred

Test Scope 2 Conclusion

  • the network adapter of hostHADR2 was turned off

  • Pacemaker did a "takeover by force" with success

  • Pacemaker stopped the Db2 instance from hostHADR2

  • at this point, to make sure some transactional log could be created, we created a table in the database

  • after the network adapter from hostHADR2 was reestablished:
    - Pacemaker started the Db2 instance in hostHAHDR2
    - HADR got DISCONNECTED status
    - the database kept in "deactivated" status in hostHADR2
    - the "HADR Role" (get db cfg) was showed as PRIMARY in both HADR hosts
    - the "activate database" command failed with SQL1776N RC=6 in the old Primary (hostHADR2)
    - then the "start hadr on db hadrpace as standby" was executed with Success in the old Primary (hostHADR2)
    - HADR then got CONNECCTED and PEER status. Working proporly.
    - No Split Brain has occurred

Highlight

Pacemaker stopped the Db2 instance when the network adapter went down. So no transaction log was created in the old primary.

Because of that, the "start hadr on db hadrpace as standby" worked fine in the old primary and no Split Brain happened.

In this test the Links 1 and 3 (see the topology image) went down because the network adapter was disabled.

Another great thing is that qDevice host isn't actually part of cluster. So we can configure that only one qDevice host to be the arbitrator node for N (several) Db2 HADR clusters automated by Pacemaker.


References

Automating Db2 HADR with Pacemaker
https://github.com/fsmegale/Db2-luw/wiki/Automating-Db2-HADR-with-Pacemaker

Quorum devices support on Pacemaker
https://www.ibm.com/docs/en/db2/11.5?topic=component-quorum-devices-support-pacemaker

TAKEOVER HADR command
https://www.ibm.com/docs/en/db2/11.5?topic=commands-takeover-hadr

How a TSAMP cluster gets into Db2 Split-brain
https://www.ibm.com/support/pages/how-tsamp-cluster-gets-db2-split-brain

⚠️ **GitHub.com Fallback** ⚠️