Patroni Troubleshooting - bcgov/common-service-showcase GitHub Wiki

Patroni Troubleshooting

Generally once you have a Patroni cluster up and running, there is not much else that needs to be done for day to day operations and maintenance. However, High Availability is a complex topic, and there can be many subtle ways for a cluster to fail, eventually leading to an outage. This article will outline general tips and suggestions for common Patroni maintenance related tasks, as well as considerations and steps on how to recover from more serious issues. Check out the official Patroni Documentation for more in-depth details on how Patroni works.

We also have some old documentaion around how to deploy Patroni and HA databases.

Useful Maintenance Commands

Most of the time, you will be interacting with Patroni directly through their CLI application patronictl. When you run patronictl, you will get the following:

Usage: patronictl [OPTIONS] COMMAND [ARGS]...

Options:
  -c, --config-file TEXT  Configuration file
  -d, --dcs TEXT          Use this DCS
  -k, --insecure          Allow connections to SSL sites without certs
  --help                  Show this message and exit.

Commands:
  configure    Create configuration file
  dsn          Generate a dsn for the provided member,...
  edit-config  Edit cluster configuration
  failover     Failover to a replica
  flush        Flush scheduled events
  list         List the Patroni members for a given Patroni
  pause        Disable auto failover
  query        Query a Patroni PostgreSQL member
  reinit       Reinitialize cluster member
  reload       Reload cluster member configuration
  remove       Remove cluster from DCS
  restart      Restart cluster member
  resume       Resume auto failover
  scaffold     Create a structure for the cluster in DCS
  show-config  Show cluster configuration
  switchover   Switchover to a replica
  version      Output version of patronictl command or a running Patroni...

The key commands you will want to be familiar with patronictl are:

failover
history
list
pause
reinit
resume
version

General Diagnosis Steps

Normally the first thing you would need to do on a Patroni cluster is to figure out if the cluster itself is reporting an issue or not. You can get an overall view of the state of the cluster with patronictl list. An example output is as follows:

$ patronictl list
+---------+------------------+----------------+--------+---------+----+-----------+
| Cluster |      Member      |      Host      |  Role  |  State  | TL | Lag in MB |
+---------+------------------+----------------+--------+---------+----+-----------+
|  master | patroni-master-0 | 192.168.0.1    | Leader | running |  7 |         0 |
|  master | patroni-master-1 | 192.168.0.2    |        | running |  7 |         0 |
|  master | patroni-master-2 | 192.168.0.3    |        | running |  7 |         0 |
+---------+------------------+----------------+--------+---------+----+-----------+

The columns you will care about the most are Role, State, TL and Lag in MB. The above output example shows a healthy cluster because:

Only one pod is marked as Leader/Master under Role
All pods are running under State
All timelines are the same number as the Leader/Master under TL
All values under Lag in MB are 0

If your cluster does not match the above 4 signals, you will need to determine next steps on how to repair a degraded cluster.

Partial Cluster Failure Recovery

Normally when one of the nodes in the cluster is not behaving correctly, you will be able to easily tell which one it is either through patronictl list or through the Pod overview in the OpenShift console. There can be many things that can make a Patroni node fail (WAL recovery corruption, general table inconsistencies, etc). A general rule of thumb is that as long as one of the nodes in the cluster is acting as master and behaving correctly, you can force a misbehaving node to reset and synchronize with the current leader.

To reset a specific node, run patronictl reinit <cluster_name> and then follow the CLI prompts. It will ask you to select which member needs to be reinitialized. After choosing, in nearly all cases, the defective node will reboot and align itself to the current Patroni timeline. Unless a spurious error appears, your cluster should now be restored.

Stuck Replica WAL Replay

One of the potential partial cluster degredation scenarios can occur when there are storage issues on the cluster platform. In situations like these, it may be possible to see rapid turnover of pods as they get evacuated off of nodes and shuffled to different machines. This shuffling may have some unintended consequences, especially if they happen too quickly within a short window of time. If you see pod logs that look something like the following, you may be in this situation:

2023-10-19 20:35:36,884 WARNING: Postgresql is not running.
2023-10-19 20:35:37.266 UTC [33] LOG:  starting PostgreSQL 12.4 (Debian 12.4-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2023-10-19 20:35:37.266 UTC [33] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-10-19 20:35:37.276 UTC [33] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-10-19 20:35:37.323 UTC [35] LOG:  database system was interrupted while in recovery at log time 2023-10-18 17:02:13 UTC
2023-10-19 20:35:37.323 UTC [35] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2023-10-19 20:35:37.324 UTC [36] FATAL:  the database system is starting up
localhost:5432 - rejecting connections
2023-10-19 20:35:37.349 UTC [35] LOG:  entering standby mode
localhost:5432 - rejecting connections
2023-10-19 20:35:37.360 UTC [38] FATAL:  the database system is starting up
2023-10-19 20:35:37.372 UTC [35] LOG:  redo starts at 13D/6994218
2023-10-19 20:35:37.563 UTC [35] LOG:  consistent recovery state reached at 13D/7A3FEB8
2023-10-19 20:35:37.564 UTC [33] LOG:  database system is ready to accept read only connections
2023-10-19 20:35:37.571 UTC [35] LOG:  invalid resource manager ID 46 at 13D/7B75520
localhost:5432 - accepting connections
2023-10-19 20:35:38.634 UTC [42] LOG:  fetching timeline history file for timeline 98 from primary server
2023-10-19 20:35:38.644 UTC [42] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:38.644 UTC [42] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:38.644 UTC [35] LOG:  new target timeline is 98
2023-10-19 20:35:38.659 UTC [48] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:38.659 UTC [48] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:38.673 UTC [49] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:38.673 UTC [49] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:43.699 UTC [51] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:43.699 UTC [51] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:48.703 UTC [52] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:48.703 UTC [52] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:53.698 UTC [80] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:53.698 UTC [80] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed
2023-10-19 20:35:58.702 UTC [81] LOG:  started streaming WAL from primary at 13D/7000000 on timeline 97
2023-10-19 20:35:58.702 UTC [81] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000610000013D00000007 has already been removed

Note that in this log, the replica is unable to acquire a missing WAL stream segment and continually crash loops on that. This indicates that there may have been enough churn that failover and recovery was interrupted due to churn. In this scenario, it is likely you would be seeing only some of the cluster statefulset up and running. Unfortunately when this occurs, the only way to resolve this is to force a patronictl reinit on the affected member of the cluster. This will usually resolve the issue as it forces the replica to fresh start and reaquire all data from master. Below is a quick example of what a patroni debug and full restore session could look like in this scenario:

$ patronictl list
+------------------+--------------+---------+---------+----+-----------+
| Member           | Host         | Role    | State   | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+---------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running | 97 |       194 |
| patroni-master-2 | 10.97.22.190 | Replica | running | 97 |       168 |
+------------------+--------------+---------+---------+----+-----------+
$ patronictl reinit master patroni-master-2
+------------------+--------------+---------+---------+----+-----------+
| Member           | Host         | Role    | State   | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+---------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running | 97 |       194 |
| patroni-master-2 | 10.97.22.190 | Replica | running | 97 |       168 |
+------------------+--------------+---------+---------+----+-----------+
Are you sure you want to reinitialize members patroni-master-2? [y/N]: y
Success: reinitialize for member patroni-master-2
$ patronictl list
+------------------+--------------+---------+------------------+----+-----------+
| Member           | Host         | Role    | State            | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+------------------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running          | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running          | 97 |       207 |
| patroni-master-2 | 10.97.22.190 | Replica | creating replica |    |   unknown |
+------------------+--------------+---------+------------------+----+-----------+
$ patronictl list
+------------------+--------------+---------+---------+----+-----------+
| Member           | Host         | Role    | State   | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+---------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running | 97 |       223 |
| patroni-master-2 | 10.97.22.190 | Replica | running | 98 |         0 |
+------------------+--------------+---------+---------+----+-----------+
$ patronictl reinit master patroni-master-1
+------------------+--------------+---------+---------+----+-----------+
| Member           | Host         | Role    | State   | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+---------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running | 97 |       223 |
| patroni-master-2 | 10.97.22.190 | Replica | running | 98 |         0 |
+------------------+--------------+---------+---------+----+-----------+
Are you sure you want to reinitialize members patroni-master-1? [y/N]: y
Success: reinitialize for member patroni-master-1
$ patronictl list
+------------------+--------------+---------+------------------+----+-----------+
| Member           | Host         | Role    | State            | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+------------------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running          | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | creating replica | 97 |   unknown |
| patroni-master-2 | 10.97.22.190 | Replica | running          | 98 |         0 |
+------------------+--------------+---------+------------------+----+-----------+
$ patronictl list
+------------------+--------------+---------+---------+----+-----------+
| Member           | Host         | Role    | State   | TL | Lag in MB |
+ Cluster: master (6935853617700180003) ----+---------+----+-----------+
| patroni-master-0 | 10.97.46.248 | Leader  | running | 98 |           |
| patroni-master-1 | 10.97.72.18  | Replica | running | 98 |         0 |
| patroni-master-2 | 10.97.22.190 | Replica | running | 98 |         0 |
+------------------+--------------+---------+---------+----+-----------+

Total Cluster Failure Recovery

While total cluster failure is an exceedingly rare situation in a container environment, it can sometimes happen such as with this incident. In this specific incident, our team was unfortunately not able to detect the initial Patroni cluster degredation incident and resolve it as outlined in the previous section. We ended up getting total cluster failure because of a series of cluster degredations over the span of a few days.

Analysis

As mentioned in the Root Cause of the incident report, we had a defective patroni-master-0 pod already for some extended amount of time. While this alone did not take down the cluster, it did leave it vulnerable. Since our StatefulSet had .spec.podManagementPolicy configured to OrderedReady, our StatefulSet would only deploy pods in sequential ascending order. When patroni-master-2 went down due to expected maintenance, it was unable to come back up because patroni-master-0 was still crash-looping. This left the cluster with only patroni-master-1 being the functional leader and no available replicas. When patroni-master-1 crashed, it was also unable to recover because patroni-master-0 was crash-looping (it never passed any health checks).

Recovery Plan

For this specific incident, we learned the following symptoms:

Health checks (liveness/readiness probes) caused patroni-master-0 to endlessly crash-loop
The StatefulSet being in OrderedReady mode meant higher numbered pods would not be able to refresh while patroni-master-0 was defective
We were able to discover (based off of Kibana logs from the platform) that patroni-master-1 had our most recent "source of truth".

As such, our effective goal was to get patroni-master-1 online again and have it act as the leader/master. Once that was done, we could leverage that node to reinitialize and reset the other degraded nodes to restore the cluster.

Recovery Steps

The following are the general steps we performed in order to resolve the incident. The notes in italics provide the context and rationale for why each step is done.

Connect a debug pod as patroni-master-0, and run patronictl pause on it.
- Since none of the cluster members are functional, and we want to prevent patroni from automatically selecting a leader, we need to pause it.
Modify the StatefulSet replica count to 0, and remove the liveness/readiness probes from the StatefulSet.
- We need to temporarily remove the health checks to prevent the pods from crash looping. The entire cluster must be shut down first to ensure the health check definitions disappear.
Modify the StatefulSet replica count to 3.
- We want all patroni pods to be up, and a replica count of 3 will achieve this.
- At this time, we can expect no Postgres DB instances should be running because Patroni is paused.
- In the event this is not the case, you can manually run pg_ctl -D /home/postgres/pgdata/pgroot/data stop to kill the database instance(s).
Identify which of the patroni instances was the most recent master.
- This step will take a bit of guesswork, but we need to know which was the right master in order to recover correctly without losing data
- One way to determine this is to manually start up Postgres out of band with pg_ctl -D /home/postgres/pgdata/pgroot/data start.
- If the node is able to start up without showing any errors, it is possible that node was formerly the master node. If it crashes or exhibits any other errors, it's likely that node is a corrupted replica.
- If there are still Kibana logs, you could search in there to try and determine which was the master before the cluster failed as an alternative.
Restart the Postgres DB manually on patroni-master-1 with patronictl restart master patroni-master-1
- We want to make sure ONLY patroni-master-1 database is running.
- While the pg_ctl command can spin up the database, we need Patroni to be the one spinning it up so that it can be managed correctly.
- Make sure that if you did spin up Postgres manually, you tell it to stop before running the patronictl restart command.
Force Patroni to failover to patroni-master-1 if it is not already flagged as the leader.
- This is where we tell the cluster to make patroni-master-1 to be the master manully
- At this point, you will likely want to verify the data integrity of patroni-master-1 before proceeding. One of the ways you can do this is with oc port-forward and using a local machine DB tool of your choosing to inspect.
Restore Patroni cluster management with patroni resume.
- With patroni-master-1 as the elected leader, we can let Patroni manage things again
- After running this command, some of the other cluster nodes will hopefully realign with the elected leader. Use patronictl list frequently to keep tabs on the situation.
Hard reset the defective patroni-master-0 with patronictl reinit master patroni-master-0.
- We can now safely tell the defective node to reset to the timeline that patroni-master-1 is on and rebuild
- Run patronictl list to check the status of the cluster. You want to ensure that all nodes are online and on the same timeline.
Force a failover to patroni-master-0 with patronictl failover master patroni-master-0.
- As our StatefulSet spins up in OrderedReady configuration, node 0 will always be the first one to come up.
- By setting the master to node 0, we can ensure that when we shut down Patroni and reboot it later, it will be the rightful master instance.
Modify the StatefulSet replica count to 0, and restore the liveness/readiness probes in the StatefulSet.
- In order to restore the health checks, we need to shut down the entire cluster again
Modify the StatefulSet replica count to 3, and verify cluster health.
- Once the health check definitions are back, restart the cluster in full and verify that it is working as intended
- By the end of this step, you should have a fully operational Patroni database again with hopefully no data loss during the outage.

Prevention

One of the key lessons with this incident is that while Patroni High Availability is actually quite resilient the way it is designed, it can easily also mask degredation symptoms within the cluster. If these signals are not caught and dealt with early enough, you can end up in a total cluster failure situation. Going forwards, we can avoid this situation by being more vigilant with periodically checking that our Patroni clusters are healthy manually.

Down the road, it is more beneficial to have some form of monitoring and alerting system in place to inform the team when a pod in the cluster has become defective. One potential avenue for furthere exploration is to leverage the Patroni REST API and periodically probe that for statuses. Another avenue of consideration is to change the StatefulSet to be Parallel instead of OrderedReady. While this would prevent this specific incident from happening, there is a potential for a different class of issues (multiple leaders, deadlocks, etc) which could occur. More research will need to be done in order to evaluate the risk of both StatefulSet approaches.