Redis Troubleshooting - bcgov/common-service-showcase GitHub Wiki

Redis Troubleshooting

Redis is a performant in-memory data store. While it is relatively easy to use in standalone mode, cluster mode introduces a handful of challenges that one must keep a close eye on during normal operations. As with anything else, High Availability is a complex topic, and there are many ways that a cluster can degrade or fail, eventually leading to an outage. This article will outline general tips and suggestions for common Redis maintenance related tasks, as well as considerations and steps on how to recover from more serious issues. For more background information on Redis deployments and examples, check out the following gist here.

General CLI Commands

Most of the time, you will be interacting with Redis directly through their CLI application redis-cli. If your Redis instance requires authentication, you will need to add in -a YOURPWHERE as an argument. However, our templates have the environment variable REDISCLI_AUTH defined, which allows you to skip needing to pass in a password explicitly via command line arguments. If you are in standalone mode, that command alone will be sufficient to drop into interactive terminal. However, if you are in cluster mode, you must add in the -c argument to tell redis-cli that you want to automatically forward commands within the cluster.

You can also run one-liner redis commands in-line with the following format:

redis-cli -c <REDIS COMMAND HERE>

The CLI will interpret your command, and then execute the command as a one-off execution, returning the results to stdout. If you do not need interactive mode, this is the recommended way to interact with and manipulate your Redis instance.

Common CLI Commands

Redis has a large variety of commands that can be executed; all of which are documented in great detail on their official documentation here. There are so many commands available, it may be daunting to figure out which command(s) you need to use to get the job done. However, in most cases, you will only need an extremely small subset of commands which we list below for general Redis inspection:

There are plenty more commands that may be used depending on how you leverage your Redis, but the above few are generally enough to assist with quick observations. Check the following cheat sheet for succinct list of other potential commands you may need.

Maintenance CLI Commands

These commands are more targeted towards the maintenance of a general Redis instance:

If you are using Redis cluster instead, the following commands may be of use:

Check the following cheat sheet for more sysadmin tasks.

General Diagnosis Steps

On Redis standalone, there's rarely any situation where you would need to perform recovery - the RDB file should be about identical with what is in Redis memory at the time. However, if you are on Redis Cluster, there can be a few situations where the cluster is unable to recover itself. Since Redis Cluster is IP based, cluster failures typically appear when there is either a high degree of Pod churn or if there are networing issues. On pod startup, the Redis node will update its own IP, and then attempt to broadcast the change to the rest of the cluster. However, if it is unable to connect to the existing cluster, it will remain isolated without manual intervention.

To determine the status of the cluster, a combination of CLUSTER INFO and CLUSTER NODES commands will generally give you enough information about the state of the cluster. You may need to run those commands on multiple nodes to get a good sense of where things are at. For cluster info, the main thing to look for is whether all 16484 cluster slots are assigned and ok, and whether there are any known failures. For cluster nodes, you are mainly checking to see if there are any known failures. Normally you could end up running something like the following:

redis-cli -c CLUSTER INFO
redis-cli -c CLUSTER NODES | grep fail

Cluster Connectivity Restoration

If you begin to see errors such as ERR! ClusterAllFailedError: Failed to refresh slots cache. on your client application, it is very likely that your Redis cluster is not fully connected with each other. As a result, the client application can't figure out how to map all key slots to the appropriate nodes because the cluster itself doesn't have full connectivity with itself.

In the event there are node failures or availability/coverage failures, or if you are seeing the ClsterAllFailedError, the following command forces all running cluster nodes to meet with each other. If enough of the nodes are able to reconnect to each other again, they should be able to gossip to each other to rebuild and restore all necessary network connections to make the cluster functional.

No authentication required (or you have env variable REDISCLI_AUTH defined on pods):

export NAMESPACE=<YOURNAMESPACE>
export INSTANCE=<YOURCLUSTERNAME>

oc get pods -n $NAMESPACE -l statefulset=redis-$INSTANCE -o jsonpath='{range.items[*]}{.status.podIP} 6379 {end}' | xargs -n2 oc exec -n $NAMESPACE -it redis-$INSTANCE-0 -- redis-cli -c CLUSTER MEET

Explicit authentication alternative:

export NAMESPACE=<YOURNAMESPACE>
export INSTANCE=<YOURCLUSTERNAME>
export AUTH=$(oc get secret -n $NAMESPACE redis-$INSTANCE-secret -o jsonpath='{.data.password}' | base64 -d)

oc get pods -n $NAMESPACE -l statefulset=redis-$INSTANCE -o jsonpath='{range.items[*]}{.status.podIP} 6379 {end}' | xargs -n2 oc exec -n $NAMESPACE -it redis-$INSTANCE-0 -- redis-cli -a $AUTH -c CLUSTER MEET

We have found in practice that the above xargs command resolves many strange concurrency behaviors in applications, and is capable of restoring a Redis cluster from total failure (when none of the nodes can reach each other). Whenever there is strange Redis Cluster behavior, the above xargs command should be tried first as it will likely resolve any operational issues you may be facing.

Prevention

Unfortunately, there is no foolproof way of preventing a Redis Cluster from entering a strange state, especially if it is executing within a Kubernetes/Openshift environment. Pod scheduling and network volatility can easily throw a Redis cluster into disarray simply because Redis refuses to provide support for intelligent DNS based resolution of other Redis nodes. Since we are stuck with IP based management, when a node is unable to get word that a node now exists on a new IP address within the cluster, the redis node will become disconnected and be unable to behave as intended.

One potential avenue of further research would be to consider ways to automatically detect when there are failures in the cluster, either by parsing the configuration files Redis maintains periodically, or figuring out a way to run the xargs cluster meet command via a cronjob to hopefully prevent the cluster from losing touch with each other.

Mass Key Renaming

In the event you need to mass rename your Redis keys for some reason or another, you may try the following, where we effectively need to change all keys matching the pattern bull:* to be renamed to {bull}:* instead:

redis-cli -c --scan --pattern bull:* | \
awk '/^/ {new_key=$1;gsub(/bull/,"{bull}", new_key); printf "*3\r\n$6\r\nrename\r\n$" length($1) "\r\n" $1 "\r\n$" length(new_key) "\r\n" new_key "\r\n";}' | \
redis-cli -c --pipe

Note that while the above command is untested, effectively it attempts to stream all existing keys matching the original pattern, transform them through an awk command, and then reinsert it as a stream back into Redis. The original idea for this can be found here.