Kubernetes Liveness and Readiness Probes - department-of-veterans-affairs/abd-vro GitHub Wiki

Background

Applications can exhibit instability due to various factors, including temporary loss of connectivity, misconfigurations, or internal application faults. Kubernetes enhances application health management through liveness and readiness probes, facilitating automatic container restarts when necessary. However, for detailed insights into application status, resource usage, and errors, developers should integrate monitoring and observability tools alongside Kubernetes.

A probe in Kubernetes is a mechanism that performs periodic health checks on containers to manage their lifecycle. These checks help determine when to restart a container (via the liveness probe) or when a container is ready to handle traffic (via the readiness probe). Developers can define probes in their service deployments using YAML configuration files or the kubectl command-line interface, with the YAML approach being highly recommended for its clarity and version control capabilities.

Types of Probes in Kubernetes

Liveness Probe

This determines whether an application running in a container is in a healthy state. If the liveness probe detects an unhealthy state, then Kubernetes kills the container and WILL RESTART the container.

Readiness Probe

This determines whether a container is ready to handle requests or receive traffic. A failure in this probe leads Kubernetes to stop sending traffic to that container by removing its IP from the service endpoints WITHOUT restarting the container. Instead, it's expected that the application will eventually pass the readiness probe through internal recovery, configuration changes, or by completing its initialization tasks (basically we have to troubleshoot to get the K8s readiness probe to pass).
This is useful when waiting for an application to perform time-consuming initial tasks, such as establishing network connections, loading files, and warming caches.

Startup Probe

This determines whether the application within a container has started. They are crucial for applications that have lengthy startup times, ensuring that liveness and readiness probes do not interfere prematurely.
Startup probes run before any other probes, and, unless it finishes successfully, disables other probes. If a container fails its startup probe, then the container is killed and follows the pod’s restartPolicy.

K8s Probe Configuration for VRO Spring Boot Applications

All VRO Spring Boot Applications (BIE Kafka and BIP API), API gateway, and Lighthouse API are configured the same way. This part of this documentation provides a comprehensive guide on configuring Kubernetes liveness and readiness probes for Spring Boot applications.

Pre-configuration Investigation

Before configuring liveness and readiness probes for our Spring Boot applications, we need the following information:

Health Check Port: Identify the port used for health checks. Found in either the application.yaml file or gradle.properties file of the VRO microservice.
Health Check URL Path: Determine the path to the health check URL. This is often specified in the Dockerfile using a HEALTHCHECK CMD directive like this: HEALTHCHECK CMD curl --fail http://localhost:${HEALTHCHECK_PORT}/actuator/health || exit 1
Actuator Dependency: Verify if the VRO application includes the Spring Boot Actuator dependency by checking the build.gradle file for this line: implementation 'org.springframework.boot:spring-boot-starter-actuator'.

Configuration Steps

Step 1: Configure Liveness and Readiness probe endpoints

In the application.yaml file (located in the resources directory of the Spring Boot VRO microservice), configure the existing Spring Boot Actuator's health endpoint to include liveness and readiness probes, which are then accessible via specific paths.

Screenshot 2024-03-22 at 7 07 15 PM

Step 2: Helm Chart values.yaml Configuration

In the Helm chart for the application, modify the values.yaml file to configure the ports, livenessProbe, and readinessProbe. This is where we specify the specific paths for liveness checks (/actuator/health/liveness) and readiness checks (/actuator/health/readiness).

Screenshot 2024-03-22 at 7 24 36 PM

initialDelaySeconds: This setting delays the start of the liveness probe checks by 120 seconds after the container has started. This delay allows the application within the container enough time to initialize and start up before Kubernetes begins checking its liveness.

periodSeconds: This configuration specifies the frequency at which the liveness probe checks are performed. With periodSeconds set to 10, Kubernetes will check the liveness of the container every 10 seconds.

timeoutSeconds: This parameter defines the time after which the probe check is considered failed if no response is received. Here, if the liveness probe does not receive a response from the /actuator/health/liveness endpoint within 10 seconds, the check will fail. Setting an appropriate timeout prevents false positives in situations where the application or system is temporarily slow.

failureThreshold: This setting determines the number of consecutive failures required to consider the probe failed. With a failureThreshold of 3, Kubernetes will mark the liveness probe as failed and restart the container only after three consecutive failures. This threshold helps in avoiding unnecessary restarts for transient or short-lived issues.

Step 3: Helm Chart deployment.yaml Configuration

In the deployment.yaml file, add the configurations for ports, livenessProbe, and readinessProbe within spec.containers:

Screenshot 2024-03-22 at 7 39 29 PM

Step 4: Incrementing Chart.yaml version

To track changes and updates, increment this chart version in the Chart.yaml file with every change to the chart and Helm templates. VRO is still to discuss how to automate this process (future work).

Step 5: Testing endpoints

Ensure the application's health endpoints (/actuator/health/liveness and /actuator/health/readiness) are correctly implemented and return the expected statuses. Locally, spin up the specific VRO microservice and confirmed the service container is running in Docker Desktop. Go on the browser and follow this format: http://localhost:${HEALTHCHECK_PORT}/actuator/health

For example, http://localhost:10301/actuator/health will show a healthy or unhealthy status

We can also confirm successful implementation of the probes in the pods on Lens after deploying:

Screenshot 2024-03-22 at 9 44 49 PM

It is best practice to regularly review and test the liveness and readiness configurations to ensure they accurately reflect the application's health and readiness states.

K8s Probe Configuration for VRO Ruby Applications

VRO platform only has one Ruby application at this time (BIS-Api formerly known as BGS-Api). This part of this documentation provides a comprehensive guide on configuring Kubernetes liveness and readiness probes for Ruby applications.

Pre-configuration Investigation

Before configuring liveness and readiness probes for our Ruby application, we need the following information:

Application Type: Determine whether the application is serverless or server-based (utilizing a web server like Puma). This information can usually be found in the Gemfile (look for gem 'puma'). Our BIS-Api application is serverless and we won't be able to configure an endpoint to test on the browser.
Health Check Necessity: Serverless applications may not require traditional health check endpoints. For server-based applications, ensure a health check URL or endpoint is configured or available.
Understand application interaction, which can involve:

Reviewing application source code or consulting with the code owner.

Examining logs to understand application interactions.

Configuration Steps

Step 1: Create Script for Liveness, Readiness, and Startup Checks

Create liveness_script.rb, readiness_script.rb, and startup_script.rb scripts within a healthcheck directory. These scripts should perform essential health checks relevant to the application's operation. Our BIS-Api liveness, readiness, startup scripts perform the following:

Startup_script.rb: Checks the configuration of BIS-Api environment and application as well as prevents liveness and readiness probe from starting prematurely.

Liveness_script.rb: Verifies BIS-Api app's ability to connect to RabbitMQ by establishing a connection to RabbitMQ using the RabbitSubscriber class initialized with BUNNY_ARGS.

Readiness_script.rb: Checks both successful connection of BIS-Api app to RabbitMQ as well as fetches data (vro_participant_id) from BIS app.

It is best practice to implement logging and error handling in our start up, liveness, and readiness scripts. This will facilitate troubleshooting as we can determine which methods were or weren't successful based on the error message.

Step 2: Update path of startup, liveness and readiness scripts in Dockerfile

Modify the application's Dockerfile to include the health check scripts in the BIS-Api image. First, copy the liveness_script.rb, readiness_script.rb, and **startup_script.rb **files from your local healthcheck directory into the /app/healthcheck/ directory inside the Docker image being built. Then ensure these scripts are executable.

Screenshot 2024-03-22 at 8 50 57 PM

Step 3: Helm Chart values.yaml Configuration

In the BIS-Api's Helm chart, update the values.yaml file to specify how the Kubernetes probes should execute the health check scripts:

Screenshot 2024-03-22 at 9 12 08 PM

Step 4: Configure the deployment.yaml file

Update BIS-Api deployment.yaml file in the Helm chart to include the probe configurations under spec.containers

Screenshot 2024-03-22 at 9 14 13 PM

Step 5: Increment the chart.yaml version To track changes and updates, increment this chart version in the Chart.yaml file with every change to the chart and Helm templates.

Step 6: Test probes functionality

Deploy your custom branch to svc-bgs-api in Dev

Screenshot 2024-03-22 at 9 22 27 PM

Get on Lens to confirm successful deployment => Get into svc-bgs-api pod and you can see the implementation of the probes in the pod:

Screenshot 2024-03-22 at 9 45 01 PM

Run the following commands to check Startup, liveness, and readiness probes respectively:

Startup check: bundle exec ruby healthcheck/startup_script.rb

Liveness check: bundle exec ruby healthcheck/liveness_script.rb

Readiness check: bundle exec ruby healthcheck/readiness_script.rb

Restart the Dev RabbitMQ container and run these command again. BOTH liveness and readiness probes will fail since connectivity between BIS-Api and RabbitMQ is a requirement for both probes.
If BIS is down and BIS-Api can't retrieve vro-participant_id then the readiness probe will also fail.

Troubleshooting Guidance

VRO Kubernetes probe scripts are designed with comprehensive logging and error handling mechanisms. This approach significantly aids in troubleshooting efforts, allowing us to quickly ascertain which methods have succeeded or failed. By examining the detailed error messages and log outputs, we can efficiently pinpoint issues and expedite the resolution process.