Chaos Mesh - Pranav-SA/thesis-support-examples GitHub Wiki

Chaos Mesh

Chaos Mesh is an open-source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, you can conveniently simulate various abnormalities that might occur in reality during the development, testing, and production environments and find potential problems in the system. It was created by PingCap to test the resilience of their distributed database TiDB, and it is very easy to use for other types of applications running in Kubernetes.

As a typical operator architecture, a controller manager pod runs in a regular deployment and is responsible to watch for its own CRDs (NetworkChaos, IoChaos, StressChaos, PodChaos, KernelChaos, and TimeChaos), which users can use to create new objects to specify and start chaos experiments.

image

To activate the requested actions against applications, the controller may have to contact the daemon service of Chaos Mesh deployed as a DaemonSet, so they can, for instance, manipulate the network stack locally to affect target pods running on the same physical node. For the I/O type of chaos, like the simulation of failures or delays in reads and writes on file systems, the application pods need to share their volume mounts with a sidecar container that will intercept file-system calls. Sidecars are injected during app deployments with the support of an Admission Webhook.

Installation

Chaos Mesh is easily deployable as a CustomResourceDefinition (CRD), to get started quickly. A Helm chart is also available in the project repository, making it easy to install using Helm. In terms of management, it can be fairly straightforward when the Helm charts are used, since they are driven by the community. On the other hand, the injection of sidecar containers, and the required use of a daemonSet, make Chaos Mesh a bit harder to operate, as it can be considered quite intrusive to the cluster.

curl -sSL https://mirrors.chaos-mesh.org/v2.3.0/install.sh | bash

Find other installation options here.

Using Chaos Mesh, operators could perform fault injection on the network, disk, file system, operating system, and other areas. Experiments can either be created in a user-friendly GUI or initiated using a YAML file. Experiment targets thus include Kubernetes and Physical nodes.

Experiments

The list of chaos types is grouped into the following categories: network, pod, I/O, time, kernel, and stress, each one with its own CRD type. They all share a common selector entry as a way to find target pods, besides the optional duration or recurrent scheduling of the desired chaos. Some of them, like NetworkChaos, have more options, like delay, corruption, or partition. For example, you could use Chaos Mesh to simulate a stress test inside containers. This configuration below defines a sample StressChaos experiment to continually read and write, draining up to 256MB of memory. Fields could be easily changed to adjust the duration, pod, size, and other factors.

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress-example
  namespace: chaos-testing
spec:
  mode: one
  selector:
    labelSelectors:
      'app': 'app1'
  stressors:
    memory:
      workers: 4
      size: '256MB'

Chaos Mesh allows to schedule cyclical testing behaviors with an inbuilt cron feature. For example, this snippet in YAML from the documentation demonstrates how to configure Chaos Mesh to continually perform a NetworkChaos experiment five minutes after every hour. This particular experiment produces a network latency fault with a 12-second duration.

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: schedule-delay-example
spec:
  schedule: '5 * * * *'
  historyLimit: 2
  concurrencyPolicy: 'Allow'
  type: 'NetworkChaos'
  networkChaos:
    action: delay
    mode: one
    selector:
      namespaces:
        - default
      labelSelectors:
        'app': 'web-show'
    delay:
      latency: '10ms'
    duration: '12s'

Security

The platform also supports RBAC as well as blacklisting and whitelisting to help protect the experimentation process itself from abuse. Chaos Mesh also uses some Linux utilities to implement the low-level chaos types. Similarly, it needs to use the Docker API in the host machine. Therefore, the daemon Pods (deployed as DaemonSet) will run as privileged containers and will mount the /var/run/docker.sock socket file. The controller manager Pod will require permissions to manage MutatingWebhookConfiguration, besides some other expected role-based access control (RBAC) permissions, if the sidecar injection is enabled.

Observation

  • The main project repository mentions a chaos dashboard side project, but it seems it works exclusively for tests with their database product. Building a more generic dashboard project is on the roadmap. So far, the state of chaos experiments can be monitored by inspecting the Custom Resources objects in the cluster.

  • Differently from the other high-level tools in this list, Chaos Mesh does not have a strict concept of an experiment and it’s not an orchestrator with different implementation options. In this sense, it works similarly to Pumba as a simple chaos injector. Being available as a Kubernetes operator, with a range of chaos options based on CRD types, it’s certainly a tool that’s easy to install and use. While documentation could be better, the list of chaos types and configuration options is quite impressive without the need for additional tools.