Litmus Chaos - Pranav-SA/thesis-support-examples GitHub Wiki

Litmus

Litmus is an open-source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way.

Developers & SREs can simply execute Chaos Engineering with Litmus as it is easy to use, based on modern chaos engineering practices & community collaboration. Litmus is a complete chaos framework that focuses entirely on Kubernetes workloads. It consists of an operator written in Go that currently uses three main CRDs to execute an experiment:

Chaosexperiment: The definition of the experiment with default parameters.
Chaosengine: Binds the experiment definition with the chaos target workload. If successful, the experiment is initiated by the Litmus operator, overriding any variables specified in the manifest file.
Chaosresult: Displays basic information about the progress and eventually the result of the experiment.

Once a chaosengine object is created, Litmus creates the Chaos runner pod in the target namespace. This runner will orchestrate the experiment in the specified namespace and against the specified targets. Target identification is something that makes Litmus different. To zero in on the target, the user has to insert a specific annotation on the deployment (more workloads are supported here: DaemonSet, StatefulSet, and DeploymentConfig). Then, the user needs to modify the labels and fields in the chaosengine object (an example is shown below) so that Litmus can then locate all (or some) of the pods of the target deployment. Once the Operator verifies that all the above prerequisites are met (correct labeling, annotation, Chaosexperiment object, permissions), it will create a pod of the experiment runner, which is responsible for the execution of the experiment. This workflow allows for limiting the blast radius of an experiment, as well as for concurrent experiment executions.

Installation

The Litmus operator is a lightweight and stateless Go application that can be deployed as a simple deployment object in a Kubernetes cluster. Here, Litmus provides two options in terms of orchestrating the experiment. The default mode is restricting the experiment to a particular namespace, which is the process described above. In this case, the cluster administrators need to be mindful of resource utilization, as the correct execution of the experiments depends on the individual namespace available resources. Litmus is easy to install using Helm:

helm install litmuschaos/litmus

Experiments

Once installed, engineers can choose a chaos scenario from a number of pre-defined Litmus Workflows. ChaosHub is an open marketplace hosting many Litmus experiments to run chaos on various infrastructures. Litmus can structure chained sequences of experiments, so you can chain many experiments to wreak as much havoc as you like.

Litmus is using an Ansible runner at the moment to define and execute the experiment depending on the chosen chaos library. However, there is active development to create a more lightweight and simple Go runner, which the community seems to agree is the way forward. It provides a well-defined way to choose your own experiment runner. It uses the concept of chaos libraries that define the packages to be used for the execution of the experiment. For example, an experiment can use the Litmus native library to kill a pod, and another experiment can use the Pumba library to perform a network experiment.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: app-chaos
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: 'chaos=true'
    appkind: 'deployment'
  # It can be true/false. If true, it will enforce the appinfo checks
  annotationCheck: 'true'
  # It can be active/stop. Patch to stop to abort an experiment
  engineState: 'active'
  auxiliaryAppInfo: ''
  chaosServiceAccount: <service_account>
  monitoring: false
  # Determines if Litmus will cleanup at the end of the experiment. It can be delete/retain
  jobCleanUpPolicy: 'delete'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '30'
            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

Internally, once the Chaosengine object is validated and created, Litmus will create a regular Kubernetes Job with all the required parameters which will execute the experiment against the target.

Security

In terms of security, Litmus requires a well-defined set of cluster role permissions. Additionally, a prerequisite for every experiment is for the experiment-specific service account, role, and role binding objects to exist in the target namespace. Litmus is a multi-faceted framework with different layers that all need the appropriate attention from a security standpoint. Litmus supports an Admin mode, where the chaos runner and experiment runner are both created alongside the operator in the same namespace. From there, the experiment runner will locate the target namespace and application to perform the experiment. This time the focus is on centralizing the created chaos resources.

Observations

Litmus has some limitations mainly around observability, as in the case of multiple concurrent executions, it is hard to have a clear picture of all the experiments and around cluster permissions as in this case Litmus requires not only control of the workload resources of the related API groups but also the node resources since it needs more elevated cluster privileges.
Litmus is easy to use. However, it requires a bit more work when it comes to finalizing an experiment. This involves manually removing labels, annotations, and deleting CRs, which should eventually be automated by the user. There is an ongoing effort to create Argo workflows to add this extra management layer to orchestrating different experiments end-to-end.
Using Litmus, engineers can also create custom workflows and schedule workflows to occur on a regular basis. For an open-source free tool, Litmus is surprisingly comprehensive, offering a feature-rich platform with a SaaS-like console.