Fault Injection - TheOpenCloudEngine/uEngine-cloud GitHub Wiki

Fault Injection

Apply some chaos engineering by throwing in some HTTP errors or network delays. Understanding failure scenarios is a critical aspect of microservices architecture (aka distributed computing)

Important
Before Start

You should have NO virtualservice nor destinationrule (in tutorial namespace) istioctl get virtualservice istioctl get destinationrule if so run:

./scripts/clean.sh

HTTP Error 503

By default, recommendation v1 and v2 are being randomly load-balanced as that is the default behavior in Kubernetes/OpenShift

$ oc get pods -l app=recommendation -n tutorial
or
$ kubectl get pods -l app=recommendation -n tutorial

NAME                                  READY     STATUS    RESTARTS   AGE
recommendation-v1-3719512284-7mlzw   2/2       Running   6          18h
recommendation-v2-2815683430-vn77w   2/2       Running   0          3h

You can inject 503’s, for approximately 50% of the requests

istioctl create -f istiofiles/destination-rule-recommendation.yml -n tutorial
istioctl create -f istiofiles/virtual-service-recommendation-503.yml -n tutorial

curl customer-tutorial.$(minishift ip).nip.io
customer => preference => recommendation v1 from '99634814-sf4cl': 88
curl customer-tutorial.$(minishift ip).nip.io
customer => 503 preference => 503 fault filter abort
curl customer-tutorial.$(minishift ip).nip.io
customer => preference => recommendation v2 from '2819441432-qsp25': 51

Clean up

istioctl delete -f istiofiles/virtual-service-recommendation-503.yml -n tutorial

Delay

The most insidious of possible distributed computing faults is not a "down" service but a service that is responding slowly, potentially causing a cascading failure in your network of services.

istioctl create -f istiofiles/virtual-service-recommendation-delay.yml -n tutorial

And hit the customer endpoint

./scripts/run.sh

You will notice many requests to the customer endpoint now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens BEFORE the recommendation service is actually called

stern recommendation -n tutorial

or `bash ./kubetail.sh recommendation -n tutorial `

Clean up

istioctl delete -f istiofiles/destination-rule-recommendation.yml -n tutorial
istioctl delete -f istiofiles/virtual-service-recommendation-delay.yml -n tutorial

Retry

Instead of failing immediately, retry the Service N more times

We will make pod recommendation-v2 fail 100% of the time. Get one of the pod names from your system and replace on the following command accordingly:

oc exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash
or
kubectl exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash

You will be inside the application container of your pod recommendation-v2-2036617847-spdrb. Now execute:

curl localhost:8080/misbehave
exit

This is a special endpoint that will make our application return only `503`s.

Now, if you hit the customer endpoint several times, you should see some 503’s

./scripts/run.sh

customer => preference => recommendation v1 from 'b87789c58-h9r4s': 864
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 865
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 866
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 867
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 868
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 869
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'

Now add the retry rule

istioctl create -f istiofiles/virtual-service-recommendation-v2_retry.yml -n tutorial

You will see it work every time because Istio will retry the recommendation service and it will land on v1 only.

./scripts/run.sh

customer => preference => recommendation v1 from '2036617847-m9glz': 196
customer => preference => recommendation v1 from '2036617847-m9glz': 197
customer => preference => recommendation v1 from '2036617847-m9glz': 198

You can see the active Virtual Services via

istioctl get virtualservices -n tutorial

Now, delete the retry rule and see the old behavior, where v2 throws 503s

istioctl delete virtualservice recommendation -n tutorial

./scripts/run.sh

customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1118
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1119
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1120
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1121
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1122
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1123
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1124
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1125
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'
customer => preference => recommendation v1 from 'b87789c58-h9r4s': 1126
customer => 503 preference => 503 recommendation misbehavior from '6f64f9c5b-ltrhl'

Now, make the pod v2 behave well again

oc exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash
or
kubectl exec -it $(oc get pods|grep recommendation-v2|awk '{ print $1 }'|head -1) -c recommendation /bin/bash

You will be inside the application container of your pod recommendation-v2-2036617847-spdrb. Now execute:

curl localhost:8080/behave
exit

The application is back to random load-balancing between v1 and v2

./scripts/run.sh

customer => preference => recommendation v1 from '2039379827-h58vw': 129
customer => preference => recommendation v2 from '2036617847-m9glz': 207
customer => preference => recommendation v1 from '2039379827-h58vw': 130

Timeout

Wait only N seconds before giving up and failing. At this point, no other virtual service nor destination rule (in tutorial namespace) should be in effect. To check it run istioctl get virtualservice istioctl get destinationrule and if so istioctl delete virtualservice virtualservicename -n tutorial and istioctl delete destinationrule destinationrulename -n tutorial

First, introduce some wait time in recommendation v2 by uncommenting the line that calls the timeout() method. Update RecommendationVerticle.java making it a slow performer with a 3 second delay.

    @Override
    public void start() throws Exception {
        Router router = Router.router(vertx);
        router.get("/").handler(this::logging);
        router.get("/").handler(this::timeout);
        router.get("/").handler(this::getRecommendations);
        router.get("/misbehave").handler(this::misbehave);
        router.get("/behave").handler(this::behave);

        HealthCheckHandler hc = HealthCheckHandler.create(vertx);
        hc.register("dummy-health-check", future -> future.complete(Status.OK()));
        router.get("/health").handler(hc);

        vertx.createHttpServer().requestHandler(router::accept).listen(8080);
    }

Rebuild and redeploy

cd recommendation/java/vertx

mvn clean package

docker build -t example/recommendation:v2 .

docker images | grep recommendation

oc delete pod -l app=recommendation,version=v2 -n tutorial
or
kubectl delete pod -l app=recommendation,version=v2 -n tutorial

cd ../../..

Hit the customer endpoint a few times, to see the load-balancing between v1 and v2 but with v2 taking a bit of time to respond

./scripts/run.sh

Then add the timeout rule

istioctl create -f istiofiles/virtual-service-recommendation-timeout.yml -n tutorial

You will see it return v1 OR "upstream request timeout" after waiting about 1 second

./scripts/run.sh

customer => 503 preference => 504 upstream request timeout
curl customer-tutorial.$(minishift ip).nip.io  0.01s user 0.00s system 0% cpu 1.035 total
customer => preference => recommendation v1 from '2039379827-h58vw': 210
curl customer-tutorial.$(minishift ip).nip.io  0.01s user 0.00s system 36% cpu 0.025 total
customer => 503 preference => 504 upstream request timeout
curl customer-tutorial.$(minishift ip).nip.io  0.01s user 0.00s system 0% cpu 1.034 total

Clean up, delete the timeout rule

istioctl delete -f istiofiles/virtual-service-recommendation-timeout.yml -n tutorial

or you can run:

./scripts/clean.sh
⚠️ **GitHub.com Fallback** ⚠️