1.5.4 Horizontal pod autoscaling - grzzboot/pingpong-service GitHub Wiki

Scaling up and down your services based on workload metrics

One great potential benefit of using Cloud Computing and GKE in particular is that you can adjust your resource utilization to match the current demands. In traditional IT operations we as a business need to buy machines that can handle the largest possible load that we will ever take, and preferably even a little more than that in case we increase in popularity. Those machines are just gonna stand there over time and cost us money, doing mostly nothing or very little.

GKE allows you to define so called HorizontalPodAutoscalers (HPAs) that, based on metrics from the services themselves, can scale deployment replicas up and down within a desired range.

Clean any existing deployment in your cluster and and change to the following folder:

cd ../horizontal-pod-autoscaling

Scale up the cluster

To make this work decently we're gonna need to allow some more nodes in our cluster because running a cluster with a single node means that some of the capacity of that node is actually already used up by Kubernetes services themselves. So we need to make some more room for our own services.

One way of scaling up is to use the SDK by running the following command:

gcloud container clusters resize pingpong-site1-cluster \
  --zone=europe-west3-a \
  --node-pool default-pool \
  --num-nodes 3

Alternatively, using the web console, navigate to Kubernetes Engine, select Clusters and then the default-pool of your cluster. Recommended is to set the node pool to a size of 3 nodes.

Auto scaling is a good thing to use in real world examples to save resources when no using them, but in this showcase it works best if we have a defined set of nodes. Scale up node pool

A few words before apply

The pingpong-service is written i Java, with the Spring Boot framework, and Java processes are quite CPU and memory intensive at startup which is a downside when you're trying to demonstrate a HPA in a quick example. There is a risk that the HPA will think that the CPU has been very high over the last minutes, because of startup, and therefore start a new instance of the service and so on, until the maximum number of replicas for the HPA is reached; - 4 in this case.

I've tried to avoid this by also setting a longer readiness probe initialDelaySeconds, in hope that the HPA will not be able to collect the high startup metrics before the service has cooled down, but there is no guarantee!

If scale up happens please have some patience and allow the services to "cool down". The HPA will scale down the services when the CPU utilization has been low for some time.

Go ahead and kustomize!

Once the service is stable at one replica continue with the demo.

A look into the HPA

The definition of the HPA in this example looks as follows:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: pingpong-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: pingpong-deployment
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 80

The most interresting parts here are the

  • minReplicas, saying how many replicas/pods we will at least have in any situation
  • maxReplicas, saying how many replicas/pods we will at most have in any situation
  • targetAverageUtilization, defines the CPU %-limit for when we scale up/down. The limit is based on pod CPU requests (not limit), in our case this is 80% of 250m CPU = 200m CPU

Driving some load against pingpong-service

Attempting to increase the CPU utilization using the same ping used before is possible but not very easy because it is a very fast operation. We need to do something heaver!

Fortunately pingpong-service is delivered with performance breaking parameters itself. All you need to do is add the parameter expensive=true to the regular ping-request and the service will be forced to do some costly maths. To avoid you from wearing out your keyboard trying to send simultaneous curl-requests there is a folder called /vegeta inside the horizontal-pod-autoscaler-folder.

Vegeta is a load test tool that can be installed with brew install vegeta on a Mac.

Obtain your LoadBalancer public IP

Before any attacks can be made you must obtain your LoadBalancer public IP. Type:

watch kubectl get services -n pingpong 

You should see something like this after a while:

NAME               TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)          AGE
pingpong-service   LoadBalancer   172.21.43.51   35.198.78.241   8080:31301/TCP   2m22s

Initially the EXTERNAL-IP will be <pending> but it will change state to a valid IP-address once GCP has allocated one for you. This is the "simple type" of exposure that we learned about before. The service will only be available through HTTP (not HTTPS) but that's ok for this experiment.

Inside the vegeta attack-*.s-scripts (under the /vegeta folder) there is a placeholder called <YOUR-IP> which you need to replace with the IP-address from the LoadBalancer to make the test work.

Slow rate attack to see some metrics

Let's start by turning on a very low level of requests (1/s). Start the vegeta script called attack-slow.sh under the /vegeta folder. You can run the script from the horizontal-pod-autoscaling folder if you want to, just type:

sh ./vegeta/attack-slow.sh

You won't see much output until the attack is over and it will last for 10m unless you abort it. However if you DO abort you will se a summary of the vegeta attack performed until the abort occurred.

Go ahead and let the attack go on for a couple of seconds and then press Ctrl+C to abort. You should see something like this in your console:

Requests      [total, rate, throughput]  22, 1.05, 1.04
Duration      [total, attack, wait]      21.127823562s, 20.996823121s, 131.000441ms
Latencies     [mean, 50, 95, 99, max]    134.703995ms, 131.097469ms, 174.913912ms, 227.359395ms, 227.359395ms
Bytes In      [total, mean]              396, 18.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:22  
Error Set:
Bucket           #   %       Histogram
[0s,     100ms]  1   4.55%   ###
[100ms,  200ms]  20  90.91%  ####################################################################
[200ms,  300ms]  1   4.55%   ###
[300ms,  +Inf]   0   0.00%   

There is a lot of info here but the most important part is that we are getting 100% success and 200 response codes. It means that the attack is hitting its target so you got the IP right!

For the remainder of this exercise the output of vegeta is not very important. Instead we're going to put a watch on our service metrics to see what's going on there.

Open a new console or use an used one and type:

watch kubectl top pods -n pingpong

The top command is sort of like Unix top, it will show CPU and Memory utilization of your pods and those are also the metrics used by the HorizontalPodAutoscaler in this example to make its scale up/down decisions. So by observing these metrics we can verify that the HPA behaviour is working correctly for us. Now, with the watch on, start up the attack-slow.sh again and let it run and reflect itself on the metrics.

A resting service metrics should look something like:

NAME                                   CPU(cores)   MEMORY(bytes)
pingpong-deployment-578f9d6768-x64pf   3m           123Mi

While a service under a slow load is more like:

NAME                                   CPU(cores)   MEMORY(bytes)
pingpong-deployment-578f9d6768-x64pf   55m          124Mi

The exact numbers may of course vary, but you should see a certain increase. If you don't check again that you're really getting 200 OK from the service.

Increase rate and observe scale up

Ok, so based on the metrics above, the resource requests and the HPA configuration, a burst of 4 attacks/s should probably put us in a state where we exceed 80% CPU average utilization. At the same time that load should be possible to handle with 2 pods. Go ahead and stop the attack-slow.sh script and start the attack-medium.sh script instead.

You should see the metrics go up similar to this:

NAME                                   CPU(cores)   MEMORY(bytes)
pingpong-deployment-7bd4ff76bd-zdlkg   226m         123Mi

This should cause your deployment to scale out after a little while:

NAME                                   READY   STATUS    RESTARTS   AGE
pingpong-deployment-7bd4ff76bd-wlgbr   1/1     Running   0          43s
pingpong-deployment-7bd4ff76bd-zdlkg   1/1     Running   0          6m25s

And after scale out and some time the CPU utilization should be acceptable to handle with two pods again:

NAME                                   CPU(cores)   MEMORY(bytes)
pingpong-deployment-7bd4ff76bd-wlgbr   102m          114Mi
pingpong-deployment-7bd4ff76bd-zdlkg   82m          126Mi

The exact distribution of workload between the two pods may vary. Sometimes one of the pods may appear to take all the load and then suddenly the situation is the opposite. But we don't need to worry about that, it is the concern of Kubernetes to make it work well.

Max out!!!

Ok, let's kill this one. Stop the attack-medium.sh script and start the attack-fast.sh script. It will launch a script that fires away enough requests to take us up to 4 pods (max).

After a little while of running the script you should see the pods scale up to 4 instances.

Decrease rate and observe scale down

Stop the attack-fast.sh script and start the slow one again.

After a couple of minutes you should see a scale down, back to one single pod, occur. It takes a couple of minutes because Kubernetes avoids what is called thrashing. Thrashing would occur if Kubernetes responded immediately to a change of load and scaled down. Then, if a new higher load came in a few seconds later, Kubernetes would be forced to scale up again and so on. This could end up in chaos with pods everywhere going up and down at the same time... Therefore Kubernetes assumes that a load rarely comes alone. It's likely that more will follow. Only after some time Kubernetes draws the conclusion that load appears to be gone for real and takes action. It takes about 5 minutes for this to happen by default.

When you feel finished with playing around; - remember to delete your deployments and scale down the cluster nodes to 1 again to avoid extra costs.

Real world considerations

CPU and Memory is not always the limiting factor of a Micro service unless you're doing science computations. It is more likely that request rate, response time or database connections are good metrics. To base your HPA:s on these kind of metrics you need to have a custom metrics solution in place. Here is an article that describes how to do so on GCP.

In the next section you'll get acquainted with how to integrate your services with SaaS solutions in the cloud.

⚠️ **GitHub.com Fallback** ⚠️