Load shedding - datacratic/rtbkit GitHub Wiki

Here we'll give an overview of how RTBKit copes in high load scenario. We'll start describing how RTBKit is capable of detecting high loads within its various components and how it uses that information to shed messages in order to remain responsive.

Loop Monitor

Before we can start shedding load, we first have to detect whether our system is being strained or not. To do this, we introduced a new component into RTBKit called the Loop Monitor whose job is to monitor the duty cycles of the various message loops that form the basis of RTBKit's event handling mechanism. It will produce a series of values between 0 and 1 where 0 indicates a loop that is completely idle and 1 indicates a loop that is operating a maximum capacity.

The loop monitor will periodically sample the load of each message loops within a component. These samples will be published to carbon under the key pattern <installation>.<node>.<component>.loopMonitor.<loop> and the the highest load of all the loops will be published under <installation>.<node>.<component>.loopMonitor.maxLoad. These keys can be monitored by outside tools like zabbix to raise alerts or automatically spin up new instances of a service. Note that we're also planing to publish these values to a REST api in a future release of RTBKit.

Load Stabilizer

Now that we can detect message loops that are under heavy load, we need to be able to react somehow to alleviate the strain. Since RTBKit is arranged in a pipeline we decided that the best and most efficient way of handling high load scenario is to cut off messages at the head of the pipeline. In other words, we'll reduce the number of incoming auction that we accept into the system using a probability that is adjusted with the load of the pipeline. This probability will henceforth be refered to as the shed probability.

In more concrete terms, our load stabilizer aims to maintain a load of 0.9 at all times which allows us to fully utilize the capabilities of our system while still being able to handle the occasional load spikes and making it easier to mesure the effects of our changes to the shed probability. To achieve this, we periodically sample the load of the system through the loop monitor and modify the shed probability according to the following rules:

if     (load == 1.0) shedProbability += 10%
else if (load > 0.9) shedProbability +=  5%
else if (load < 0.9) shedProbability -=  1%

Note how the probability raises faster then it falls. This is because a system that is overloaded will start to fall behind and slowly snowball out of control which is why it's important that we react quickly. Once our system starts to recover we can slowly ease back into a more stable level.

The shed probability is also published to carbon in various ways depending on the components being monitored:

  • Router: <installation>.<node>.router.auctionKeepPercentage
  • Augmentors: <installation>.<node>.augmentor.shedProbability (will vary depending on how the augmentor is built).

Note that for augmenters, the load shedding mechanism is slightly different because it can't directly affect the components of the router. Instead it simply returns a null augmentation response which may later be used as a signal by the router's loop monitor.

⚠️ **GitHub.com Fallback** ⚠️