HA semantics - SKA-ScienceDataProcessor/RC GitHub Wiki

Introduction

This document attempts to describe desired semantics for high availability/fault tolerance.

Actors

It's assumed that actors follow simple model: they receive single initial parameter, produce single result and may spawn additional actors, send initial parameters to them, and receive their results. It's likely that this hierarchical model is not sufficient and we will have to extent it.

Another important feature groups of CH processes are considered a single logical actor and they results are single set of value albeit distributed one.

Actors could fail for following reasons:

  • CH process crash due to programming mistake or unhandled exception.
  • OS process crash again due to programming mistake (e.g. segfault, OOM)
  • Hardware failure

Note that some failures are essentially random (hardware failures) and some could be deterministic. Latter could pose problem with respect to restart. We may end up in situation when we restart actor in infinite loop.

Another problem is interaction of restarts and errors due to timeouts. Should we reset timeout when we restart process? If process being restarted is CPU/IO intensive and we don't reset timeout it will most likely fail due to timeout. Another question should we restart actor which timed out?

Fault tolerance

Useful way of thinking about high availability is to separate detection of failures and recovery. Cloud haskell provides us with detection of failures. When failure is detected we may try to recover from it. If it's not possible only thing we can do is to forcefully terminate dependent actor until either we can recover from failure or have to terminate entire program.

Tear down and restart actor

Simplest strategy of actor restart is to tear down actor's computation as a whole and start from scratch. No special action needed for tearing computation down (it's implemented already) but following conditions should be met, assuming that restart is handled by parent actor

  1. If actor is not connected yet we can restart it freely.

  2. If it receives initial parameter from parent we need to store it locally and resend to actor.

  3. If actor receives data from another actor we need to ensure that source actor is still alive and will send data to freshly respawned actor. There's room for race if for example source actor already sent data to crashed actor but parent actor doesn't know about it yet. It seems only way to prevent it is to force every actor to send confirmation to the parent in order to check that it sent data to correct destination and resend it otherwise.

Restart actor and reinsert it into existing dataflow graph

This is more complicated case since respawned actor need to reestablish monitoring of its child processes so we need to store description of dataflow graph somewhere. It's however absolutely necessary for restarting top level node.

Another problem is related to error propagation. We need to check for reasons of failure. If actor failed because irrecoverable failure of its child there's no point in restarting actor and reinserting it into DFG since DFG is in invalid state already but in this case we still can fall back to tear down and restart.