High Availability - acaprojects/ruby-engine GitHub Wiki

Database

A full copy of the database will be replicated to each edge location using XDCR

Data structures

This how each edge controller is represented in the database

  • class ControlSystem
    • edge_control_id (optional)
  • class EdgeControl
  • doc name edge
    • name – String name of that edge location
    • description
    • failover – Bool should the master server be taking over if this location goes down
    • timeout – Failover timeout, how long before we act on the failure
    • window_start – A CRON string representing the start of the restore window (if undefined then must be manually restored)
    • window_length – The length of the restore window
    • admins – Array of users that can administer this location
    • commit – The current commit version of the edge locations code

Coordination

Master Boot process

This describes the process of booting the master control system. Module loading etc

  1. Load all the edge locations
  2. Load all the edge systems
  3. Build a list of edge location modules
  4. Load all modules, except for those controlled by edge locations
  5. Wait for edge locations (they’ll retry every 3 seconds so wait 6 seconds max)
    • perform any failover actions that are required
  6. Mark system as booted
    • inform edge servers that master is ready
    • accept API requests

Modules controlled by edge locations will still be contactable from control systems on the master server via proxies and promises over a TCP link to the edge location.

Edge controller boot process

Uses an environmental variable to detect it is an edge system

  1. Loads only its edge location document
  2. Loads the systems under its control (map-reduce)
  3. Loads a list of modules it is to control
  4. Attempts to connect to the master server

On disconnected

Considered disconnected after waiting 3 seconds and the following attempted re-connect fails

  1. Loads the modules and connects to devices if not already
  2. Continue to re-connection to the master

On connection

  1. Authenticates with master using a shared secret over SSL
  2. Negotiates control
    • If devices loaded and connected then edge location wins (assumes some kind of master/network outage)
    • If during the restore window the edge location wins
    • If master is loading (and hasn’t loaded edge modules) then edge wins
    • Else the master wins (and will take control of the devices)
  3. Loads the modules and connects to devices if winner and hasn’t already

Effectively the Master only wins control if it already has control

Ongoing

  1. TCP messaging connection maintained
  2. Accept the following requests:
    • Repository request (reset, pull, etc possibly followed by a live reload)
    • Live reload (might include file data if edited inline)
    • Data model updated (settings, IP address, port, URI, etc)
    • Expire system cache (due to data update)
    • Stop / Start system
    • Execute method request
    • Status request
    • Debug message proxy

Multiple requests can be executing simultaneously with IDs for tracking.

Considerations

  1. Client must be aware of the edge server
    • Edge server only services API (no interface code)
    • Interface needs to cache on the client device
  2. Master and edge should be able to direct clients to each other (when restores or failures occur)
    • Clients can ignore this
    • If ignored then the edge or master server will do its best to proxy requests (might not be possible in the case of failure)
⚠️ **GitHub.com Fallback** ⚠️