Clustering and High Availability - netfishers-onl/Netshot GitHub Wiki

Introduction

Starting with version 0.17, Netshot supports running in cluster mode. In this mode, several Netshot instances (servers) interact to load balance the tasks (snapshots, compliance checks, etc.). In case of failure of a member, the pending and tasks are automatically redistributed to other instances.

Note: Clustering requires PostgreSQL as the underlying database.

The other component to look at, in order to achieve full redundancy, is obviously the database. With PostgreSQL, multiple options exist. Netshot 0.17 also brings some enhancements in the way it connects to the database, for a better support of PostgreSQL primary/secondary databases.

Application-level redundancy

Roles

Master

Among the members of a cluster, one member (and only one) is elected as master unit. The role of the master instance is to load balance the new tasks to the possible runners (including itself if selected as task runner). It also monitors the other members and reassigns tasks in case of failure of one of them.

The master instance is chosen at startup between the available members, based on the highest master priority (100 by default, this can be customized in Netshot configuration file. The local instance ID is the tie breaker. There is no preemption on the master role.

Runner

The master will distribute jobs only to the (available) instances configured with the highest runner priority (100 by default). If multiple instances have the same high runner priority, then the jobs will be load balanced based of the relative weights of these instances.

For instance, if the cluster has the following members:

Member	Master priority	Runner priority	Runner weight
server1	110	90	100
server2	100	100	80
server3	100	100	20

Assuming that all servers are simultaneously started:

server1 will be cluster master.
About 4/5th of the tasks will be assigned to server2, 1/5th to server3.
In case of failure of server2 or server3, all tasks will run on the remaining.
In case of failure of both server2 and server3, server1 will execute the tasks itself.
In case of failure of server1, the new master will be either server2 or server3 based on their IDs.

Instance ID

Each instance joining the cluster requires a unique identifier. This should be set up in Netshot configuration file. If the configuration is missing, an ID will be automatically generated but this is not recommended as the ID could change after a restart.

Communication between members

Cluster members exchange HA messages using PostgreSQL notification system as message bus. Thus, there is no need to open additional protocol/port on firewalls, or to install whatever else. Only access to PostgreSQL server/cluster from each Netshot cluster member is required.

User Interface

Northbound interface (Web GUI, REST API) remains available on all cluster members, whatever the actual role is.

New API endpoints have been added to retrieve the clustering status (please check OpenAPI browser).

Configuration files

Each server instance has its own configuration file. You must ensure that the various instance configurations are consistent, especially the clustering configuration (of course), and the encryption settings.

Device drivers

As of Netshot 0.17, the devices drivers are NOT automatically synchronized between the cluster members. If you install custom drivers on one instance (additional JavaScript drivers), they need to be manually copied to other instances. The synchronization status of the device drivers between cluster instances can be checked in the Admin page, by looking at the driver hash of each instance member.

Note: The Refresh drivers button on the Admin page automatically propagates the refresh to the other instances.

Device configuration files

For the device drivers (Checkpoint for example) which store data as files (outside of the database), an additional synchronization mechanism must be set up to ensure that these files are copied to all instances with northbound interface. The folder to synchronize is /var/local/netshot (or whatever is configured in the main configuration file under netshot.snapshots.binary.path). Possible solutions:

Shared partition mount point, for example using NFS.
Frequent rsync's between servers.
...

SNMP collector

The integrated SNMP collector reacts to SNMP traps indicating configuration change on a device to trigger an automatic snapshot. When working in cluster mode, starting with Netshot 0.18.3, any Netshot instance can receive the trap, it will notify the master to trigger a snapshot for the given device assuming another automatic snapshot is not already pending.

Database redundancy

PostgreSQL offers multiple options to achieve database redundancy for load balancing and high availability. The user guide describe them in details: https://www.postgresql.org/docs/12/high-availability.html

A common choice is to deploy a primary database, in read/write mode, with synchronous replication to a secondary database, which can be used for read-only requests, and can be promoted to primary in case of major failure on the first node.

Netshot 0.17+ can take benefit of this setup, by configuring two database URIs:

The main one, netshot.db.url, points to the read-write database instance.
The second one, netshot.db.readurl, points to the read-only database instance. If that second option is configured, Netshot will use the given server for all read-only requests - that will effectively balance the load.

Assuming that the PostgreSQL database is set up using server1 as primary instance and sever2 as secondary instance, one can configure Netshot as follows:

netshot.db.url = jdbc:postgresql://server1,server2/netshot01?sslmode=disable&targetServerType=primary&hostRecheckSeconds=5
netshot.db.readurl = jdbc:postgresql://server1,server2/netshot01?sslmode=disable&targetServerType=preferSecondary&hostRecheckSeconds=5

With this configuration, all read-only requests will be sent to server2, while server1 will receive read-write requests. In case of failure of one of the two servers, Netshot will use the remaining one. If server1 fails and server2 is not promoted to primary, all read operations will still be possible in Netshot.

High availability configuration

As previously stated, this requires Netshot 0.17+ and PostgreSQL as database.

The same version of Netshot must be deployed on the target servers, using the standard installation procedure.
On each server, the following minimal configuration is required in /etc/netshot.conf:

netshot.cluster.enabled = true
netshot.cluster.id = [20 figures or lowercase letters unique to each server]

For database high availability, deploy two PostgreSQL instances (server1 and server2) in streaming replication mode, and configure the Netshot instances as follows:

netshot.db.url = jdbc:postgresql://server1,server2/netshot01?sslmode=disable&targetServerType=primary&hostRecheckSeconds=5
netshot.db.readurl = jdbc:postgresql://server1,server2/netshot01?sslmode=disable&targetServerType=preferSecondary&hostRecheckSeconds=5