Distributed Systems Replication - keshavbaweja-git/guides GitHub Wiki

Replication - Keeping a copy of the same data on multiple machines that are connected by a network.
Why Replicate
- Reduce Latency, by keeping data geographically close to users
- Increase Availability, by having copies on data on multiple machines
- Increase Read Throughput, by scaling out the number of machines that can serve read queries
Data changes bring complexity to replication solution, if data never changed there would not be need to further synchronize data across machines.
For this discussion, assume that the database is small enough to fit in one node.
Replica - each node that stores a copy of the database
Every write to the database needs to be processed by every replica

Leader based replication (also known as active/passive or master/slave replication)

One replica is designated as Leader, clients issue all write requests against Leader
Other replicas are designated as Followers
Whenever Leader writes data to it's local storage, it also sends the data change to all its followers as part of a replication log or change stream.
Each Follower takes the log from Leader and updates its local copy of database accordingly
For reads, clients can issue requests to Leader or any of the followers Examples (PostgresSQL, MySQL, Oracle DataGuard, MongoDB, Kafka, RabbitMQ)

Leader based replication can be synchronous or asynchronous

Synchronous Replication

Leader waits for all Followers to confirm that the write has been processed by them before reporting success to client.

Asynchronous Replication

Leader does not wait for Followers confirmation that the write has been processed for reporting success to client.

It is impractical for all Followers to be set up with Synchronous Replication. Often Leader based replication is configured to be completely asynchronous. In this scenario, it is possible to run into data losses when the Leader fails, with recent writes on Leader not having been propagated to Followers.

Implementation approaches for Leader Based Replication

1. Statement Based Replication

The leader logs every write request (statement) that it executes and sends that statement log to followers. For a relational database it means every INSERT, UPDATE and DELETE statement is forwarded to the followers. There are several issues with this approach -

Non-deterministic functions in statements such as NOW() will result in different data on Followers
If statements use an auto incrementing column or rely on existing data in database, they must be executed in exact same order on all replicas.
Statements that have side effects (triggers, stored procedure, UDFs) may result in incorrect or duplication of actions.

2. Write Ahead Log (WAL) Shipping

A log is an append only sequence of bytes containing all writes to the database. A log can be used to build a replica on another node. With this method, the Leader besides writing the log to disk, also ships it across to followers. This replication method is used by Postgres SQL and Oracle. Main disadvantage of this method is that WAL describes data at a very low level - data bytes that were changed in disk blocks. This makes replication tightly coupled with storage engine. If the database engine changes it's storage format from one version to another, it is typically not possible to run different versions of database software on Leader and Followers.

3. Logical (row-based) log replication

A logical log for a relational database is usually a sequence of records describing writes to a database table at the granularity of a row. Logical log is decoupled from storage engine internals.

4. Trigger based replication

Statement, WAL and Logical log replication are all implemented by database system. In some replication scenarios, there might be need for additional flexibility, like controlling the data subset that gets replicated. Triggers and Stored procedures offer this flexibility, allowing application to control the replication parameters. Trigger based replication has greater overhead than other three replication methods.

Problems with Replication Lag

It is impractical to set up a fully synchronous cluster based on Leader-based replication. Greater is the number of follower nodes in a cluster, greater are the chances that one of the follower node will be down or unreachable by network. Thus in real world most replication set ups are asynchronous. Asynchronous replication presents the problem of a client reading stale data from one of the follower nodes dues to replication lag. Distributed data systems offers "eventual consistency" which implies that the data at follower replicas will be consistent after "some" time.

1. Reading your own writes

If the write to the Leader has not yet been replicated to a follower node, a client when reading from that follower will see his write as lost. The distributed data system needs to provide read-after-write consistency, also known as read-your-own-write consistency.

2. Monotonic Reads

When reading from an asynchronous follower, it is possible for the client to see things moving back in time. This can happen client, for two successive read operations, reads from two different followers replicas with different replication lags. If the first read is from the follower with smaller replication lag and the second one is from the follower with greater replication lag, client will see things moving backwards in time. Monotonic Reads is the guarantee that such kind of anomaly does not happen. One way to achieving monotonic reads is to ensure that each client makes its read requests against the same replica.

3. Multi Leader Replication

A natural extension of Leader based replication model is to allow more than one node to accept writes. Replication still happens in the same way: each node that processes a write must forward that data change to all other leader nodes. This is multi-leade configuration, also known as master/master or active/active configuration.

3.1. Use Cases for Multi-Leader Replication

3.1.1. Multi-datacenter operations

3.1.2. Clients with offline operation

3.1.3. Collaborative editing

3.2. Handling write conflicts

3.2.1. Conflict Avoidance

The simplest strategy for dealing with conflicts is avoiding them. If application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur. Since many implementations of multi-leader replication handle conflicts quite poorly, conflict avoidance is a frequently recommended approach. However there are scenraios where conflict avoidance breaks dowm and the data system and/or application have to deal with conflict resolution.

3.2.2. Convergent Conflict Resolution

3.2.2.1.

Give each write a unique id (a timestamp, a long random number, a UUID, or a hash of a key and a value), pick the write with the highest id as winner, and throw away other writes. If a timestamp is used, this technique is known as last write wins (LWW). Although this technique is popular, it is prone to data loss.

3.2.2.2.

Give each replica a unique id, and let writes that originated at a higher numbered replicat to take precedence over over writes that originated at a lower numbered replica. This approach also implies data loss.

3.2.3.3.

Develop a merging algorithm for conflicting values, if possible.

3.2.3.4.

Record conflicts in an explicit data structure that records all information and handle conflict resolution in application code at some later time.

3.2.3. Multi-Leader Replication Topologies

3.2.3.1. Circular

3.2.3.2. Star

3.2.3.3. All-to-all

4. Leaderless replication

Popularized by Amazon DynamoDB paper and implementation. Cassandra is an open source database with leaderless replication inspired by DynamoDB.

4.1.

The replication system should ensure that eventually all data is copied to all nodes. After an unavailable node comes back online, how does it catch up on the writes it missed?

4.1.1. Read repair

When a client makes a read from several nodes in parallel, it can detect any stale responses. When it detects a stale response a client can write back new value to the replica with stale response.

4.1.2. Anti-entropy

A background process that constantly checks for differences in data between replicas and copies any missing data from one replica to another. Unlike the replication log in leader-based replication, this anti entropy process does not copy writes in any particular order, and there may be a significant delay before the data is copied.

4.2. Quorum for reading and writing

If there are n replicas, each write must be confirmed by w nodes to be considered successful, and we must query at least r node nodes for each read. As long as w + r > n, we expect to get an up-to-date value when reading. This is because the set of nodes you have written to and the set of nodes you are reading from must overlap. Reads and writes that obey these r and w values are known as Quorum reads and writes.