2.5 Replication - KeynesYouDigIt/Knowledge GitHub Wiki

from Designing-Data-Intensive-Applications

Leaders v Followers

( Leader based replication used in most RDMS systems, many noSQL systems ) Replication is copying the same database to a new node, and keeping it up to date somehow.

Replication enables scaling over many dimensions!

the easiest architecture is a single Leader that accepts writes, and many Followers that do reads.

Synchronous reps are slower and always consistent, Async are faster and eventually consistent. (sync - replicas confirm data before data is written and sent to users, async - write is confirmed right away and replicated in the background)

(cool stuff to checkout - chain replication -> https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf)

Provisioning new nodes can be done using logging/snapshotting/etc.

(See Implementation of Replication logs) logs can really help with replication. WALs are the best kind.

Handling Node Outages

Followers (readers) easy. catch up based on the ledger of writes.

Leaders (writers) hard.

Determine if the leader is really down -> choose a new leader -> reconfig system to use new leader

In general, The database should handle replication. But for advanced cases, writing application code can become necessary for handling stuff (trigger based replication to save resources, etc).

Problems with Replication Lag

High read, low write systems are common across the web, but in the much more common async replication systems, replication lag creates all kinds of strange behavior, mostly linked to lag (not all nodes get the data at the same time, stale nodes get read)

**The key, to me, seems like its highly related to understanding what systems DO and DO NOT guarantee, so you can figure out what tradeoffs are right to you.

_ READ AFTER WRITE CONSISTENCY _ AKA read your own writes - any user performing a write will read from the leader node that took a particular write. (implementation varies)

_ MONOTONIC READS _ Across multiple reads, a user never "goes back in time" and gets a stale read. (differs from the above because this refers to any reader reading a particular write, not just the writer) (implementation varies)

_ CONSISTENT PREFIX READS _ Reads with multiple points of data see these points in the same order when this guarantee is implemented ("About 10 Seconds Mr. Poons" "How Far can you see into the future Mrs. Cake?")

Multi Leader Replication

If you need to scale writes, but need to replicate your DB, you need multi leader replication, which is much harder to keep consistent. (also enables offline writes with clients as writers, collaborative editing, etc)

( I moved all the notes on conflict handling to Conflicts )

In addition to conflict handling, Multi Leader topologies factor into trade offs as well for multi-leader systems. The left is faster, the right is more consistent for data.

Leaderless Replication

( See Riak, Cassandra, Voldemort )

A new approach, which scales writes, conflict detection, and many other things much better is Leaderless replication, where all nodes can read and write.

Consensus and Quorum become important not in leader election (like above) but anytime a write happens. As long as a pre defined quorum of nodes accepts the write, the write is considered good. Then, for each subsequent read, another quorum is needed (usually a >= number of nodes).

Ideally, your quorum is set so that w + r > n.

w (nodes needed to confirm a write)
r (nodes read on each read)
n (total nodes in a cluster)

Of course, if your system can tolerate more staleness, you can relax this quorum requirement and get more speed.

Background processes can also aid in this by detecting and squaring up missing data. (anti-entropy processees)

There are many cases in which quorums can still fail to produce consistency. Leaderless systems solve many problems but face different challenges than multi leader systems. "Transactions" (chapter 7) and "Consensus" (chapter 9) can help provide some guarantees.

Leader-based systems can be easily monitored for staleness (compare leader vs the replicas) but leaderless systems are tricky. Theres experimentation with "expected staleness" figures underway Probabilistically Bounded Staleness

Conflicts

Conflicts are the biggest problem with multiple leader and leaderless application (different write nodes get different writes not necessarily in the same order, clock skew and network latency makes order uncertain).

Conflicts are hard to detect and resolve. The tradeoff is speed vs data loss.

A fast conflict detector is more likely to throw out data it shouldn't
At the far end of the opposite extreme, users resolve conflicts manually (like git)

LWW, the default in Cassandra, resolves conflicts quickly but may throw out data. It timestamps incoming data (which is arbitrary thanks to skew, latency, etc) and picks the last write, throwing out previous writes. LWW can be durable if keys and there data are considered immutable. (something akin to event sourcing)

In more advanced algorithms "concurrency" (in the Distributed Computing sense) is important. Concurrency here refers to the knowledge a client has. For two write operations A and B

If B happened on a client that was aware of A, the writes are not concurrent.
^^ the opposite is considered concurrent in distributed systems.

Thus conflict resolution can be created using "causal" relationships as a starting point (client that created write B was aware of A, so it is strictly second to A. For some write operation C, which the writing client did not know about, is resolved by some other criteria).

Some data types have the seeds of their own resolution CRDTs, Mergeable Persistent Data Structures, and Operational Transformation are 3 of the most popular automatic resolution algorithms, But it must be noted that ALL auto resolution technologies are still highly experimental.

Monitoring version states can be done using Version Vectors and Version Clocks