6: DESIGN A KEY‐VALUE STORE - swchen1234/systemDesign GitHub Wiki

A key-value store, also referred to as a key-value database, is a non-relational database. key must be unique.

Single server key-value store

容易实现，但放不下两个优化：

Data Compression
memory 中只存常用key，剩下的存在disk 但即使如此，很快也会不够用

Distributed key-value store

CAP theorem

三个条件只能满足其中两个

Consistency: consistency means all clients see the same data at the same time no matter which node they connect to.
Availability: availability means any client which requests data gets a response even if some of the nodes are down.
Partition Tolerance: a partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.

现在市场上的key-value store 都为CP或是AP(因为network failure不可避免，Partition tolerance必须有）

System components

In this section, we will discuss the following core components and techniques used to build a key-value store:

Data partition

Consistent Hashing 优点：

Automatic scaling
Heterogeneity

Data replication

为了保证data available, 在 N 台server上保持同样的数据。N servers are chosen using the following logic: after a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers(不是virtual node) on the ring to store data copies.

N server的选择通常尽量在不同的physical server, data center, locations, etc.

Consistency

Quorum consensus can guarantee consistency for both read and write operations. N = The number of replicas W = A write quorum of size W. For a write operation to be considered as successful, write operation must be acknowledged from W replicas. R = A read quorum of size R. For a read operation to be considered as successful, read operation must wait for responses from at least R replicas.

W, R and N 的选择 is a typical tradeoff between latency and consistency。If W + R > N, strong consistency is guaranteed because there must be at least one overlapping node that has the latest data to ensure consistency.

Consistency Models

Strong consistency（不推荐）: any read operation returns a value corresponding to the result of the most updated write data item. A client never sees out-of-date data.
Weak consistency: subsequent read operations may not see the most updated value.
Eventual consistency(推荐）: this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent.

Inconsistency resolution

Versioning means treating each data modification as a new immutable version of data. 每次server用过数据后将updated data item写到系统 D[(server_id, version_id), ...]，当不同server同时添加data item时，依照后面的version_id不能小于前一项的原则，判断先后，这个conflict的解决在客户端发生。

the [server: version] pairs in the vector clock could grow rapidly, 解决方法：设置threshold，删掉过于久远的pair。

Handling failures

Failure detection

如果一个server挂了，可以用all-to-all multicasting（向所有server宣布), 但会低效。

解决方法：gossip protocol

Each node maintains a node membership list, which contains member IDs and heartbeat counters. • Each node periodically increments its heartbeat counter. • Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes. • Once nodes receive heartbeats, membership list is updated to the latest info. • If the heartbeat has not increased for more than predefined periods, the member is considered as offline.

如图中，Node s0 sends heartbeats that include s2’s info to a set of random nodes. Once other nodes confirm that s2’s heartbeat counter has not been updated for a long time, node s2 is marked down, and this information is propagated to other nodes.

处理占时server failure

sloppy quorum chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
若只是一个server down, 会选一个server代为执行，之后把changes push back(aka hinted handoff)。

处理永久server failure

an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version.
A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred. Merle tree 的具体细节见书。

System architecture diagram

Main features of the architecture are listed as follows: * Clients communicate with the key-value store through simple APIs: get(key) and put(key, value). * A coordinator is a node that acts as a proxy between the client and the key-value store. • Nodes are distributed on a ring using consistent hashing. * The system is completely decentralized so adding and moving nodes can be automatic. • Data is replicated at multiple nodes. * There is no single point of failure as every node has the same set of responsibilities.

Write path

The graph explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra.

The write request is persisted on a commit log file. 2. Data is saved in the memory cache.
When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable [9] on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value> pairs. For readers interested in learning more about SStable, refer to the reference material [9].

Read path

当有read request时，先检查data是不是在memory cahce中。

如果没有的话，从disk中取。We need an efficient way to find out which SSTable contains the key => Bloom filter

6: DESIGN A KEY‐VALUE STORE - swchen1234/systemDesign GitHub Wiki

Single server key-value store

Distributed key-value store

CAP theorem

System components

Data partition

Data replication

Consistency

Consistency Models

Inconsistency resolution

Handling failures

Failure detection

解决方法：gossip protocol

处理占时server failure

处理永久server failure

System architecture diagram

Write path

Read path

Summary

⚠️ GitHub.com Fallback ⚠️

6: DESIGN A KEY‐VALUE STORE - swchen1234/systemDesign GitHub Wiki

Single server key-value store

Distributed key-value store

CAP theorem

System components

Data partition

Data replication

Consistency

Consistency Models

Inconsistency resolution

Handling failures

Failure detection

解决方法：gossip protocol

处理占时server failure

处理永久server failure

System architecture diagram

Write path

Read path

Summary

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️