6: DESIGN A KEY‐VALUE STORE - swchen1234/systemDesign GitHub Wiki
A key-value store, also referred to as a key-value database, is a non-relational database. key must be unique.
容易实现,但放不下 两个优化:
- Data Compression
- memory 中只存常用key, 剩下的存在disk 但即使如此,很快也会不够用
三个条件只能满足其中两个
- Consistency: consistency means all clients see the same data at the same time no matter which node they connect to.
- Availability: availability means any client which requests data gets a response even if some of the nodes are down.
- Partition Tolerance: a partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.
现在市场上的key-value store 都为CP或是AP(因为network failure不可避免,Partition tolerance必须有)
In this section, we will discuss the following core components and techniques used to build a key-value store:
Consistent Hashing 优点:
- Automatic scaling
- Heterogeneity
为了保证data available, 在 N 台server上保持同样的数据。N servers are chosen using the following logic: after a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers(不是virtual node) on the ring to store data copies.
- N server的选择通常尽量在不同的physical server, data center, locations, etc.

Quorum consensus can guarantee consistency for both read and write operations. N = The number of replicas W = A write quorum of size W. For a write operation to be considered as successful, write operation must be acknowledged from W replicas. R = A read quorum of size R. For a read operation to be considered as successful, read operation must wait for responses from at least R replicas.
- W, R and N 的选择 is a typical tradeoff between latency and consistency。If W + R > N, strong consistency is guaranteed because there must be at least one overlapping node that has the latest data to ensure consistency.
- Strong consistency(不推荐): any read operation returns a value corresponding to the result of the most updated write data item. A client never sees out-of-date data.
- Weak consistency: subsequent read operations may not see the most updated value.
- Eventual consistency(推荐): this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent.
Versioning means treating each data modification as a new immutable version of data. 每次server用过数据后将updated data item写到系统 D[(server_id, version_id), ...],当不同server同时添加data item时,依照后面的version_id不能小于前一项的原则,判断先后,这个conflict的解决在客户端发生。
- the [server: version] pairs in the vector clock could grow rapidly, 解决方法:设置threshold, 删掉过于久远的pair。

如果一个server挂了,可以用all-to-all multicasting(向所有server宣布), 但会低效。
Each node maintains a node membership list, which contains member IDs and heartbeat counters. • Each node periodically increments its heartbeat counter. • Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes. • Once nodes receive heartbeats, membership list is updated to the latest info. • If the heartbeat has not increased for more than predefined periods, the member is considered as offline.

- sloppy quorum chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
- 若只是一个server down, 会选一个server代为执行,之后把changes push back(aka hinted handoff)。
- an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version.
- A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred. Merle tree 的具体细节见书。



The graph explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra.
- The write request is persisted on a commit log file. 2. Data is saved in the memory cache.
- When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable [9] on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value> pairs. For readers interested in learning more about SStable, refer to the reference material [9].
- 当有read request时,先检查data是不是在memory cahce中。

- 如果没有的话,从disk中取。We need an efficient way to find out which SSTable contains the key => Bloom filter

