6: DESIGN A KEY‐VALUE STORE - swchen1234/systemDesign GitHub Wiki

A key-value store, also referred to as a key-value database, is a non-relational database. key must be unique.

Single server key-value store

容易实现,但放不下 两个优化:

  • Data Compression
  • memory 中只存常用key, 剩下的存在disk 但即使如此,很快也会不够用

Distributed key-value store

CAP theorem

三个条件只能满足其中两个

  • Consistency: consistency means all clients see the same data at the same time no matter which node they connect to.
  • Availability: availability means any client which requests data gets a response even if some of the nodes are down.
  • Partition Tolerance: a partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.

现在市场上的key-value store 都为CP或是AP(因为network failure不可避免,Partition tolerance必须有)

System components

In this section, we will discuss the following core components and techniques used to build a key-value store:

Data partition

Consistent Hashing 优点:

  • Automatic scaling
  • Heterogeneity

Data replication

为了保证data available, 在 N 台server上保持同样的数据。N servers are chosen using the following logic: after a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers(不是virtual node) on the ring to store data copies.

  • N server的选择通常尽量在不同的physical server, data center, locations, etc.

Consistency

Quorum consensus can guarantee consistency for both read and write operations. N = The number of replicas W = A write quorum of size W. For a write operation to be considered as successful, write operation must be acknowledged from W replicas. R = A read quorum of size R. For a read operation to be considered as successful, read operation must wait for responses from at least R replicas.

  • W, R and N 的选择 is a typical tradeoff between latency and consistency。If W + R > N, strong consistency is guaranteed because there must be at least one overlapping node that has the latest data to ensure consistency.
Consistency Models
  • Strong consistency(不推荐): any read operation returns a value corresponding to the result of the most updated write data item. A client never sees out-of-date data.
  • Weak consistency: subsequent read operations may not see the most updated value.
  • Eventual consistency(推荐): this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent.

Inconsistency resolution

Versioning means treating each data modification as a new immutable version of data. 每次server用过数据后将updated data item写到系统 D[(server_id, version_id), ...],当不同server同时添加data item时,依照后面的version_id不能小于前一项的原则,判断先后,这个conflict的解决在客户端发生。

  • the [server: version] pairs in the vector clock could grow rapidly, 解决方法:设置threshold, 删掉过于久远的pair。

Handling failures

Failure detection

如果一个server挂了,可以用all-to-all multicasting(向所有server宣布), 但会低效。

解决方法:gossip protocol

Each node maintains a node membership list, which contains member IDs and heartbeat counters. • Each node periodically increments its heartbeat counter. • Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes. • Once nodes receive heartbeats, membership list is updated to the latest info. • If the heartbeat has not increased for more than predefined periods, the member is considered as offline.

如图中,Node s0 sends heartbeats that include s2’s info to a set of random nodes. Once other nodes confirm that s2’s heartbeat counter has not been updated for a long time, node s2 is marked down, and this information is propagated to other nodes.
处理占时server failure
  • sloppy quorum chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
  • 若只是一个server down, 会选一个server代为执行,之后把changes push back(aka hinted handoff)。
处理永久server failure
  • an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version.
  • A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred. Merle tree 的具体细节见书。

System architecture diagram

Main features of the architecture are listed as follows: * Clients communicate with the key-value store through simple APIs: get(key) and put(key, value). * A coordinator is a node that acts as a proxy between the client and the key-value store. • Nodes are distributed on a ring using consistent hashing. * The system is completely decentralized so adding and moving nodes can be automatic. • Data is replicated at multiple nodes. * There is no single point of failure as every node has the same set of responsibilities.

Write path

The graph explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra.

  1. The write request is persisted on a commit log file. 2. Data is saved in the memory cache.
  2. When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable [9] on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value> pairs. For readers interested in learning more about SStable, refer to the reference material [9].

Read path

  • 当有read request时,先检查data是不是在memory cahce中。
  • 如果没有的话,从disk中取。We need an efficient way to find out which SSTable contains the key => Bloom filter

Summary

⚠️ **GitHub.com Fallback** ⚠️