5: DESIGN CONSISTENT HASHING - swchen1234/systemDesign GitHub Wiki
目的: distribute requests/data efficiently and evenly across servers.
The rehashing Problem
$$serverIndex = hash(key) % N, where N is the size of the server pool.$$ problems arise when new servers are added, or existing servers are removed. This means that when server 1 goes offline, most cache clients will connect to the wrong servers to fetch data. 解决方法:consistent hashing
Consistent hashing
Basic Steps:
- Map servers and keys on to the ring using a uniformly distributed hash function.
- Hash function的选择和之前不同,如SHA-1 is used as the hash function f, and the output range of the hash function is: x0, x1, x2, x3, ..., xn. In cryptography, SHA-1’s hash space goes from 0 to 2^160 - 1.
- 注意这里不用module了。
- To find out which server a key is mapped to, go clockwise from the key position until the first server on the ring is found.
- 添加/删除 a server,依照顺时针查找的原理,只有 k/n 个keys 需要被 remapped on average, where k is the number of keys, and n is the number of slots.
Two Issues:
- it is impossible to keep the same size of partitions on the ring for all servers considering a server can be added or removed.
- it is possible to have a non-uniform key distribution on the ring
解决方法 => Virtual nodes
Instead of using s0, we have s0_0, s0_1, and s0_2 to represent server 0 on the ring. Similarly, s1_0, s1_1, and s1_2 represent server 1 on the ring. With virtual nodes, each server is responsible for multiple partitions. Partitions (edges) with label s0 are managed by server 0. On the other hand, partitions with label s1 are managed by server 1. 实验证明virtual node 越多,standard deviation越小,从而mapping 更加平均。 virtualNode
Benefits of Consistent Hashing
- Minimized keys are redistributed when servers are added or removed.
- It is easy to scale horizontally because data are more evenly distributed.
- Mitigate hotspot key problem