2.6 Partitioning - KeynesYouDigIt/Knowledge GitHub Wiki

Replicas are full copies of a given set of data to different nodes, Partitions are different sections of data spread across different nodes. Scalability is the ultimate goal with partitioning.

Most often, each record belongs to one partition only, and potentially many replicas.

The main challenge of partitioning is keeping "balance" and avoiding "skew" and "hotspots". Skew is when one node in the partition has significantly more data than the others, using more disk than it should. Hotspots are nodes that end up getting a disproportionate amount of load.

Partitioning by key range

These issues are avoiding by choosing or creating a partition key wisely. Many systems can be configured to set partitions and rebalance data automatically, but both tasks can also be done manualy ((I would think the former might be manual sometimes, but rarely the later?))

Indexes can be used to keep the sort correct ((Partitions are, implicitly, indexes on the key used to create them. most systems will treat them as such and use Index Scans to find the right partition node to send a query to))

Basic key range partitions are vulnerable to hotspots, ((especially, I would think, if usage patterns are not fully understood.)) When building a composite key range partition key, think about hotspots and how you might avoid them.

Partitioning by Hash of Key

A good hash function takes skewed data and makes it uniformly distributed.

your hash function does NOT need to be cryptographically strong. What matters is the function that is pure in relative to the inputs. More specifically, it should be process independent and reliable.

Hash keys do NOT work like an index, which is a drawback. Many databases loose the implicit index because of this and end up using table scans unless other indexes are established.

Cassandra supports composite partition keys, which hash the first part of the key, but preserve other columns unhashed. A searched based on the hashed key is still a table scan, but searches using other parts of the concat'd index are index scans.

In the end, hotspots and skew cannot be accounted for automatically, its up to the application and app team to prepare for them.

Partitioning and Secondary Indexes

any non-primary index, especially mutliple indices, might not map well to our partitions. We handle this with Document Based or Term Based partitioning strategies.

Document based ("local")

faster writes, slower reads Partitions are determined by primary key or other unique id (a reference to the "document" or row). This leads to scatter/gather reads where a read could hit every partition if the filter is NOT related in any way to the PK or unique id. it is prone to tail latency amplification, but is good enough for Mongo, Riak, Cassandra, elastisearch, solrcloud, volt db.

Term Based ("global")

faster reads, slower writes We can avoid scatter/gather by storing sections of an index on each partition. The initial query will gather PKs from a single partition, then gather the needed rows from the needed partitions.

For example, suppose we have a secondary index on a text column name. partition 1's portion of this index stores only the PK of each row with a name that starts with A - C. partition 2 takes on D - H. partitions are balanced as needed. ((for a more even load, we could hash name to calc the partition))

Dynamo DB uses this, updating the indexes asynchronously,which can take a while.

Rebalancing Partitions

DONT use hash mod N - it makes sense until you realize it causes ALOT of data movement overhead if you re-balance frequently.

Fixed Partitions

Greatly simplifies reads and writes, used in Riak, Elasticsearch, Courchbase, Voldemort. Create many more partitions than nodes, and as you add nodes, the new node steals a few partitions (we add nodes and have less partitions PER nodes as the database grows, thus the system autoscales without increasing the number of partitions. Transfer to new nodes happens in a background process while the original node is used for the partition. You can add weaker or stronger nodes hardware wise and divy partitions accordingly.

More partitions scale better but have more fixed overhead, less partitions have less fixed overhead (theres less partitions to deal with for the entire lifetime of the database) but dont scale as well (moving big partitions is more expensive).

Dynamic Partitions

Rethink, HBase, Mongo

With some addntl complexity & overhead, partitions can merge when data shrinks and split when data grows (~10gb partition in hbase, for example). Its similar to a b tree index.

The "fixed overhead" associated with a fixed n partitions is now dynamic and responds to the size of the data (with the added overhead of managing this). You can set a higherr than 1 initial partition count when anticipating a large seeding job.

Other

Cassandra, Ketama, partitions are proportional to number and size of nodes, add nodes to get more partitions.

auto/mannual

Theres a gradient - some rebalancing is fully mannual, some with automated suggestions that the admin approves, some are fully auto. Fully auto take less work and offer less predictability.

Request Routing

NOTE - consistent hashing, per Karger et al, is used in systems like CDNs, but rarely in databases. its similar, but really not the same, as the best hash functions for hash key partitioning

NOTE - Postgres supports various partition types depending on the data