Sharding - rFronteddu/general_wiki GitHub Wiki

Sharding or Data Partitioning is a technique to break up a big DB into many smaller parts. After a certain point, it is cheaper and more feasible to scale horizontally by adding more machines than to grow vertically by adding beefier servers.

Partitioning Methods

Horizontal Partitioning (row partitioning)

Splits a table into multiple smaller tables (partitions) based on a partition key. Each partition contains a subset of the original table's rows, but all columns are preserved.

image

Advantages:

  • Useful to handle large volumes of data with many rows
  • Faster queries: Reduced data size means queries can be faster
  • Higher availability and scalability: Data distribution can improve availability and scalability

Negatives:

  • Increased complexity
  • Higher overhead and maintenance costs
  • Uneven distribution: Some partitions may contain more data than others
  • More complicated order by operations: Data may need to be fetched from different partitions and sorted in the application's code

Horizontal partitioning is different from vertical partitioning, which divides data based on columns, and hybrid partitioning, which combines both horizontal and vertical partitioning.

Vertical Partitioning (col partitioning)

Each partition contains a different set of columns, which can improve data storage and retrieval efficiency.

image

Advantages:

  • Reduce the amount of data that needs to be read from disk,
  • Improve query performance
  • Reduce contention for table locks.
  • It can be used even when a table has already been normalized, which also involves splitting columns across tables.

Negatives:

  • Hard to scale if one of the features scales more than another.

Directory Based Partitioning (metadata-based sharding)

Splits dataset into shards across multiple computers. Uses a lookup service to maintain a mapping of data to shards. The lookup service keeps track of the current partitioning scheme and which database shard each entity is stored on.

image

Advantages:

  • Flexibility: It allows for data distribution based on a variety of criteria, such as business logic and data attributes.
  • Adaptability: It can meet complex distribution needs.
  • Ease of management: It simplifies the process of shard management and rebalancing.
  • Dynamic changes: It supports dynamic changes to data distribution rules.

However:

  • Partition splitting: In directory-based sharding, a partition can only be split into two partitions at a time.
  • Reverting to un-sharded architecture: difficult and expensive operation to revert because backups made before sharding won't include data written after partitioning.

Partitioning Criteria

  • Key or Hash-Based: Distributes data based on a hash function applied to a key attribute.
    • Must ensure a uniform allocation of data among servers.
    • If the servers number change, consider using Consistent Hashing.
  • List Partitioning: divides data into partitions based on predefined lists.
    • Each partition contains a specific list of values that determine which data belongs to that partition. (for example store data from Italy in the Italian partition).
    • Useful when a column can only contain a limited set of values. It allows you to equally distribute the rows by assigning a proper set of values to each partition.
    • Several benefits including: Improved performance, Enhanced data organization, Scalability, and Data isolation and security.
  • Range Partition: divides data into nonoverlapping ranges based on specified values or value ranges in a table.
    • It's often used when a table has a continuous key, such as time, date, region, or product category.
    • For example, a table of commercial transactions, blog posts, or events could be partitioned by year, month, or week.
    • This can help keep recent data in one partition and distribute historical data across larger partitions stored on slower disks.
  • Composite partition: Combine any partitioning scheme to devise new ones.