YugabyteDB - kdwivedi1985/system-design GitHub Wiki

YugabyteDB

  • YugabyteDB is distributed PostgreSQL for Modern Apps with a flexible underlying storage engine called DocDB that supports multiple data models, including relational (SQL) and semi-structured JSON document models.

  • DocDB is the underlying distributed storage engine that powers both the YSQL and CQL. It is short for Document Database which stores data as key-value pair. Each row is stored as a document-like structure. e.g. [Table ID] + [Primary Key] + [Columns/Subfield] + [Hybrid Timestamp] -> {Value}

    • (table_id, id, 'name', ts) → "abc"
    • (table_id, id, 'email', ts) → "[email protected]"
  • It extends Postgres's capabilities by adding:

    • Automatic sharding
    • Fault-tolerance and
    • Multi-region support
  • It can be partitioned horizontally in same DB, vertically and Sharded.

  • These features makes it suitable for cloud native and globally distributed applications.

  • Yugabyte supports row level sharding by hashing, or range partitioning on PK.

  • When new nodes are added, shards are rebalanced automatically to utilize the new resources. This enables seamless horizontal scaling.

  • It provides strong ACID compliance with distributed transactions in YSQL and Non distributed(single-partition) ACID with CQL.

  • Yugabyte offer high performance by combining table partitioning and Sharding.


What are key components in YugabyteDB?

  • Master - Metadata manager (tablet-to-node mapping, cluster state). Cluster of 1-3 master node is recommended for high-availability and fault-tolerance.
  • Tablet - A shard of a table, replicated across nodes using Raft
  • TServer (Tablet Server) - Stores actual data (tablets) and handles client read/write requests. one YB-TServer process runs per node.
  • Raft Group - A set of replicas(leader and followers) for a tablet, with a single leader. For each tablet one node can act as leader and other may be followers.

Explain the architecture, Fault Tolerance, High availability in YugabyteDB?

  • YugabyteDb is open-source, build on top of RocksDB and further enhanced. RockDB is high-performance, embeddable, persistent key-value store developed by Facebook.
  • RockDB is based on Log-Structured Merge Trees (LSM trees), and each tablet is a RocksDB instance, stored as LSM tree. So it will have many LSM trees.
  • Client requests can initially land on any node in the cluster because clients typically connect through a load balancer that routes requests to any available node.
  • Client queries are handled by TServer in the node, which caches the master's meta-data locally, and routes write to tablet's leader.
  • Leader writes to local RocksDB WAL, and sends to followers and waits for QUORUM before commit.
  • YugabyteDB uses the Raft consensus protocol to replicate data and ensure fault tolerance and strong consistency across distributed nodes. Among the replicas in a Raft group, one node is elected as the leader.The leader receives client write requests and appends them to its write-ahead log.
  • The Raft consensus protocol is implemented as part of the YB-TServer.
  • Not all reads always go to the leader, it depends on the consistency requirements and configuration, but the leader handles all writes and coordinates replication to follower replicas.
  • By-default both read/write are handled via leader. Reads can be configured at follower but it will become eventual consistant.
  • Multi-master, multi-region deployment with replication makes it highly available and fault-tolerant.
  • Master Configuration: Master holds the cluster metadata, tablet placement, leader election, and cluster-wide operations.
    • Masters and TServers can be part of same nodes- (default/shared mode) and simpler.
    • YugabyteDB also supports dedicated master nodes, where YB-Master processes run only on specific nodes separate from the nodes running YB-TServers. (1-3 node).
  • Each Raft group tracks its own leader and followers for that specific tablet, not for all tablets cluster-wide.
  • Meta-data at master is maintained while table is created, re-sharded etc.

Cluster Configuration Tyep1

Cluster Configuration Tyep2


What is Table Partitioning and how does that works in YugabyteDB?

  • Table Partitioning is Horizontal scaling in single DB. Yugabyte supports:
    • Range Partitioning: Based on a range of values (e.g., date ranges).
    • List Partitioning: Based on a list of values (e.g., country names).
    • Hash Partitioning: Based on a hash function for even data distribution.
    • Composite: Combining two types (e.g., hash + range).

How does Sharding works in YugabyteDB?

  • Shards in Yugabytedb are called tablets, each of which is distributed across the cluster.
  • Each table is automatically split into tablets (default: 8 tablets per node).
    • Tablet = a shard containing a subset of rows.
    • Multiple tablets can reside on the same node (row partitioning within node).
    • Sharding = distributing these tablets across nodes in the cluster.
  • It split tablets automatically, but only one logical table is visible to user. The default limit of tablet is around 10 GB per tablet, but can be changed. Tablet splitting (resharding) happens automatically when a tablet grows beyond configured size thresholds.
  • When you create a table with hash sharding (default for YSQL tables without explicit range partitioning), YugabyteDB starts with roughly one tablet per node in the cluster. As data grows, tablets that exceed size thresholds (e.g., 128 MiB in the low phase, up to 100 GiB in the final phase) are automatically split into smaller tablets to maintain balanced load and performance. e.g. If you have 3 nodes in the cluster, it will create 1 tablet in each node. If data increases, tablets will further be splitted into sub tablets across these 3 nodes.
  • Data is replicated across multiple nodes using Raft consensus.

How does range based sharding works in YugabyteDb?

  • To achieve range based sharding we can specify the number of tablets or tablet boundaries at table creation. e.g.
    • CREATE TABLE census_stats (age INT,user_id INT,PRIMARY KEY (age ASC, user_id)) SPLIT INTO TABLETS (VALUES (0), (20), (40), (60), (80), (100));
    • This would create tablets covering age ranges: (-∞, 0), [0, 20), [20, 40), [40, 60), [60, 80), [80, 100), [100, +∞)

How does region based sharding works in YugabyteDb?

  • It is custom, YugabyteDB partitions data based on a user-specified column (such as a region or country) and maps those partitions (range shards) to specific geographic regions and nodes within those regions.

  • Region to Node mapping is done by associating tablets with tablespaces or placement blocks that define geographic or cloud-region boundaries.

  • View tablespaces- SELECT spcname FROM pg_tablespace;

    • We might see tablespaces like us_central1_ts, europe_west1_ts, etc., corresponding to cluster regions. or create new
  • Create a partitioned table by region with tablespaces

    • CREATE TABLE users (user_id UUID NOT NULL, region TEXT NOT NULL, name TEXT, email TEXT, PRIMARY KEY (region, user_id)) PARTITION BY LIST (region);
  • Create partitions assigned to region-specific tablespaces

    • CREATE TABLE users_us_central1 PARTITION OF users FOR VALUES IN ('us-central1') TABLESPACE us_central1_ts;
    • CREATE TABLE users_europe_west1 PARTITION OF users FOR VALUES IN ('europe-west1') TABLESPACE europe_west1_ts;
  • Select the data for us-central1-

    • SELECT * FROM users WHERE region = 'us-central1';

What is GIS (Global secondary index) in Yugabyte?

  • A global secondary index is an index that enables fast queries on non-partition key attributes. e.g. other columns of the table.
  • It covers all records across partitions or shards, and is independent of the table’s primary partition key.
  • Used mainly in distributed databases (like YugabyteDB, DynamoDB, Citus, or CockroachDB), enables fast queries on non-partition key attributes.
  • e.g. A time-series app: Partitions data by created_at (monthly). Uses a GSI on user_id to quickly find all events for a given user across months.

What is YSQL and YCQL?

  • YSQL :Yugabyte supports Postgres compatible SQL-API for relational DB.
  • YCQL :Yugabyte supports cassandra compatible CQL (Cassandra Query Language) for querying No-SQL tables.
  • YSQL and YCQL are isolated and independent APIs. Data written via YSQL cannot be queried via YCQL and vice versa. You can run both YSQL and YCQL workloads on the same YugabyteDB cluster instance.
  • Cassandra and Postgres compatibility helps in easily migrating applications to Yugabyte from those.
  • Both APIs are separate but runs in parallel under the same distributed storage layer support both relational and No-SQL data in separate tables.
  • YSQL is for transactional, relational apps (e.g., e-commerce, finance, ERP, user management).
  • YCQL is for real-time NoSQL apps (e.g., telemetry, IoT, shopping carts, analytics).
  • NO-SQL/semi-structured data is supported through JSON or JSONB, which can be part of a relational table or separate table.
  • Use YSQL if tables are created and managed via Postgres compatible interface. It uses postgres provides oerators and functions to query JSON/JSONB attributes.
  • Use YCQL if tables are created via cassandra compatible interface. It uses CQL like syntax for querying JSON data.
  • Justuno has consolidated Cassandra, Neo4j, MSQL Server, and CockroachDB in single YugabyteDB cluster to simplify the oepration. Justuno provides Conversion Rate Optimization (CRO) for e-commerce websites to help retailers increase their revenue.

When should we use YugabyteDB for No-SQL use-case over Cassandra?

  • Use if you need Strong consistency and ACID behavior in No-SQL, have more data like 5-10TB over cassandra's support to 1-2TB.
  • YCQL combines Cassandra-like API with strong consistency and ACID guarantees but doesn't support distributed ACID.

What is LSM tree and how does that work?

  • An LSM (Log-Structured Merge) tree is a data structure primarily used for storing key-value pairs in storage systems.
  • It is optimized for high write throughput by batching and sequentially writing data to disk, for faster writes.
  • It optimizes for fast write operations by using a tiered structure of memory
  • LSM trees accumulate writes in memory, batching them before flushing to disk to minimize the number of disk writes.
  • LSM trees are build as multiple levels. Level 0 is kept in main memory, and might be represented using a tree. The on-disk data is organized into sorted runs of data. Each run contains data sorted by the index key. A run can be represented on disk as a single file, or alternatively as a collection of files with non-overlapping key ranges. image

Who is using Yugabytedb in industry?

  • Check - https://www.yugabyte.com/success-stories/
  • Paramount+, Kroger and many other are using Yugabyte. It’s distributed PostgreSQL with scale on-demand and built-in resilience. Kroger has build ecommerce microservices platform on Yugabyte.