PostgresSQL or Postgres - kdwivedi1985/system-design GitHub Wiki
-
Postgres is an open-source RDBMS used for Data warehousing, analytics, high-traffic web apps, scientific and academics. it support vertical, horizontal and declarative table partitioning.
-
Postgres supports both SQL and No-SQL style data(JSOn, JSONB).
-
Postgres core provides some features of sharding using FDW (Foreign Data Wrappers), but many core capabilities like distributed 2PC, parallel foreign scan are missing. Foreign scan is sequential at the moment. It is WIP - https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding .
-
FDW (Foreign Data Wrappers) - This provides a mechanism to query external/sharded Postgres databases. This enables the bulk of query processing to be done on the foreign server side (Shard) and only the filtered results will be sent back to the parent node. The push down capabilities enable the shards (i.e. foreign servers) to do the heavy lifting, which greatly improves the performance of this feature.
- Distributed joins and transactions are not supports.
- It ties push down Select, Where clause, Joins between tables on same sever, projection (select columns), aggregation, upate/delete, order by etc.
- Query execution is sequential at the moment.
- Doesn't help with horizontal write scalability
-
Postgres Native Partitioning - Vertical
- CREATE TABLE sales ( id SERIAL, region TEXT, sale_date DATE, amount NUMERIC ) PARTITION BY RANGE (sale_date);
-
-
Citus is the most popular open-source extension for sharding PostgreSQL. It transforms PostgreSQL into a distributed database by automatically sharding tables across multiple nodes and supports parallel queries.
- It is also eventual consistent and used single-phase commit, doesn't support distributed 2PC.
- Choose Citus over FDW if Sharding the database. It is production grade.
- It provides one logical view- We can write simple SQL, Citus autmatically manage shards, while FDW needs us to write remote connections and joins.
- High write/read throughput and parallel execution.
-
By default Citus supports only hash base sharding, range based sharding is not supported.
-
Citus supports distributed joins across shards and nodes if tables are sharded on the same key.
-
Some small reference tables can be replicated in all shards to support local joins. Local joins will have better performance. e.g. country table. e.g. SELECT create_reference_table('country');
- Create create_reference_table will replicate table to all the shards synchronously, but you should only do it for small master data tables. If new data is inserted, it will synchronously commit that in all shards to main consistency.
-
Sharding example
- Distribute orders table by customer_id-
- SELECT create_distributed_table('orders', 'customer_id');
- Query normally-
- SELECT * FROM orders WHERE customer_id = 50;
- Citus automatically routes that query to the worker node holding the shard for customer_id = 50.
- SELECT * FROM orders WHERE customer_id = 50;
- Distribute orders table by customer_id-
-
Let's say we have 4 nodes. 1 node will will setup as coordinator/controller, other 3 will be the worker nodes.
-
Each node will require Postgre + Citus installation. Coordinator only holds meta-data.
-
Coordinator hold the meta-data about shards and work nodes store the actual data.
-
By Default Citus creates 32 shards and those will be distributed among 3 shards. e.g.
- node 1: 11 shards (1- 11)
- node 2: 11 shards (12- 22)
- node 3: 10 shards (23- 32)
- It means if shards are created based on custom_id hash, and data is distributed among shards based on hash. one node will host about 10 or 11 separate shard tables.
-
There are 2 ways to setup Shards in Postgres with Citus:
-
Schema-based sharding: Schema-based sharding is easier to use since you don’t need a distribution key. Good for multi-tenant SaaS & microservices. Each tenant has a separate schema with its own set of tables, in the same database. The benefits is cross-queries across shards and centralized management (Monitoring and backup). e.g. Shard1- tanentid, user, customer, contact tables and Shard2- tanentid, product, price, catalog tables
-
Row-based sharding: Row-based sharding is the traditional way Citus does sharding and is useful for all use cases, especially real-time analytics, time series, & IOT. The data from all tenants is in the same set of tables. Each table has a shardkey / tenant ID column (or equivalent) which acts as the distribution column.
-
- HA and Failures setup manually through tools.
- Citus doesn't provide high availability OOTB, but that can be achieved by integrating with HA tool.
- If any node goes down, query will not run on that node. If Coordinator is down, complete DB will be down.
- We can use streaming replication + Patroni or repmgr for replication and failover.
- Keep at-least one primary and stand by for coordinator and same for worker nodes (Primary and Standby).
- It supports both synchronous and asynchronous replication (default).
- Asynchronous replication/ WAL is default and faster writes. With WAL (Write-Ahead Log) changes are sent to the standby after they’re written on the primary.
- Synchronous Replication -The primary waits for confirmation that the standby has received and then commits the transaction. It slows the writes.
- Geo/Region based sharding may be required to improve latency, regulatory compliance, and availability.
- Citus only supports hash based sharding, but geo or range base sharding can be achieved smartly by using a geo-based hash key. e.g. region_code as us, ca, in, be etc.
- We can achieve geo-sharding manually with following approaches:
-
- Schema-Based Sharding by Region- Different tables in different regions.
-
- Multi-Cluster Approach - Create completely separate, isolated cluster in each region.
-
-
Custom Shard Placement Using Node Racks and pg_dist_node table - Assign logical region or location to each rack using noderack field. noderack='us'.
- pg_dist_node table holds meta-data about nodes.e.g. nodename (hostname/ip), groupid, nodeid etc.)
- UPDATE pg_dist_node SET noderack = 'eu' WHERE nodename LIKE 'eu-%';
- UPDATE pg_dist_node SET noderack = 'us' WHERE nodename LIKE 'us-%';
- It's completely custom using schema creation with region and querying like next point.
- pg_dist_node table holds meta-data about nodes.e.g. nodename (hostname/ip), groupid, nodeid etc.)
-
Custom Shard Placement Using Node Racks and pg_dist_node table - Assign logical region or location to each rack using noderack field. noderack='us'.
-
- Choose a Region-Aware Sharding/hash Key
- No automatic geo routing, instead it is based on where clause - WHERE region_code = 'X'
- CREATE TABLE orders (
- id BIGINT,
- region_code TEXT, -- e.g., 'us-east', 'eu-west'
- customer_id BIGINT,
- amount NUMERIC,
- order_date TIMESTAMP);
- Define distributed table/shard - SELECT create_distributed_table('users', 'region_code');
- Select- SELECT * FROM orders WHERE region_code = 'us';
- value of region_code will be hashed to determine the respective shard.
-
- You can also create sub partitions if there are large tables or sub-regions.
- CREATE TABLE orders_us_east PARTITION OF orders FOR VALUES IN ('us-east');
-
What are the different data types supported by Postgres?
- Postgres supports both SQL and No-SQL style data(JSOn, JSONB). Apart form regular data types for text, numbers etc, Postgres supports following data types:
-
UUID: Universally Unique Identifiers- 128-bit unique random non-sequential values. In Sharded cluster collision may happen but chances are rare.
-
Array: Can hold Arrays of any non-array type, and also search the value
- CREATE TABLE sal_emp (mobile_number text[], schedule text[][])
- SELECT mobile_number[1] FROM customer_details
-
JSON: json, jsonb (Json attribute can be part of a relational table or separate table itself. e.g. User Data in Relational table and preferences and other meta-data as Json in seperate table.
- JSON: Stores JSON data as plain text, and retrieved as exact text it was retrieved.
- JSONB: Stores JSON data in a decomposed binary format. This allows for efficient indexing and querying, making jsonb the preferred choice for most applications.
- Filters, comparisons and queries can be made on Json and jsonb. jsonb supports robust comparison and containment operators. json can be compared as text, but this is less reliable.
- CREATE TABLE products (id serial PRIMARY KEY, attributes jsonb);
- INSERT INTO products (attributes) VALUES ('{"color": "red", "size": "M"}'), ('{"color": "blue"}');
- SELECT * FROM products WHERE attributes->>'size' = 'M';
- SELECT * FROM products WHERE attributes @> '{"color": "blue", "size": "L"}';
- Jsonb allows you to create custom indexes on json attributes. e.g.
- CREATE TABLE customers (id SERIAL PRIMARY KEY,name TEXT,attributes JSONB -- flexible data: custom fields, preferences, etc.);
- attributes value- {"loyalty_level": "gold","email_verified": true, "signup_source": "mobile","tags": ["vip", "beta"]}
- Gin Index(Recommended) (map many index entries per row)- CREATE INDEX idx_customers_attrs ON customers USING GIN (attributes jsonb_path_ops);
- B-tree direct index on column (maps one row to one index entry)- CREATE INDEX idx_customers_loyalty ON customers ((attributes->>'loyalty_level'));
-
Binary: bytea- Stores binary data using the BYTEA data type.
-
Monetary: money- Stores currency amounts.
-
Enumerated: enum- It is custom data types that consist of a static, predefined set of values with a specific order.
-
Geometric: point -(1.5, 2.0), line- ((1,2),(3,4)), lseg, box, path, polygon, circle- <(1,2),3>
-
Network address: cidr- '192.168.1.0/24', inet, macaddr- '08:00:2b:01:02:03'
-
Special types: tsvector (Lexeme vector for full-text search), tsquery- Text Search Query (SELECT * FROM docs WHERE document @@ to_tsquery('search')), xml, hstore (Key-value pairs (string-to-string)), etc.
- PostgreSQL supports searching XML attributes (via XPath), hstore key-value pairs (with operators and indexes), and full-text search on tsvector columns (using tsquery and indexes).
- tsquery enables complex, efficient full-text queries and indexing.
- PostgreSQL can replace Elasticsearch for many moderate search needs, but Elasticsearch is still better for advanced, large-scale, or distributed search scenarios.
-
- Postgres supports both SQL and No-SQL style data(JSOn, JSONB). Apart form regular data types for text, numbers etc, Postgres supports following data types:
-
How does Geo fensing/ Spatial feature works in Postgres?
- Geofencing is the process of defining virtual geographic boundaries (geo-fences) and determining whether objects (such as people, vehicles, or devices) are inside or outside these boundaries.
- In Postgres geofensing is acheived through the PostGIS extension. PostGIS adds extensive spatial data types and functions, making PostgreSQL a popular choice for GIS and geospatial applications.
- PostGIS uses Spatial Index for efficient querying.
- PostGIS uses GiST (Generalized Search Tree) or SP-GiST indexes on geometry/geography columns for fast spatial queries.
- Spatial Functions:
- ST_Contains(geom, point): Checks if a point is inside a polygon or other geometry.
- ST_Within(point, geom): Checks if a point is within a given geometry.
- ST_DWithin(geog1, geog2, radius): Checks if two geometries are within a specified distance (useful for circular geofences).
- ST_Intersects(geom1, geom2): Checks if two geometries overlap
- GIST and SPGIST index are created explicitly by the user using the CREATE INDEX statement.
- CREATE INDEX index_name ON table_name USING GIST (column_name);
- CREATE INDEX index_name ON table_name USING SPGIST (column_name);
-
How does GIS and SP-GIS work?
- GiST (Generalized Search Tree)- GiST is a flexible, general-purpose indexing framework. Internally, it works on balance tree (e.g. R-tree, ensures logarithmic time complexity for insertion, deletion, and search etc. regardless of the initial shape of the tree).
- Nodes can overlap, good for overlapping data.
- Support K-NN (K-Nearest Neighbor) for search.
- Good for large dataset. Fits well for Uber like usecases.
- SP-GiST (Space-Partitioned Generalized Search Tree)- SP-GiST is designed for partitioned search trees, such as quad-trees, k-d trees, and radix trees (tries). It recursively splits the search space into non-overlapping subdomains, resulting in non-balanced trees.
- SP-GiST makes search faster and more efficient and it is particularly effective for data that can be naturally partitioned non- overlapping, such as uniformly distributed points or data with natural clustering.
- SP-GiST also support K-NN.
- Fast for smaller or uniformly distributed datasets, but can degrade with large, overlapping data.
- GiST (Generalized Search Tree)- GiST is a flexible, general-purpose indexing framework. Internally, it works on balance tree (e.g. R-tree, ensures logarithmic time complexity for insertion, deletion, and search etc. regardless of the initial shape of the tree).
-
How does simultaneous update works on PostgreSQL?
- MVCC ensures each transaction sees a consistent snapshot.
- Versioning means updates create new row versions, preserving history until old versions are vacuumed.
- Row-level locks prevent conflicting concurrent updates, while reads remain non-blocking.
- When a row is updated or deleted in PostgreSQL, the old version (dead tuple) stored on same table's file on disk and marked as obsolete.
- PostgreSQL uses a background process called VACUUM to periodically scan tables and identify dead tuples, and free-up the space.
- e.g. The first transaction acquires the lock reads the value, increments, and writes back.The second transaction waits for the lock. Once it acquires the lock, it re-reads the (now updated) value, increments, and writes back.
-
Does Postgres supports automatic sharding if new shards are added?
- FDW doesn't support automatic sharding while Citus does.
- Citus allows to add new Shards while application is live by using - SELECT master_add_node('', ).
- Rebalancing in Citus doesn't happend automatically, it needs to be triggered explicitly.
- Rebalancing in Citus can be done with running application but will have performance impact.