Database theory - klagan/learning GitHub Wiki

Introduction

CAP Theorem

source

CAP Theorem states that only two of three properties can be guaranteed.

This means that in the case of a partition failure e.g. network failure, a distributed data store must decide on one option:

  • maintain consistency at the cost of availability
  • maintain availability at the cost of consistency

cap-theorem

Consistency

  • waiting for all nodes to be consistent before returning data
  • means data will always be current but availability will be lowered

Availability

  • every node returns an immediate response
  • but data may be stale

Partition

  • guarantees the system continues to operate even if it has lost connection to other nodes

Example

source

Consider a scenario where there are two nodes which are accessible by a load balancer and replicate changes between them via any arbitrary means. We have stated the system contains partitions and by definition must support partition tolerance.

Now consider a network communication failure between the nodes. We cannot keep the data in sync so we must decide on whether we choose consistency or availability.

This problem is the root of CAP Theorem. We cannot have both Consistency and Availability in the event of a partition (a loss of communication between nodes) we must choose between the two.

There are generally three types of systems to solve this:

CP - Consistent and partition tolerant

To keep things consistent we could stop all updates to all nodes. This would prevent data from mutating and mean any node that is read from would be in a consistent state but at the expense of availability.

AP - Available and partition tolerant

To keep things available we would allow all nodes to continue operating as normal. This would mean reads could result in stale data on nodes which have not received the synchronised updates. This would be fine for non critical updates like Twitter and the 'like' feature. It would not be a catastrophic event if a Twitter user had a delay on viewing a stale 'like' count on a post.

CA - Consistent and available

To keep things consistent and available it would mean there is no partition tolerance, which means no partitions. This would mean a single database catering for updates and reads. This is normally offset with a replicated failover database in another region.

Database types

There are generally three database types:

  • Relational
  • NoSQL
  • NewSQL

Relational

Relational databases favour consistency and availability and typically scales vertically.

Relational database management systems (RDBMS) can replicate out databases to secondary databases.

  • primary is write operations
  • secondary can have read operations which reduces load on the primary

The moment we add partition tolerance - sharding - we introduce overheads like connectivity. Latency on write or inability to write will cause a write failure

NoSQL

NoSQL databases favour availability and partition tolerance and typically scales horizontally.

Data stored in a NoSQL database will be eventual consistency and highly available. Clients may write to any node and the architecture lowers dependency on any one node which makes it resilient to failures

NewSQL

NewSQL extends RDBMS with the horizontal scaling and performance of NoSQL

Examples of this are:

  • cockroach db
  • tidb
  • yugabytedb
  • Vitessdb

Further reading

source

⚠️ **GitHub.com Fallback** ⚠️