Database theory - klagan/learning GitHub Wiki
CAP Theorem states that only two of three properties can be guaranteed.
This means that in the case of a partition failure e.g. network failure, a distributed data store must decide on one option:
- maintain consistency at the cost of availability
- maintain availability at the cost of consistency

- waiting for all nodes to be consistent before returning data
- means data will always be current but availability will be lowered
- every node returns an immediate response
- but data may be stale
- guarantees the system continues to operate even if it has lost connection to other nodes
Consider a scenario where there are two nodes which are accessible by a load balancer and replicate changes between them via any arbitrary means. We have stated the system contains partitions and by definition must support partition tolerance.
Now consider a network communication failure between the nodes. We cannot keep the data in sync so we must decide on whether we choose consistency or availability.
This problem is the root of CAP Theorem. We cannot have both Consistency and Availability in the event of a partition (a loss of communication between nodes) we must choose between the two.
There are generally three types of systems to solve this:
To keep things consistent we could stop all updates to all nodes. This would prevent data from mutating and mean any node that is read from would be in a consistent state but at the expense of availability.
To keep things available we would allow all nodes to continue operating as normal. This would mean reads could result in stale data on nodes which have not received the synchronised updates. This would be fine for non critical updates like Twitter and the 'like' feature. It would not be a catastrophic event if a Twitter user had a delay on viewing a stale 'like' count on a post.
To keep things consistent and available it would mean there is no partition tolerance, which means no partitions. This would mean a single database catering for updates and reads. This is normally offset with a replicated failover database in another region.
There are generally three database types:
- Relational
- NoSQL
- NewSQL
Relational databases favour consistency and availability and typically scales vertically.
Relational database management systems (RDBMS) can replicate out databases to secondary databases.
- primary is write operations
- secondary can have read operations which reduces load on the primary
The moment we add partition tolerance - sharding - we introduce overheads like connectivity. Latency on write or inability to write will cause a write failure
NoSQL databases favour availability and partition tolerance and typically scales horizontally.
Data stored in a NoSQL database will be eventual consistency and highly available. Clients may write to any node and the architecture lowers dependency on any one node which makes it resilient to failures
NewSQL extends RDBMS with the horizontal scaling and performance of NoSQL
Examples of this are:
- cockroach db
- tidb
- yugabytedb
- Vitessdb