Designing Data Intensive Applications Distributed Data 2 - gouthamv03/notes GitHub Wiki

Topics

Transactions

Failures are an integral part of a system. Unexpected system, network, OS or application issues can cause data loss or corruption. Fault-tolerance mechanisms are a lot of work, so transactions are the preferred method of simplifying these issues. A Transaction is a way for an application to group several reads and writes together into a logical unit. This set of operations either execute successfully (commit) or the entire thing fails even if one operation fails (abort, rollback). The application can then retry writes to the DB. It avoids the complicated problem of partial failure - some operations succeed, some fail.

ACID

Acronym for Atomicity, Consistency, Isolation, and Durability

Atomicity

ACID atomicity describes what happens if a client wants to make several writes, but a fault occurs after some of the writes have been processed. If the writes are grouped together into an atomic transaction, and the transaction cannot be completed (committed) due to a fault, then the transaction is aborted and the database must discard or undo any writes it has made so far in that transaction. Atomicity is basically Abortability, in this context.

Consistency

Idea of ACID consistency is that there are certain statements that must always be true in your application before and after writes. Eg. Credits and Debits always being balanced for an accounting application Consistency is more a property of the application than that of a DB. Some cases where DB can check is for Uniqueness or foreign key constraints.

Isolation

Isolation is when concurrent transactions don't step on each others toes. Eg. 2 transactions incrementing the same record should behave as if they happened one after another. It is the appearance of serializability even when operations might have happened concurrently.

Durability

Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes. In practice, there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and backups — and they can and should be used together.

Single and Multi-object writes

Guarantees of Atomicity and Isolation talk about transactions being comprised of multiple writes. This is usually needed when multiple records are updated together. Eg. Email app with separate table to track unread mails and a separate one for unread mail count.

Multi-object transactions require some way of determining which read and write operations belong to the same transaction. In relational databases, that is typically done based on the client’s TCP connection to the database server: on any particular connection, everything between a BEGIN TRANSACTION and a COMMIT statement is considered to be part of the same transaction OR a unique transaction ID is maintained to group them.

Single-object writes

Atomicity guarantees also apply to single objects. Eg. Large file being updated and network or power fails. Large file being updated and a read starts which loads partial data from it. To avoid these, storage engines almost universally aim to provide atomicity and isolation on the level of a single object.

Atomicity is implemented with Log file for recovery. Isolation can be implemented using a lock on each object , allowing only one thread to access an object at any one time. Some DBs offer Atomic operations like Increment or Compare-and-Set, which act as useful 'lightweight' transactions.

Multi-object writes

Atomicity is required for multi-object writes. RDBMS have foreign key references that need to be updated in sync. Document databases usually have different documents that may refer to each other in a similar way.

Handling errors and aborts

ACID databases work on the philosophy that if the database is in danger of violating its guarantee of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow it to remain half-finished. Databases with leaderless replication usually don't guarantee this, instead providing a best-effort guarantee. On failure, there may be partial writes. The app should manage recovery on failure. Although abort-and-retry is popular, it may cause other issues:

If transaction succeeded but the Database 'Ack' dint reach the application, it will redo the same operation
If transactions are failing due to overload, retry will only make it worse
If the commit has other side effects like sending out an email, a retry may result in a re-send of email also. (Two-phase commit can help here)

Weak Isolation Levels

If two transactions don’t touch the same data, they can safely be run in parallel, because neither depends on the other. Concurrency issues (race conditions) come into play when 2 transactions try to update the same data.

Transaction isolation

Databases have long tried to hide concurrency issues from developers by providing Transaction isolation. Serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially. This is costly though, and many DBs only provide weak isolation and protect against some, not all, types of concurrency issues. The following are some weak isolation levels and problems that can happen with them:

Read Committed

The most basic level of transaction isolation is read committed. It makes two guarantees:

No dirty reads - When reading from the database, you will only see data that has been committed.
No dirty writes - When writing to the database, you will only overwrite data that has been committed.

Recap: Committed here means that all operations in that transaction have happened successfully.

Implementation: Dirty writes can be handled by letting a transaction hold a row lock on that object. Another transaction can lock that row only after the previous transaction has completed and released the lock. Dirty reads can be handled in the same way - lock the row value so a write can't change it during read, but this ends up making reads too slow. The more popular option is to save the old value of a row when a write starts, and return the old value for reads until the transaction is committed.

Snapshot isolation and Repeatable read

While read-committed transaction isolation provides some level of protection, it doesn't prevent all kinds of concurrency issues. Eg. Transfer money from one account to another. If we read/write only committed data, there will be a read when one account will have transferred money and have lesser, but the other would not have received and continue showing older. Only after a few seconds, will both accounts reflect the correct amounts. This is called Read skew or nonrepeatable read. This can be an issue in some cases like:

Backup - If writes continue while a backup happens, data that was backed up may not have some entries
Analytics queries and integrity checks - These may see incorrect values when read skew exists.

Snapshot isolation is the solution to this problem. The idea is that each transaction reads from a consistent snapshot of the database — that is, the transaction sees all the data that was committed in the database at the start of the transaction. This allows backups or analytics queries to operate on consistent data, not something that is constantly changing through the course of a transaction.

Implementation: A key principle of snapshot isolation is readers never block writers, and writers never block readers. Implementations use write locks to prevent dirty writes. Reads can proceed without any locks.

The database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Because it maintains several versions of an object side by side, this technique is known as **Multi-Version Concurrency Control **(MVCC). A typical approach is that read committed uses a separate snapshot for each query, while snapshot isolation uses the same snapshot for an entire transaction.

MVCC based isolation is implemented by assigning an ID to each transaction. Each row in a table has a created_by field, containing the ID of the transaction that inserted this row into the table. Moreover, each row has a deleted_by field, which is initially empty. If a transaction deletes a row, the row isn’t actually deleted from the database, but it is marked for deletion by setting the deleted_by field to the ID of the transaction that requested the deletion. At some later time, when it is certain that no transaction can any longer access the deleted data, a garbage collection process in the database removes any rows marked for deletion and frees their space. An update is internally translated into a delete and a create.

Visibility rules for observing a consistent snapshot

Transaction IDs are used to determine which snapshot is visible and not. The rules are:

At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those transactions have made are ignored, even if the transactions subsequently commit.
Any writes made by aborted transactions are ignored.
Any writes made by transactions with a later transaction ID (i.e., which started after the current transaction started) are ignored, regardless of whether those transactions have committed.
All other writes are visible to the application’s queries.

Indexes and snapshot isolation

One option is to have indexes point to different snapshot versions. Another option with B-tree based DBs is to copy-on-write the whole page. With this, any updates result in a new page and not affect existing pages. However, it does require compaction and garbage collection.

Preventing Lost Updates

The lost-update problem happens when 2 concurrent transactions read-update-write the same value. Writes will happen one after the other (because of write-locking) but the value will get clobbered. There are a number of solutions:

Atomic write operations

Many databases provide atomic update operations, which remove the need to implement read-modify-write cycles in application code. This includes atomic counter, json value, priority queue updates etc. It is implemented by doing a read-lock on an object that has to atomically updated. It is called cursor stability. The other option is to force all atomic operations to be run on a single thread. ORMs can accidentally introduce this issue.

Explicit locking

Application can explicitly lock the objects to be updated.

Automatically detecting lost updates

Instead of locking and forcing a sequential write, we can allow transactions to execute in parallel and, if the transaction manager detects a lost update, abort the transaction and force it to retry its read-modify-write cycle. An advantage of this approach is that databases can perform this check efficiently in conjunction with snapshot isolation. Applications don't have to worry about locking since databases manage it.

Compare-and-set

The purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not changed since you last read it. If the current value does not match what you previously read, the update has no effect, and the read-modify-write cycle must be retried.

Conflict resolution and replication

In replicated databases, we need to clear incorrect entries. Locks and compare-and-set operations assume that there is a single up-to-date copy of the data. However, databases with multi-leader or leaderless replication usually allow several writes to happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a single up-to-date copy of the data. Thus, techniques based on locks or compare-and-set do not apply in this context. Commutative operations (like increment) can be carried out on replicated envs.

Write Skew and Phantoms

Write skew is a generalization of the lost-update problem. Write skew can occur if two transactions read the same objects, and then update some of those objects (different transactions may update different objects). In the special case where different transactions update the same object, you get a dirty write or lost update anomaly (depending on the timing). With write skew, our solution options are lesser since atomic single-object operations don’t help, as multiple objects are involved.

This effect, where a write in one transaction changes the result of a search query in another transaction, is called a phantom. Snapshot isolation avoids phantoms in read-only queries, but not in read-write transactions.

Materializing conflicts

A last resort to avoid phantoms is to maintain a table of items that can be locked before updating certain other rows in a db. This approach is called materializing conflicts, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database.

Serializability

Serializable isolation is usually regarded as the strongest isolation level. It guarantees that even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency. In other words, the database prevents all possible race conditions. This is achieved using one of 3 techniques:

Literally executing transactions in a serial order
Two-phase locking - popular viable option
Optimistic concurrency control techniques such as serializable snapshot isolation

Actual Serial Execution

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to execute only one transaction at a time, in serial order, on a single thread. This is considered more feasible now over the performance concerns from earlier. Two developments caused this rethink:

RAM became cheap enough that for many use cases is now feasible to keep the entire active dataset in memory. When all data that a transaction needs to access is in memory, transactions can execute much faster than if they have to wait for data to be loaded from disk.
OLTP transactions are usually short and only make a small number of reads and writes. Long running analytics transactions can be run on a consistent snapshot (using snapshot isolation).

Used in redis, voltDB, datomic. Throughput is limited to that of a single CPU core. In order to make the most of that single thread, transactions need to be structured differently from their traditional form.

Encapsulating transactions in stored procedures

Executing transactions one-by-one is very time consuming if concurrency is removed. Executing one statement and then using the result to create the next statement etc. will make things very slow. For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must submit the entire transaction code to the database ahead of time, as a stored procedure. The challenges in a stored procedure are that each DB has its own language to implement a stored procedure, maintaining this code is hard, and they need to be well written as they run on the db thread and can affect latency on other transactions. More recently, these issues are handled better with common languages for eg.

With stored procedures and in-memory data, executing all transactions on a single thread becomes feasible. As they don’t need to wait for I/O and they avoid the overhead of other concurrency control mechanisms, they can achieve quite good throughput on a single thread. It can also be used for replication by executing the same procedure on all replicas.

Partitioning

If you can find a way of partitioning your dataset so that each transaction only needs to read and write data within a single partition, then each partition can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own partition, which allows your transaction throughput to scale linearly with the number of CPU cores. However, for any transaction that needs to access multiple partitions, the database must coordinate the transaction across all the partitions that it touches. The stored procedure needs to be performed in lock-step across all partitions to ensure serializability across the whole system. Since cross-partition transactions have additional coordination overhead, they are vastly slower than single-partition transactions. Simple key-value data can often be partitioned very easily, but data with multiple secondary indexes is likely to require a lot of cross-partition coordination.

Summary of serial execution

Serial execution of transactions has become a viable way of achieving serializable isolation within certain constraints:

Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.
It is limited to use cases where the active dataset can fit in memory. Rarely accessed data could potentially be moved to disk, but if it needed to be accessed in a single-threaded transaction, the system would get very slow.
Write throughput must be low enough to be handled on a single CPU core, or else transactions need to be partitioned without requiring cross-partition coordination.
Cross-partition transactions are possible, but there is a hard limit to the extent to which they can be used.

Two Phase Locking

In 2PL, writers don’t just block other writers; they also block readers and vice versa. Several transactions are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as anyone wants to write (modify or delete) an object, exclusive access is required:

If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts before it can continue. (This ensures that B can’t change the object unexpectedly behind A’s back.)
If transaction A has written an object and transaction B wants to read that object, B must wait until A commits or aborts before it can continue. (Reading an old version of the object is not acceptable under 2PL.)

Implementation of two-phase locking

The blocking of readers and writers is implemented by a having a lock on each object in the database. The lock can either be in shared mode or in exclusive mode. The lock is used as follows:

If a transaction wants to read an object, it must first acquire the lock in shared mode. Several transactions are allowed to hold the lock in shared mode simultaneously, but if another transaction already has an exclusive lock on the object, these transactions must wait.
If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No other transaction may hold the lock at the same time (either in shared or in exclusive mode), so if there is any existing lock on the object, the transaction must wait.
If a transaction first reads and then writes an object, it may upgrade its shared lock to an exclusive lock. The upgrade works the same as getting an exclusive lock directly.
After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort). This is where the name “two- phase” comes from: the first phase (while the transaction is executing) is when the locks are acquired, and the second phase (at the end of the transaction) is when all the locks are released.

Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for transaction B to release its lock, and vice versa. This situation is called deadlock. The database automatically detects deadlocks between transactions and aborts one of them so that the others can make progress. The aborted transaction needs to be retried by the application.

Performance of two-phase locking

The big downside is performance: transaction throughput and response times of queries is higher in 2PL than weak isolation. It is due to Locking and the absence of concurrency. If multiple transactions want to operate on a single object, all of them get queued and wait. If a transaction is aborted due to a deadlock, it has to repeat the same process again. These problems cause unstable latencies.

Predicate locks

A database with serializable isolation must prevent phantoms, where writes on an object can affect search result on another. We eliminate this using Predicate locks. It is similar to an exclusive/shared lock but instead of locking a single object, it locks all objects that match a search condition. The key idea is that this search condition applies to both existing and future (to-be-written) objects.

Index-range locks

Predicate locks don't perform well if there are many locks by active transactions. So most 2PL systems use index-range locking. They lock a larger set of objects instead of a single object. This is usually the range of indices or dates. This approximate lock, locks multiple objects at a time and act as a good compromise.

Serializable Snapshot Isolation (SSI)

2PL doesn't perform or scale well. Weaker isolation levels perform well but have race conditions. Serializable snapshot isolation (SSI) is able to provide full serializability but has only a small performance penalty compared to snapshot isolation. Contrary to Pessimistic concurrency control like serial execution or 2PL, SSI is an Optimistic concurrency control model. It allows transactions to proceed even if something potentially dangerous happens. When a transaction wants to commit, the database checks whether anything bad happened. If isolation was violated, the transaction is aborted and has to be retried. Only transactions that executed serializably are allowed to commit.

This technique performs poorly if there is high contention as it may cause too many aborts. However, if there is enough spare capacity, and if contention between transactions is not too high, optimistic concurrency control techniques tend to perform better than pessimistic ones.

SSI is based on Snapshot isolation, i.e., all reads within a transaction are made from a consistent snapshot of the database. On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among writes and determining which transactions to abort. Additionally, it needs logic to avoid write skew and phantoms. There are two cases to consider:

Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
Detecting writes that affect prior reads (the write occurs after the read)

Detecting stale MVCC reads

If a read happens before an uncommitted write completes, a stale value is read from the snapshot. When a write is based on this stale value, DB should abort it during commit. DB does this by checking whether any of the ignored writes have now been committed. We wait until commit to abort it, because regular reads are still fine.

Detecting writes that affect prior reads

If a new write renders a previous read value stale, the commit for the current transaction needs to be aborted. We can track the transaction_ids that read the same value and use it as a trip-wire to know which transactions may affect each other and abort accordingly.

Note: The Commit query is handled as its own query in a transaction, it's not just the last query that goes in.

Summary

In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular read committed, snapshot isolation (sometimes called repeatable read), and serializable. We characterized those isolation levels by discussing various examples of race conditions:

Dirty reads

One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads.

Dirty writes

One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes.

Read skew (non-repeatable reads)

A client sees different parts of the database at different points in time. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot at one point in time. It is usually implemented with multi-version concurrency control (MVCC).

Lost updates

Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implementations of snapshot isolation prevent this anomaly automatically, while others require a manual lock (SELECT FOR UPDATE).

Write skew

A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly.

Phantom reads

A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks.

Three different approaches to implementing serializable transactions:

Literally executing transactions in a serial order
Two-phase locking
Serializable snapshot isolation (SSI)