database - ghdrako/doc_snipets GitHub Wiki

Information - inernal

https://15445.courses.cs.cmu.edu/fall2024/schedule.html
Database Systems course
https://15445.courses.cs.cmu.edu/fall2024/schedule.html
https://jepsen.io/
https://dbmsmusings.blogspot.com/ - blog by Daniel Abadi, a database researcher at University of Maryland, College Park. He explains some easy to confuse terms and some trending topics about database.
An Introduction to Distributed Systems

Build

He is making a database from the ground up in modern C++ using TDD and good practices for carrying on the project. His roadmap covers a lot of really interesting things about SQL and No-SQL databases. He is on the third video of the series.

Ranking:

Being absolutely correct on their own, transactions can start operating incorrectly when run in parallel. That’s because operations belonging to different transactions often get intermixed. There would be no such issues if the database system first completed all operations of one transaction and then moved on to the next one,but performance of sequential execution would be implausibly low.

Correct transactions that behave incorrectly when run together result in concurrency anomalies, or phenomena.

When running transactions concurrently, the database must guarantee that the result of such execution will be the same as the outcome of one of the possible sequential executions. In other words, it must isolate transactions from one another, thus taking care of any possible anomalies.

A transaction is a set of operations that takes the database from one correct state to another correct state (consistency), provided that it is executed in full (atomicity) and without being affected by other transactions (isolation). This definition combines the requirements implied by the first three letters of the ACID acronym.

The SQL standard specifies four isolation levels.1 These levels are defined by the list of anomalies that may or may not occur during concurrent transaction execution.So when talking about isolation levels, we have to start with anomalies.

obraz

The great extent the difference between the standard isolation levels is defined by the number of locks required for their implementation.

If the rows to be updated are locked for writes but not for reads, we get the Read Uncommitted isolation level, which allows reading data before it is committed.
If the rows to be updated are locked for both reads and writes, we get the Read Committed level: it is forbidden to read uncommitted data, but a query can return different values if it is run more than once (non-repeatable reads).
Locking the rows to be read and to be updated for all operations gives us the Repeatable Read level: a repeated query will return the same result.
the Serializable level poses a problem: it is impossible to lock a row that does not exist yet. It leaves an opportunity for phantom reads to occur: a transaction can add a row that satisfies the condition of the previous query, and this row will appear in the next query result.

Lock-based protocols for transaction management got replaced with the Snapshot Isolation (SI) protocol.Snapshot isolation minimizes the number of required locks. In fact, a row will be locked only by concurrent update attempts. In all other cases, operations can be executed concurrently: writes never lock reads, and reads never lock anything.

Multiversion concurrency as flawer of Snapshot Isolation control implies that at any moment the database system can contain several versions of one and the same row.

Query optimizer

Data cluster

Clustering data means to store consecutively accessed data closely together so that accessing it requires fewer IO operations. Data clusters are very important in terms of database tuning. Indexes allow one to cluster data. the index leaf nodes store the indexed columns in an ordered fashion so that similar values are stored next to each other. That means that indexes build clusters of rows with similar values.

The index clustering factor is an indirect measure of the probability that two succeeding index entries refer to the same table block. The optimizer takes this probability into account when calculating the cost value of the TABLE ACCESS BY INDEX ROWID operation.

Some databases can indeed use an index as primary table store. The Oracle database calls this concept index-organized tables (IOT), other databases use the term clustered index. In this section, both terms are used to either put the emphasis on the table or the index characteristics as needed.

An index-organized table is thus a B-tree index without a heap table. This results in two benefits: (1) it saves the space for the heap structure; (2) every access on a clustered index is automatically an index-only scan. Both benefits sound promising but are hardly achievable in practice.

The drawbacks of an index-organized table become apparent when creating another index on the same table. Analogous to a regular index, a so-called secondary index refers to the original table data—which is stored in the clustered index. There, the data is not stored statically as in a heap table but can move at any time to maintain the index order. It is therefore not possible to store the physical location of the rows in the index-organized table in the secondary index. The database must use a logical key instead.

Accessing a secondary index does not deliver a ROWID but a logical key for searching the clustered index. A single access, however, is not sufficient for searching clustered index—it requires a full tree traversal. That means that accessing a table via a secondary index searches two indexes: the secondary index once (INDEX RANGE SCAN), then the clustered index for each row found in the secondary index (INDEX UNIQUE SCAN).

Tables with one index only are best implemented as clustered indexes or index-organized tables.

Tables with more indexes can often benefit from heap tables. You can still use index-only scans to avoid the table access. This gives you the select performance of a clustered index without slowing down other indexes.

Dbname	heap table	cluster table
Ibm db2	+	-
Oracle	+	+(IoT)
Postgres	+	Luster clause
Mysql	+(MyISAM engine)	+(InnoDB engine)
Mssql	+ (use NONCLUSTERED clause in the PK )	+ (default)

By default SQL Server uses clustered indexes (index-organized tables) using the primary key as clustering key. Nevertheless you can use arbitrary columns for the clustering key—even non-unique columns.

To create a heap table you must use the NONCLUSTERED clause in the primary key definition:

CREATE TABLE (
   id    NUMBER NOT NULL,
   [...]
   CONSTRAINT pk PRIMARY KEY NONCLUSTERED (id)
)

Dropping a clustered index transforms the table into a heap table.

SQL Server’s default behavior often causes performance problems when using secondary indexes.

PostgreSQL only uses heap tables.

You can, however, use the CLUSTER clause to align the contents of the heap table with an index.