1.3 Storage and retrieval - KeynesYouDigIt/Knowledge GitHub Wiki

from Designing-Data-Intensive-Applications

Data Structures That Power Your Database

Hash indexes make up the core. "hash" is used since it refers to splitting data, its not a hash like in cryptography.
Hash tables are just offests in bytes (goto offset and start reading there rather than 0 all the time) key offset 1 0 2 64 13 568 ...

SStables / LSM trees

Sorted string table/log structured merge tree.
Key value store index that merges pages of a ledger in the background over time.
Leveldb, rocksdb, Cassandra, H Base. This is a new approach to indexing data storage.
Extremely fast writes, no index processing in real time (its in the background).
Smaller and simpler dbs are very easy to deal with.
Lower predictability, because compaction takes place in the background rather than as a part of the transaction.
Has the database gets bigger, compaction gets more complicated. This is less true of b trees.

B trees

B trees are older and more tried and true.
These are the most widely used indexes, are the default on most RDBMSs.
Each page of this index has references to ranges of key vals. In the range, a reference to an child page is made. 1x5y10 | x - 2,3,4
Most databases can fit into a b tree that is three or four levels deep.
Usually, LSM is faster write. B tree is faster read.
But this is highly dependent on the data, sometimes experimentation with both is a good idea.

Other

Heap file indexes can be helpful.
- covering indexes, which store a key, several values in one file, then point to the heap for other row information.
Multi dimensional indexes are useful for multidimensional queries (like geospatial data, lat & long).
- R Trees are common for this. Postgis uses GiST to produce r trees.
Fuzzy indexes for things like full text search are also important.
- Lucene is an example of an index that that is based on edit distance. Lucy uses finite state automaton to pull this off.
In memory databases are faster because they do not have to encode their data for the disk, not because they exist in memory as opposed to the disk.
future? -> https://www.sciencedirect.com/book/9780857098030/advances-in-non-volatile-memory-and-storage-technology

Transaction Processing or Analytics?

A transaction refers to a read or write that is performed on demand, rather than processed in batch.

online transaction processing, OLTP. online analytic processing, OLAP.

this is a spectrum more than a black and white thing. OLTP - few records, lower latency/downtime tolerance, used to record decisions, usually for the user, use a "Database" OLAP - few-many records, higher latency latency/downtime tolerance, used to make decisions, usually for the organization, use a "Data Warehouse"

Data Warehousing

Using seperate systems for OLTP and OLAP accommodates different priorities, eliminates managing resource competition.
- Sales is reporting that customers can't buy! is an analyst running a big query?
- The analyst team wont have the report ready, because a bunch of customers or bots just logged in...
most DWs are read only (if derived data needs to be stored, thats different)
DWs tend to still be schematized/SQL data lakes are unschematized/unstructured
DW examples
- Attempts at BOTH DB/DW
  - MS SQL Server, SAP HANA (more DW)
- DW Focused
  - AWS Redshift, Teradata, Vertica, Google BigQuery, FB Presto, Cloudera Impala
  - Apache has a few Hive, Spark SQL, Tajo, Drill

Stars and Snowflakes: Schemas for Analytics

Star/Snowflake Schemas (AKA dimensional modeling)
- Fact table holds events as rows
  - some event columns are "attributes" - single columns (eg "price")
  - others are "dimensions" - foreign key refs to other tables (eg "customers")
  - "Star" - everything proceeds from the Fact table as a point of origin
  - "Snowflake" - ^^ > 2 layers

Event tables AND dim tables can be 100s of columns wide (& millions of rows long)

Column oriented storage

Some argue that COS replaces Star Schema like systems
many OLAP queries only require a few columns at a time, but analytics in general requires many columns.
in COS, Columns are stored in the same file, and rows are derived from matching column files together (ORDER MATTERS)
amazing big reads, slow writes.
- b-trees are basically impossible. updates are hard due to compression.
- LSM-trees are more feasible.

for the cost of slower writes, we get efficient reads (only read attributes you need, highly customizable to attributes of interest)
- Writes should be a background ETL anyway
Compression is easy
- Compress-ability is inversely related to data variety and flatness of distribution (a tall bell curve is easy to compress using a bitmap)
- Explored in depth in this paper, cited in book
COS is not to be confused with Column Families from the Bigtable data model (used in Cassandra and HBase)
COS is more poised to take advantage of modern CPU architecture, via "vectorized processing"
- This is partially because less data and less logic is loaded at once. tight, simple loops can run inside of a CPU's L1 cache.
you should sort by frequent attributes of interest, with the frequent values of interest first (if querying data by last month is common, sort by date descending)
compression is strongest on first item in sort order, with the effect weakening across sort attibutes.
C-store - which became Vertica - introduced the idea of multiple alternate sorts on different nodes. Nodes serve as backups to each other AND alternate sorts for different types of queries.
Vertica braces the COS with an in-memory db, which writes to the COS in the background. This way "new" data is still available, and just takes a bit longer to join to the older data. see C-Store 7 Years Later
Being high read low write (or on-demand read, eventual write) DWs are a great place to use Materialized Views (cached query results that are updated in the background as configured).
OLAP Cubes extend this idea and store data aggregations across common dimensions.