1.3 Storage and retrieval - KeynesYouDigIt/Knowledge GitHub Wiki

from Designing-Data-Intensive-Applications

Data Structures That Power Your Database

  • Hash indexes make up the core. "hash" is used since it refers to splitting data, its not a hash like in cryptography.

  • Hash tables are just offests in bytes (goto offset and start reading there rather than 0 all the time) key offset 1 0 2 64 13 568 ...

SStables / LSM trees
  • Sorted string table/log structured merge tree.
  • Key value store index that merges pages of a ledger in the background over time.
  • Leveldb, rocksdb, Cassandra, H Base. This is a new approach to indexing data storage.
  • Extremely fast writes, no index processing in real time (its in the background).
  • Smaller and simpler dbs are very easy to deal with.
  • Lower predictability, because compaction takes place in the background rather than as a part of the transaction.
  • Has the database gets bigger, compaction gets more complicated. This is less true of b trees.
B trees
  • B trees are older and more tried and true.

  • These are the most widely used indexes, are the default on most RDBMSs.

  • Each page of this index has references to ranges of key vals. In the range, a reference to an child page is made. 1x5y10 | x - 2,3,4

  • Most databases can fit into a b tree that is three or four levels deep.

  • Usually, LSM is faster write. B tree is faster read.

  • But this is highly dependent on the data, sometimes experimentation with both is a good idea.

Other
  • Heap file indexes can be helpful.

    • covering indexes, which store a key, several values in one file, then point to the heap for other row information.
  • Multi dimensional indexes are useful for multidimensional queries (like geospatial data, lat & long).

    • R Trees are common for this. Postgis uses GiST to produce r trees.
  • Fuzzy indexes for things like full text search are also important.

    • Lucene is an example of an index that that is based on edit distance. Lucy uses finite state automaton to pull this off.
  • In memory databases are faster because they do not have to encode their data for the disk, not because they exist in memory as opposed to the disk.

  • future? -> https://www.sciencedirect.com/book/9780857098030/advances-in-non-volatile-memory-and-storage-technology

Transaction Processing or Analytics?

A transaction refers to a read or write that is performed on demand, rather than processed in batch.

online transaction processing, OLTP. online analytic processing, OLAP.

this is a spectrum more than a black and white thing. OLTP - few records, lower latency/downtime tolerance, used to record decisions, usually for the user, use a "Database" OLAP - few-many records, higher latency latency/downtime tolerance, used to make decisions, usually for the organization, use a "Data Warehouse"

Data Warehousing

  • Using seperate systems for OLTP and OLAP accommodates different priorities, eliminates managing resource competition.

    • Sales is reporting that customers can't buy! is an analyst running a big query?
    • The analyst team wont have the report ready, because a bunch of customers or bots just logged in...
  • most DWs are read only (if derived data needs to be stored, thats different)

  • DWs tend to still be schematized/SQL data lakes are unschematized/unstructured

  • DW examples

    • Attempts at BOTH DB/DW
      • MS SQL Server, SAP HANA (more DW)
    • DW Focused
      • AWS Redshift, Teradata, Vertica, Google BigQuery, FB Presto, Cloudera Impala
      • Apache has a few Hive, Spark SQL, Tajo, Drill
Stars and Snowflakes: Schemas for Analytics
  • Star/Snowflake Schemas (AKA dimensional modeling)
    • Fact table holds events as rows
      • some event columns are "attributes" - single columns (eg "price")
      • others are "dimensions" - foreign key refs to other tables (eg "customers")
      • "Star" - everything proceeds from the Fact table as a point of origin
      • "Snowflake" - ^^ > 2 layers

Event tables AND dim tables can be 100s of columns wide (& millions of rows long)

Column oriented storage
  • Some argue that COS replaces Star Schema like systems
  • many OLAP queries only require a few columns at a time, but analytics in general requires many columns.
  • in COS, Columns are stored in the same file, and rows are derived from matching column files together (ORDER MATTERS)
  • amazing big reads, slow writes.
    • b-trees are basically impossible. updates are hard due to compression.
    • LSM-trees are more feasible.
  • for the cost of slower writes, we get efficient reads (only read attributes you need, highly customizable to attributes of interest)
    • Writes should be a background ETL anyway
  • Compression is easy
  • COS is not to be confused with Column Families from the Bigtable data model (used in Cassandra and HBase)
  • COS is more poised to take advantage of modern CPU architecture, via "vectorized processing"
    • This is partially because less data and less logic is loaded at once. tight, simple loops can run inside of a CPU's L1 cache.
  • you should sort by frequent attributes of interest, with the frequent values of interest first (if querying data by last month is common, sort by date descending)
  • compression is strongest on first item in sort order, with the effect weakening across sort attibutes.
  • C-store - which became Vertica - introduced the idea of multiple alternate sorts on different nodes. Nodes serve as backups to each other AND alternate sorts for different types of queries.
  • Vertica braces the COS with an in-memory db, which writes to the COS in the background. This way "new" data is still available, and just takes a bit longer to join to the older data. see C-Store 7 Years Later
  • Being high read low write (or on-demand read, eventual write) DWs are a great place to use Materialized Views (cached query results that are updated in the background as configured).
  • OLAP Cubes extend this idea and store data aggregations across common dimensions.