Designing Data Intensive Applications: Foundations of Data Systems - gouthamv03/notes GitHub Wiki

Topics

Data models

Relational	Document
Schema on write	Schema on read
Relational db has a primary data row which references other tables. Eg. LinkedIn profile will point to a separate table for Location using an id, separate table for recommendations etc. each with a primary key id that points to the original row. First one helps keep an abstraction where Location data can change in a global way for everyone pointing to it. Second one helps organize multiple data points under a separate table rather than keep it with the primary table. Models exist where data can be inserted as a blob in the primary table but searching, indexing is usually hard in these	Document model represents data as a blob, in json, xml or other formats. All data is usually nested within the original blob. References can be made to other document collections with an id, but combining these is a job for the application
Not straightforward to use in code	Relatively straightforward to use in code
Complex joins handled in db	Application handles joins
Relational db can split it into smaller units and keep data contiguous with special features	Whole object is stored contiguously and loaded fully always. For very large documents, pulling the whole file for a small update is wasteful. So it makes sense to keep smaller docs where needed
Declarative queries and languages - makes it easier, db can optimize, query stays the same	Imperative querying - which is specific code on client side
Declarative queries can be parallelized	Imperative queries are running on one core usually in series

Distinction of Declarative vs Imperative apply to the UI world as well. Eg. CSS styling (declarative) lets browser handle styling, Javascript (imperative) can do the same thing, but it handles the style in more complicated code. Leaving styling to the browser as with CSS is easier to do and lets the browser drive workflow and optimizations.

Graph DB - when many-to-many relationships are one too many. Data modeled this way usually also has applications that use popular graph algorithms to find solutions. Graph is made of vertices and edges. These vertices can be of any type making the graph db super-flexible. Triple-stores do the same thing with (subject, predicate, object). Several query languages to make declarative queries on graph DBs.

Overall each data model has its own use-cases and applications. Important to learn what to pick.

Storage and Retrieval - what the db does

Databases are usually

Log-Structured storage engines
Page-Oriented storage engines

Log-Structured storage engines

Think of it as data stored in a file in key, value string pairs. Data write is super-quick with an append to the file - newest data is from the last key. Data read is slower with larger data sets. To handle this, we maintain an index file - a signpost to where data can be found. This makes writes a bit slower as index is usually updated during write.

Hash Indexes

Index file can be a Hashmap of Key to where it can be found in memory. Index could be held in memory and it works well when the number of keys is small enough to fit in-memory. It suits applications with frequent writes to the same key. To avoid the database log file running out of memory from constant appends, we can:

Break the file into segments once a certain size is reached
Run compactions where older duplicate key:value pairs are removed leaving only the newest entries, potentially merging segments
Run compaction in a background thread to avoid slowing down read/write performance of the DB. During read we start from the latest segment and check backwards. Good compaction strategy will reduce number of segments to run through before finding a key.

Points to consider

File format : The CSV format described is not the best way to save data. Binary encoded formats are better
Delete records : An entry is made to indicate that the key is deleted. During compaction, the older entries are removed
Crash handling : Index in-memory is lost in a crash. Loading all values from segments can be time consuming till warmup happens. BitCask writes a snapshot of each segments index to disk to make loading faster.
Partial data write : Writes may fail because DB process died in the middle. Checksums are maintained to detect such occurrences and clear partial entries.
Concurrency : Maintain a single write thread to append logs. Reads can use multiple threads

Why Append rather than replace?

Faster to append and operate on memory in a sequential way
Concurrency and crash handling is less complex. Wont have segments with spliced incorrect data
Compacting segments reduces fragmenting caused by deleted entries

Limitations

Index hash should be in-memory. Keeping it on disk is very expensive in terms of I/O
Range queries are inefficient. Queries need to be made for each key in the range (key0, key1..key10). Overcome these limitations with Index structuring described next

SSTables and LSM Trees

This is an optimization on top of the hash logic. We add a requirement where segment files are sorted by keys of the hash. This format is called Sorted String Table (SSTable). The advantages are:

Segments can be easily compacted even from several different segment files. Read from the files and keep adding entries in sorted order to the new compacted segment. If an incoming entry is newer than the current one, replace it; else discard.
To find a particular key, index of each key need not be kept in-memory. Since they are all arranged in sorted order, one index can be maintained for whole segment. When looking for a key, jump to the appropriate segment and search for the key. This is efficient since it's searching on a smaller dataset.
Since we need to search for a key anyway, we can compress each segment and save it. In-memory index points to segment, which is then loaded and processed. Reduces disk space and I/O speed.

Flow

When a write comes in, data is stored in-order in a Tree structure (RB-Tree, AVL-Tree). This unit is called a MemTable.
Read requests are served from this in-memory dataset. If key is missing, segments are checked from newest to oldest.
When MemTable reaches a certain size it is written out to disk as a segment. Since MemTables become Segments they are already sorted.
When this MemTable is written to disk, another instance of MemTable continues handling write requests.
Run compaction from time to time. Segments are compacted in sorted order too. Several compaction strategies exist, either by size or by levels (separated by old/new).
To protect against crashes, append new writes to disk as log. Load from log if crash happens. Each time memtable write happens, delete the corresponding log.
Absent entries are identified with a Bloom filter. A bloom filter can definitively say if an entry does NOT exist, and say with some probability that an entry does exist.

Examples

This kind of indexing is called Log-Structured Merge Tree (LSM-Tree). Eg. LevelDb, RocksDB, Cassandra, HBase, Google BigTable. Lucene an indexing-engine for search used by Elasticsearch uses the same index concept for its term dictionary. It maintains a map of a word or term as key and the list of documents it occurs as values.

B Trees

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key- value lookups and range queries. B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size and read or write one page at a time.

A tree of pages is constructed each with multiple branches out of it, typically in the hundreds.
Search for a data can be done by walking through the tree.
During write, search for the value and update the page. If there is not enough space to store data on a page, split the data in the page into 2 new branched pages.
B Trees are usually 4 or 5 levels deep. A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.
Resiliency is guaranteed with a Write-ahead Log (WAL). If db crashes, this log is used to update
Reads need concurrency which may be affected if the same page is being updated. Lightweight locks called Latches are used for this. Log-structured tables are simpler than this because they make a copy of the segments to merge before swapping the old and new.
Optimization can be to create new pages for modified data and swap pointers
Sequential reads may be inefficient in this form because different pages may not be in sequence. Log-structured tables keep sequential data close to each other. Some ideas like pointers to left, right nodes are used to reduce traversals. Fractal trees show better performance than B-trees.

Comparison of LSM Trees and B-Trees

As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads. However, this depends on the benchmarks used and results could vary.

Advantages of LSM Trees (think speed,time,memory)

Write amplification describes multiple writes for a single piece of data. In B-Tee and LSM-Tree systems, there is a WAL/Commit log constantly maintained on disk, in addition to the data written to the page/segment. In B-Tree, a full page needs to be written even for a small change. LSM Trees compact segments and re-write data many times. The sequential writes from LSM-Trees show better performance than the disk-seeks needed by B-Trees. LSM-Trees also write out compressed data and reduce fragmentation during compaction. This leads to lower storage overhead and disk I/O.

Disadvantages of LSM Trees Compaction thread can increase processing time and disk IO rate of queries at higher loads. B-Trees are more predictable in time taken. If Compaction threads are not able to keep up with rate of data-writes, then database space taken can grow out of control. B-Trees keep one key only, but LSM-Trees have several older copies. B-Trees lock the key value at DB level which is useful in transactional processing. B-Trees referred in Relational DBs for this reason.

Kinds of Indices

Primary The single key we have looked at so far, is often considered the Primary Key. A primary key uniquely identifies one row in a relational table, or one document in a document database, or one vertex in a graph database.

Secondary It is also very common to have secondary indexes. In relational databases, you can create several secondary indexes on the same table and they are often crucial for performing joins efficiently. Secondary indexes help point to all the tables where the chosen key is present. Since this wont be unique (multiple values for the key), we can store it as a list of values for the key or make the key unique by appending a row-identifier.

Keys in an index are what we search for. The value can be either be the full row or be a reference stored elsewhere called the heap file. Heap file is common as it avoid repeating the data for multiple secondary index keys. Heap file is also good if we need to overwrite this data with new data. Little tricky if size of new data is larger - new heap file is created and pointer updates need to be done. Sometimes if this hop is expensive, index contains all elements in its row. It's called a clustered index. Primary index becomes clustered and secondary indices can simply point to it. A hybrid approach is to store only few row elements in the index.

Clustering and covered indices can increase speed, but they duplicate data. DBs need to manage updates and provide transactional guarantees to prevent inconsistent/stale data from being sent back.

Multi Concatenated indices combines several fields into one key by appending one column to another. They can then be sorted to retrieve data for all indices starting with the same last-name for instance.

Full Text or Fuzzy searches Searching for a similar word or a misspelt word require different techniques. Lucene uses a SSTable-like structure for its term dictionary. This structure requires a small in-memory index that tells queries at which offset in the sorted file they need to look for a key. In Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance. Basically, a trie can be walked to get the position of the term in a file and its closest matches.

Keeping everything in memory

Disks have two significant advantages: Durability and Lower cost per GB than RAM. With RAM getting cheaper, databases can be held fully in-memory. Eg. Memcached, Redis

Disk is used as a backup log from which to load data. Reads are served from memory. The performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the OS caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
In-memory databases also provide some data structures that disk databases dont. Eg. Redis offers a database-like interface to various data structures such as priority queues and sets. Because it keeps all data in memory, its implementation is comparatively simple.
Anti-caching approach works by evicting the least recently used data from memory to disk when there is not enough memory, and loading it back into memory when it is accessed again in the future

Analytic processing

Transaction processing vs Analytic processing

Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Read pattern	Small number of queries by key. Usually full row	Large number of queries to aggregate. Usually specific columns
Write pattern	Random, low latency	Bulk import or event stream
Application	End user, website etc	Data analysts, Business intelligence

Data warehousing is popular for OLAP. The idea is to extract data from critical OLTP systems and load them into dedicated OLAP systems. By decoupling these two, analysts can work on extracted data without querying production OLTP systems. This process is called Extract-Transform-Load (ETL).

Data warehouses are typically relational, and follow SQL type syntax for queries.

Stars and Snowflakes: Schemas for Analytics

OLTP systems uses a range of schemas (doc, relational, graph). OLAP systems have much less diversity. They follow a relational approach where all relevant events are recorded as part of a Central table. This table then has secondary keys pointing to other relevant tables. Eg. An event_table can capture a customer shopping on a particular day, viewing or buying a product at a particular store. All these what, why, who, how, when questions form the basis of the other tables. So date can be a table with additional holiday or offer information. Similarly, product_table, customer_table etc. can branch off of event_table. This takes the shape of a star or a snowflake (secondary tables branch further).

Column Oriented storage

The workflows in analytical processing almost never do select * and read all rows. They pick specific columns and run SUM/AVG on those. These are huge bulk operations that traditional OLTP methods get too slow even when indexed. OLTP systems also save data in rows in a contiguous fashion. Disk seeks to get the next column can be expensive. The efficient way to do this is by arranging columns in a contiguous fashion rather than rows. This is column-oriented storage.
If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. It has the added benefit that compressing contiguous data of same type can be highly effective. Bitmap encoding is used sometimes along with run-length encoding.
Compression also aids in lowering bandwidth while loading data into memory. Operations on compressed data can make very efficient use of the CPU and L1 caches. It is called vectorized processing.
Sort order of columns can help make both queries and compression better. Each column may benefit from a specific ordering technique. In replicated systems, different columns can be sorted and replicated in different ways. Use the version that best fits the query.

Row oriented store keeps every row in one place (in the heap file or a clustered index), and secondary indexes just contain pointers to the matching rows. Column oriented indexes don't have pointers, it is simply the column containing values.

Writing data

Data can be written in LSM-Trees where an in-memory structure maintains recent entries and sorted segments are maintained during write to disk. B-Trees don't suit the purpose well for inserting data since seeking and jumping memory locations makes things slow and breaks contiguous data units.

Materialized Views

Since aggregate queries like SUM, AVG are common, we can create a Materialized View - a copy of the results of a query written to DB. It is in contrast to a Virtual view, where nothing is written to DB. When underlying data changes, Materialized View should also change. It can get expensive in regular OLTP databases, but it works well on read-heavy OLAP systems. Data cubes are multi-dimensional Materialized Views where several indexes can be combined.

Encoding and Evolution

A codebase continues evolving with requirements. Adding new fields to a schema, or editing it in some way happens all the time. Each database model (relational, doc, graph) have some way of handling this. Eg. ALTER table in SQL, directly adding fields in Document model. Application code needs to change to handle the new fields. Because of the way rolling upgrades are done and mobile apps are updated (user-dependent) there can be cases where old and new versions of application and data-model co-exist in all ways. So the application needs to be:

Backward compliant - New code able to handle old data
Forward compliant - Old code able to handle new data => This is tricky

We look at ways to encode data and how different schemas can help with achieving fwd and bkwd compatibility.

Encoding

In memory data structures are lists, trees etc that make it easy for CPU to operate on a data.
Database or network stream needs to contain this data in a self-contained format. Eg. json, xml The translation between the in-memory structure and the database representation is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling)

Language specific formats like Java Serializable, Python pickle exist. Problems with these are:

Not able to communicate with programs encoding in other formats/languages. Transmitting data in this format between programs makes it hard to move away from the language.
Allowing load of this data and instantiating class objects is a security risk
Versioning is an afterthought and there are problems with fwd and bkwd compatibility
Efficiency may also be low like with Java's in-built serialization

Standardized encodings JSON, XML, CSV, Binary variants are popular language-independent formats. These are preferred but have some minor drawbacks:

Numbers and string with numbers are often not identified correctly in XML. JSON can distinguish between strings and numbers but not Integer/Floating point and accuracy.
JSON/XML support Unicode but not binary formats well. Binary encoding with base64 works but can increase size by 33%.
Use of XML schemas is fairly widespread, but many JSON-based tools don’t bother using schemas. CSV is a bit vague and not all interpreters correctly implement rules the same way.

Despite these flaws, JSON, XML, and CSV are good enough for many purposes.

Binary encoding techniques that produce tinier versions of data are also available. Eg. Google protobuf, Facebook thrift, Messagepack. Google protobuf requires a programmer to define a schema of the model. It has utils to auto-generate code for the classes in any language. Data is then composed with tags for fieldnames, data type (string,int each represented as 0, 1 etc), length of data to follow, data blob.
If a field value is not set, it is simply omitted from the encoded record. Thus, adding or removing a field becomes easier. If a field is removed, it is simply not passed. If a new field is added but passed to older code, it will ignore the tag or field that it doesn't understand. Only caveat is to make newly added fields optional/with default value and not as required fields.
Changing the data type of a field is problematic. Eg. Int32 can be read as Int64. Protobufs dont have a list/array but has the repeated field. So a field can be changed from variable to repeated variable (list) for fwd/bkwd compatibility. Old code would only pick last element of the list.
Apache avro does something similar, but it completely drops the index and type fields leaving only length and data. It does this by comparing the writer and reader schema and ignoring elements that don't show up in one or the other. When adding a new field, schema should include a default value that can be used if older schema write data is read by a newer schema reader.
Avro's Writer and reader schema can be shared once at the beginning in a file-write or network handshake; a version number included in Db entries to indicate the schema on it.
Avro schema can also be auto-gen from database schema
Avro also works nicely with dynamic languages (js, ruby, py) since it doesn't need code generation like with protobufs

Evolution

Compatibility is a relationship between one process that encodes the data, and another process that decodes it. There are many ways data can flow from one process to another:

Via databases

Data can be written by a service for its own consumption at a later time or for use by other services. Either way, problems of fwd and bkwd compatibility exist. Data outlives code, so older data schemas will live in DB alongside newer schemas. There could be a scenario where new data was written by a service, which was read by an older service into an object that had the older schema and then written back to DB. This may not have the new field when writing out to DB, thereby losing the data. There are ways to handle this. Eg. @JsonAnyGetter, @JsonAnySetter but we should be aware of this and handle it correctly. LinkedIn’s document database Espresso uses Avro for storage, allowing it to use Avro’s schema evolution rules

Via service calls Applications can expose API over which clients can reach them. Basics of SOA or microservices arch. Each service can be owned by different teams and talk to each other for a specific function. We can expect old/new versions of service running at the same time.

Web Services : When HTTP is used as the underlying protocol for talking to the service, it is called a web service. REST and SOAP are the popular ones. REST is more easy to use (represents resources as URL, data is usually JSON), SOAP is harder (XML based and relies on tooling support, not readable, can be invoked from code).

RPC : Tries to make a request to a remote net‐ work service look the same as calling a function or method in your programming language, within the same process. Fundamental problems with this - network calls make function run-time unpredictable, timeouts can mean failure anywhere, retries can be done only if server side manages multiple calls idempotently. Talking across different languages is painful. Services still exist: gRPC, rest.li (json over http). They use futures, streams (connection open for many not just one request/response). Some of these frameworks also provide service discovery.

Compatibility We can make an assumption here that RPC/Web services are usually updated first and clients later. This can make the constraint slightly lesser as responses can only be fwd compatible and requests can only be bkwd compatible. If a compatibility-breaking change is required, the service provider often ends up maintaining multiple versions of the service API side by side. For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP Accept header. For services that use API keys to identify a particular client, another option is to store a client’s requested API version on the server and to allow this version selection to be updated through a separate administrative interface.

Webservices using REST are flexible and are preferred for public data, RPC is preferred for internal data transfer.

Via asynchronous message passing Hybrid between Database and RPC. They are similar to RPC in that a client’s request (usually called a message) is delivered to another process with low latency. They are similar to databases in that the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called a message queue or message-oriented middleware), which stores the message temporarily. Eg. RabbitMQ, Kafka

Message broker has advantages: It acts as a buffer (producer-consumer problem), retries message delivery, sender doesn't need to know the recipient's network details, sender can broadcast/multicast its message and logically decouples sender and receiver.

General pattern: One process sends a message to a named queue or topic, and the broker ensures that the message is delivered to one or more consumers of or subscribers to that queue or topic. There can be many producers and many consumers on the same topic. A topic provides only one-way dataflow. We need to combine Request/Reply queues to send and receive messages.

Distributed actor model launches multiple services listening to brokers and taking action. There may be multiple actors with some on a different machine over the network. Broker takes care of transmitting and receiving messages across the network as well. Compatibility is still a concern when passing messages between nodes running old and new schema. Need to rely on something like Avro for this.