Chapter 3: File Formats ‐ Organizing Data on Disk - abhay-jindal/wallet GitHub Wiki

📁 Chapter 3: File Formats - Organizing Data on Disk 🚀

Welcome to Chapter 3 of Database Internals by Alex Petrov! This chapter dives into File Formats, the unsung heroes of database storage. It explains how databases organize data on disk, from encoding numbers to structuring pages for efficient access. Since you’re relying on these notes without reading the book, we’ll cover every critical concept—binary encoding, page structures, slotted pages, and more—with vivid real-life examples to make it crystal clear. We’ll also describe the chapter’s diagrams to visualize key ideas. Let’s explore how databases store data with enthusiasm! 🎉


🎯 Chapter Scope and Purpose

Chapter 3 explores how databases store data in files on disk, focusing on file formats that balance efficiency, flexibility, and reliability. It covers how raw data (e.g., numbers, strings) is encoded, how files are structured into pages, and how databases manage variable-size data. These concepts are foundational for understanding storage engines in later chapters.

What’s Covered:

  • Motivation: Why file formats matter for performance and storage 🗂️
  • Binary Encoding: Representing numbers, strings, and other data types 📚
  • Page Structure: Organizing data into fixed-size pages 📄
  • Slotted Pages: Managing variable-size data efficiently ⚙️
  • Cells and Layout: Storing records within pages 🧩
  • Versioning and Checksumming: Ensuring data integrity 🔒
  • Diagrams: Visuals of page layouts and encoding 📊

Real-Life Example: Imagine an e-commerce platform like Amazon storing product details on disk. A well-designed file format ensures fast retrieval of product names, prices, and descriptions, even with millions of items! 🛒


🌟 Motivation: Why File Formats Matter

File formats determine how efficiently a database stores and retrieves data. A good format:

  • Minimizes Disk I/O: Reduces the number of reads/writes.
  • Supports Flexibility: Handles variable-size data like strings.
  • Ensures Reliability: Prevents corruption with checksumming.
  • Optimizes Space: Encodes data compactly.

Real-Life Example: A social media platform stores user posts on disk. An efficient file format ensures quick loading of timelines while minimizing storage costs 📱.


📚 Binary Encoding

Databases store data in binary format on disk, converting high-level types (e.g., integers, strings) into bytes. Chapter 3 explains how this encoding works for various data types.

🔢 Primitive Types

  • Integers:
    • Stored as fixed-size binary numbers (e.g., 32-bit or 64-bit).
    • Big-Endian vs. Little-Endian: Determines byte order (e.g., big-endian stores most significant byte first).
    • Example: The number 1234 might be stored as 00 00 04 D2 (big-endian, 32-bit).
  • Floating-Point Numbers:
    • Use IEEE 754 standard (e.g., 32-bit float or 64-bit double).
    • Example: The number 3.14 is encoded with a sign bit, exponent, and mantissa.
  • Booleans:
    • Often stored as a single bit or byte (e.g., 0 for false, 1 for true).

Real-Life Example: A financial database stores transaction amounts as 64-bit doubles. Precise encoding ensures accurate calculations for balances 💸.

📝 Strings and Variable-Size Data

  • Fixed-Length Strings:
    • Allocate a fixed number of bytes, padding if needed (e.g., “cat” in a 10-byte field: 63 61 74 00...).
  • Variable-Length Strings:
    • Store length prefix (e.g., 4-byte length) followed by bytes.
    • Example: “hello” might be 00 00 00 05 68 65 6C 6C 6F (length 5, then ASCII bytes).
  • Null-Terminated Strings:
    • End with a null byte (00), common in older systems but less efficient.

Real-Life Example: A blog platform stores post titles as variable-length strings. A length prefix ensures quick reading without scanning for nulls ✍️.

🔲 Bit-Packed Data

  • Booleans, Enums, Flags:
    • Pack multiple values into a single byte to save space.
    • Example: 8 boolean flags (e.g., “is_active”, “is_admin”) fit in one byte, with each bit representing a flag.
  • Bit-Packing:
    • Compresses small values (e.g., enums with 4 values use 2 bits each).
    • Example: An enum {RED, GREEN, BLUE, YELLOW} can store 4 values in 8 bits.

Real-Life Example: A gaming platform stores player status flags (e.g., online, in-game) in a single byte, reducing storage for millions of users 🎮.

🛠️ General Principles

  • Alignment: Align data to byte boundaries for faster CPU access.
  • Compression: Minimize wasted space (e.g., avoid padding in variable-length data).
  • Portability: Use consistent encoding (e.g., big-endian) for cross-platform compatibility.

Diagram Description:

  • The book includes a Binary Encoding Diagram, showing how an integer (e.g., 1234), a float (e.g., 3.14), and a variable-length string (e.g., “hello”) are encoded in bytes. It compares big-endian vs. little-endian for the integer and shows the length prefix for the string.

📄 Page Structure

Databases organize files into pages, fixed-size blocks (e.g., 4KB or 8KB) that align with disk block sizes. Pages are the unit of I/O, so their structure is critical for performance.

🧱 Basic Page Layout

  • Header: Metadata like page ID, type (e.g., data, index), and checksum.
  • Body: Stores actual data (records or index entries).
  • Free Space: Reserved for future writes or updates.

Real-Life Example: A library database stores book records in 4KB pages. Each page holds multiple records, with a header tracking how many records are stored 📚.

📐 Slotted Pages

Slotted pages are a common format for storing variable-size records (e.g., strings, JSON). They balance flexibility and efficiency.

  • Structure:
    • Header: Tracks number of records and free space.
    • Slot Array: An array of pointers to record locations (cells) within the page.
    • Cells: Variable-size records stored at the end of the page, growing backward.
    • Free Space: Gap between slot array and cells, used for new records.
  • How It Works:
    • Records are stored in cells, and the slot array maps each record’s offset.
    • New records are added to free space, updating the slot array.
    • Deletes reclaim space, marked as free for reuse.

Diagram Description:

  • The book includes a Slotted Page Diagram, showing a page with a header, a slot array (e.g., pointers to offsets 100, 200), and cells at the page’s end (e.g., records “Book A”, “Book B”). Free space is shaded between the slot array and cells, with arrows showing how new records reduce free space.

Real-Life Example: A retail database uses slotted pages to store product descriptions of varying lengths. The slot array ensures quick access to each description, even as new products are added 🛍️.

🧩 Cell Layout

  • Cells: Individual records within a page, containing fields (e.g., ID, name, price).
  • Fixed vs. Variable Fields:
    • Fixed fields (e.g., integers) have predictable sizes.
    • Variable fields (e.g., strings) use length prefixes or pointers to overflow pages.
  • Combining Cells:
    • Cells are packed tightly, with the slot array tracking their locations.
    • Overflow pages store large records that don’t fit in a single page.

Real-Life Example: A hospital database stores patient records in cells, with fixed fields (e.g., patient ID) and variable fields (e.g., medical notes). Overflow pages handle lengthy notes 🩺.


🔒 Versioning and Checksumming

📜 Versioning

  • Purpose: Tracks page or record versions to support concurrency or recovery.
  • How: Include a version number or timestamp in the page header or cell.
  • Use Case: Multi-version concurrency control (MVCC) in PostgreSQL uses versioning to show different record versions to transactions.

Real-Life Example: A banking system versions account records to ensure transactions see consistent data, even during concurrent updates 🏦.

🔍 Checksumming

  • Purpose: Detects data corruption (e.g., from disk errors).
  • How: Compute a checksum (e.g., CRC32) for the page’s content and store it in the header. Verify on read.
  • Trade-Off: Adds computation overhead but ensures reliability.

Real-Life Example: A cloud storage system checksums pages to detect corruption, ensuring user files remain intact ☁️.

Diagram Description:

  • The book includes a Page Header Diagram, showing a page header with fields like page ID, checksum, version number, and free space pointer. It illustrates how the header precedes the slot array and cells in a slotted page.

🌍 Use Cases and Trade-Offs

🛠️ Use Cases

  • Relational Databases: MySQL, PostgreSQL use slotted pages for tables and indexes.
  • Key-Value Stores: RocksDB uses custom file formats for efficient key storage.
  • File Systems: Similar page-based structures for metadata.

Real-Life Example: A streaming service stores movie metadata in slotted pages, enabling fast retrieval of titles and descriptions 🎥.

⚖️ Trade-Offs

Aspect Advantage Disadvantage
Slotted Pages Flexible for variable-size data Complex to manage free space
Checksumming Detects corruption Adds computational overhead
Variable Encoding Saves space for strings Slower to parse than fixed formats

Real-Life Example: A logistics database uses slotted pages for shipment records. Flexibility is key, but managing free space requires careful defragmentation 📦.


🌟 Key Takeaways

  • File Formats: Organize data on disk for efficiency and reliability 📁.
  • Binary Encoding: Converts integers, strings, and flags into compact bytes 🔢.
  • Slotted Pages: Manage variable-size records with headers, slots, and cells 📄.
  • Versioning: Supports concurrency and recovery 📜.
  • Checksumming: Ensures data integrity 🔍.
  • Foundation: Prepares for B-Tree implementations in Chapter 4 🗂️.

🚗 Real-Life Case Study: Online Retail

An online retailer like ShopFast uses file formats to store product data:

  • Relational DB (PostgreSQL): Stores product records in 8KB slotted pages, with headers, slot arrays, and cells for IDs, names, and prices 🛍️.
  • Binary Encoding: Prices as 64-bit floats, names as variable-length strings with length prefixes.
  • Operations:
    • Read: Slot array locates product records quickly.
    • Write: New products added to free space, updating slots.
  • Versioning: MVCC versions records for concurrent checkouts.
  • Checksumming: Detects disk errors to ensure accurate inventory.

Outcome: Customers enjoy fast searches and reliable checkouts, while the database optimizes storage and integrity! 🛒


📖 Further Reading

  • Books:
    • Introduction to Algorithms by Cormen et al. (data structures) 📚
    • Designing Data-Intensive Applications by Martin Kleppmann 🛠️
  • Papers:
    • “The Design of PostgreSQL’s Storage” (covers slotted pages) 📜
  • Resources: Check Database Internals’ bibliography for file format research 📝.

Real-Life Example: A database engineer at a retailer studies PostgreSQL’s storage papers to optimize product indexing, speeding up searches 🛍️.