Architecture Overview - louisphilipmarcoux/rill-json GitHub Wiki

This document explains the internal design of rill-json for contributors. The library is designed in two main stages, a classic compiler design that separates "lexing" from "parsing".

The flow is:

  1. Raw &[u8] Input: The Tokenizer consumes raw bytes.
  2. Tokenizer (Lexer): This is the "engine." It scans the bytes and produces Tokens (e.g., LeftBrace, String("hello"), Comma).
  3. Parser (State Machine): This is the "brains." It consumes Tokens, validates the grammar, and produces ParserEvents (e.g., StartObject, Key(...), String(...)).
  4. ParserEvent Output: The user consumes these events from the StreamingParser iterator.

1. src/tokenizer.rs (The "Engine")

This is the performance-critical core of the library. We achieved high performance without unsafe code by using two key techniques:

Byte-based Lookup Table (LUT)

Instead of a slow match statement on every character, we use a 256-entry static array (BYTE_PROPERTIES) as a lookup table. This allows us to classify any byte in a single, branchless array lookup.

  • BYTE_PROPERTIES[b' ' as usize] -> W (Whitespace)
  • BYTE_PROPERTIES[b'{' as usize] -> S (Structural)
  • BYTE_PROPERTIES[b't' as usize] -> L (Literal)
  • BYTE_PROPERTIES[b'9' as usize] -> D (Digit)
  • BYTE_PROPERTIES[b'"' as usize] -> Q (Quote)
  • BYTE_PROPERTIES[b'z' as usize] -> 0 (Invalid)

The main Iterator::next function in the tokenizer is just a match on this property, which is extremely fast.

memchr for String Parsing

The "hot path" for string parsing (lex_string) is skipping over long stretches of plain text. We use the memchr crate, which uses SIMD instructions (like AVX2) under the hood to find the next " or \ byte.

  1. Hot Path: Scan for the closing " and the first \.
  2. If no \ is found:
    • Run a quick loop to check for unescaped control characters.
    • Return a Cow::Borrowed slice of the original input (zero-allocation).
  3. Cold Path: Only if a \ is found do we drop into the slow, byte-by-byte loop to build an escaped String and return Cow::Owned.

This makes parsing JSON with many simple strings exceptionally fast.