Architecture Overview - louisphilipmarcoux/rill-json GitHub Wiki
This document explains the internal design of rill-json for contributors.
The library is designed in two main stages, a classic compiler design that separates "lexing" from "parsing".
The flow is:
- Raw
&[u8]Input: TheTokenizerconsumes raw bytes. Tokenizer(Lexer): This is the "engine." It scans the bytes and producesTokens (e.g.,LeftBrace,String("hello"),Comma).Parser(State Machine): This is the "brains." It consumesTokens, validates the grammar, and producesParserEvents (e.g.,StartObject,Key(...),String(...)).ParserEventOutput: The user consumes these events from theStreamingParseriterator.
1. src/tokenizer.rs (The "Engine")
This is the performance-critical core of the library. We achieved high performance without unsafe code by using two key techniques:
Byte-based Lookup Table (LUT)
Instead of a slow match statement on every character, we use a 256-entry static array (BYTE_PROPERTIES) as a lookup table. This allows us to classify any byte in a single, branchless array lookup.
BYTE_PROPERTIES[b' ' as usize]->W(Whitespace)BYTE_PROPERTIES[b'{' as usize]->S(Structural)BYTE_PROPERTIES[b't' as usize]->L(Literal)BYTE_PROPERTIES[b'9' as usize]->D(Digit)BYTE_PROPERTIES[b'"' as usize]->Q(Quote)BYTE_PROPERTIES[b'z' as usize]->0(Invalid)
The main Iterator::next function in the tokenizer is just a match on this property, which is extremely fast.
memchr for String Parsing
The "hot path" for string parsing (lex_string) is skipping over long stretches of plain text. We use the memchr crate, which uses SIMD instructions (like AVX2) under the hood to find the next " or \ byte.
- Hot Path: Scan for the closing
"and the first\. - If no
\is found:- Run a quick loop to check for unescaped control characters.
- Return a
Cow::Borrowedslice of the original input (zero-allocation).
- Cold Path: Only if a
\is found do we drop into the slow, byte-by-byte loop to build an escapedStringand returnCow::Owned.
This makes parsing JSON with many simple strings exceptionally fast.