Concepts Parser - UBOdin/mimir GitHub Wiki
Mimir uses Sparsity, a simple SQL parser based on FastParse. General SQL primitives should be added directly to Sparsity.
Mimir extends Sparsity with a set of new commands for MimirSQL. If you want to add new commands, the FastParse documentation page is quite thorough, but here's a quick rundown of a few key concepts and gotchas that arise with MimirSQL specifically.
Defining new rules
Use the following template for new rules (e.g., new command types).
def myTermName[_:P] = P[MyReturnType](
ruleDefinitionGoesHere
)
Most new rules re-use existing rules defined by Sparsity in
FastParse defines a range of rule leaves. The most common in MimirSQL are:
StringInIgnoreCase("STRING", "OTHER")
: Case in-sensitive string match any string in the list"STRING"
: Case sensitive string match
Rules can be combined:
rule1 | rule2
: (alternatives) Try rule1, if it fails try rule2. Both rule1 and rule2 must return the same type.rule1 ~ rule2
: (concatenation) Try rule1, if it succeeds continue parsing with rule2.rule1.?
: (optional) Try rule1. If it succeeds, returnSome(...)
, otherwise return None.rule1.!
: (return) Replace rule1's return value with the string matched by rule1. Parenthesization works as you would expect.&rule1
: (lookahead) Try rule1. If it succeeds, rewind the input to before the try and succeed. It if fails, fail.!(rule1)
: (negative lookahead) Try rule1. If it succeeds, fail. If it fails, rewind the input to before the try and succeed.
In general, I find it most intuitive to think of FastParse parsers as imperative programs with a sort of transactional execution. Each ~
separated line of code is an imperative statement that consumes a chunk of the input and moves on to the next chunk. Each |
separated line of code is like an exception handler: If option 1 fails, rewind the input to before running option 1 and try option 2 instead.
~/
)
Backtrack-Blocking Composition (There's one additional frequently used operator for combining inputs: rule1 ~/ rule2
. This behaves almost exactly like ~
but with one important caveat: It kills the parser's memory of all |
options higher in the stack. There are good reasons for doing this. It improves parser performance, but more importantly for us it makes for better error messages.
For example, consider ANALYZE FEATURES
, implemented towards the top of MimirSQL in the rule analyzeFeatures
. There are two commands that start with the word ANALYZE
. However, once the parser sees FEATURES
, we know that we're locked into this particular branch. Thus, StringInIgnoreCase("FEATURES")
is followed by ~/
. Let's look at what happens with the parser on the following input with a plain old ~
and with ~/
.
SELECT FEATURES IN FOO;
In either case first two rules of analyzeFeatures
(StringInIgnoreCase("ANALYZE") ~ StringInIgnoreCase("FEATURES")
) pass, while the third rule (StringInIgnoreCase("OF")
) fails. Thus, in either case analyzeFeatures
fails.
With a plain old ~
, the failure propagates up and out into the statement
rule. The parser then tries all remaining options in statement
but fails. The user then gets an error message associated with statement
:
ANALYZE FEATURES IN FOO;
^-- Expected one of ANALYZE, COMPARE, ...
With a ~/
, failure propagation doesn't backtrack any further than this operator. Thus, the user gets a far more informative error message associated with analyzeFeatures
.
ANALYZE FEATURES IN FOO;
^-- Expected OF
Common Bugs and Gotchas
Identifier Bugs
Symptom: The parser doesn't seem to want to recognize a keyword in a parsed string.
Classical parsers generally rely on a preprocessing stage called Tokenization. Tokenization transforms the input stream of raw bytes into a stream of slightly more meaningful Tokens (e.g., "QuotedString", or "SELECT", or "identifier") that the parser can then build a more useful structure out of. Consider the following example:
SELECT FOO BAR FROM BAZ;
A tokenizer would process this into
Seq( SelectKeyword, Identifier("FOO"), Identifier("BAR"), FromKeyword, Identifier("BAZ"), Semicolon )
The parser would then take this input and turn it into a selection projecting "BAZ.FOO" to "BAR".
In the documentation, FastParse's author is pretty vocal about Tokenization being inappropriate for FastParse. There are certainly a number of arguments in favor of ditching Tokenization. However, lacking tokens means that the parser needs a little bit of logic to distinguish cases. FastParse does deal with whitespace reasonably, so you might think of the input it sees as:
Seq( "SELECT", "FOO", "BAR", "FROM", "BAZ", ";" )
Note however, that there's no structural distinction between identifiers and keywords. "BAR" and "FROM" are indistinguishable. Consider the following input:
SELECT FOO FROM FROM BAZ;
This is not legal SQL (FROM FROM
), but from the perspective of a non-tokenized parser, 100% indistinguishable from the prior case. This means we need to create that distinction manually. The Sparsity.identifier
provides this distinction by keeping an explicit list of keywords (see avoidReservedKeywords
) that are not allowed to be identifiers.
This list may need to be extended over time as new keywords are added.