Parsing is Difficult - oilshell/oil GitHub Wiki

Blog Theme

Language Design and Theory of Computation -- CFGs aren't powerful enough. Java has two grammars.

Why Lexing and Parsing Should Be Separate -- criticism of PEGs

https://news.ycombinator.com/item?id=13821829 -- Real parsers are hand-written; they use code generators or meta-languages.

https://news.ycombinator.com/item?id=13041646 -- metalanguage feedback in response to old META II paper. Ruby parser is complex. JRuby parser is an example of writing two parsers.

https://news.ycombinator.com/item?id=13630686 -- my own metalanguage

https://news.ycombinator.com/item?id=15713825 -- someone who thinks parsing languages is simpler than it actually is (regarding language server protocol)

https://news.ycombinator.com/item?id=14379114 -- Walter Bright overstating the applicability of CFGs

Note about error messages: are parse errors really hard? They all seem similar, as long as you have location info? Python doesn't seem to have any special handling of parse errors. It uses a grammmar.

I think type-checking error messages are hard, like Clang diagnostics. And runtime errors to an extent. But parse errors may be easy if the language is straightforward enough.

The one place I might want an advanced error message is for $((. You want to know where it started thinking of things as arithmetic. The place you fix is different than the place the error occurred.

My reply to someone who says parsing is easy, with points from this page: Parsing Is Not Solved
- another point: Microsoft and JetBrains have parsing technology that is beyond the state of the art in open source. It's still advancing.

Real Languages have at least Two Parsers

Python: CPython parser, lib2to3, redbaron
Ruby: MRI parser, JRuby parser
Go: C parser, Go parser
Scala: seems like scala-meta might have another parser
JavaScript: many different parsers
Unified parsers:
- Clang
- Roslyn
SQL:
- A third party parser for Postgres SQL
- https://github.com/uber/queryparser/blob/master/dialects/hive/src/Database/Sql/Hive/Parser.hs -- it's written in Haskell with Parsec, like ShellCheck
- It's a pity that the thing has to get reimplemented all over so many times. For example, you'd think you could reuse PostgreSQL's parser, but its code is autogenerated and riddled with side effects -- not side effects per se, but semantic actions that are embedded in the grammar. gram.y is 16K+ lines!!!

From coverage.py. The Python stdlib tokenizer isn't good enough, because it's not lossless! It's also a second tokenizer, since the C one is not wrapped.

def phys_tokens(toks):
    """Return all physical tokens, even line continuations.

    tokenize.generate_tokens() doesn't return a token for the backslash that
    continues lines.  This wrapper provides those tokens so that we can
    re-create a faithful representation of the original source.

    Returns the same values as generate_tokens()

TODO: Look at the list of linters here: https://github.com/mcandre/linters
- what parsers do they use?

Parsing is Slow

Binary AST format proposal: https://github.com/syg/ecmascript-binary-ast
v8 has eager and lazy parsers (see YouTube video)
Streaming Compilation with WebAssembly -- contrast with JavaScript: https://news.ycombinator.com/item?id=16169236

Well another takeaway is that people ship a lot of JavaScript over the wire that they never use!!!

Blog Outline

The state of the art is hand-written parsers
The state of the art is two hand-written parsers (possibly written to the same spec, or not)
The state of the art is two grammars? (TODO: Look at Java). Grammars are no longer declarative.
State of the art:
- v8 and Clang for performance
- TypeScript / C# for generality
- v8 for safety? any others?

Gaps Between Theory and Practice

https://www.reddit.com/r/ProgrammingLanguages/comments/74ktjg/what_compromises_have_you_made/do0rsku/ -- In theory, parsing is about testing whether a language is in a set. In practice, it's also about creating an AST/LST structure. This is where the concept of "ambiguity" arises -- more than one possible sequence of productions can lead to the same string.
- It's also analogous to the "group capture" problem in regular expressions. We want to retain linear time performance and specify greedy/nongreedy behavior when extracting substrings: https://swtch.com/~rsc/regexp/

Parsing Requires Little Tricks

Operator Precedence Parsing, even though C grammars can manually encode it (https://news.ycombinator.com/item?id=15470988)
Handling left-recursive rules in a recursive descent parser: https://tavianator.com/bfs-from-the-ground-up-2/
- Matters for shell's equal-precedence || and &&
- this is the difference between treating a parsing as set membership vs. creating a useful tree structure
Parse Tree vs. AST vs. LST (Lossless Syntax Tree Pattern)
- these are punted onto semantic actions
backtracking or predictive recursive descent?
- lookahead?
C's context sensitivity (types vs. identifiers)
dangling else problem (the canonical example of a shift-reduce conflict)
- Is Java Context Free? My comments on Reddit

Contrast with paper: Pure Declarative Syntax Lost and Regained.

Many Parsers Parse Twice

Python does this silly thing where it re-parses its own tokens. For example, strings, multiline strings, raw strings, byte strings, etc.
- See astcompiler/astbuilder.py module in in PyPy! It calls parse_number() and uses the parsestring module.
- See parsestring() and parsenumber() functions in Python/ast.c. Wow I can't believe they do this!
- Better solution: use lexer state. String literals are their own language with C escapes.
- Conclusion: The Lexer/Parser interface is broken. Lexer modes are better!
  - for string literals
  - for numeric literals
Bash also parsers code twice -- to find the end of $(, $((, etc. and then to parse what's inside it!

Idea for Parsing Tool

Stateful lexer -- generalization of regular lexer.
- it's meant to handle \n in strings, 1.0e100 in numbers, etc.
LL(k) for efficiency
pratt parsing (or shunting yard) for efficiency and compact description of a grammar
LR(k) for C, Awk, maybe Ruby, R, etc.
- TODO: research the ways in which LL(k) is not sufficient for those languages?
- we know about C prototypes declarations vs. definitions. This is LR(k), but requires infinite k for LL(k).
- JavaScript => function.
Zephyr ASDL for abstract syntax and Lossless Syntax Tree Pattern
what language for semantic actions? Maybe reuse the OVM data structures and library? OVM could be based on algebraic data types?

Papers

A Simple, Possibly Correct Parser for C11. They use OCaml lex, menhir, and a Context object to do the "lexical feedback". The Related Work section at the end is very good.

Parsing Network Protocols Is Also Difficult

langsec line of research is using the wrong formalism. The update on calc-regular languages is immature and can't express things like varint decoding (AFAICT):

https://news.ycombinator.com/item?id=16126234

http://duriansoftware.com/joe/Constructing-human-grade-parsers.html