Why Lexing and Parsing Should Be Separate - oils-for-unix/oils GitHub Wiki

Summary: Do the easy thing with the fast algorithm, and the hard thing with the slow algorithm. Lexing and parsing are different things.

Note: the arguments here are more important for big languages/parsers. For small languages, these issues don't matter as much.

  • Code Clarity: Lexing and parsing are fundamentally different. While in practice there isn't a precise line between lexing and parsing (for a given language, or across languages), here's a clear way to organize it:
    • A lexer recognizes the non-recursive structure of a language. Its output is a token stream.
    • A parser recognizes the recursive structure of a language. Its output is a tree.
  • Code and language stability. Lexing is straightforward and "solved", but parsing isn't. The best parsing model isn't clear up front, since it depends on the details of the language. At some point, you may want to change the parsing model, while keeping the lexer the same.
    • Example: Rust changed parsing models early in its life. Graydon Hoare said: IMO Rust isn’t ideally thought of as LL(k) anymore; I tried to keep it that way for a long time, but it’s grown a lot of bits that work better in LR-family. I highly recommend just deleting anything ANTLR-related and focusing on LR(k) or LALR(k) grammars.
    • Example: CPython is doing this in Python 3.9 after 30 years. It's moving from pgen LL(1) to PEG over tokens. The lexer is kept the same.
    • Note: If you're parsing an existing language, the safest option is to the same parsing model that the original parser used, and the same definition of tokens to divide the lexer and parser.
  • Performance. Any lexer you write is likely to fast. On the other hand, parsing is more complex, and can be slow.
    • Lexing with regular languages (expressions) is nice because it can be done in O(n) time and O(1) space (using automata-based techniques). Many lexers are also straightforward to write by hand.
      • There is essentially one algorithm for lexing -- march forward through the input exactly once.
    • Parsing CFGs is O(n^3) in general, and parsing PEGs either takes O(n) space (packrat parsing) or exponential time (backtracking)
    • Important point: separating lexing and parsing reduces the n in your O(n^3). In any real program, there are many more characters than tokens.
    • Case study: Speeding up the Sixty compiler. This is a compiler written in Haskell with a front end that used parser combinators. Parsing took 45% of the time at first. After other optimizations, it took 30% of the time. Optimization 7: Separate Lexer: So the change I made here is to write a faster lexer that's separate from the parser, and then make the parser work on the list of tokens that the lexer spits out. This sped up the program by 37%, which was one of the bigger wins for the whole compiler.
    • Case study: Lessons learned from writing ShellCheck, GitHub’s now most starred Haskell project -- This part is an argument for a separate lexer, which Oil has: Be careful if your parser framework makes it too easy to backtrack. Look up good parser design. I naively wrote a character based parser function for each construct like ${..}, $(..), $'..', etc, and now the parser has to backtrack a dozen times to try every possibility when it hits a $. With a tokenizer or a parser that read $ followed by {..}, (..) etc, it would have been much faster
  • Easier composition of sublanguages with modal lexers. On this thread, there seemed to be the misconception that lexers inhibit language composition. In fact it's just the opposite -- the lexer can take a mode parameter, and each mode corresponds to the different lexical structure of a sublanguage (templated strings, etc.)

I used to use a PEG for my language, but I had to move away because the error messages were so bad I couldn't iterate on the parser because I couldn't understand what was going wrong.

Threads/comments:

Not a strawman:

https://fosdem.org/2019/schedule/event/perl6newtool/ - The great part is that you no longer need to split your language implementation in traditional phases: lexer, parser, etc. -- IMO this is a mistake, at least for medium/large languages.

  • Consistent handling of Location Info -- You will likely want to attach filename/line/column information to tokens in the lexer. If you follow the style that tokens are leaves to AST nodes, then the parser can be ignorant of this concern.

  • IDE support likely favors separate lexers. According to this video about an IDE for Raku, the IntelliJ platform inherently separates the notion of lexing and parsing. The leaves of the AST are tokens, and the lexer is responsible for generating tokens.

  • from matklad -- Lexers are easy to make incremental, but parsers aren't. This matters for IDE support.