Parsing: Parser Doc Collections - go-sqlparser/current GitHub Wiki

History on Parsers

https://goccmack.github.io/posts/2020-05-31_gogll/

Most well-known parsing techniques, such as: LL(1)/LL(*) (ANTLR), LALR (Yacc) or LR(1) (gocc), handle only a subset of context-free grammars and can fail unpredictably when hidden left recursion occurs in the grammar.
Generalised LL (GLL) parsing [Scott 2010,2012] is a parsing technique that that can handle any context-free grammar. It parses from left to right and produces a parse forest containing all valid parse trees.
Generalised LR (GLR) parsers [Grune 2008] also handle all context-free grammars but are more complex to implement than GLL.

The signature achievement of Computer Science is the theory of Computation (see [Sipser 2013] for a comprehensive introduction), comprising: automata and languages; computability theory; and complexity theory.

The quest for practical parsing algorithms for context-free languages led to the landmark 1965 paper by Knuth [Knuth 1965] on LR(k) languages.

De Remer invented LALR parsing in 1969 [De Remer 1969] and published an algorithm to generate the lookahead sets in 1982 [De Remer 1982].

David Pager's beautiful paper on his Practical General Method for Generating LR(k) Parsers (PGM) appeared in 1977 [Pager 1977].

These two techniques, LALR and PGM, reduced the size of LR(1) parsing tables sufficiently to make LR parsing practical on the hardware of the time. Together with LL(1), LALR and LR(1) have remained the primary workhorses of parser writers.

Much work was done in the following years to find practical parsing algorithms for more general context-free languages. An excellent source for parsing up to and including generalised LR is by Grune and Jacobs [Grune 2008].

In the mean time most parser writers either hand crafted an LL(1) parser or used tools such as ANTLR (LL(1)/LL(*)), Yacc (LALR) or gocc (LR(1)) to generate parsers.

In 2010 Scott and Johnstone published a practical algorithm for generalised LL (GLL) recognition [Scott 2010], which could handle all context free grammars. Their 2012 paper [Scott 2012] describes GLL parse-tree generation, which is necessary for a functional parser. In a 2019 paper with Van Binsbergen [Scott 2019] they described a variant of the GLL algorithm that represents the parse forest as a set of binary subtrees.

For many years I have disliked the ad hoc conflict resolution required to parse practical languages with a single token look-ahead, be it by LL(1) or LR(1). In 2019 I implemented a GLL parser generator for the [Scott 2019] technique. This compiler generator, which also generates a linear-time FSA lexer for the tokens of the same grammar, is available as open source (gogll). Since 2020 gogll generates code for both Go and Rust.

golang uses a handwritten recursive decent parser and handwritten lexer

https://news.ycombinator.com/item?id=19008781

golang uses a handwritten recursive decent parser and handwritten lexer.

I can't think of a single major tool that uses ANTLR or bison. I say that being a huge ANTLR fan. Parser/lexer generators are really easy to use for DSLs but for real languages and tools, they're too heavyweight and constraining.

That's the gccgo frontend; the more usual Go compiler is here: https://github.com/golang/go/tree/master/src/cmd/compile/internal/syntax
It's still handwritten though so your point stands.

GCC used to use bison. I don't think GCC became a major project recently. But you're correct that recursive descent is more efficient. Not everything is efficiency though (parsers are usually 1% of compile time since they're dirt cheap compared to code generator and especially static analyzer and optimizer). Bison makes parsing extremely simple. I can't see why anyone would use recursive descent unless they're a super major project and need to squeeze every ms. (gnulinux on Jan 27, 2019)
- A big part of why tools move away from Bison and ANTLR isn’t performance, but UX (especially error reporting) (wwright on Jan 27, 2019)
  - In my experience recursive descent doesn't have any better error reporting than Bison or ANTLR. Parser combinators are much better if error reporting is what you're aiming for. Also, for LALR/LR parsers like Bison/ANTLR, the canonical trick is to encode errors in your grammar, which really isn't a huge deal. E.g. in pseudocode ... (gnulinux on Jan 27, 2019)