ANTLR: Why you should not use (f)lex, yacc and bison - go-sqlparser/current GitHub Wiki

Why you should not use (f)lex, yacc and bison

https://tomassetti.me/why-you-should-not-use-flex-yacc-and-bison/

Written by Gabriele Tomassetti
in Application modernization, Parsing

Table of contents

Lex and Yacc were the first popular and efficient lexers and parsers generators, flex and Bison were the first widespread open-source versions compatible with the original software. Each of these software has more than 30 years of history, which is an achievement in itself. For some people these are still the first software they think about when talking about parsing. So, why you should avoid them? Well, we found a few reasons based in our experience developing parsers for our clients.

For example, we had to work with existing lexers in flex and found difficult adding modern features, like Unicode support or making the lexer re-entrant (i.e., usable in many threads). With Bison our clients had trouble organizing large codebases and we found difficult improving the efficiency of a parser without rewriting large part of the grammar. The short version is that there are tools that are more flexible and productive, like ANTLR.

The first part of this article explains the history of these two software, while the second one analyze their flaws. If you do not care about their history, you can go to the second part using this handy table of contents.

History of Parsing Pioneers

We start from the beginning, of course. We briefly look at the history of these software.

The Beginnings

Both software have an interesting history, although Yacc's story looks like a soap opera.

The Story of Lex

Lex is a lexer generator, that is to say a tool to generate lexical analyzers. In case you do not know what a lexer is, these are the basics. A lexer accepts as input normal text and it organize it in corresponding tokens. All you need to have is a grammar that describes the rules to follow. So, for instance you can tell Lex that any grouping of digits (i.e., [0-9]+ characters) should be classified as an INT. A standalone lexer has few uses, because it basically just organize set of characters into categories. It can be used for things like the recognition of elements of a programming language to perform syntax highlighting. You can read more about the basics of parsing in our article: A Guide to Parsing: Algorithms and Terminology.

Mike Lesk and Eric Schmidt (who later became CEO of Google) developed it as proprietary software in 1975 in C. It was an important software, both for its quality and for the needs of development at that time. In fact it became part of the POSIX standard, essentially any respectable OS needed to have a tool like that.

The Story of Yacc

Yacc is a parser generator, specifically a tool to generate LALR parsers. Essentially a parser groups tokens (like the ones generated by Lex) into logical structures. Again, all you need to have is a grammar that describes the rules to follow. For example, you can generate a parser that can take a IDENTIFIER (e.g., a variable), a EQUAL token, and a INT token and understand that together they form an assignment. A parser deals with much more complicated problems than that of a lexer.

Stephen Johnson developed Yacc during the early 1970s, writing (and rewriting) it many times between 1973 and 1978 as proprietary software. He wrote the latest version in C, although initially he wrote in B. It started as a practical tool to help the development of the B Language. It was continually improved, and effectively also helped developing the theory of parsing. Johnson was a computer scientist that make sure that every trick to improve speed was theoretically sound and based in computer science. That is quite the level of dedication that you do not see in contemporary software. Yacc also became part of the POSIX standard.

Since Lex is used to generate lexers and Yacc to generate parsers, they were complementary and often used together. They were simply the best software available in their respective niches.

The Open Source Versions

As mentioned, the initial versions were proprietary software. This means that their use was restricted, which left some users dissatisfied. That is why several open source versions of both software were created. These software were often compatible with the grammars of the original software, but also added their own improvements. The Lex-compatible software that gained prominence was flex, which is still used today. It uses the BSD License and supports also C++. There are also countless implementations of Lex in different languages.

There were two major replacements for Yacc, Berkeley Yacc (byacc) and GNU Bison, the first is in the public domain, while the other obviously use the GPL License. The original author of both software was Robert Corbett and, confusingly, byacc was originally called bison. Furthermore Bison was not originally compatible with Yacc, but it was made compatible by Richard Stallman, which brought it in the GNU world. In the meantime, Corbett rewrote byacc from scratch to be compatible with Yacc. Byacc was originally the most popular version of Yacc-compatible software, but now is GNU Bison (byacc is still developed). There are also ports of Bison in other languages.

In many ways Yacc was an innovative software, because of its impact on development of parser. Lex was less innovative, however it also had a long lasting influence. You will find that many standalone lexer generators are inspired by Lex. This is both because of its quality and because there is less need for standalone lexers. So, lex-inspired lexer generators are still good enough for common uses.

There are even software that are compatible and replace both Lex and Yacc together, like PLY in Python.

Comparison between flex, bison and modern parser generators

Now that the introduction is over, we can see why you should avoid these tools, if you have the chance. We want to avoid just talking about what is possible and ideal, to focus on what is really feasible today. That is why we are going to compare Flex and Bison with a modern parser generator: ANTLR.

A Glorious History Comes with its Set of Problems

Now that we know the history of these software, we can already imagine some of the problems. These are very old software created to be compatible with even older software. Even worse, their design is based on really old principles that are not best practices today. Can you even imagine the quality of code that was written and fixed for 30 years?

Their main point was, and still is, compatibility with an old codebase. In short, there are mainly three cases to consider:

  1. if you have existing parsers in maintenance mode using flex and Bison, it makes sense to keep using them;
  2. if you have existing parsers that are actively developed, you should consider porting them to more productive tools, the effort of porting them will be repaid with greater productivity;
  3. if you need to develop new parsers from scratch, there are too many downsides in using them and you should opt for more productive tools, like ANTLR

Summary of Differences

Here is a brief summary of the comparison between flex/Bison and ANTLR.

  • Stability and Development of New Features
    • flex and Bison are stable and maintained software but there is no active development. C++ support can be of limited quality.
    • ANTLR is actively developed and new features are added periodically
  • Separation between Grammar and Code
    • flex and Bison maintain an old-school design with little support for readability or productivity.
    • ANTLR is a modern parsing generator tool with a design that favors a portable grammar, usable for multiple target languages
  • Unicode Support
    • flex does not directly support Unicode
    • ANTLR does support all flavors of Unicode and even makes easy to select Unicode properties (e.g., Emoji)
  • They Use Different Licenses
    • flex uses a BSD license, while Bison uses the GPL. Bison waive the license for most generated parsers, but it can be a problem
    • ANTLR uses the BSD license
  • Grammar Format
    • Bison only supports BNF, which makes grammars more complicated
    • ANTLR supports EBNF, which makes easier to describe most languages
  • Features of Lexing Algorithms
    • flex supports regular expressions to define rules, which works for most elements, but adds complexity
    • ANTLR supports context-free expression to define rules, which simplifies defining some common elements
  • Features of parsing algorithms
    • Bison supports two parsing algorithms that cover all ranges of performance and languages. It gives cryptic error messages
    • ANTLR supports one algorithm that works for all languages with usually a great performance
  • Community and documentation
    • flex and Bison have smaller and fractured communities, but they have good documentation
    • ANTLR has one vibrant community and a good documentation

.
.
.

Community and Documentation

ANTLR has a community that is alive and vibrant. The community provided more than a hundred grammars that can be freely used by everybody. There is also a mailing list that can help. The main author of the tool, Terence Parr, created the book _The Definitive ANTLR 4 Reference _that explains the features of ANTLR and much more: the most useful patterns in parsing and the common issues encountered when parsing. We created a Mega Tutorial to help you with ANTLR and even an entire course to help you Using ANTLR Like a Professional.

Flex and Bison have a smaller and less active community. Bison has an official mailing list, flex does not even have that. Though they both have a manual that provide ample information about the features and the options of the respective tool.

ANTLR has an involved community, that have contributed not just grammars, but even new target languages. Everybody can create and contribute a new target language or help improve the existing ones. For example, we built experimental support for Kotlin as a target language. Instead flex and Bison have a long tradition, but they are also less used now, and they have fractured communities that makes harder to find support.

Traditionally open source tools have issues in creating comprehensive documentation, that can hurt productivity and slow acquisition of proficiency in the tool. It is one of the hidden costs of open source software. However, in this case all of the three tools have a good documentation. Flex and Bison have good official manuals, while ANTLR has a book that explains the tool and the how to use it in the most common parsing situations.

Summary

If we have to sum up this article in one phrase it would be this one: flex and Bison still work, but every day they become less of a good choice. It is not that using them will doom your project, but there is really no good reason to pick them over ANTLR or any other modern parsing generator tools available. If for some reason you do not like ANTLR, for instance because you want something simpler, there are also many other valid options. We have reviewed all the major parsing tools and libraries for Java, C#, Python and JavaScript. So pick one of them, any one of them, but flex and Bison.