ANTLR: The ANTLR Mega Tutorial - go-sqlparser/current GitHub Wiki
https://tomassetti.me/antlr-mega-tutorial/
Written by Gabriele Tomassetti
in ANTLR, Language Engineering, Parsing
Parsers are powerful tools and using ANTLR you could write all sort of parsers, usable from many different languages.
In this complete tutorial we are going to:
- explain the basics: what a parser is, what it can be used for
- see how to setup ANTLR to be used from JavaScript, Python, Java and C#
- discuss how to test your parser
- present the most advanced and useful features present in ANTLR: you will learn all you need to parse all possible languages
- show tons of examples
Maybe you have read some tutorial that was too complicated or so incomplete that seemed to assume that you already knew how to use a parser. This is not that kind of tutorial. We just expect you to know how to code and how to use a text editor or an IDE. That's it.
At the end of this tutorial:
- you will be able to write a parser to recognize different formats and languages
- you will be able to create all the rules you need to build a lexer and a parser
- you will know how to deal with the common problems you will encounter
- you will understand errors and you will know how to avoid them by testing your grammar.
In other words, we will start from the very beginning and when we reach the end you will have learned all you could possibly need to learn about ANTLR to be productive.
ANTLR Mega Tutorial Giant List of Content
ANTLR is a parser generator, a tool that helps you to create parsers. A parser takes a piece of text and transforms it in an organized structure, a parse tree, also known as a Abstract Syntax Tree (AST). You can think of the AST as a story describing the content of the code, or also as its logical representation, created by putting together the various pieces.
Graphical representation of an AST for the Euclidean algorithm
What you need to do to get a parse tree:
- define a lexer and parser grammar
- invoke ANTLR: it will generate a lexer and a parser in your target language (e.g., Java, Python, C#, JavaScript)
- use the generated lexer and parser: you invoke them passing the code to recognize and they return to you a parse tree
So you need to start by defining a lexer and parser grammar for the thing that you are analyzing. Usually the "thing" is a language, but it could also be a data format, a diagram, or any kind of structure that is represented with text.
Notice that technically what you get from ANTLR is a _parse tree _rather than an AST. The difference is that a parse tree is exactly what comes out of the parser, while the AST is a more refined version of the parse tree. You create the AST by manipulating the parse tree, in order to get something that is easier to use by subsequent parts of your program. These changes are sometimes necessary because a parse tree might be organized in a way that make parsing easier or better performing. However, you might prefer something more user friendly in the rest of the program.
The distinction in moot in our examples shown here, given they are quite simple, so we use the terms interchangeably here. However, it is something to keep in mind while reading other documents.
If you are the typical programmer, you may ask yourself why can't I use a regular expression? A regular expression is quite useful, such as when you want to find a number in a string of text, but it also has many limitations.
The most obvious is the lack of recursion: you cannot find a (regular) expression inside another one, unless you code it by hand for each level. Something that quickly became unmaintainable. But the larger problem is that it is not really scalable: if you are going to put together even just a few regular expressions, you are going to create a fragile mess that would be hard to maintain.
It is not that easy to use regular expressions
Have you ever tried parsing HTML with a regular expression? It is a terrible idea, for one you risk summoning Cthulhu, but more importantly it does not really work. You do not believe me? Let's see, you want to find the elements of a table, so you try a regular expression like this one: <table>(.*?)</table>
. Brilliant! You did it! Except somebody adds attributes to their table, such as style
or id
. It does not matter, you do this: <table.*?>(.*?)</table>
. Still, you actually cared about the data inside the table. So you now need to parse tr
and td
, but they are full of tags.
Therefore you need to eliminate that, too. And somebody dares even to use comments like <!--- my comment >l--->
. Comments can be used everywhere, and that is not easy to treat with your regular expression. Is it?
So you forbid the internet to use comments in HTML: problem solved.
Or alternatively you use ANTLR, whatever seems simpler to you.
Okay, you are convinced, you need a parser, but why to use a parser generator like ANTLR instead of building your own?
The main advantage of ANTLR is productivity
If you actually have to work with a parser all the time, because your language, or format, is evolving, you need to be able to keep the pace. This is something you cannot do if you have to deal with the details of implementing a parser. Since you are not parsing for parsing's sake, you must have the chance to concentrate on accomplishing your goals. And ANTLR makes it much easier to do that, rapidly and cleanly.
As second thing, once you defined your grammars you can ask ANTLR to generate multiple parsers in different languages. For example, you can get a parser in C# and one in JavaScript to parse the same language in a desktop application and in a web application.
Some people argue that writing a parser by hand you can make it faster and you can produce better error messages. There is some truth in this, but in my experience parsers generated by ANTLR are always fast enough. You can tweak them and improve both performance and error handling by working on your grammar, if you really need to. And you can do that once you are happy with your grammar.
Two small notes:
- in the companion repository of this tutorial you are going to find all the code with testing, even where we don't see it in the article
- the examples will be in different languages, but the knowledge would be generally applicable to any language
- Lexers and Parsers
- Creating a Grammar
- Designing a Data Format
- Lexer Rules
- Parser Rules
- Mistakes and Adjustments
- Setting Up the Chat Project in Javascript
- Antlr.js
- HtmlChatListener.js
- Working with a Listener
- Solving Ambiguities with Semantic Predicates
- Continuing the Chat in Python
- The Python Way of Working with a Listener
- Testing with Python
- Parsing Markup
- Lexical Modes
- Parser Grammars
- The Markup Project in Java
- The Main App.java
- Transforming Code with ANTLR
- Joy and Pain of Transforming Code
- Advanced Testing
- Dealing with Expressions
- Parsing Spreadsheets
- The Spreadsheet Project in C#
- Excel is Doomed
- Testing Everything
- What is ANTLR?
- Are not Regular Expressions Enough?
- ANTLR vs Writing Your Own Parser by Hand
- Table of Contents
- Setup
- Beginner
- Mid-Level
- Advanced
- Final Remarks
- Setup
- The ANTLR Mega Tutorial as a PDF
- 1. Setup ANTLR
- Instructions
- Executing the instructions on Linux/Mac OS
- Executing the instructions on Windows
- Typical Workflow
- 2. JavaScript Setup
- 3. Python Setup
- 4. Java Setup
- 5. C# Setup
- Alternatives If You Are Not Using Visual Studio Code
- Picking the Right Runtime
- Beginner
- 6. Lexers and Parsers
- 7. Creating a Grammar
- Top-down approach
- Bottom-up approach
- 8. Designing a Data Format
- 9. Lexer Rules
- 10. Parser Rules
- 11. Mistakes and Adjustments
- Mid-Level
- 12. Setting Up the Chat Project with JavaScript
- 13. Antlr.js
- 14. HtmlChatListener.js
- 15. Working with a Listener
- 16. Solving Ambiguities with Semantic Predicates
- 17. Continuing the Chat in Python
- 18. The Python Way of Working with a Listener
- 19. Testing with Python
- 20. Parsing Markup
- 21. Lexical Modes
- 22. Parser Grammars
- Advanced
- 23. The Markup Project in Java
- 24. The Main App.java
- 25. Transforming Code with ANTLR
- 26. Joy and Pain of Transforming Code
- 27. Advanced Testing
- 28. Dealing with Expressions
- 29. Parsing Spreadsheets
- 30. The Spreadsheet Project in C#
- 31. Excel is Doomed
- 32. Testing Everything
- Final Remarks
- 33. Tips and Tricks
- Catchall Rule
- Channels
- Rule Element Labels
- Problematic Tokens
- 34. Conclusions
- The ANTLR Mega Tutorial as a PDF