The ANTLR Mega Tutorial

https://tomassetti.me/antlr-mega-tutorial/

Written by Gabriele Tomassetti
in ANTLR, Language Engineering, Parsing

Parsers are powerful tools and using ANTLR you could write all sort of parsers, usable from many different languages.

In this complete tutorial we are going to:

explain the basics: what a parser is, what it can be used for
see how to setup ANTLR to be used from JavaScript, Python, Java and C#
discuss how to test your parser
present the most advanced and useful features present in ANTLR: you will learn all you need to parse all possible languages
show tons of examples

Maybe you have read some tutorial that was too complicated or so incomplete that seemed to assume that you already knew how to use a parser. This is not that kind of tutorial. We just expect you to know how to code and how to use a text editor or an IDE. That's it.

At the end of this tutorial:

you will be able to write a parser to recognize different formats and languages
you will be able to create all the rules you need to build a lexer and a parser
you will know how to deal with the common problems you will encounter
you will understand errors and you will know how to avoid them by testing your grammar.

In other words, we will start from the very beginning and when we reach the end you will have learned all you could possibly need to learn about ANTLR to be productive.

ANTLR Mega Tutorial Giant List of Content

What is ANTLR?

ANTLR is a parser generator, a tool that helps you to create parsers. A parser takes a piece of text and transforms it in an organized structure, a parse tree, also known as a Abstract Syntax Tree (AST). You can think of the AST as a story describing the content of the code, or also as its logical representation, created by putting together the various pieces.

An abstract syntax tree for the Euclidean algorithm
Graphical representation of an AST for the Euclidean algorithm

What you need to do to get a parse tree:

define a lexer and parser grammar
invoke ANTLR: it will generate a lexer and a parser in your target language (e.g., Java, Python, C#, JavaScript)
use the generated lexer and parser: you invoke them passing the code to recognize and they return to you a parse tree

So you need to start by defining a lexer and parser grammar for the thing that you are analyzing. Usually the "thing" is a language, but it could also be a data format, a diagram, or any kind of structure that is represented with text.

Notice that technically what you get from ANTLR is a _parse tree _rather than an AST. The difference is that a parse tree is exactly what comes out of the parser, while the AST is a more refined version of the parse tree. You create the AST by manipulating the parse tree, in order to get something that is easier to use by subsequent parts of your program. These changes are sometimes necessary because a parse tree might be organized in a way that make parsing easier or better performing. However, you might prefer something more user friendly in the rest of the program.

The distinction in moot in our examples shown here, given they are quite simple, so we use the terms interchangeably here. However, it is something to keep in mind while reading other documents.

Are not Regular Expressions Enough?

If you are the typical programmer, you may ask yourself why can't I use a regular expression? A regular expression is quite useful, such as when you want to find a number in a string of text, but it also has many limitations.

The most obvious is the lack of recursion: you cannot find a (regular) expression inside another one, unless you code it by hand for each level. Something that quickly became unmaintainable. But the larger problem is that it is not really scalable: if you are going to put together even just a few regular expressions, you are going to create a fragile mess that would be hard to maintain.

It is not that easy to use regular expressions

Have you ever tried parsing HTML with a regular expression? It is a terrible idea, for one you risk summoning Cthulhu, but more importantly it does not really work. You do not believe me? Let's see, you want to find the elements of a table, so you try a regular expression like this one: <table>(.*?)</table>. Brilliant! You did it! Except somebody adds attributes to their table, such as style or id. It does not matter, you do this: <table.*?>(.*?)</table>. Still, you actually cared about the data inside the table. So you now need to parse tr and td, but they are full of tags.

Therefore you need to eliminate that, too. And somebody dares even to use comments like . Comments can be used everywhere, and that is not easy to treat with your regular expression. Is it?

So you forbid the internet to use comments in HTML: problem solved.

Or alternatively you use ANTLR, whatever seems simpler to you.

ANTLR vs Writing Your Own Parser by Hand

Okay, you are convinced, you need a parser, but why to use a parser generator like ANTLR instead of building your own?

The main advantage of ANTLR is productivity

If you actually have to work with a parser all the time, because your language, or format, is evolving, you need to be able to keep the pace. This is something you cannot do if you have to deal with the details of implementing a parser. Since you are not parsing for parsing's sake, you must have the chance to concentrate on accomplishing your goals. And ANTLR makes it much easier to do that, rapidly and cleanly.

As second thing, once you defined your grammars you can ask ANTLR to generate multiple parsers in different languages. For example, you can get a parser in C# and one in JavaScript to parse the same language in a desktop application and in a web application.

Some people argue that writing a parser by hand you can make it faster and you can produce better error messages. There is some truth in this, but in my experience parsers generated by ANTLR are always fast enough. You can tweak them and improve both performance and error handling by working on your grammar, if you really need to. And you can do that once you are happy with your grammar.

Two small notes:

in the companion repository of this tutorial you are going to find all the code with testing, even where we don't see it in the article
the examples will be in different languages, but the knowledge would be generally applicable to any language

ANTLR: The ANTLR Mega Tutorial - go-sqlparser/current GitHub Wiki

The ANTLR Mega Tutorial

What is ANTLR?

Are not Regular Expressions Enough?

ANTLR vs Writing Your Own Parser by Hand

Table of Contents

Setup

Beginner

Mid-Level

Advanced

Final Remarks

Table of contents

⚠️ GitHub.com Fallback ⚠️

ANTLR: The ANTLR Mega Tutorial - go-sqlparser/current GitHub Wiki

The ANTLR Mega Tutorial

What is ANTLR?

Are not Regular Expressions Enough?

ANTLR vs Writing Your Own Parser by Hand

Table of Contents

Setup

Beginner

Mid-Level

Advanced

Final Remarks

Table of contents

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️