Abstract Syntax Tree - BHsketch/Athena GitHub Wiki

The Abstract syntax tree (AST) structure:

What an AST is and why we might want to make one when implementing a compiler.

Consider a single expression "a+b" that appears in the code. So ultimately, our compiler's first module, whatever that may be, must convert the code written in the input file to such a form (an intermediate form), that if we were to put the language grammar and the intermediate form side by side, and directly compare, it would be "easy" to figure out where the mistakes are in the code (i.e. where does it not follow the language grammar).

So maybe something like this: a tree with + at the root, with two children 'a' and 'b'. Seems intuitive. But if you observe the grammar, its much more intense. It uses the help of a bunch of different non-terminals to represent an expression. Bet the parse tree for that doesn't look as simple as the one above. So do we need all those extra non terminals? Do they have a significance at all? Must we include them when we build our tree?

First of all, let me clear why we need a 'tree' in the first place. If you've gone through some basics on grammars, you might have come across parse trees. Basically, any string belonging to a language must be "derived" from a grammar, and if you derive anything from a grammar, it's intuitively like building a tree. Each non-terminal branches off into several terminals and non terminals, only one of those options being actually used. To summarize, the process of deriving anything from a grammar looks like a tree. Hence if we were able to convert our code(which, since it belongs to a grammar, must have a derivation from it) into its corresponding parse tree, we would be able to traverse it and easily recognize places where an "illegal derivation" has been done if any.

  +
 / \
a   b

Coming to all those "extra" non-terminals, it is relatively simpler to understand why you want them in the grammar (Why they are not required in an AST will become apparent when I explain the first reason)

  1. They Ensure that expressions follow grammatical rules (a format) for how an expressin should look like.
  2. They help implement a precedence order on the operators within the grammar itself (instead of implementing the precedence in the parsing phase as some algorithm)

The first one is pretty much what a grammar is supposed to do...that is it defines a structure for that statement in terms of generic words like "term", "op" and "factor". But that is all. Once our program knows what the statment should look like, it can parse it just once according to that format, spit errors if necessary, and if correct, create a parse tree out of it. But then when creating the parse tree, there is no need to store that information again. Because the fact that the AST for that statement was built in the first place implies that it was checked for correctness in format before, and hence it must be in the correct format. i.e. correct format (grammar) is a prerequisite for building an AST fragment for it. Which means those terms are no longer required to tell us the same information again, they can be forgotten and the tree structure simplified into just a '+' at the root and 'a' and 'b' as its children.

The second one is less obvious, and it's disconnected from our thread of discussion so I'll only briefly explain. In an expression like a*2 + b/3 + c , the pluses will be executed last, once all the other terms, as in 'a*2', 'b/2' and 'c' have been calculated. Which means, when in our grammar, terms, which are clusters of variables (or literals) separated by the * or / operators, must be recursively solved before they are added together. Hence they must represent a whole new non-terminal in our grammar that is parsed before the +'s are parsed.

⚠️ **GitHub.com Fallback** ⚠️