ANTLR4 - nus-cs4215/x-slang-t1-xz-jj GitHub Wiki

Syntax Validation - Parsing Python3

For the syntax validation process, we rely on ANTLR4 to perform the necessary checking before proceeding to the next step, i.e., AST generation. This process allows the user program to be checked and fine-grained error messages can be returned to the user, if any.

See verify_syntax() in src/sicpy/syntax_verifier.ts.

However, due to unforeseen circumstances, this module is currently disabled. See below for more details.

ANTLR4

Part of the ANTLR4 project is a parser generator that can be used to build languages, tools and frameworks. SICPy's Python parser was the product of this ANTLR4 parser generator, and was based on the Python3 grammar.

Originally, ANTLR4, which was based on Java, generates parsers as a series of the Java source files. The ANTLR4TS project extended the original project and enabled it to produce a parser based in TypeScript.

The following is example of a generated parse tree in Python 3:

parser output

Vistors and Listeners

In order to walk a ANTLR4 parse tree, we declare a visitor or a listener in a manner specified by ANTLR4's specifications. An example of a listener (written in Java) is given here. Listeners function primarily as event listeners that get triggered whenever a certain Python grammar was encountered in the parse tree. Listeners offer less autonomy to its users compared to visitors as the walk was premeditated by the ParseTreeListener class in ANTLR4. Nevertheless, we deemed that this functionality was enough for our use-case, as we simply only need to filter away/ block out syntaxes that students weren't suppose to use in specific SICPy chapters.

Encountered Issues

Convulated Parse Trees

Initially we thought that Listeners alone are sufficient for our use case -- we simply need to know what keywords are used by in the source code. One way to do this is to install listeners for specific keywords, and shall these listeners get triggered and called, we will return a flag that tells us the forbidden keywords that were used in the students' code.

However, we quickly found out that listener event handlers does not provide the enough information about the context surrounding the keyword when it was triggered. For example, for the simple expression "x=1", when the walk eventually reaches the numeric value "1", we do not know what type is "1", as the event handler does not provide us with information such as who are its parents, what type of node is it (what "token" is it in the grammar), etc.

The next progressive thing we did is to implement visitors and generate our own Abstract Syntax Tree. However, we found that the Python grammar that we had been using thus far had other more serious issues that we had not anticipated, as explained below.

Only One Expression at a Time?

The generated ANTLR4 parser does not parse the entire program at once, but instead expression by expression. This is a fundamental issue with the ANTLR4 parser, as each expression is evaluated independently of each other. We spent significant time to remedy this problem in our project. In the end, we found that this was part of a bigger issue in the Python3 grammar. See the following related GitHub issues: 1061 1073 1078 1086 1352 1646 1801.

Updating the Grammar from 3.6 to 3.8

The current supported Python3 Grammar provided on ANTLR's Github repository is for Python version 3.6. We extended the grammar to version 3.8.

The updated grammar can be found in x-slang/src/sicpy/Python3.g4. Once deemed fit, a pull request would be lodged to ANTLR4's grammars-v4 repository.