ANTLR - BlueObelisk/jumbo-converters GitHub Wiki
ANTLR is a tool for generating parsers. It has support for generating parsers in many targeting computer languages, for example Java, C++, Perl, PHP, etc. ANTLR generates the parser based on the grammar.
Using ANTLR to generate parsers for computational chemistry seems to be rather new and is under investigation as a future alternative. Some previous work used parser generators
* to parse Gaussian input files ([[http://dx.doi.org/10.1109/ISISE.2008.50|Liu et al., 2008]]) * to parse SMILES and SMARTS strings using ANTLR ([[http://www.journal.chemistrycentral.com/content/2/S1/P40|Dalke, 2007]]). By the same author, see also [[http://www.dalkescientific.com/writings/diary/archive/2007/10/30/antlr_mw.html|Calculating MW with ANTLR]]. * some Quixote members have dealt with parser generators ([[http://dx.doi.org/10.1039/b411033a|Townsend et al., 2004]]) and have started a specific project to use ANTLR ([[http://bitbucket.org/petermr/jumbo-antlr]]).
It is important to note down the problems and solutions to problems for further reference, as well as to show the advantages and disadvantages of the method (to know when to apply it).
This page contains the tips, problems and solutions which is useful for creating grammars for parsing computational chemistry.
Besides parser generators, the other option for writing parsers is doing it by hand. What are the advantages and disadvantages of each option?
Pros:
* Automatic generation. * Higher level description of a language and its recognition process, which eases understanding of the language, as well as maintenance and modification of the language description. * Automatic detection and notification of errors. * A single grammar can be used as a template for different target programming languages (it depends on the parser generation tool used). * Theoretical support for grammar definitions and about solution of common problems.
Cons:
* Bound to a given target programming language (it depends on the parser generation tool used). * Requires learning the grammar definition language and the parser generation tool. * Lacks flexibility, since the generated parser is created using a given architecture (in ANTLR, the LL(*) descent parser architecture).
1. Use multiple lexers to analyse different sections of your text stream. In many case, your document contains many sections or slices which are radically diverse and it is hard to write a single set of lexical rules to tokenize all different bits at once (under investigation). 1. How to change lexer in runtime? For instance, when recognizing a C comment, the tokenizing process changes and everything is considered as part of the comment until the end of comment is found. Though there exist simple solutions for this specific problem in ANTLR (see the ANTLR wiki page on [[http://www.antlr.org/wiki/pages/viewpage.action?pageId=1573|How do I match multi-line comments?]]), the solution employed in Lex and Yacc (see [[http://www.cs.man.ac.uk/~pjj/cs212/ex2_str_comm.html|an example]]) can be applied to ANTLR for switching lexers. What is needed is [[http://dinosaur.compilertools.net/lex/index.html|Left Context Sensitivity]], and it can be done in ANTLR with semantic predicates (of course, one could probably use a Lex lexer connected to an ANTLR grammar). See this [[http://stackoverflow.com/questions/2798545|related stackoverflow discussion]]. The following is an example of semantic predicates used to switch between sets of rules in the lexer:
and this is the output with valid and invalid input sentences:
Note that this solution may conflict with syntactic predicates in the grammar.
1.#3 How to change to a different lexer and grammar? This may happen with so-called island grammars. The [[http://www.antlr.org/download.html|ANTLR examples]] include one such case in the {{{island-grammar}}} folder (such solution also requires the lexer to detect the moment requiring a grammar change). The ANTLR wiki entry [[http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control|Island Grammars Under Parser Control]] shows a more complex example.
1. The start rule must not be recursive. Check: * [[http://www.jguru.com/faq/view.jsp?EID=422381]] * [[http://thesoftwarelife.blogspot.com/2008/07/antlr-frustrations.html]]
1. ANTLR resolves conflicting lexer rules by rule precedence (the first rule in the lexer file is selected). The following simple grammar shows the problem:
We wish to detect any character distinct from an equal sign followed by any character distinct from a comma. Obviously, and belong to that language. However, since ANTLR can match both with the rule and the rule , it chooses the first one, and the grammar recognition fails: