pegLeg - smartnuf/ees GitHub Wiki
I thought it might be interesting to look at creating some simple interpreter for a tiny language for an embedded system with a parser generator. peg/leg looked like a reasonable choice as it generates a C language recursive decent parser using an extended parsing expression grammar (PEG) with a lex/yacc feel.
I downloaded the source from http://piumarta.com/software/peg/, and poked around the source tree. The man page is here. I found the MAN page useful -- and there are some good examples in the examples/ sub-directory. We'll examining these later, meantime I scanned Wikipedia/PEG to give me some background info, then divert to the original paper, Bryan Ford 2004, as Wikipedia's discussion only raised more questions for me.
It is not a "generating" grammar (in which rules define ways to generate strings in the grammar), but rather a "recognising" one (in which rules are used to decide whether a given string is or is not in the grammar). Any PEG can be directly converted to a recursive descent parser.
PEG is stylistically similar to a combination of Regular Expressions (REs) and Context-Free Grammars (CFGs). PEG differs from RE in that it is always greedy and doesn't backtrack. It differs from CFGs in that the choice operator, '/', selects the first match in PEG, while it, '|', is ambiguous in CFGs.
A <- e
,
which defines non-terminal A
to recognise the parsing expression e
. PEG operators used to construct expressions are given by Bryan Ford 2004 as
Operator | Type | Precedence | Description |
---|---|---|---|
’ ’ |
primary | 5 | Literal string |
" " |
primary | 5 | Literal string |
[ ] |
primary | 5 | Character class |
. |
primary | 5 | Any character |
(e) |
primary | 5 | Grouping |
e? |
unary suffix | 4 | Optional |
e* |
unary suffix | 4 | Zero-or-more |
e+ |
unary suffix | 4 | One-or-more |
&e |
unary prefix | 3 | And-predicate |
!e |
unary prefix | 3 | Not-predicate |
e1 e2 |
binary | 2 | Sequence |
e1 / e2 |
binary | 1 | Prioritised Choice |
Excepting the predicates, &e
and !e
, these are familiar as either RE or CFG syntax. Two important differences, however, are
- REs are not always greedy, while PEGs are. The PEG operators
e*
,e+
ande?
match as much as can be matched, and never backtrack (which is to say they are greedy). So that, f.e., where as the string "aa" will match the RE"a*a"
, it won't match the PEG'a' * 'a'
, as the'a'*
will match all the 'a's first, leaving none remaining to match the trailing 'a'. - The choice operator, '/', selects the first match in PEG, while it, '|', is ambiguous in CFGs.
-
A <- ! e1 e2
: means e2 so long as e1 does not match too. # Note the text of! e1
is part of e2. -
A <- & e1 e2
: means e2 so long as e1 matches too. # Note the text of& e1
is part of e2. -
A <- e1 ! e2
: means e1 so long as it is not followed by e2. # Note the text of! e2
is not consumed. -
A <- e1 ! e2
: means e1 so long as it is followed by e2. # Note the text of& e2
is not consumed.
From Bryan Ford 2004
"Both left and right recursion are permissible in CFGs, but as with top-down parsing in general, left recursion is unavailable in PEGs because it represents a degenerate loop. For example, the CFG rules ‘A → a A | a’ and ‘A → A a | a’ represent a series of ‘a’s in a CFG, but the PEG rule ‘A ← A a / a’ is degenerate because it indicates that in order to recognize nonterminal A, a parser must first recognize nonterminal A... This restriction applies not only to direct left recursion as in this example, but also to indirect or mutual left recursion involving several nonterminals. Since both left and right recursion in a CFG merely represent repetition, however, and repetition is easier to express in a PEG using repetition operators, this limitation is not a serious problem in practice."
From Bryan Ford 2004
"Like a CFG, a PEG is a purely syntactic formalism, not by itself capable of expressing languages whose syntax depends on semantic predicates. Although the Java language can be described as a single unified PEG, Bryan Ford 2002, C and C++ parsers require an incrementally constructed symbol table to distinguish between ordinary identifiers and typedef-defined type identifiers, Parr and Quong 1994."
Some discussion wanted here...
The MAN PAGE has
start <- "username" { printf( "%s\n", getlogin() ); }
/ < . > { putchar( yytext[ 0 ] ); }
about which it says
"The first line instructs the parser to print the user's login name whenever it sees "username" in the input. If that match fails, the second line tells the parser to echo the next character on the input the standard output. Our parser is now performing useful work: it will copy the input to the output, replacing all occurrences of "username" with the user's account name."
That is all very well, but what exactly is intended here? Won't .
match any single character is it encountered, and consume it, preventing the parser from ever seeing "username"
?
- The answer to that is, no, I was thinking of feeding the parser with one character at a time; but this parser needs to have available all the text necessary to reject the current set of possible matches.
- This could be an issue for a REPL. The parsing functions would need to have three possible results, (more, accept, and reject), to enable the REPL to determine whether it should wait for more input.
The MAN page goes on to say,
" Note the angle brackets ('
<
' and '>
') that were added to the second alternative. These have no effect on the meaning of the rule, but serve to delimit the text made available to the following action in the variableyytext
."
Then
"If the above grammar is placed in the file username.peg, running the command
peg -o username.c username.peg
will save the corresponding parser in the file username.c. To create a complete program this parser could be included by a C program as follows.
#include <stdio.h> /* printf ( ), putchar ( ) */
#include <unistd.h> /* getlogin ( ) */
#include "username.c" /* yyparse ( ) */
int main()
{
while ( yyparse ( ) ) /* repeat until EOF */
;
return 0;
}
I did exactly that on cygwin
, and ran the resulting parser using stdin
and stdout
for I/O. I found that the parser echoed characters immediately as typed in, but otherwise appeared to process them only on receipt of '\n'
(eol). The echo looks like it is mode of the terminal, rather than a behaviour of the parser per se, and the block until '\n'
will be line buffering.
I issued stty -echo; ./username
and the immediate echo was turned off (I restored it with stty echo
). Then I added setvbuf( stdin, ( void * ) 0, _IONBF, 0 );
to main, to turn off C-stdlib line buffering, but this did not help. Using stty raw -echo; ./username
did the trick though, and I found that while there is a partial match f.e. for "username"
, matching .
is suppressed. This is exactly what we need for an REPL.
I'll inspect the generated parser to see how an application can query the parser as to whether more input is required.
It is not clear how this is accomplished on a cursory inspection (the generated code is not as readable as it could be), I'll return to it later.
Meantime the MAN page goes on to describe the peg and leg grammars. The peg grammar differs from [Ford 2004] only minimally. It introduces { action }
, and <
and >
delimiters.
{ action }
Curly braces surround actions. The action is arbitrary C source code to be executed at the end of matching. Any braces within the action must be properly nested. Any input text that was matched before the action and delimited by angle brackets (see below) is made available within the action as the contents of the character arrayyytext
. The length of (number of characters in) yytext is available in the variable yyleng. (These variable names are historical; see lex(1).)
<
An opening angle bracket always matches (consuming no input) and causes the parser to begin accumulating matched text. This text will be made available to actions in the variableyytext
.
>
A closing angle bracket always matches (consuming no input) and causes the parser to stop accumulating text foryytext
.
& { expression }
In this predicate the simple C expression (not statement) is evaluated immediately when the parser reaches the predicate. If the expression yields non-zero (true) the 'match' succeeds and the parser continues with the next element in the pattern. If the expression yields zero (false) the 'match' fails and the parser backs up to look for an alternative parse of the input.
Note that [these] predicates are evaluated immediately during the search for a successful match, since they contribute to the success or failure of the search. Actions, however, are evaluated only after a successful match has been found.
#
introduces a comment that continues to the end of the line.
The leg grammar adds further features from lex/yacc:
-
%{ text... %}
introduces a declaration section that can be placed anywhere a Definition is expected. The text is copied verbatim to the generated C parser code before the code that implements the parser itself. - Uses '=' instead of
<-
. - Uses '|' instead of
/
. - Hyphen,
-
, is valid in an Identifier, as is a single-
. -
;
optionally terminates a rule. - Introduces
@{ action }
that is performed during the matching process at the point at which it is encountered. - Introduces
exp ~ { action }
which is performed like any action, when the match completes, but in this case is called iff the exp failed. (Used for error handling; yytext is not available). -
$$ = value
, a sub-rule can return a semantic value from an action by assigning it to the pseudo-variable '$$'. All semantic values must have the same type (which defaults to 'int'). This type can be changed by definingYYSTYPE
in a declaration section. -
identifier:name
: The semantic value returned (by assigning to '$$') from the sub-rule name is associated with the identifier and can be referred to in subsequent actions.
Examining examples/Makefile, we see examples are
- test
- rule
- accept
- wc
- dc
- dcv
- calc
- basic
- localpeg
- localleg
- erract
The examples are built by issuing make test
. Examining the generated code for the first example, we see
YY_RULE(int) yy_start(yycontext *yy)
{ int yypos0= yy->__pos, yythunkpos0= yy->__thunkpos;
yyprintf((stderr, "%s\n", "start")); if (!yy_body(yy)) goto l13; if (!yymatchChar(yy, '.')) goto l13; yyDo(yy, yy_1_start, yy->__begin, yy->__end);
yyprintf((stderr, " ok %s @ %s\n", "start", yy->__buf+yy->__pos));
return 1;
l13:; yy->__pos= yypos0; yy->__thunkpos= yythunkpos0;
yyprintf((stderr, " fail %s @ %s\n", "start", yy->__buf+yy->__pos));
return 0;
}
etc.
It looks challenging to read. There has to be a generator with a cleaner RD output...
I'll take a look at apg by Lowell D Thomas at APG.
- Update, I did that, and found my way to [Roman Roman R. Redziejowski 2007] (http://www.romanredz.se/papers/FI2007.pdf), sited by B FORD. It looks like it might be what I'm looking for.
Could also look at