Parsing Guards - RopleyIT/GLRParser GitHub Wiki
In this document, we explain how the parser transforms a grammar containing non-terminals that have guard conditions applied into a grammar where guards only appear on terminal tokens in rules. This enables parsers with a simple extension to support guarded terminal tokens to handle the more complex grammars that allow guards on non-terminal tokens.
When appying a guard expression to a terminal token in a grammar rule, this is interpreted as meaning that the specified terminal token must be seen in the input stream, and the guard must evaluate to true once the token has been seen, before the rule can shift to the next token, or reduce to the parent rule if the guard is applied to the last token in the rule. Applying guards to terminal tokens in an LR grammar is a simple but useful extension to the parsing engine.
A guard condition on a non-terminal symbol in a rule production implies that the guard condition should be applied to the last terminal token within each expansion of that non-terminal. Hence a new set of productions should be added for the non-terminal with that condition applied. The rules for generating a grammar containing these expansions are described below.
We are going to show
the transformations that are applied using the following sample input grammar.
In this grammar, lower-case identifiers represent non-terminal tokens, while
upper-case identifiers are terminal tokens as received from the input tokeniser.
Note that the top-level
non-terminal token that will have been recognised if the input token
sequence obeys that grammar is topRule
. Note also that there are
two places where a guard condition is applied to a non-terminal token, both of
them being the application of the guard condition [C1]
to the
non-terminal token z
.
The numbered alpha symbols in curly braces represent the places in the grammar where an action is executed in response to a production being recognised, immediately prior to its reduction. Our aim is to show that the actions still get executed at the correct time and in the correct order, despite the transformations to the grammar:
topRule: // The original top-level rule
Z w {α0}
| w {α1};
w: x y z[C1] V {α2}
| x y z[C1] x {α3};
y: Y Y {α4}
| ε {α5};
z: Z y {α6}
| y W {α7}
| ε {α8};
x: X Y {α9}
| ε {α10};
The first transformation ensures you
start with an expanded top-level grammar rule, as required by LR parsers. This
guarantees that the entry rule never appears on the right hand side of an
expansion. Note that the expanded grammar rule, called _start
in
the code sample below, now includes a synthetic first
terminal token (SOF
) before the top rule. This is because the algorithm used to
recognise guards on non-terminals recursively applies the guard condition to
tokens on the non-terminal’s left, if the non terminal includes the empty
expansion among its productions. Thus in the extreme case of an empty grammar
being a valid completion of the grammar, we have a token to which we can attach
a guard condition that has moved left through the grammar right up to the parent
rule. The grammar is rewritten below:
_start: SOF topRule; // Auto-expanded rule
topRule: // The original top-level rule
Z w{α0}
| w {α1};
w: x y z[C1] V {α2}
| x y z[C1] x {α3};
y: Y Y {α4}
| ε {α5};
z: Z y {α6}
| y W {α7}
| ε {α8};
x: X Y {α9}
| ε {α10};
Next, for each non-terminal with a guard condition, add a new set of productions for that non-terminal guard pair as the LHS of the production. The RHS will have its last token modified to take the same guard condition.
_start: SOF topRule;
topRule:
Z w{α0}
| w {α1};
w: x y z[C1] V {α2}
| x y z[C1] x {α3};
y: Y Y {α4}
| {α5};
z: Z y {α6}
| y W {α7}
| {α8};
x: X Y {α9}
| {α10};
zC1:
Z y[C1] {α6}
| y W[C1] {α7}
| [C1] {α8};
Apply this rule recursively for any further non-terminals that the guard condition gets moved to in the expansion:
yC1:
Y Y[C1] {α4}
| [C1] {α5};
Now we notice that for some of the productions in the examples given above, some of the expansions result in a guard condition at the beginning of the rule, rather than on a terminal or non-terminal symbol in the rule. These are what we shall term carry-back guard conditions. For convenience in later expansions, we shall divide these rules into two sets, the set consisting of an embedded guard condition but with no guard at the beginning of the rule (denoted using subscript φ), and a set beginning with the guard condition (denoted using subscript ε) as follows:
zφC1: Z y[C1] {α6}
| y W[C1] {α7};
zεC1: [C1] {α8};
yφC1: Y Y[C1] {α4};
yεC1: [C1] {α5};
The carry-back guards now have to be applied to the rules in which
the guarded non-terminals appear. Any place in a rule where we see
p: q r[C] s
, and r[C]
has a carry-back set of rules, we shall replace
it with two rules. One rule merely replaces r[C]
with rφC
,
while the other rule in which we carry back the condition to the previous symbol
replaces r[C]
with its empty set rule: p: q[C] rεC s
.
In our grammar, this means we shall replace rules as follows:
w: x y zφC1 V {α2}
| x y zφC1 x {α3}
| x y[C1] zεC1 V {α2}
| x y[C1] zεC1 x {α3};
We also have a similar modification to apply in one of our previously added rules:
zφC1: Z yφC1 {α6}
| Z[C1] yεC1 {α6};
The newly written rules for w
have uncovered some more non-terminals followed by
guard conditions (y[C1]
), so we apply the same rule replacement steps in
w
again, but for y[C1]
:
w: x yφC1 zεC1 V {α2}
| x[C1] yεC1 zεC1 V {α2}
| x yφC1 zεC1 x {α3}
| x[C1] yεC1 zεC1 x {α3};
We now notice that the non-terminal and guard x[C1]
has not been expanded yet,
so we need to apply that expansion:
xφC1: X Y[C1] {α9};
xεC1: [C1] {α10};
Causing the rules for w
to be remodelled as:
w: x y zφC1 V {α2}
| x y zφC1 x {α3}
| x yφC1 zεC1 V {α2}
| xφC1 yεC1 zεC1 V {α2}
| [C1] xεC1 yεC1 zεC1 V {α2}
| x yφC1 zεC1 x {α3}
| xφC1 yεC1 zεC1 x {α3}
| [C1] xεC1 yεC1 zεC1;
As we now have caused carry-backs for w, so must we split the rules into two groups:
wφ: x y zφC1 V {α2}
| x y zφC1 x {α3}
| x yφC1 zεC1 V {α2}
| xφC1 yεC1 zεC1 V {α2}
| x yφC1 zεC1 x {α3}
| xφC1 yεC1 zεC1 x {α3};
wεC1:
[C1] xεC1 yεC1 zεC1 V {α2}
| [C1] xεC1 yεC1 zεC1;
As w
has now been split into two sets of rules, one with a guard
condition to be carried back, we next recursively apply carry-back rule rewrites
on all rules in the grammar containing w among the tokens on the right hand side of
any production:
topRule: Z wφ {α0}
| wφ {α1}
| Z[C1] wεC1 {α0}
| [C1] wεC1 {α1};
Once again, we now see that topRule
has its own carry-back rule,
so we split it into two groups:
topRuleφ: Z wφ{α0}
| wφ {α1};
| Z[C1] wεC1 {α0};
topRuleεC1: [C1] wεC1 {α1};
Since this uncovers a carry back for topRule
that must be applied to rules containing
topRule
in the RHS expansion, we rewrite
any rules in which topRule
appears on the RHS:
_start: SOF topRuleφ
| SOF[C1] topRuleεC1;
Notice that by rewriting the extended starting rule to have a synthetic
terminal first token that precedes any real tokens in the input stream, we guarantee
that the starting token _start
will never have a carry-back token of its own.
Hence the start of the grammar applies any carry-back for the entire grammar
onto the synthetic token. Notice too that at this point, no part of the
grammar contains a guard condition applied to a non-terminal. Guard
conditions are always applied to preceding terminal tokens. The fully
reconstructed grammar is:
_start:
SOF topRuleφ
| SOF[C1] topRuleεC1;
topRuleφ:
Z wφ {α0}
| wφ {α1}
| Z[C1] wεC1 {α0};
topRuleεC1:
wεC1 {α1}; // Carries back [C1]
wφ: x y zφC1 V {α2}
| x y zφC1 x {α3}
| x yφC1 zεC1 V {α2}
| xφC1 yεC1 zεC1 V {α2}
| x yφC1 zεC1 x {α3}
| xφC1 yεC1 zεC1 x {α3};
wεC1: xεC1 yεC1 zεC1 V {α2} // Carries back [C1]
| xεC1 yεC1 zεC1 x {α3}; // Carries back [C1]
x: X Y {α9}
| {α10};
xφC1: X Y[C1] {α9};
xεC1: {α10}; // Carries back [C1]
y: Y Y {α4}
| {α5};
yφC1: Y Y[C1] {α4};
yεC1: {α5}; // Carries back [C1]
z: Z y {α6}
| y W {α7}
| {α8};
zφC1: Z yφC1 {α6}
| Z[C1] yεC1 {α6}
| y W[C1] {α7};
zεC1: {α8}; // Carries back [C1]
This can now be used to construct the first sets and item sets for the LR(1) parser in the usual way. A few points to note on the implementation of these rules are worth mentioning:
- Any epsilon rule (a rule that carries back a guard condition) has already had its guard condition mapped back onto the previous token in the grammar. Hence when one of these rules is encountered in the rewritten grammar, the leading guard condition has already been mapped back onto any preceding tokens;
- In a pedantic implementation of this process, every rule group would be split into two groups of tokens, being those that have no carry-back and those that do. In practice only a subset of the rules will have carry-backs into previous elements, and an even smaller group will have epsilon rules but no phi rules. The final grammar only needs to keep those rules that appear on the RHS of a production, so there may be occasions when rules can just be dropped altogether if the pedantic approach is taken;
- The worst case scenario is that the total number of productions doubles for the grammar. This naturally implies a bigger cluster of item sets for the state machine, and is the price paid for supporting guard conditions on non-terminal tokens. The size of the machine can be mitigated by using parser table simplifications such as LALR(1) or SLR(1) parsing, but this will have the usual caveats relating to a reduction in the number of unambiguous grammars that can be read. If a combination of LALR(1) parsing with GLR techniques is used, this may give the benefits of a reduced size state machine, but at cost of running multiple state machines over the same input token sequence to see which ones survive the entire input sequence.
Both the in-line and the offline parser generators used in the ParseLR suite
apply the transformations described above, if they encounter non-terminal
tokens that have guard conditions attached to them. They also perform
canonicalisation of boolean guard conditions to test them for equivalence
against each other as part of the optimisation process. Hence, for example, the expressions
[!X & !Y]
and [!(X | Y)]
would be treated as
the same guard condition when encountered.