logical grammar specification - adesutherland/CREXX GitHub Wiki

Grammar Specification

Page Status: This page is ready for review for Phase 0 (PoC). It may have errors and be changed based on feedback and implementation experience

ANSI Standard

The current REXX ANSI specification describes the grammar in terms of implementation details, specifically pulling out details across the interactions between a stateful lexer and parser (although it does not describe it in these terms). And although not explicitly stated, it is possible that the parser is assumed to be a LALR one.

The difficulty with this approach is that the specification logic is split up and therefore hard to interpret or indeed validate. Also it is often not clear where certain rules are:

  • Essential to the REXX Grammar being described
  • Needed to remove ambiguities in the grammar inherent in REXX itself
  • Needed to overcome weaknesses in the the LALR parser (i.e. with one lookahead)

Parsing Expression Grammar (PEG)

We want a format that describes the grammar in an implementation independent fashion that allows comprehension for a REXX programmer.

We at the logical level we will use PEG grammar to describe REXX syntax, and (for REXX Level C - Classic REXX) convert the ANSI specification to this format.

For readability, we will assume that the Parser / Grammar can support Left Recursion.

For Phases 0-2 we will specify (at the Physical Level) how we implement these grammars in the 3 stages - lexer, parsers, and finally AST Manipulation.

For Phase 3 we intend to use a PEG parser delivered in REXX Level L directly; likely a PIKA Parser. We are therefore using the PEG format defined/inspired by its reference implementation.

Parser Errors

For all cREXX Phases and REXX levels - parsing errors will be embedded into the AST tree, and it is expected that a AST Manipulation stage (even for phase 3) will be needed to ensure that error reporting is useful and standards compliant.

REXX PEG Format

This extends the PEG format to concisely support REXX constructs, and is inspired by the PIKA Parser Reference implementation on Github.

Currently this is only supported by "human interpretation". In time it is envisaged that this format be developed and used in a REXX Level L PARSE Instruction to allow language parsing and AST tree generation.

Whitespace and Comments

An issue with the PEG Format is the treatment of non-program tokens (i.e. Whitespace and Comments): either they are assumed to have been stripped out (in which case the grammar cannot make use of this information; recalling that REXX uses abuttals as an operator), or they have to be referenced throughout the grammar rules (very messy and not supportive of the common approach of implementing with a separate lexing stage).

Our format defines two master rules:

  • \WHITESPACE - to indicate whitespace, these are discarded

  • \COMMENT - to indicate comment tokens. These tokens can be considered to have been created but hidden. For human interpretation not different to /ws but at a later date this difference could be used, as an example, for language transformation applications.

In addition, the &| notation for (Abuttal) - see below - allows languages to detect when there was no whitespace (or comments) between two tokens.

String (" or ') of characters and rule names that begin with a Capital letter are treated a tokens (see below) in this case white space and comments are NOT ignored.

Rule Format

The rules are of the form RuleName <- Clause;.

If a rule name starts with a capital (recommendation: capitalise the whole rule name) then it is considered to be a TOKEN. This just means that whitespaces and comments are included in the matching / token rather than being supressed.

AST node labels may be specified in the form RuleName <- ASTNodeLabel:Clause;.

Sub-Rules processed in the order specified by the parent rule to get rid of ambiguities

Deprecated: The rule name may be followed by optional square brackets containing the precedence of the rule (as an integer), optionally followed by an associativity modifier (,L or ,R).

Nonterminal clauses can be specified using the following notation:

  • X Y Z for a sequence of matches (X should match, followed by Y, followed by Z), i.e. Seq
  • X / Y / Z for ordered choice (X should match, or if it doesn't, Y should match, or if it doesn't' Z should match) , i.e. First
  • X+ to indicate that X must match one or more times, i.e. OneOrMore
  • X* to indicate that X must match zero or more times, i.e. ZeroOrMore - nongreedy
  • X** to indicate that X must match zero or more times, i.e. ZeroOrMore - greedy
  • X? to indicate that X may optionally match, i.e. Optional
  • &X to look ahead and require X to match without consuming characters, i.e. FollowedBy
  • !X to look ahead and require that there is no match (the logical negation of &X), i.e. NotFollowedBy
  • &| to look ahead (without consuming characters) and require that there is no whitespace before the next token, i.e. Abuttal

Terminal clauses can be specified using the following notation. Standard character escaping is supported, including for Unicode codepoints.

  • '[' for individual characters
  • "import" for strings of characters (case sensitive)
  • 'import' for strings of characters (case insensitive)
  • [0-9] for character ranges
  • [+\-*/] for character sets (note - is escaped)
  • [^\n] for negated character sets (note that this will slow down the parser, since a negated matching rule will spuriously match in many more places)
  • \ = Escape character
  • . = wildcard (matches any single character)

Logic can be used between clauses

  • // = logical or
  • && = logical and
  • ^ = not

Unordered Sets

  • {a b c} - Unordered set 1 of each member in any order
  • {+ a b c} - Unordered set 1 or more of each member in any order
  • {* a b c} - Unordered set 0 or more of each member in any order
  • {n* a b c} - Unordered set n or more of each member in any order

Special Rules

  • \eof - Platform specific detection of EOF (or End of Stream)
  • \eol - Platform specific detection of EOL

AST Generation

  • Each rule returns a set of ordered nodes (which can be empty)
  • rule <- a:subrule b:subrule -> * Any sub-rules AST nodes are are kept as a set of sibling nodes. Equivalent to:
  • rule <- a:subrule b:subrule No AST Node is created, any subrules AST nodes are are kept as a set of sibling nodes (for the use of parent rules)
  • ... -> TYPE Creates an AST node of type TYPE made up of the complete rule selection, any subrules AST nodes are added as children. Equivalent to:
  • rule <- a:subrule b:subrule -> (TYPE *) AST Tree with node type TYPE as root and nodes as children == -> TYPE
  • rule <- subrule subrule -> TYPE * AST Tree with node type TYPE as root and any nodes as siblings following TYPE
  • rule <- a:subrule b:subrule -> TYPE a b AST Tree with node type TYPE as root and a and b nodes as siblings following TYPE
  • rule <- a:subrule b:subrule -> a b AST Tree with nodes a and b as siblings
  • ... -> TYPE[parm] As -> TYPE but with a parameter stored in the node
  • rule <- a:subrule b:subrule c:subrule ... -> (a b c ...) AST Tree with rule label a as root and b and c (etc.) nodes as children (b and can be null in which case they are '\' added.
  • rule <- a:subrule b:subrule* -> (a b) AST Tree with rule label a as root and many children from subrule b
  • rule <- a:subrule b:subrule -> (TYPE a b) AST Tree with node type TYPE as root and a and b nodes as children
  • rule <- a:subrule b:subrule c:subrule -> (TYPE (a b) c) AST Tree with node type TYPE as root and a and c nodes as children b as a grandchild
  • rule <- subrule -> TYPE1 / subrule -> TYPE1 Shows how different subrules may lead to a different AST node
⚠️ **GitHub.com Fallback** ⚠️