Parsing Expression Grammars - troyp/jq GitHub Wiki

jq as a PEG engine

jq contains all the ingredients required for conveniently writing PEGs and PEG parsers. In effect, one can write a PEG parser for a PEG grammar simply by transcribing the grammar into jq, though some care with respect to jq's "declare before use" rule may be required.

Other tasks, such as evaluation of arithmetic expressions defined by a PEG grammar, can be achieved with either no or only minimal changes to the transcribed grammar. This is illustrated in the second example below.

jq also has the machinery required to add parameterized non-terminals, as illustrated in the third example below.

For an introduction to Parsing Expression Grammars, see for example https://en.wikipedia.org/wiki/Parsing_expression_grammar

Transcribing a PEG grammar to jq

Transcription to jq is possible because each of the seven basic[^1] PEG operations can be mapped directly to jq filters with the same semantics. Here is the mapping that will be used in this article:

 Sequence: e1 e2             e1 | e2
 Ordered choice: e1 / e2     e1 // e2
 Zero-or-more: e*            star(E)
 One-or-more: e+             plus(E)
 Optional: e?                optional(E)
 And-predicate: &e           amp(E) 
 Not-predicate: !e           neg(E)

where:

def star(E): (E | star(E)) // .;

def plus(E): E | (plus(E) // . );

def optional(E): E // .;

def amp(E): . as $in | E | $in;

def neg(E): select( [E] == [] );

Example 1 - `a^n b^n c^n : n >= 1`

Following the article on PEGs, the PEG for the string expressions a^n b^n c^n where n >= 1 can be written as follows:

 S ← &(A 'c') 'a'+ B !.
 A ← 'a' A? 'b'
 B ← 'b' B? 'c'

This is a nice example as it uses five of the seven basic PEG operations, including the two look-ahead operations & and !. In addition, it uses '.' to signify that the input is nonempty.

The PEG can be expressed in jq by simple transcription:

def A: literal("a") | optional(A) | literal("b");
def B: literal("b") | optional(B) | literal("c");
def S: amp(A | literal("c")) | plus(literal("a")) | B | neg(nonempty);

assuming that literal and nonempty have been suitably defined. Their implementations will depend on the task at hand. In the following we consider two specific tasks: recognition, and tokenization.

Recognition

For recognition of an input string, the following would suffice:

def literal($s): select(startswith($s)) | .[$s|length :];

def nonempty: select(length > 0);

If we want the output to be true or false depending on whether the input string conforms to the grammar, we could write:

(S | true) // false

Example:

Assuming all the above jq code has been assembled into a file, parse.jq:

   $ jq -f parse.jq <<< abc
   true

Tokenization

Let's suppose we wish to build up an array of the recognized terminals. In that case, we need only modify the functions that consume or examine the string input, i.e. literal and nonempty.

For convenience, on the jq side, we'll pass a JSON object of the form {"remainder": _, "result": _} through the filter.

The definitions of literal and nonempty thus become:

def literal($s):
  select(.remainder | startswith($s))
  | .result += [$s]
  | .remainder |= .[$s | length :] ;


def nonempty: select( (.remainder | length) > 0 );

And for convenience, we could define:

def parse: {remainder: .} | S | .result;

So, with parse as our main program, a run would now look like this:

jq -R -cf abc.jq <<< abc
["a","b","c"]

Declare-before-use

In jq, a filter can only be invoked after it has been declared. This means that when transcribing PEG rules into jq filters, the order is important. To handle circularly-defined PEG expressions, one must take advantage of jq's support for subfunctions, that is, defs within defs. This is illustrated in the next example, also taken from the Wikipedia article.

Example 2 - arithmetic

As presented in the aforementioned Wikipedia article, a PEG for recognizing simple arithmetic expressions for addition, subtraction, multiplication and division applied to non-negative integers is as follows:

Expr    ← Sum
Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*

In this case, for the reason mentioned in the previous section, care must be taken with transcription. Since the main goal is in any case to define Expr, it makes sense to define all the other PEG expressions as subfunctions of Expr:

# Expr    ← Sum
def Expr:

  # Value   ← [0-9]+ / '(' Expr ')'
  def Value: parseNumber( "[0-9]+" ) // (literal("(") | Expr | literal(")"));
   
  # Product ← Value (('*' / '/') Value)*
  def Product: (Value | star( (literal("*") // literal("/")) | Value));

  # Sum     ← Product (('+' / '-') Product)*
  def Sum: (Product | star( (literal("+") // literal("-")) | Product));

  Sum ;

With literal as previously defined for a recognition engine, and parseNumber as defined in the Appendix, we thus have a recognition engine.

Evaluation

In this section, we'll illustrate how the transcribed PEG can be adapted to evaluate the recognized arithmetic expressions.

The strategy will be to pass along an object {"remainder":, "result":} as with the previous example, except that when a parenthesized subexpression is encountered, the result will be augemented by an array. Thus the def of Value above will be modified to:

def Value: parseNumber( "[0-9]+" ) // (consume("[(]") | box(Expr) | consume("[)["));

where consume/1 simply consumes the string matching the regex input, and box/1 records Expr in an array. These helper functions, together with an eval function for evaluating the parsed result, are given in the Appendix below.

We can then define a jq function to evalute a string representing strings defined by the PEG as follows:

def evaluate:
  {remainder: .}
  | Expr.result
  | eval ;

Example:

   $ jq -f arithmetic.jq <<< "(1+2)*(3+4)+(1*1)"
   22

Example 3: `[a-z]+ing`

Because a PEG parser does not backtrack, the task of determining the first match of the regex [a-z]+ing in a string cannot be accomplished simply by writing a PEG rule such as:

plus(parse("[a-z]")) | literal("ing")

because the first expression will gobble up all the "i-n-g" characters, resulting in failure.

An appropriate PEG grammar for the task at hand could be achieved in jq as follows:

def atoz: parse("[a-z]");
def ing:  literal("ing");

plus(neg(ing) | atoz) | ing

Better yet, we could define a parameterized non-terminal to capture the pattern for reusability:

def Aplus_B(A; B): plus(neg(B) | A) | B;

So the PEG grammar now looks like the corresponding regular expression:

Aplus_B( atoz; ing )

Appendix - Helper Functions

def eval:
  if type == "array" then
    if length == 0 then null
    else .[-1] |= eval
    | if length == 1 then .[0]
      else (.[:-2] | eval) as $v
      | if   .[-2] == "*" then $v * .[-1]
        elif .[-2] == "/" then $v / .[-1]
        elif .[-2] == "+" then $v + .[-1] 
        elif .[-2] == "-" then $v - .[-1] 
        else tostring|error
	end
      end
    end
  else .
  end ;

def box(E):
   ((.result = null) | E) as $e
   | .remainder = $e.remainder
   | .result += [$e.result]  # the magic sauce
   ;

# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
  # on failure, match yields empty
  (.remainder | match("^" + $re)) as $match
  | .remainder |= .[$match.length :]
  | .match = $match.string ;

def parse($re):
  consume($re)
  | .result = .result + [.match] ;

def parseNumber($re):
  consume($re)
  | .result = .result + [.match|tonumber] ;

[^1]: Note that some of these can be written in terms of the others, e.g. E+ is just EE*, so "basic" is not meant in the sense of "fundamental".