Recursive Descent Parser - Spicery/Nutmeg GitHub Wiki

Recursive Descent Parsing with Precedence

The syntax is suitable for a style of parsing called recursive descent and this section outlines an implementation of an extensible parser are outlined here. The idea is to drive the parser from a table that maps tokens into specialised 'mini-parsers'.

Tokens

For our purposes, tokens will fall into the following categories:

  • literal constants e.g. "foo", 123.7
  • identifiers e.g. x, the_bad_play, Oboe
  • syntax-word e.g. if, +, endfor, ), ], then
  • punctuation e.g. ), ], endif, then, do

Syntax-words have mini-parser-functions associated with them. They come in two flavours, prefixers and postfixers. Normally these are completely distinct but there are a few tokens (e.g. () that are both.

  • prefixers e.g. if, [, let, (
  • postfixers e.g. +, *, ,, (

Prefix-words are allowed to start expressions and postfix-words allow expressions to be continued. The trick is to implement operator precedence. To do this we give postfix-words numerical precedences, which conventionally is in a fixed range (e.g. 0 to 100). These are used to determine binding. A lower numerical precedence binds more tightly and, by convention, the precedence of general expressions might be a fixed value (e.g. 100). For example, the + token might have a precedence of (say) 60 and * might have a precedence of (say) 30. This would mean that in the expression a * b + c the parser will group it the way we want as (a * b) + c.

Recursive Descent

The idea is that we drive a main expression parser (readExpression) that uses the mini-parsers. Each mini-parser is a function that consumes tokens from a (peekable) token-generator to return a code-tree. Here's the basic structure of a recursive descent parser that includes precedence:

def readExpression( prec, source ):
    token = source.peekOrElse()
    if token:
        if token.isPostfixer():
            raise Exception( f'Unexpected token in bagging area: {token}' )
        else:
            source.pop()
            if token.isPrefixer():
                sofar = token.runPrefixMiniParser( source )
            else:
                sofar = token.toCodeTree()
            while True:
                token = source.peekOrElse()
                if not token or not token.isPostfixer(): break
                p = token.precedence()
                if not p <= prec: break
                source.pop()
                sofar = token.runPostfixMiniParser( p, sofar, source )
            return sofar
    else:
        raise Exception( 'Unexpected end of input' )

We can show this working on the expression a * b + c * d. First, we cheat a bit by hard-coding the token-source as shown below:

class TokenSource:
  """See Peekable-Pushable Generators for how to do this properly"""

  def __init__( self ):
    self.tokens = [ Token( t ) for t in 'a * b + c * d'.split() ]

  def peekOrElse( self ):
    return self.tokens[0] if self.tokens else None

  def pop( self ):
    self.tokens = self.tokens[1:]

We will need to set up the mini-parsers. In practice we would make tokens a class whose methods did the table look-up. But we'll just hard-code everything here.

class Token:

    def __init__( self, token_text ):
        self._token = token_text

    def isPrefixer( self ):
      """None in this demo"""
      return False

    def isPostfixer( self ):
      return self._token in '*+'

    def toCodeTree( self ):
      return dict( id=self._token )

    def runPrefixMiniParser( self, src ):
      """None needed in this demo"""
      raise Exception('Not implemented yet')

    def runPostfixMiniParser( self, prec, sofar, src ):
      return dict( call=self._token, lhs=sofar, rhs=readExpression(prec, src) )

    def precedence(self):
      if self._token == '+':
        return 60
      elif self._token == '*':
        return 30
      else:
        return 100

You can now test it out like this:

steve% python3 -i test.py 
>>> readExpression(10,TokenSource())
{'kind': 'call', 'name': '+', 'lhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'a'}, 'rhs': {'id': 'b'}}, 'rhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'c'}, 'rhs': {'id': 'd'}}}
>>>