Recursive Descent Parser - Spicery/Nutmeg GitHub Wiki
Recursive Descent Parsing with Precedence
The syntax is suitable for a style of parsing called recursive descent and this section outlines an implementation of an extensible parser are outlined here. The idea is to drive the parser from a table that maps tokens into specialised 'mini-parsers'.
Tokens
For our purposes, tokens will fall into the following categories:
- literal constants e.g.
"foo",123.7 - identifiers e.g.
x,the_bad_play,Oboe - syntax-word e.g.
if,+,endfor,),],then - punctuation e.g.
),],endif,then,do
Syntax-words have mini-parser-functions associated with them. They come in two flavours, prefixers and postfixers. Normally these are completely distinct but there are a few tokens (e.g. () that are both.
- prefixers e.g.
if,[,let,( - postfixers e.g.
+,*,,,(
Prefix-words are allowed to start expressions and postfix-words allow expressions to be continued. The trick is to implement operator precedence. To do this we give postfix-words numerical precedences, which conventionally is in a fixed range (e.g. 0 to 100). These are used to determine binding. A lower numerical precedence binds more tightly and, by convention, the precedence of general expressions might be a fixed value (e.g. 100). For example, the + token might have a precedence of (say) 60 and * might have a precedence of (say) 30. This would mean that in the expression a * b + c the parser will group it the way we want as (a * b) + c.
Recursive Descent
The idea is that we drive a main expression parser (readExpression) that uses the mini-parsers. Each mini-parser is a function that consumes tokens from a (peekable) token-generator to return a code-tree. Here's the basic structure of a recursive descent parser that includes precedence:
def readExpression( prec, source ):
token = source.peekOrElse()
if token:
if token.isPostfixer():
raise Exception( f'Unexpected token in bagging area: {token}' )
else:
source.pop()
if token.isPrefixer():
sofar = token.runPrefixMiniParser( source )
else:
sofar = token.toCodeTree()
while True:
token = source.peekOrElse()
if not token or not token.isPostfixer(): break
p = token.precedence()
if not p <= prec: break
source.pop()
sofar = token.runPostfixMiniParser( p, sofar, source )
return sofar
else:
raise Exception( 'Unexpected end of input' )
We can show this working on the expression a * b + c * d. First, we cheat a bit by hard-coding the token-source as shown below:
class TokenSource:
"""See Peekable-Pushable Generators for how to do this properly"""
def __init__( self ):
self.tokens = [ Token( t ) for t in 'a * b + c * d'.split() ]
def peekOrElse( self ):
return self.tokens[0] if self.tokens else None
def pop( self ):
self.tokens = self.tokens[1:]
We will need to set up the mini-parsers. In practice we would make tokens a class whose methods did the table look-up. But we'll just hard-code everything here.
class Token:
def __init__( self, token_text ):
self._token = token_text
def isPrefixer( self ):
"""None in this demo"""
return False
def isPostfixer( self ):
return self._token in '*+'
def toCodeTree( self ):
return dict( id=self._token )
def runPrefixMiniParser( self, src ):
"""None needed in this demo"""
raise Exception('Not implemented yet')
def runPostfixMiniParser( self, prec, sofar, src ):
return dict( call=self._token, lhs=sofar, rhs=readExpression(prec, src) )
def precedence(self):
if self._token == '+':
return 60
elif self._token == '*':
return 30
else:
return 100
You can now test it out like this:
steve% python3 -i test.py
>>> readExpression(10,TokenSource())
{'kind': 'call', 'name': '+', 'lhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'a'}, 'rhs': {'id': 'b'}}, 'rhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'c'}, 'rhs': {'id': 'd'}}}
>>>