Recursive Descent Parser - Spicery/Nutmeg GitHub Wiki
Recursive Descent Parsing with Precedence
The syntax is suitable for a style of parsing called recursive descent and this section outlines an implementation of an extensible parser are outlined here. The idea is to drive the parser from a table that maps tokens into specialised 'mini-parsers'.
Tokens
For our purposes, tokens will fall into the following categories:
- literal constants e.g.
"foo"
,123.7
- identifiers e.g.
x
,the_bad_play
,Oboe
- syntax-word e.g.
if
,+
,endfor
,)
,]
,then
- punctuation e.g.
)
,]
,endif
,then
,do
Syntax-words have mini-parser-functions associated with them. They come in two flavours, prefixers and postfixers. Normally these are completely distinct but there are a few tokens (e.g. (
) that are both.
- prefixers e.g.
if
,[
,let
,(
- postfixers e.g.
+
,*
,,
,(
Prefix-words are allowed to start expressions and postfix-words allow expressions to be continued. The trick is to implement operator precedence. To do this we give postfix-words numerical precedences, which conventionally is in a fixed range (e.g. 0 to 100). These are used to determine binding. A lower numerical precedence binds more tightly and, by convention, the precedence of general expressions might be a fixed value (e.g. 100). For example, the +
token might have a precedence of (say) 60 and *
might have a precedence of (say) 30. This would mean that in the expression a * b + c
the parser will group it the way we want as (a * b) + c
.
Recursive Descent
The idea is that we drive a main expression parser (readExpression
) that uses the mini-parsers. Each mini-parser is a function that consumes tokens from a (peekable) token-generator to return a code-tree. Here's the basic structure of a recursive descent parser that includes precedence:
def readExpression( prec, source ):
token = source.peekOrElse()
if token:
if token.isPostfixer():
raise Exception( f'Unexpected token in bagging area: {token}' )
else:
source.pop()
if token.isPrefixer():
sofar = token.runPrefixMiniParser( source )
else:
sofar = token.toCodeTree()
while True:
token = source.peekOrElse()
if not token or not token.isPostfixer(): break
p = token.precedence()
if not p <= prec: break
source.pop()
sofar = token.runPostfixMiniParser( p, sofar, source )
return sofar
else:
raise Exception( 'Unexpected end of input' )
We can show this working on the expression a * b + c * d
. First, we cheat a bit by hard-coding the token-source as shown below:
class TokenSource:
"""See Peekable-Pushable Generators for how to do this properly"""
def __init__( self ):
self.tokens = [ Token( t ) for t in 'a * b + c * d'.split() ]
def peekOrElse( self ):
return self.tokens[0] if self.tokens else None
def pop( self ):
self.tokens = self.tokens[1:]
We will need to set up the mini-parsers. In practice we would make tokens a class whose methods did the table look-up. But we'll just hard-code everything here.
class Token:
def __init__( self, token_text ):
self._token = token_text
def isPrefixer( self ):
"""None in this demo"""
return False
def isPostfixer( self ):
return self._token in '*+'
def toCodeTree( self ):
return dict( id=self._token )
def runPrefixMiniParser( self, src ):
"""None needed in this demo"""
raise Exception('Not implemented yet')
def runPostfixMiniParser( self, prec, sofar, src ):
return dict( call=self._token, lhs=sofar, rhs=readExpression(prec, src) )
def precedence(self):
if self._token == '+':
return 60
elif self._token == '*':
return 30
else:
return 100
You can now test it out like this:
steve% python3 -i test.py
>>> readExpression(10,TokenSource())
{'kind': 'call', 'name': '+', 'lhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'a'}, 'rhs': {'id': 'b'}}, 'rhs': {'kind': 'call', 'name': '*', 'lhs': {'id': 'c'}, 'rhs': {'id': 'd'}}}
>>>