HOW TO Add a New Language Feature - Spicery/Nutmeg GitHub Wiki

In this article we walkthrough how we added the if-syntax to Nutmeg. This gives a rough idea of the work involved when adding a new feature to the language.

Design

The first step is designing the new feature. There's no recipe for this except the need for writing up, which acts as an essential form of reflection. We do write up by adding a wiki page that describes the feature. So in this case I wrote a page "If Syntax" and added the description to that. This wiki page does not have to be perfect first time, by the way, but it should certainly convey the overall idea and be substantially correct.

  • An overview of the new feature, explaining its typical use, its meaning, some examples and providing some motivation.
  • Since the if statement has its own feature, it should include a EBNF grammar for the feature. This should be checked using https://bottlecaps.de/rr/ui and turned into a railroad diagram. This snippet will be included into the manual page in a later step.
  • A more formal explanation of its behaviour.

Syntax Colouring

When a new feature introduces new syntax, the VS Code syntax-colouring extension will need updating. This lives in its own repository. The if syntax introduces the following keywords: if, then, elseif, else, endif. These need to be added to the syntaxes/nutmeg.tmLanguage.json page in the "keywords" section, highlighted below.


"keywords": {
	"patterns": [{
		"name": "keyword.control.nutmeg",
		"match": "\\b(def|enddef|if|then|else|elseif|endif|while|for|do|endfor|return|begin|end|let|in|var)\\b"
	}]
},

Create a Feature Branch

All the other work that is required is best done in a 'feature branch'. This is a fairly short-lived branch that will contain the code changes and also the addition to the Nutmeg in a Nutshell programmer's guide. Once the new feature is working in the branch, a pull-request is issued, reviewed and completed.

The feature branch for this work is v0.1/41_if_syntax. Branches are named according to the scheme should be in the form "v{version}/{issueNumber}_{descriptiveTitle}", which is set out in the definition of ready.

Add IfCodelet in codetree.py

Fortunately this was already done for us. However, in general, we might need to add another type of Codelet.

Add the keywords to tokenizer.py

The first change we need to make is to add the keywords (if, then, elseif, else, endif) to the tokenizer. The parser is driven by the token attributes, so we need to set the attributes correctly. The attributes are:

  • isPrefix: Can this token appear at the start of an expression?
  • isPostfix: Can this token appear after an expression?
  • precedence: Only meaningful for postfix tokens; 1000 for loose bindings, 0 for the tightest binding.
  • category: A key that is used to pick the right mini-parser & is typically the name of the named capture group.

Tokens get their attributes from a TokenType object, whose main role is holding these attributes but also provides a token-factory. The tokenizer is extended by adding TokenType objects to the token_spec table. Each entry to the table needs at least a regular expression that has a named capture group and a function for constructing the token.

The if keyword will be a prefix syntax-word that is responsible for starting the "mini-parser" that will read if-expressions (prefix). The other keywords are simply punctuation (not prefix & not postfix). All these entries should go ahead of the end token; order matters as we want the regex engine to match endif before end.

We already have convenience functions available for common cases such as being a syntax-word or a punctuation-token. The make keyword specifies the token factory.

        TokenType( r"(?P<IF>if)", prefix=True, make=SyntaxToken.make ),
        TokenType( r"(?P<THEN>then)", make=PunctuationToken.make ),
        TokenType( r"(?P<ELSE_IF>elseif)", make=PunctuationToken.make ),
        TokenType( r"(?P<ELSE>else)", make=PunctuationToken.make ),
        TokenType( r"(?P<END_IF>endif)", make=PunctuationToken.make ),
        TokenType( r"(?P<END>end)", make=PunctuationToken.make ),                           # MUST come after all other end... token types.

When the tokenizer runs, it will use the supplied make function to build the Token object. These are called with the token-type object and the string that got matched and must return the new token.


make( TOKEN_TYPE, MATCHED_TEXT ) -> TokenType

The correct tokenisation of these keywords needs to be checked by writing a suitable unit test in test/test_tokenizer.py. In this case we verify all the keywords' tokenization in one test.

def __isPunctuation( token, category ):
    return not token.isPrefixer() and not token.isPostfixer() and isinstance( token, PunctuationToken ) and token.category() == category

def test_if_syntax():
    ts = tokenize( "if then elseif else endif" )
    assert 5 == len( ts )
    assert ts[0].isPrefixer() and isinstance( ts[0], SyntaxToken ) and ts[0].category() == "IF"
    assert __isPunctuation( ts[1], "THEN" )
    assert __isPunctuation( ts[2], "ELSE_IF" )
    assert __isPunctuation( ts[3], "ELSE" )
    assert __isPunctuation( ts[4], "END_IF" )

Add the mini-parser for if-syntax to parser.py

The next step is to implement the mini-parser, which will be invoked when the if token is encountered by the parser. This is a prefix mini-parser which is a function with this signature mini_parser(parser: Parser, token: Token, source: PeekablePushable<Token>) -> codelet. It will need to build codetree using the IfCodelet class.

Our first step is to insert the soon-to-be-written mini-parser into the table for prefix mini-parsers. This looks like:


PREFIX_TABLE = {
    "LPAREN": lparenPrefixMiniParser,
    "DEC_FUNCTION_1": defPrefixMiniParser,
    "IF": ifPrefixMiniParser,
    BasicToken: lambda parser, token, source: codetree.StringCodelet( value=token.value() ),
    IdToken: lambda parser, token, source: codetree.IdCodelet( name=token.value(), reftype="get" ),
    IntToken: lambda parser, token, source: codetree.IntCodelet( value=token.value() ),
    StringToken: lambda parser, token, source: codetree.StringCodelet( value=token.literalValue() ),
}

Note that this table is keyed off the token category. Here this is the string "IF" which was defined in the regex r"(?P\<IF\>if)". When the parser encounters an if-token, it will find the category "IF" and run the function ifPrefixMiniParser.

The mini-parser itself needs a bit of care but is largely self-explanatory:

def ifPrefixMiniParser( parser, token, source ):
    # Reads an expression up to the next punctuation, math.inf = loosest precedence.
    testPart = parser.readExpr( math.inf, source ) 
    # Check we have the keyword `then`
    mustRead( source, "THEN" )
    # Read the statements up to the next punctation.
    thenPart = parser.readStatements( source )
    # It can continue with endif, elseif or else
    if tryRead( source, "END_IF" ):
        return codetree.IfCodelet( testPart=testPart, thenPart=thenPart, elsePart=codetree.SeqCodelet() )
    elif tryRead( source, "ELSE_IF" ):
        # Invoke the mini-parser recursively, which will consume up to the endif.
        elsePart = ifPrefixMiniParser( parser, None, source )
        return codetree.IfCodelet( testPart=testPart, thenPart=thenPart, elsePart=elsePart )
    else 
        mustRead( source, "ELSE" ):
        elsePart = parser.readStatements( source )
        mustRead( source, "ENDIF" )
        return codetree.IfCodelet( testPart=testPart, thenPart=thenPart, elsePart=elsePart )

We also need to write some unit tests to go with this. These are a bit more verbose, so I only include the shortest of the tests.

def test_if2_ifPrefixMiniParser():
    if2 = parseOne( "if x then y endif" )
    assert isinstance( if2, codetree.IfCodelet )
    assert isinstance( if2.testPart(), codetree.IdCodelet )
    assert isinstance( if2.thenPart(), codetree.IdCodelet )
    assert isinstance( if2.elsePart(), codetree.SeqCodelet )
    assert len(if2.elsePart().body()) == 0

Extend the resolver to correctly handle the lexical-scopes of an if-then-else in resolver.py

A key detail of the if-syntax is that a new lexical scope is introduced for each action. Lexical scopes are handled by the Resolver in resolver.py. The Resolver scans the entire tree and annotates all variables ("kind": "id"). We use the visitor pattern to ensure that the correct action is taken for each different type of codelet.

Here's the appropriate code for the IfCodelet. Note that LexicialScopes are linked to their enclosing-scopes via the previous keyword parameter.

    def visitIfCodelet( self, if_codelet, scopes ):
        """
        Fix up if-expression e.g. if t then x else y endif
        """
        if_codelet.testPart().visit( self, scopes )
        if_codelet.thenPart().visit( self, LexicalScope( previous = scopes ) )
        if_codelet.elsePart().visit( self, LexicalScope( previous = scopes ) )

To test this, we can use a test such as this, where the three x's should all be labelled separately.

def f(): x := 0; if t then x := 1; x else x := 2; x endif enddef

The catch is that the := operator has not been implemented yet! This is a typical problem when developing a programming language where so many elements need to be developed in parallel. The answer we choose here is to write the test but to defer it using the decoration @pytest.mark.skip(reason="no way of currently testing this").

This test is complicated by the fact that we don't, as yet, have a convenient way of navigating to the variables in the tree produced by the parse. We hand-code the path navigation and, because this is so verbose and easy to make mistakes in, pepper it with assertions.

@pytest.mark.skip(reason="no way of currently testing this")
def test_if3_visitIfCodelet():
    # Arrange
    tree = parseOne( "def f(): x := 0; if t then x := 1; x else x := 2; x endif enddef" )
    # Act
    resolveCodeTree( tree )
    # Re-arrange for Assert
    _function = tree.rhs()
    assert isinstance( _function, codetree.LambaCodelet )
    _seq = _function.body()
    assert isinstance( _seq, codetree.SeqCodelet )
    _binding0 = _seq[0]
    assert isinstance( _binding0, codetree.BindingCodelet )
    _x0 = _binding0.lhs()
    assert isinstance( _x0, codetree.IdCodelet )
    _if3 = _seq[1]
    assert isinstance( _if3, codetree.IfCodelet )
    _seqThen = _if3.thenPart()
    assert isinstance( _seqThen, codetree.SeqCodelet )
    _x1_decl = _seqThen[0].lhs()
    assert isinstance( _x1_decl, codetree.IdCodelet )
    _x1_ref = _seqThen[1]
    assert isinstance( _x1_ref, codetree.IdCodelet )
    _seqElse = _if3.elsePart()
    assert isinstance( _seqElse, codetree.SeqCodelet )
    _x2_decl = _seqElse[0].lhs()
    assert isinstance( _x2_decl, codetree.IdCodelet )
    _x2_ref = _seqElse[1]
    assert isinstance( _x2_ref, codetree.IdCodelet )
    # Assert
    assert _x1_decl.label() == _x1_ref.label()
    assert _x2_decl.label() == _x2_ref.label()
    assert _x1_decl.label() != _x2_decl.label()
    assert _x0.label() != _x1_decl.label()
    assert _x0.label() != _x2_decl.label()

Add any optimization rules needed in optimize.py

Actually there are a lot of potential optimizations associated with if-syntax. In this walkthrough we only add the simplest optimization, which is to collapse the IfCodelet when the test is a constant. This is one of a group of straightforward simplifications and so we add a simplifying pass. This is a recursive transformation done by coordinating the transform and visit methods. This starts off like this:

class Simplify( codetree.CodeletVisitor ):
    
    def __call__( self, tree ):
        return tree.visit( self )

    def visitCodelet( self, codelet ):
        return codelet.transform( self )

To this basic shell we need to add an visitIfCodelet specialisation.

    def visitIfCodelet( self, if_codelet ):
        test_part = if_codelet.testPart()
        if isinstance( test_part, codetree.BoolCodelet )
            if test_part.valueAsBool():
                return if_codelet.thenPart()
            else:
                return if_codelet.elsePart()
        else:
            return self.visitCodelet( if_codelet )

To check this works we should add a few units test. In this case our code-trees are so simple we can hand-code them from JSON. Here's the first such test.

def test_if3_simplification():
    # Arrange
    jcodelet = {
        "kind": "if",
        "test": {
            "kind": "bool",
            "value": "true"
        },
        "then": {
            "kind": "id",
            "name": "x",
            "reftype": "get"
        },
        "else": {
            "kind": "id",
            "name": "y",
            "reftype": "get"
        }
    }
    tree = codetree.codeTreeFromJSONObject( jcodelet )
    # Act
    after = optimizer.Simplify()( tree )
    # Assert
    assert isinstance( after, codetree.IdCodelet )
    assert after.name() == 'x'

We also added unit tests for false and also a negative test when no transformation was applied.

Add the IfCodelet to the C# runner

Thanks for sticking with me this far. As it happens, this had already been done for us, so this is the end of the walkthrough. But if you would like to hear how the IfCodelet was added to the runner, do bug me (Steve Leach) and I will write this up as well. But for now, enjoy the break!

Final Touches

All the work is now done. Run the unit tests and make sure that everything compiles and runs. In the compiler subfolder run:

% pytest tests/

And then, having reviewed all your changes in the feature branch locally, push your branch to GitHub and create a pull-request. I always create pull-requests in a web browser. If you get comments, respond to them all, push the changes up and your work is done!

⚠️ **GitHub.com Fallback** ⚠️