Macros extended course. Part 1 - vilinski/nemerle GitHub Wiki

Table of Contents Introduction What is a macro and why do we need it? Warning Kinds of macros Development and debugging Compilation stages Compilation process PreParse step Lexer, PreParse, and keywords Top level construction parsing (i.e. parsing of types) AST TypeBuilder's creation TypeBuilder typing Further compilation stages Compilation and typing of method bodies Error recovery mode Namespace Tree Conclusion of part one References

Introduction

In this article I begin an extended description of the macro creation process and debugging. Regrettably, manpower and magazine space are limited, so this issue contains only the first part of the article. We will not get to actual examples of macros here. However, I will try to explain the compiler's inner workings, which, in my opinion, is necessary to fully understand how macros work and, consequently, how to develop them more effectively.

Examples of macros can be found in previously published articles on our site, on the site of the language, and in the sources of the compiler itself.

In the second part of the article, I will try to describe some aspects of specific macro type creation and give examples demonstrating the creation process.

As for now... the theory :).

What is a macro and why do we need it?

A Nemerle macro is a compiler plugin that enables some kind of code generation. This definition is too vague, but I will try to clarify below.

Warning

Before starting an extended story about macros, I want to make a small warning to save you some rash decisions and injuries that might be bestowed on you by your colleagues for your intense activity.

Macros are an extremely powerful tool that can significantly simplify solving complex problems, but can also make your life (and, more importantly, your colleagues' lifes) unbearable.

As Igor Tkachev (IT) astutely noticed, macros are like the atom. The atom can be used to make weapons of mass destruction, but it can also be used for creation of cheap electricity. Likewise, macros can be a mega-tool simplifying your life, but can also blow your brains out by shooting you in the foot :).

Macros should only be used when the alternatives provide no satisfactory solution. If you are not sure whether a problem should be solved with a library or with a new macro, better choose the former route (at least until you get strong arguments for macro use).

Here are some possible arguments in favor of using macros:

Necessity to generate code according to a certain template. Such an example could come from a design (or implementation) pattern that requires fairly voluminous manual work for implementation and/or information about types in the project or external assemblies.
Opportunity to solve a problem by creating a simple DSL (Domain-Specific Language). For instance, you might describe all your data in a certain language and then use this description to generate numerous complicated types implementing the necessary details. Using a DSL can become a mighty weapon in the fight against complexity, bringing the system's implementation closer to its specification.
If you feel that you can automate your work if you get access to the internal structures and compiler's API.
If you truly miss some construct that would make your code significantly more safe or concise. For instance, if you work with multithreaded or pseudo-parallel applications, then it would possibly be worthwhile to develop a strategy for non-blocking work and introduce into the language special constructs for more transparent integration of this strategy into the language.

In other words, you should think twice before rushing headlong, if you want to keep your head :).

Kinds of macros

Macros in Nemerle can be classified into the following types:

Expression macros. These could look in code like ordinary functions or introduce new syntax.
Operator macros.
Attribute macros.
Lexical macros.

Expression macros or plain macros are macros that look like plain functions, or macros that can be used in expressions and introduce specialized syntax. Plain macro-functions are convenient, if some task cannot be achieved with a function or a class, but can be with macro, and at the same time you are satisfied with calling the macro like a function. Unlike a plain function, a macro can use the compiler's API, analyze and transform code passed to the macro as arguments, analyze type code declared in the project sources. Some examples of plain macros are macros for formatted input and output (print, printf, strint, etc), macro lazy, adding ability to declaratively mark code regions and structures as supporting deferred computation. Examples of macros introducing syntax are the majority of operators in the Nemerle language itself (operator &&, for instance). Such structures as return, break, continue, if/else, while, do/while, for, foreach, using, lock, and many other operators that look built in are macros. At the same time, macros are much more powerful than their built-in analogues in other programming languages. For instance, the foreach macro in Nemerle not only performs the same optimizations as the C# analogue, but also supports pattern matching, which makes working with enumerations much more flexible.

Operator macros allow one to introduce new operators into the language. Nemerle supports extension of types with operators the same way as C#. Additionally, such operators would be accessible from C# (if the type in which they are defined is public). However, macro operators are more flexible in some cases.

First, operator macros can work with all data types (even those defined in other assembles unknown at macro compilation time; that is, not yet existing). This can be both an advantage and a liability. It can be advantageous, because such an operator immediately covers a larger type class, and can be used even if the operands have different types. They can be a liability, because you cannot simultaneously use two identical operators (declared in different namespaces).

Second, you can introduce really new operator types. The only restriction - operators must consist from non-literal symbols (not letters, numbers, or underscors). Even C++, for all its flexibility and extensibility, does not have anything like this. However, despite the many advantages, there is another side to it. Nemerle operators are defined as non-literal symbol sequences, and if you mistakingly concatenate two operators, the lexical analyzer in Nemerle will interpret them as a single operator (likely undefined). For instance, you cannot use double negation - "!!someValue", since it will be interpreted as operator "!!". Of course, such tricks have no use in Nemerle (as they do in C++, where this is a common pattern for converting an integer value into a boolean). Prefix operators are fairly rare in Nemerle, while binary operators inherently have no such problem. Consequently, I have never encountered this issue in practice. Also, the RSDN code formatting guidelines require the programmer to set operators apart with spaces. In any case, the ability to define new operators is truly invaluable for those who want to create DSLs!

Meta-attributes are visually identical to custom attributes in C#, but, unlike them, invoke macro execution with the corresponding name and do not generate user attributes in assembly metadata. Meta-attributes are perfectly suited for programming pattern implementation (many of which are already implemented in the standard library), various frameworks, miscellaneous automation (like attributes adding efficient serialization code to classes). Like regular macros, meta-attributes can analyze and modify code of types declared in the project being compiled.

A lexical macro introduces a certain keyword followed by a series of tokens delimited by any of the following kinds of brackets: {}, (), [], <[]>. Any Nemerle token combination can be used inside the brackets. The only restriction is that all brackets must be paired (the lexical analyzer first divides tokens intro groups, so unpaired tokens lead to errors). Lexical attributes are perfectly suited to creation of complex built-in DSLs and language extensions (for instance, they are used to add "Design By Contract" to Nemerle).

Development and debugging

Since macros are, in essence, compiler plugins, they can be debugged like regular applications.

Macros can be developed and debugged in one of two ways:

The first way was described in my article "Macros in Nemerle" (article accessible only in Russian). It consists of using the Scintilla editor and a small batch file for automating the process of compilation of the assembly containing the macro and the assembly in which it is used. With this approach, the standard means for debugging a macro are logging code produced by the macro to console of the Scintilla editor and setting System.Diagnostics.Trace.Assert(false) and System.Diagnostics.Debugger.Break() in the macro body, with the goal of attaching a debugger (Scintilla itself does not support .NET application debugging).

The second way is to use our integration of Nemerle into Visual Studio. This way is the more promising one, but at the time of writing, the integration project is still in alpha version stage and contains errors. However, the errors will be fixed in near future. Using the integration makes development much more comfortable.

To create and debug a macro in VS 2005, one needs to:

Create project "Nemerle\Macro Library" (Ctrl+Shift+N in an empty solution, choose the Nemerle folder, then project type "Macro Library").
Create project of the type "Console Application" in the same solution.
Add to the second project (console application) a reference to the first project (menu "Project\Add Reference...", choose the only available value in the Project tab).

After this, compiling the macro project will make the console project use new version of macros.

This makes it possible to study code generated by the macro by simply holding the mouse cursor over its occurrences in code and studying the tooltip.

To enable debugging, one should do the following:

Make the macro project to be debugged the active one.
In project properties in the "Debug" tab field "Start Program" add the path to ncc.exe. It is usually located in the folder "%ProgramFiles%\Nemerle", so this field could be written as: "$(ProgramFiles)\Nemerle\NCC.exe".
In field "Working Directory", one should set the path to the console project. For instance: "$(MSBuildProjectDirectory)\..\MacroTestConsoleApplication".
In field "Command Arguments", one should set the names of files (or, more exactly, relative paths to them), then sepcify the key "-M:" (or -r:, if the assembly also contains types), then specify the path to the macro assembly to be debugged. In the end, you should get something like: "Main.n -m:$(MSBuildProjectDirectory\bin\Debug\MacroLibrary2.dll".

NOTE
VS 2005 project files are actually MSBuild files. Therefore, they can use constructs $(...) to access project properties and environment variables. Property "MSBuildProjectDirectory" expands into the path to the current project, while "ProgramFiles" expands into the path to the "Program Files" folder. More on this can be found in the article "MSBuild" (published in RSDN Magazine #6 2004).

WARNING
At this time, the dialog for editing "Nemerle Integration in VS" expands $(...) constructs when project settings are opened. Hence, all paths written with the use of $(...) constructs are expanded into full paths. We will try to remove this problem in the near future, but right now it is possible to edit the macro project file directly in a text editor. Tags responsible for debugging in the project might look like this:

<xml><PropertyGroup Condition=" '$(Configuration)' == 'Debug' "> <StartProgram>$(ProgramFiles)\Nemerle\NCC.exe</startprogram> <WorkingDirectory> $(MSBuildProjectDirectory)\..\MacroTestConsoleApplication &lt;/workingdirectory&gt; <CmdArgs> Main.n -m:$(MSBuildProjectDirectory)\bin\Debug\MyMacroLibrary.dll </cmdargs> </propertygroup></xml>

If you set everything up correctly, then you can set a debugging break in the body of a macro, press F5, then debug the macro step by step.

Most importantly, keep in mind that macro code is regular library code and can be debugged like any DLL.

Another problem during macro debugging may be that a macro might produce syntactically correct, but semantically incorrect code for the language. As a result, the project, in which the macro is used, won't compile. Compiler developers are working on reducing the negative impact of such situations, but the problem cannot be removed entirely. To debug cases like this, we can recommend inspecting the generated code in the IDE. To do this, move the mouse to the location in which the macro is used, then examine the tooltip. Besides this, you can use a macro converting code into text, or simply call ToString() on the expressions returned from the body of a macro or from added types.

Of course, unit tests can be very useful for this. With their help, one can emulate complicated boundary conditions for use. Unit tests are also an excellent way of debugging macro code (since macros must be called from somewhere).

It is also possible that the code produced by a macro is syntactically and semantically correct, but contains logical errors. The program using the macro might compile successfully, but execute with errors. Here, again, conversion of produced code into text (with subsequent analysis) can be used, but also other methods. For instance, you can use Reflector to decompile produced code, or add code produced by macros into debug logging.

Besides this, the compiler can be forced to produce debugging information for generated code. At this time, text can only be produced for code of types and their members added with the method DefineWithSource of class TypeBuilder. I think that this idea will be developed, and the ability to generate debugging information will become available for expression level macros.

This is currently a new function and I have not had the opportunity to test it. Therefore, I cannot comment on its quality. If you encounter any problems, write error reports ;).

And, of course, when you create macros, it helps to manually write a prototype of code that will be generated by the macro. This will make it possible to develop the logic of this code, and then simply transfer it into the macro, making it general.

I hope that with time macro debugging will become not much more complicated than debugging regular code. At this time this is not so, but the capabilities given by macros are worth spending some time debugging them. Most importantly, a debugged macro can easily be used even by beginners.

Compilation stages

Compilation is a multistage process. Macros can execute at different stages. Depending on which stage a macro executes in, the data available to it changes, and its abilities for modification of the program being compiled change.

To write a macro, it is not necessary to know all the internals of the compiler, but understanding of the compilation process would not be useless. It would allow you to make more educated decisions.

Compilation process

In listing 1, I give source code for the function ManagerClass.Run(), which calls other functions, responsible for the different compilation stages:

https://github.com/rsdn/nemerle/tree/master/ncc/passes.n

Listing 1. ManagerClass.Run() method – compiler's main method.

/**
 * Run passes of the compiler.
 */
public Run() : void
{
  Instance = this;
  try
  {
    InitCompiler();

    ProgressBar(1);

    LoadExternalLibraries();

    ProgressBar(2);

    Hierarchy = TypesManager(this);

    def trees = List.RevMap(Options.Sources, fun(x) { ParsingPipeline(LexingPipeline(x)) });

    Message.MaybeBailout();    // we have detected multiple files already

    ProgressBar(5);

    // create N.C.TypeBuilders for all parsed types and add
    // them to namespace hierarchy
    foreach (group in trees)
      List.Iter(group, ScanningPipeline);

    when (Options.DoNotLoadStdlib)
      InternalType.InitNemerleTypes();

    ProgressBar(8);

    Hierarchy.Run();
    Message.MaybeBailout();

    Hierarchy.CreateAssembly();

    ProgressBar(10);

    Hierarchy.EmitAuxDecls();
    Message.MaybeBailout();

    NameTree.CheckReferencedNamespaces();

    Hierarchy.EmitDecls();
    Message.MaybeBailout();

    NameTree.FinishMacroContextClass();

    Hierarchy.CheckForUnusedGlobalSymbols();
    Hierarchy.CheckFinalization();

    when (Options.XmlDocOutputFileName != null)
    {
      def docs = XmlDoc(DocComments, Options.XmlDocOutputFileName);
      Hierarchy.SourceTopIter(docs.DumpType);
      docs.Save();
    }

    unless (Options.CompileToMemory)
      Hierarchy.SaveAssembly();

    Message.MaybeBailout();

    KillProgressBar();
    Stats.Run(this);
  }
  finally
  {
    CleanUp();
    when (Options.PersistentLibraries)
      Hierarchy.RemoveProgramTypes();
  }
}

First, internal data structures are initialized (InitCompiler). At this step, in particular, the so-called namespace tree is created. It provides fast access to descriptions of namespaces, types, and macros. Types from the current project are placed into this tree, as well as macros from assemblies that are referenced by the project. The type tree is described by the Node class, which is nested in the class NamespaceTree, which, in turn, is located in the namespace Nemerle.Compiler. The implementation of these classes can be found in the file https://github.com/rsdn/nemerle/tree/master/ncc/hierarchy/NamespaceTree.n.

While creating complex macros needing information about the project types, you have to use this class, so we have to go into it in more detail. To avoid going on a tangent, I will give the description of this class a bit later (see section "NamespaceTree")).

At the next step, types from external assemblies are loaded (LoadExternalLibraries). Types from external assemblies are loaded in a "lazy" manner. You will find more about this in the NamespaceTree section. Now, it is important to mention that it is at this step that macros from external assemblies are loaded. In essence, macros are regular .NET types marked with a special attribute and adhering to certain contracts (since macros using a special syntax, there is no reason to go into the details of these types). Macro description is located in the namespace and can be accessible, if the programmer "opens" this namespace or uses a qualified path to the macro.

After this, the TypesManager is created and initialized. This is a class containing a list of TypeBuilders and manipulating it. Typebuilder describes a type, such as a class, a structure, a variant, or an enumeration. Typebuilder is created for types retrieved from parsing sources or created by macros or the compiler. In other words, these are the types that will be placed in the resulting assembly.

After this step, the compiler is ready for the main work. In fact, any compiler performs the above steps, so they are not really interesting. It is then that marked differences to the "average" compiler for an "average" programming language start.

In order to support macros and syntax extensions (as well as to support a Python-like syntax without brackets), the Nemerle compiler adds the unusual compilation stage PreParse and a fairly unusual code representation.

So, lets go in order. First, lexing is performed. A reference to a lexical analyzer is handed to the parser, which actually use it. You can see in file https://github.com/rsdn/nemerle/tree/master/ncc/parsing/Lexer.n the code for the lexical analyzer, class LexerBase, from the Nemerle.Compiler tree and its descendants: LexerFile, lexing files, LexerString, capable of lexing code from a string, LexerCompletion, which is used for syntax highlighting in the IDE.

The lexical analyzer is written by hand (without using lexer or parser generators). This is motivated by:

(main) That we can dynamically extend (and limit) the list of keywords recognized by lexer.
That it produces tokens as Nemerle variants data-structures.
That it is much simpler to debug than generated one and lexer is indeed rather simple module, so it was not that much effort for the benefits.
At the time we were writing it, there were no good lexer generators for .NET The Antlr and CocoR did not allow to extend grammar and syntax dynamically. Besides was necessary to retarget them code generators on Nemerle.
It was one easy part of compiler which could be used for testing when compiler was bootstraping (compiling itself).

And so, the purpose of the lexical analyzer is to "read" a file (or a string) and return a stream (list) of tokens describing the lexemes of the processed file. Nemerle has a small list of predefined keywords and operators. Here it is:

def tab = array[
    "_", "abstract", "and", "array", "as", "base", "catch",
    "class", "def", "delegate", "enum", "event",
    "false", "finally", "fun", "implements",
    "interface", "internal", "is", "macro", "match", "matches",
    "module", "mutable", "namespace", "new", "null", "out",
    "override", "params", "private", "protected", "public",
    "ref", "sealed", "static", "struct", "syntax", "this",
    "throw", "true", "try", "type", "typeof", "using",
    "variant", "virtual", "void", "volatile", "when", "where",
    "partial", "extern", "with"
];

mutable kes = Set();

foreach (el in tab)
  kes = kes.Add (el);

BaseKeywords = kes;

def tab = array ['=', '<', '>', '@', '^', '&', '-', '+', '|', '*',
                 '/', '$', '%', '!', '?', '~', '.', ':', '#'];

opchars = array (256);

foreach (x in tab)
  opchars[x :> int] = true;
...

Besides that, the brackets are predefined. Nemerle supports the following kinds of brackets: (...), {...}, [...], and <[...]>.

The last kind of brackets: <[and]>, is for quoting. Application of the others is almost analogous to C#, except that generic type parameters are delimited by square brackets [and], instead of angle brackets < and >.

I did not mention brackets accidentally. Brackets in Nemerle are processed in a special way. This also has to do with syntax extensions. You will understand why a little later.

Lexemes (tokens) in Nemerle are represented by the variant Token. Here is its description:

namespace Nemerle.Compiler
{
  public variant Token : System.Collections.IEnumerable
  {
    | Identifier { name : string; } // ie "main"
    // Identifier prefixed by @.
    | QuotedIdentifier { name : string; }
    | IdentifierToComplete { prefix : string; } // Used by IntelliSense.
    | Keyword { name : string; }
    | Operator { name : string; }

    // Various literals
    | StringLiteral { value : string; }
    | CharLiteral { value : char; }
    | IntegerLiteral { lit : Literal.Integer; cast_to : Parsetree.PExpr }
    | FloatLiteral { value : float; }
    | DoubleLiteral { value : Double; }
    | DecimalLiteral { value : Decimal; }

    | Comment { value : string; }

    | Semicolon { generated : bool; }    // ;
    | Comma                              // ,
    | BeginBrace { generated : bool; }   // {
    | EndBrace { generated : bool; }     // }
    | BeginRound                         // (
    | EndRound                           // )
    | BeginSquare                        // [
    | EndSquare                          // ]
    | BeginQuote                         // <[
    | EndQuote                           // ]>

    // The following three tokens are only used in IDE lexers
    | Indent     { value : string; }
    | WhiteSpace { value : string; }
    | NewLine    { value : string; }

    // The following group of tokens appears only after the PreParse step.
    | RoundGroup { Child : Token; }             // ( ... )
    | BracesGroup { Child : Token; }            // { ... }
    | SquareGroup { mutable Child : Token; }    // [ ... ]
    | QuoteGroup { Child : Token; }             // <[ ... ]>
    | LooseGroup { mutable Child : Token; }     // ; ... ;

    // These tokens also appear only after the PreParse step.
    // Moreover, their Env fields recieve a special object describing
    // the list of open namespaces connected with them, keywords, and
    // operators. This enables type linkange for namespaces used
    // further on.
    | Namespace { Env : GlobalEnv; Body : Token; }
    | Using { Env : GlobalEnv; }

    | EndOfFile
    | EndOfGroup

    // Determines the location of a token in a file.
    public mutable Location : Nemerle.Compiler.Location;

    // Reference to the following token.
    public mutable Next : Token;

    public this () { }
    public this (loc : Location) { this.Location = loc; }
    public override ToString () : string { ... }
  }
...

Describing tokens with a variant data type makes it possible to simplify their subsequent analysis by using pattern matching.

PreParse step

So, the differences from a "normal" compiler begin immediately after the lexical analysis step.

You probably noticed that, compared to C# or, say, C++, Nemerle compiler recognizes too few keywords and operators.

Of course, this does not mean that Nemerle is a limited language. It is just that the given list of keywords is sufficient for a basic language. The rest of keywords and operators are described in the form of macros. PreParse step is when additional keywords are recognized.

Since keywords in Nemerle are added by including namespaces, PreParse has to analyze directives "using" and "namespace".

But namespaces can follow each other and contain code for types, which requires knowledge of all types and namespaces in the project. To allow inclusion of DSLs in code, the authors of Nemerle came up with an original idea. They defer real parsing of project types until later stages, while this stage merely matches brackets.

"Matching brackets" assumes that all brackets in a Nemerle source file match recursively. If brackets fail to match at the PreProcess step, the compiler gives an error message. For instance, such code:

 { '''( }''' )

is considered erroneous.

The lexical analyzer parses brackets as separate lexemes:

  | BeginBrace { generated : bool; }   // {
  | EndBrace { generated : bool; }     // }
  | BeginRound                         // (
  | EndRound                           // )
  | BeginSquare                        // [
  | EndSquare                          // ]
  | BeginQuote                         // <[
  | EndQuote                           // ]>

The goals of the PreParse step includes their analysis and transformation into a token hierarchy:

  | RoundGroup { Child : Token; }          // ( ... )
  | BracesGroup { Child : Token; }         // { ... }
  | SquareGroup { mutable Child : Token; } // [ ... ]
  | QuoteGroup { Child : Token; }          // <[ ... ]>
  | LooseGroup { mutable Child : Token; }  // ; ... ;

All the "ordinary" (that is, not related to brackets and tokens Namespace and Using) tokens are placed inside the bracket group tokens. In this manner, after the PreParse step, the stream of tokens changes from flat to hierarchical. Already at this step all bracket mismatches are recognized. Tokens following each other and not joined in a bracket token are placed in the token LooseGroup.

Lexer, PreParse, and keywords

The way in which keywords are recognized is a very subtle moment. The thing is that lexical analysis (token parsing) and the PreParse step are in fact executed simultaneously! PreParse simply requests tokens from Lexer one after another. The Lexer then commences recognition of the next lexeme in the stream of symbols, and, if it is an identifier, compares it with the current list of keywords contained in Lexer's Keywords variable.

PreParse, in turn, recognizes namespaces and "using" directives and forms a set of keywords to replace the Lexer's keyword list.

...
def parse_using_directive(tok)
{
  finish_current(current_begin);
  def (id, idLocs) = get_qualified_identifier();

  match (get_token())
  {
    | Token.Semicolon as st =>
      def loc = tok.Location + st.Location;
      Env = Env.AddOpenNamespace(id, loc);
      lexer.Keywords = Env.Keywords;
...

In this manner, if the namespaces "opened" by this moment contain a new keyword, it is automatically identified and placed into the token Token.Keyword.

When directives "using" are processed, syntax extensions are registered (if the namespace being "opened" contains a macro extending syntax). This enables consequent recognition of token sequences at the parsing stage, controlled by macros, and places them in the special AST branch "MacroCall" (more about AST a little later). At the typing stage, MacroCall "expands", forming a final expression, which is then typed, and later on (at code generation stage) generates MSIL.

After the PreParse step, all tokens of the analyzed file turn into a tree consisting of group tokens (RoundGroup, BracesGroup, SquareGroup, QuoteGroup, LooseGroup) and simple tokens contained within. Group tokens are different from simple tokens by having, besides field Next pointing to the next token at the same level of the hierarchy, field Child, pointing to nested tokens.

Figure 1 shows a tree formed for the following source file:

using System;
using System.Console;
using Nemerle.Utility;

class A
{
  public Test(x : int) : string { $"result='$x'" }
}

module Program
{
  Main() : void
  {
    WriteLine(A().Test(123));
  }
}

Figure 1. Token tree after the PreParse step.

As you see, each "using" directive has turned into the token Using, containing a reference to a GlobalEnv (describing a list of "open" namespaces), while tokens for class A and module Program are in their respective LooseGroup tokens.

Tokens for other methods are types members are distributed between groups the same way.

Figure 2 shows a fragment of the tree describing method Test:

Figure 2. Token tree fragment describing method Test.

A bird's eye view of the tokens shows it as a linked list of tokens, some of which are groups and store sublists. In fact, already after the tree is built, any language construct is placed into a separate list, so the parser's job is very much simplified. Also, in case of errors in parsed code, the parser can skip the ambiguous group and attempt to recognize the rest. This has a positive impact on error report quality.

Top level construction parsing (i.e. parsing of types)

Once the token tree is formed, parsing time comes. At this stage, syntactical constructs, including those built into the language, as well as those defined by macros, are recognized.

The principle of recognition of syntactic constructs defined by macros is very simple. If tokens contain among them keywords, a check for whether these keywords represent beginnings of new syntax extensions "opened" up to that point is performed. If so, control is handed over to the function responsible for recognizing the syntax extension. The result of such a function is the AST branch MacroCall.

The result of parsing is branches of the AST. So, we should go into this in some more detail.

AST

We have two news regarding the AST: one good and one bad. The bad news is that, as a matter of fact, there is no single AST. AST consists from a set of types, mostly variants.

The good news is that in most cases, when creating a macro, you will not have to work with the AST on type level. Quoting is used for both generation and decomposition of the AST. This makes the problem incredibly simpler. Much can be described declaratively.

Still, to fully grasp the significance of this, we should familiarize ourselves with the types comprising the AST.

One could look at AST describing types (defined with the variant TopDeclaration), AST describing type members (defined with the variant ClassMember), and AST describing expressions (defined with the variant PExpr).

Variants TopDeclaration and ClassMember inherit from the class DeclarationBase:

[Record]
public class DeclarationBase : Located
{
  public mutable name : Splicable;
  public mutable modifiers : Modifiers;
  public ParsedName : Name { get { name.GetName () } }
  public Name : string { get { name.GetName ().Id } }
  public GetCustomAttributes () : list[PExpr] { modifiers.custom_attrs }

  public Attributes : NemerleAttributes
  {
    get { modifiers.mods }
    set { modifiers.mods = value }
  }

  public AddCustomAttribute (e : PExpr) : void
  {
    modifiers.custom_attrs = e :: modifiers.custom_attrs
  }
}

Obviously, the given class allows top-level AST branches (defining types and their members) contain a name, modifiers (such as public and static), and user attributes.

All AST classes directly or indirectly descend from the class Located, which adds support for storing the a construct's location in the source file. Placement described by the Location structure has the source file's identifier and name (members FileIndex : int and File : string), upper and lower coordinates (members Line, Column, EndLine, and EnedColumn, type int), as well as a flag saying whether the given element is retrieved from code or generated by the compiler/macro (member IsGenerated : bool).

[Record (Exclude = [_definedIn])]
public variant TopDeclaration : DeclarationBase
{
  | Class // Describes classes
    {
      mutable t_extends : list[PExpr];
              decls     : list[ClassMember];
    }
  | Alias { ty : PExpr; } // Describes the type construct
  | Interface // Describes interfaces
    {
      mutable t_extends : list[PExpr];
              methods   : list[ClassMember]; // only iface_member
    }
  | Variant // Describes variants
    {
      mutable t_extends : list[PExpr];
      mutable decls     : list[ClassMember];
    }
  | VariantOption { decls : list[ClassMember]; } // Variant matching
  | Macro // Describes macros
    {
      header : PFunHeader;
      synt   : list[PExpr];
      expr   : PExpr;
    }
  | Delegate { header : PFunHeader; } // Describes delegates
  | Enum // Describes enumerations
    {
      t_extends : list[PExpr];
      decls     : list[ClassMember];
    }

  public mutable typarms : Typarms; // Параметры generic-типа

  /// If the following declaration is nested in another declaration,
  /// this property contains a reference to it. This property contains
  /// null, if this is a top-level declaration.
  public DefinedIn : TopDeclaration { get { ... } };

  public this (tp : Typarms) { ... }
  public this () { ... }
  public override ToString() : string { ... }
}

[Record (Exclude = [_env, _tokens, _bodyLocation, _definedIn])]
public variant ClassMember : DeclarationBase
{
  | TypeDeclaration { td : TopDeclaration; }
  | Field { mutable ty : PExpr; }
  | Function
    {
              header      : PFunHeader;
      mutable implemented : list[PExpr];
      mutable body        : FunBody;
    }
  | Property
    {
      ty      : PExpr;
      prop_ty : PExpr;
      dims    : list[PParameter]; // parameters of indexer property
      get : option[ClassMember];
      set : option[ClassMember];
    }
  | Event
    {
      ty     : PExpr;
      field  : ClassMember.Field;
      add    : ClassMember.Function;
      remove : ClassMember.Function;
    }
  | EnumOption { value : option [PExpr]; }

  public SetEnv(env : GlobalEnv) : void { _env = env; }
  [Accessor] internal mutable _env          : GlobalEnv;
  [Accessor] internal mutable _tokens       : Token.BracesGroup;
  [Accessor] internal mutable _bodyLocation : Location;

  /// This property contains a reference to the type (TopDeclaration),
  /// in which this member is defined.
  [Accessor] internal mutable _definedIn    : TopDeclaration;

  /// Only accessible if this is an isntance of ClassMember.Function
  /// with an untyped body.
  public Body : PExpr
  {
    get { ... }
    set { ... }
  }

  public IsMutable() : bool { modifiers.mods %&& NemerleAttributes.Mutable }
  internal PrintBody (writer : LocatableTextWriter) : void { ... }
  public override ToString() : string { ... }
}

You can get a ClassMember or TopDeclaration instance by constructing them directly, or by using quoting.

However, there is a special feature here. By default, quoting assumes that the quotation contains PExpr. Therefore, you have to tell the compiler that you want the AST for a different kind of construct. The prefix "decl:" allows you to tell the compiler that it should find a type, instead of an expression in the quotation. For instance, the following macro demonstrates addition of a type described by means of quoting to the project:

macro BuildClass ()
{
  def ctx = Nemerle.Macros.ImplicitCTX();
  // Describe the class with quoting. To make the compiler understand
  // that it is a class, we add the "decl:" prefix.
  def astOfClass = <[ decl:
    internal class FooBar
    {
      public static SomeMethod () : void
      {
        System.Console.WriteLine ("Hello world");
      }
    }
  ]>;

  def builder = ctx.Env.Define(astOfClass);

  builder.Compile();

  <[ FooBar.SomeMethod () ]> // Now, this is an expression. It does not need the prefix.
}

A class member can be described the same way:

tb.Define(<[ decl:
  public InternalType : InternalTypeClass
  {
    get { Manager.InternalType }
  }
]>);

Such a prefix is called a "quoting type". Here is a full list of quoting types:

decl – top-level declaration. A type or a type member.
fundecl – local function declaration.
case – match operator.
parameter – function parameter description.
ttype - "typed" reference to a type. When you use it, the quotation should contain a link to a type. Partial qualification is allowed (with respect to the open namespaces at the point of declaration).

All quoting types, except the last, can be used not only for forming the AST, but also for its decomposition (with pattern matching). ttype cannot be a pattern.

We should return to the parsing process. At the end of this process, a recognized AST type list is created for each parsed file. If a type is declared in the code as partial, several ASTs are created for it (one for each part). The parts of a type can be located in different files (as they typically are).

Of course, it is inconvenient to work with parts of a type. In addition, the code must be typed. Therefore, the next step has the compiler create, for each type declared in the project, a separate TypeBuilder class, in which all the AST parts are combined.

You may laugh, but all of the above describes a single line of the Run function shown above. Here is that line:

def trees = List.RevMap(Options.Sources,
  fun(x) { ParsingPipeline(LexingPipeline(x)) });

TypeBuilder's creation

At the next step, the compiler goes through all TopDeclaration objects (AST types) received from the previous step, creates TypeBuilder objects for each type, then adds them to the namespace tree. This is handled by the following line of the Run function:

foreach (group in trees)
  group.Iter(ScanningPipeline);

All partial type parts are placed in a single TypeBuilder object. Their members are combined into a single list, which makes it possible to work with the type as a single entity, instead of messing around with individual parts. Although, the TopDeclaration list for partial types is still available. The Location list for the partial type parts are also available.

Generation of classes for macros declared in the project being compiled and for delegates is performed at this step. Delegates generate special placeholder classes, which are then finalized at execution time by the JIT.

TypeBuilder typing

At this step, TypeBuilder objects created at the previous step are "typed".

Numerous important operations are performed at this step, of which the two most important for us are macro "expansion" and type resolution. This step is handled by the call:

Hierarchy.Run();

It does the following:

Creates type environments.
Links types.
Deduces inheritance relationships between types.
Matches types.
Adds member descriptions (first types, then untyped).
Adds special support methods for pattern matching to variant types.

The typing process is divided into several stages. Macro expansion is performed at different phases of this process. These phases are described by the MacroPhase enumeration:

BeforeInheritance - executes before the typing process begins. At this moment, there is the TypeBuilder list, but only the information retrieved from parsing. At this moment, the type hierarchy is not yet built, and all type references are unresolved. This leads to the situation, in which, on one hand, it is possible to perform almost any modifications in the code at this stage, on another - almost no type information is available (only information from the untyped AST is present). This phase is wonderfully suited for such problems as addition of interfaces to the implemented interface list, addition of base classes, and addition of new types (to the complete type list for the project).
BeforeTypedMembers - executes before type members (methods, properties, and fields) are "linked" with real types. Until the stage WithTypedMembers, type descriptions are practically untyped expressions (in reality, there are more complicated data structures, including, among the rest, context reference (GlobalEnv), which stores the "open" namespace list; and it is not simple strings that are stored). Considering that at this moment the type hierarchy is completely built at this point, this stage provides much more information. And even though parameter and return value types are not yet "linked" with real types, these links can be deduced by using the method MonoBindType from the class TypeBuilder (even though it is, in essence, a hack, the compiler developers use it in practice). This function uses the context (GlobalEnv) and AST references to tye type to deduce the reference to the description of the type (I want to remind you that by this time type descriptions are already located in the namespace tree). The phase BeforeTypedMembers is attractive for macro-attributes and other type-level macro types, since it allows painless addition of new members.
WithTypedMembers executes as the very last stages of typing. As a matter of fact, the only code that is processed after this stage is code that is generated by the compiler for internal use and is of no interest to the programmer. Complete type information is accessible at this stage. It seems like the most convenient stage for executing a macro. One might ask, why bother with other phases at all? The answer is both simple and sad. The thing is that at this time there is almost no way to change type descriptions for the compiled project. Attempting to add new methods, classes, inheritance, or even attempts to modify something will encounter active resistance by the compiler. Therefore, all the power of type information accessible in this phase can only be applied for modification and method (and other type members) code generation. This is why the ability to use the MonoBindType method, described in the previous section, is so valuable. As such, the WithTypedMembers stage is best suited for type member level macros.

Further compilation stages

The following compilation stages are of little interest for macro developers. This is why I will not go into their detail. If you are curious about what goes on there, you can return to listing 1.

Compilation and typing of method bodies

During the regular compilation process (that is, when the compiler is ran, not under IDE's control) method bodies are processed during type body parsing. This behavior is different when the compiler is controlled by the IDE, when the compiler is working in IntelliSense mode.

In IntelliSense mode, function code parsing is deferred until the user requests information about function contents (for instance, when he tries to look at a Tooltip, or tries to auto-complete an identifier), or until the background code check is performed. Background checks are performed (with a small delay) after a source code file is opened in the IDE editor (or after some time since a file is edited).

If you want your macro to correctly perform in the IDE, you have to take this feature into account. That is, you should not add classes that would be accessible to IntelliSense from expression-level macros.

It is also possible to add specialized support for IntelliSense mode editing (although, writing generic code is simpler). To find out which mode the compiler is running in, you can query the property IsIntelliSenseMode of class ManagerClass, a reference to which could be acquired in the following way:

Nemerle.Macros.ImplicitCTX().Manager

Error recovery mode

By default, during typing, the compiler works in fast mode. In this mode it does not analyze errors, rightly assuming that there will not be any. If an error is nevertheless encountered, the compiler generates a RestartInErrorMode exception, which is caught at the top level of the typing process. In this case, the compiler switches into the thorough error-checking mode and repeats the typing process.

Code not taking this behaviour into account can lead to cascading errors that can confuse an applied programmer.

The problem is that macro code can add new types and change existing types' descriptions. In case of an error, this macro would be executed twice, and can lead to creation of duplicate descriptions.

In order to avoid this unpleasant situation, one can either refuse from adding/modifying types from expression-level macros (this is optimal) or, at least, check if the compiler is working in the normal mode or in the recovery mode.

One can find out if the compiler is in the error recovery mode by querying the InErrorMode property from the same ManagerClass object that was used to access the IsIntelliSenseMode property.

WARNING
One should notice that in case of running in the IntelliSense mode (IsIntelliSenseMode == true), InErrorMode property is always equal to true, and second pass is not made. This has to do with the fact that one of IntelliSense's responsibilities is error checking, and code in IDE is unfinished a large portion of time. Double passes would have only slowed down and complicate work the IDE's work.

Namespace Tree

This section is not directly related to macros, but truly powerful macros are unlikely to do without analyzing types available in the project.

Information about all types available in the project (declared in it, as well as imported from external assemblies) is placed in the namespace tree or, as it is also known, the type tree.

A simplified description of this tree is given below.

This tree's branches are described by the Node class, which is nested in the class NamespaceTree:

[ManagerAccess]
public class NamespaceTree
{
  /// Used for "lazy" type loading from external assemblies and
  /// erasing differences between types declared in the assembly
  /// being compiled and exterior types.
  public variant TypeInfoCache
  {
    | No
    | Cached { tycon : TypeInfo; }
    | CachedAmbiguous { elems : list[TypeInfo] }
    | NotLoaded { e : ExternalType; }
    | NotLoadedList { elems : list[ExternalType] }
    | MacroCall { m : IMacro; }
    | NamespaceReference
  }

  // Describes a single branch of the namespace tree (type tree).
  public class Node
  {
    public Parent : Node; // Branch in which the current branch is nested.

    /// Name of the current branch. If, say, this is the System.IO.File
    /// branch, this property will have the value "File".
    [Accessor(PartName)] name : string;

    /// Value of the branch. See TypeInfoCache description above and the
    /// EnsureCached() description.
    public mutable Value : TypeInfoCache;

    /// Child branch list.
    public Children : Hashtable[string, Node] { get; }

    /// Guarantees that type information for this branch is fully loaded,
    /// parsed, and available for analysis by generic methods.
    /// The thing is that information about types from external assemblies
    /// is loaded "lazily". When an assembly is loaded at the beginning
    /// of compilation, only the list of types is loaded, while detailed
    /// information is retrieved only when it is first requested. At this
    /// time, .NET types are placed in TypeInfoCache.NotLoaded or
    /// TypeInfoCache.NotLoadedList. Consequently, when it is necessary
    /// to retrieve information about these types, a loading procedure is
    /// executed, in the process of which the branch type is changed to
    /// TypeInfoCache.Cached or TypeInfoCache.CachedAmbiguous. The method
    /// EnsureCached executes this procedure.
    public EnsureCached() : void;

    /// Returns the list of types declared in this project, types created
    /// in the process of parsing the project's source code, types
    /// created by macros, and types generated by the compiler (for
    /// delegates, for instance).
    public GetTypeBuilders(onlyTopDeclarations : bool) : array[TypeBuilder];

    /// Textual representation of the full name of the branch (the way it
    /// is displayed in compiler messages).
    public GetDisplayName() : string;

    /// Full name of the branch in the form of a list.
    public FullName : list[string] { get; }

    /// Says that the current branch is "false", and is an alternative
    /// name for another type.
    public IsFromAlias : bool { get { name == null } }

    /// Clear the subbranch list for the current branch.
    /// Better not call this. :)
    public Clear();

    /// Get a reference to the branch by path, in the form of a period-
    /// separated string. Attention! If the path does not exist, it will be
    /// created automatically. To be honest, this should have been called
    /// something like "GetNodeByPathOrCreateIt".
    public Path (n : string) : Node;

    /// Same as the preceding type, but the path is given by a list.
    public Path (n : list [string]) : Node;

    /// The given method is analogous to the preceding, except that it does
    /// not create a path in case the branch does not exist, and returns
    /// the value of a branch, instead of the branch itself. To be honest,
    /// the latter makes it of little interest for practical use.
    public TryPath (n : list [string]) : TypeInfoCache;

    /// Method analogous to Path(), but not creating missing branches.
    public PassTo (name : list [string]) : Node { ... }

    /// Makes it possible to get a branch by a relative path. Several root
    /// branches to start the search from can be specified.
    public static PassTo (nss : list [Node], name : list [string]) : Node;

    /// Methods LookupType make it possible to find a type given a full
    /// path. Like the Path method, they create an empty branch if the
    /// path cannot be found.
    public LookupType (split : list [string], args_count : int)
      : option[TypeInfo];
    public LookupTypes (split : list [string], for_completion = false)
      : list[TypeInfo];
    public LookupValue() : option[TypeInfo];
    public LookupMacro(split : list[string]) : option[IMacro];

    /// A number of unimportant methods and properties have been skipped...
  }

  /// The root branch of the namespace tree (the property has name NamespaceTree).
  [Accessor] internal namespace_tree : Node;

  /// This method makes it possible to do that, which is not officially
  /// permitted... Dynamically load a macro.
  public AddMacro (split : list[string], m : IMacro) : void;
  public static AddMacro (ns : Node, m : IMacro) : void;

  /// A number of unimportant methods and properties have been skipped...
}

The source code comments should explain the various members' purpose. The purpose of the classes themselves is simple... Node objects form a tree. The Children property contains an associative list of sub-branches accessed by their names. A branch can contain a type description, description for several types (if there is ambiguity), be a macro description, or a namespace branch. Besides, a branch can contain a reference to a description of an unloaded type from an external assembly. Before working with the branch's value, the method EnsureCached should be called. This will cause type information to be loaded, if it has not been already.

Notice that nested types are not in this tree! To get the list, you should call the method GetNestedTypes of class TypeBuilder.

Conclusion of part one

The goal of this part of the article is to give you basic information that will make it easier to deeply understand the Nemerle macro system (and the compiler in general). It touches many subtle aspects of the compiler's work. Do not worry if you did not memorize everything at once. You can always return to this part and read a section over.

In the next part I will tell you how to create specific macro types, about the kinds of problems that can be solved with the various types, and I will give you examples for each type of macro.

References

This text is based on an article from RSDN Magazine #1-2007 by Vlad Chistiakov (VladD2).