Lexical structure (ref) - vilinski/nemerle GitHub Wiki

<< Back to Reference Manual.

Table of Contents Programs Tokens Blanks Preprocessing directives Identifiers Keywords Literals String literals Regular string literals - "string" Verbatim string literals - @"string" Recursive string literals - <#string#> String interpolation i.e. $-notation Character literals Numeric literal Integer literals Floating point literals Other literals

Programs

Programs are written using the Unicode character set, using the UTF-8 encoding. Every Nemerle source file is reduced to a sequence of lexical units (tokens) separated by sequences of whitespace characters (blanks).

Tokens

There are four classes of lexical tokens:

identifiers
keywords
literals
blanks

/* A comment. */
// Also a comment

foo                                              // identifier
foo_bar foo' foo3                                // other identifiers
'a 'foo'bar'baz                                  // more identifiers
42                                               // integer literal
1_000_000                                        // _ can be used for readability
1_42_00                                          // or unreadability...
0x2a                                             // hexadecimal integer literal
0o52                                             // octal integer literal
0b101010                                         // binary integer literal
'a'                                              // character literal
'\n'                                             // also a character literal
"foo\nbar"                                       // string literal
"foo" "bar"                                      // same as "foobar"
@"x\n"                                           // same as "x\\n"
@"x
 y"                                              // same as "x\n y"
<#This string type can contains any symbols include "
and new line. It not support escape codes
like "\n".#>                                     // same as "This string type can contains any symbols include \"\nand new line. "
                                                 //       + "It not\nsupport escape codes\nlike \"\\n\"."
@if                                              // keyword used as identifier
<#Test <# Inner #> end#>                         // same as "Test <# Inner #> end" (i.e. this string type support recursion.

3.14f                                   // float literal
3.14d, 3.14                                      // double literal
3.14m                                            // decimal literal

10         // int
10u               // unsigned int
10b, 10sb, 10bs  // signed byte
10ub, 10bu       // unsigned byte
10L              // long
10UL, 10LU      // unsigned long

Blanks

Spaces, vertical and horizontal tabulation characters, new-page characters, new-line characters and comments (called blanks altogether) are discarded, but can separate other lexical tokens.

A traditional comment begins with a /*, and ends with */. Traditional comments cannot be nested. An end-of-line comment starts with //, and ends with the line terminator (ASCII LF character).

Preprocessing directives

There is a set of preprocessing directives used for conditional compilation and changing line numbering context. They are the same as in C#. Allowed directives are:

#define
#undef
#if
#elif
#else
#endif
#line
#error
#warning
#region
#endregion
#pragma

Identifiers

Ordinary identifiers consist of letters, digits, underscores and apostrophe, but cannot begin with a digit. Identifiers may be quoted with the @ character, which is stripped. It removes any lexical and syntactic meaning from the following string of characters until blank, thus enabling programmer to use keywords as identifiers.

There is an important difference between identifiers starting with underscore character (_) and the other ones. When you define local value with name starting with _ and won't use it, compiler won't complain about it. It will warn about other unused values though.

Symbolic identifiers consist of following characters: =, <, >, @, ^, |, &, +, -, *, /, $, %, !, ?, ~, ., :, #. Symbolic identifiers are treated as standard identifiers except to the fact that they are always treated as infix operators.

Keywords

Following identifiers are used as keywords, and may not be used in any other context unquoted: _, abstract, and, array, as, base, catch, class, def, delegate, do, else, enum, event, extern, false, finally, for, foreach, fun, if, implements, in, interface, internal, lock, macro, match, module, mutable, namespace, new, null, out, override, params, private, protected, public, ref, sealed, static, struct, syntax, this, throw, true, try, type, typeof, unless, using, variant, virtual, void, when, where, while, assert, ignore.

Following infix identifiers are reserved keywords: =, $, ?, |, ->, =>, <[, ]>, &&, ||.

Literals

There are few kinds of literals:

String literals, enclosed in " ", @" " or <# #>.
Character literals, enclosed in '.
Numeric literal divide into Integer literals, Floating point literals

String literals

Represents string constant. Nemerle supports three forms of string:

Regular string literals.
Verbatim string literals.
Recursive string literals.

Regular string literals - "string"

A regular string literal consists of zero or more characters enclosed in double quotes and may include both simple escape sequences (such as \n for the newline character) and hexadecimal and Unicode escape sequences (See character literals for details).

Verbatim string literals - @"string"

A verbatim string literal consists of an @ character followed by a double-quote character, zero or more characters, and a closing double-quote character. In a verbatim string literal, the characters between the double-quotes are recognized verbatim, the only exception is a sequence "" (used to indicate '"' character) (Note that simple escape sequences and hexadecimal and Unicode escape sequences are not recognized in verbatim string literals). A verbatim string literal may span multiple lines.

Examples:

def s1 = "Nemerle string !";            // Nemerle string !
def s2 = @"Nemerle string !";           // Nemerle string !
def s3 = "Nemerle\tstring !";           // Nemerle    string !
def s4 = @"Nemerle\tstring !";          // Nemerle\tstring !
def s5 = "I heard \"zonk !\"";          // I heard "zonk !"
def s6 = @"I heard ""zonk !""";         // I heard "zonk !"
def s7 = "\\\\trunk\\ncc\\ncc.exe";     // \\trunk\ncc\ncc.exe
def s8 = @"\\trunk\ncc\ncc.exe";        // \\trunk\ncc\ncc.exe
def s9 = "\"Nemerle\"\nrocks\n!";       // "Nemerle"
                                        // rocks
                                        // !
def s10 = @"""Nemerle""                 // same as s9
rocks
!";

String s10 is a verbatim string literal that spans 3 lines.

Recursive string literals - <#string#>

A recursive string literals similar to verbatim string literal but allow use quote symbols (which make it more flexible) and allow nested strings.

def s1 = @"Nemerle\tstring !";          // Nemerle\tstring !
def s2 = <#Nemerle\tstring !#>;         // Nemerle\tstring !
def s3 = @"I heard ""zonk !""";         // I heard "zonk !"
def s4 = <#I heard "zonk !"#>;          // I heard "zonk !"
def s5 = "\"Nemerle\"\nrocks\n!";       // "Nemerle"
                                        // rocks
                                        // !
def s6 = <#"Nemerle"
rocks
!#>;                                    // same as s5
def s6 = <#"Nemerle"<#Nested#>string#>  // "Nemerle"<#Nested#>string

String interpolation i.e. $-notation

You can use $ operator before the string literal to enable string interpolation feature (the `$' operator is now a shorthand to Nemerle.IO.sprint).

def x = 40;
def y = 42;
System.Console.Write ($ "$(x + 2) == $y\n")

Any expression can be used in $(...), but there might be problems with embedded strings and so on. It is meant to be used with simple expressions like array/field access, method call, etc. If you need use embedded strings use <# #> string type. For example:

WriteLine($<#Test $("concate" + "nation")!#>); // => Test concatenation!

One can also use the ..$ notation. It helps simplifying printing of sequences (which implement System.Collections.Generic.IEnumerable[T]).

The syntax: ..$sequence
Or: ..$(seq; separatorStringExpression)
Or: ..$(seq; separatorStringExpression; elementConversionFunction)

For example:

using System.Console;

def x = 1;
def lst = [1, 2, 3, 52];
WriteLine($<#lst = ..$(lst; "; ");#>); // This string similar: @"lst = ..$(lst; ""; "");" but it simply read.
def sep = "; ";
def cnv = x => "0x" + x.ToString("X");
WriteLine($@"x = $x; lst = ..  $(lst; sep; cnv);");
WriteLine($".$x;");
WriteLine($"lst = '..$lst';");

The output of the snippet above will be:

lst = 1; 2; 3; 52;
x = 1; lst = 0x1; 0x2; 0x3; 0x34;
.1;
lst = '1, 2, 3, 52';

See more examples in https://github.com/rsdn/nemerle/tree/master/ncc/testsuite/positive/printf.n

Character literals

A character literal consists of one character enclosed in single-quotes (' ') or escape character of form '\X' where X can be one of the following: [FIXME:]

\, ', " - this allows representation of (respectively) backslash, single-quote double-quote
0 - zero character
$ - dollar sign
0X - where X is an octal ASCII code (up to three digits) of the character we want to represent (N)
xX - where X is an hexadecimal ASCII code (exactly two digits) of the character we want to represent (N)
xX - where X is an hexadecimal UNICODE code (at most four digits) of the character we want to represent
uX - where X is an hexadecimal UNICODE code (exactly four digits) of the character we want to represent (N)
UX - where X is an hexadecimal UNICODE code (exactly eight digits) of the character we want to represent (N)
a - matches a bell (alarm) (N)
b - matches a backspace \u0008
r - matches carriage return \u000D
v - matches vertical tab \u000B (N)
t - matches horizontal tab
f - matches form feed \u000C (N)
n - matches a new line \u000A
e - matches an escape \u001B
cX - matches an ASCII control character; for example \cC represents control-C (N)

It has type char.

Numeric literal

Numeric literal are allowed to have suffix describing their type as seen in tokens section. Suffix is insensitive to case and order of it's symbols.

The underscore can be used to separate groups of digits.

Integer literals

Integer literals, possibly prefixed with 0x, 0o or 0b to denote hexadecimal, octal or binary encoding respectively. Prefixes are case insensitive too.

decimal_literal = digits [ { '_'  digits } ] [suffix]
digits = { decimal_digit }
suffix = integer_suffix
integer_suffix = 'b' | 'sb' | 'ub' | 's' | 'us' | 'u' | 'l' | 'lu'

Floating point literals

Floating point literals, defined as:

floating_point_literal =
       [ digits_ ]  '.'  digits_  [ exponent ] [ suffix ]
| digits_ exponent [ suffix ]
| digits_ suffix
exponent = exponential_marker [ sign ] digits
digits =
      { digit }
digits_ = digits [ { '_' digits } ]
exponential_marker =
      'e' | 'E'

sign = '+' | '-'

digit = decimal_digit

suffix = floating_point_suffix
floating_point_suffix = 'f' | 'd' | 'm'

Other literals

literal = 'true' | 'false' | 'null' | '()'

true and false have type bool and represent respectively true/false boolean value.

null represents a special instance of any reference type that you cannot dereference.

() represents the only instance of type void.