Vellum Manual - mr-martian/quest-editor GitHub Wiki

Introduction

Vellum is best described as being a procedural language that feels very much like a functional language, while technically being neither. Like most modern languages, it is multi-paradigm, albeit with an intentional neglect of object-orientation at the language level. It blends elements from the

  • procedural
  • functional
  • concatenative
  • generic

programming paradigms to make (what we hope is) a rich and productive programming experience.

Intro

Structured syntax

The main way Vellum is intended to be used is via the structured syntax. It provides a rich and familiar programming experience, which should be easy to approach for most programmers. The structured syntax can represent anything that can be expressed in an applicative style, so it is possible to use Vellum entirely using the structured syntax, treating it as if it were just an ordinary procedural language with functional aspects, though with more familiarity with the language, it is expected that Substrate will be used when its properties are useful.

UFCS

Vellum has uniform function call syntax. That is, f(a) and a.f() can call the same function. These rules also subsume some of the functionality of argument-dependent lookup (ADL) in C++. Specifically, given let a : A, and an unqualified function name f, the rules are as follows: [Todo: formalize these rules better.]

  • For the expression f(a), look up f in the current scope, and then attempt to type-match it with either a or &a as the argument, in that order. If no function can be bound, look up f in the scope of A, and repeat type matching.
  • For the expression a.f(), look up f in the scope of A, and then attempt to type-match it with either a, &a, or a^ as the argument, in that order. If no function can be bound, look up f in the current scope, ignoring variable names, and repeat type-matching.
  • For the expression a.f, look up f in the scope of A. If that lookup finds a function callable as f(a), but does not type-check, then rewrite the expression to a.f().
  • In the case of a.f(b), for any number of parameters in place of b, if lookup of a.f(b) fails, but finds a candidate function in the scope of A such that a.f() would succeed, rewrite the expression to a.f()(b) and retry, but do not perform this rewrite more than once.

The latter two rules allow leaving off empty parentheses for function calls with . where it is unambiguous. [Note: These rules are very complex and will probably be altered to be simpler and more intuitive.] The rules in this section are applied after any rewriting of overloaded operator expressions into ordinary function calls, or other similar rewrites that produce unqualified names.

Substrate syntax

The computational model underlying Vellum is very different from most languages you may be familiar with. Instead of the usual imperative paradigm, or the somewhat less common functional one, Vellum is concatenative. Additionally, it provides direct access to that low-level model via the sub-language named Substrate, so named because it is the foundation the rest of the language is built on. A primer on concatenative programming from a general background is included later in this manual; this section exists to describe Vellum's implementation of the concept.

Substrate expressions are, syntactically, extremely simple. There is only one primary syntactic primitive, the term. Terms include literals, variables, function names, certain keywords, variable bindings and unbindings, and quotations, which are similar to lambdas; though this is not an exhaustive list. A substrate expression is simply a list of terms. Almost all of the syntactic complexity of structured Vellum is completely absent in Substrate.

However, and perhaps uniquely among concatenative languages, Vellum has function overloading, and particularly, variadic functions; this requires a more sophisticated way to name things than just identifiers, so Vellum introduces discriminators, which are parts of terms rather than being terms in themselves. There are three kinds of discriminator, two of which, the scope resolution discriminator (ex: scope::id) and the signature discriminator (ex: (fn(T, U) -> R) id), are useful in structured Vellum as well, while the overload disambiguation discriminator (ex: id!3) is almost entirely useless outside of Substrate because its meaning is redundant with the information encoded by the syntax.

Unlike most concatenative languages, Substrate is statically-typed and statically-bound, which each impose certain restrictions compared to languages such as Forth. Instead of being described by a stack, the Substrate execution environment is instead described by a "notional tuple" and the set of names in scope. Every term has an environment effect, which is a description of its affects on the notional tuple, and a lexical effect, which is its affect on the set of names in scope. Because Vellum uses lexical scoping and static binding, lexical effects are generally uncommon and obvious, being essentially limited to variable bindings and function/procedure definitions, and so need little discussion.

In languages such as Forth, it is possible to redefine any term at all at any time, including literals, so that for instance, the term 4 actually pushes the value 5 to the stack. Vellum has static binding of names, and only allows valid identifiers to be defined (no keywords or literals), which makes such redefinition impossible. This is considered to be no loss at all, but rather an improvement, because there is no valid use-case for such tricks.

Static typing is inconsistent with dynamic environment effects, which means that every term must consume and output a fixed number of values, which must have statically-resolvable types. Usually, the compiler is able to keep track of these effects without any annotation, however, sometimes it is necessary to use signature discriminators to explicitly declare stack effects. If such an annotation disagrees with what the compiler can determine, the program is ill-formed.

The overload disambiguation discriminator is used to define the number of inputs an overloaded function name should consume. When not explicitly disambiguated, function overload resolution in Substrate works in a different, and sometimes unwanted, way, where (with the exception of certain operators), given a notional tuple such as ... w x y z and a term such as n::f identifying an overloaded function, will attempt to bind n::f(z), n::f(y, z), n::f(x, y, z), n::f(w, x, y, z), ..., in turn, binding the first function which successfully type-checks. If n::f(z) is valid, but n::f(x, y, z) was intended, the programmer would have to write n::f!3 instead, forcing the compiler to pass exactly 3 arguments, no more and no less, and if the call cannot be bound, the program is ill-formed. Additionally, if an overloaded function takes many arguments but has forms taking fewer, even if the type checking is fully unambiguous, having to consider so many bindings could be slow for the compiler, so it is a good idea to specify for any overloaded function.

The operators which are an exception to the usual rules for order of binding are those with both binary and unary forms, such as + and &. Pending a case-by-case review, such operators are tried with two operands, and terms such as +!1 must be used to invoke the unary forms.

Additionally, instead of using () to invoke callables, the call term is used. TODO: define how call works (the semantics assumed in loc_stats.vll may not be sound.)

Substrate flow control

In Substrate, several flow control keywords work in different ways to how they work in structured Vellum.

if takes as inputs (... condition true-function false-function), plus whatever the environment effects of the branches are (they must be compatible). There is no else for if, it always takes exactly those inputs. unless has the same caveats as if does. (? :) (space required) is exactly the same as if. (?:) and (??) both take (... input else-function), where the first argument is a simple value and the latter is invokable, but otherwise work exactly the same as you would expect from the structured syntax.

TODO: match

Though they are not strictly control flow operators, the logical AND and OR operators do short-circuit in structured Vellum. TODO.

TODO: Do loops even exist in Substrate?

For result expressions, the order is changed from 'result' [label] [expression] to expression 'result'[label] (there can be no space between result and the : of the label), and the same for the other related keywords. The equivalent of structured nullary result is None result. However, in general, result-expressions are not very natural within Substrate expressions.

Definitions

Types

This section describes the Vellum type system.

Builtin types

These types are built into the language. All compound types are ultimately composed of some combination of builtin types.

Integers

Like many languages, Vellum has two families of integer types: signed and unsigned. Each can be declared to have any bit width. The signed types are spelled iN or int(N) and the unsigned types are spelled unsigned(N) or uN, for any values of N between 1 and an implementation-defined limit of no less than 65,536 (2^16) and no greater than 2147483647 (2^31-1). (i0 and u0 are valid, but are aliases of None and are not integer types. Attempting to specify a width greater than 2147483647 is a syntax error.)

The int(N) and unsigned(N) types are, in fact, builtin templated structs defined in the compiler. This means that they fit in very naturally with the rest of Vellum's type system.

Additionally, the type iSize is defined to be an alias to the signed integer type with the width of a pointer on the machine. (There is no complementary uSize, as it is not generally needed in Vellum, but there is a type abi::size_t provided by the library, intended for interfacing with C and C++. Several other ABI-compatibility integer types are also defined in the library. For more details, see the ABI/FFI section.)

Integer types are also used to represent string code units, often ambiguously called "characters". Encoding is specified externally. For ABI compatibility with C, the type abi::wchar_t is defined as an alias to an implementation-defined integer type. The types abi::char, abi::char8_t, abi::char16_t, and abi::char32_t are also defined as aliases to the obvious corresponding integer types.

Floating-point Numbers

Vellum has two binary floating-point types, Float32 and Float64. These are defined to be IEEE754 single and double precision, respectively, and generally correspond to C's float and double, respectively. (There is also abi::long_double, which is an implementation-defined type compatible with the C long double type on the platform.)

Unlike the integer types, the floating-point types are not part of a larger generalized family. There is an underlying template for the purposes of genericity, but its argument is an enum, defined as follows:

enum vll::Float_Tag {
    single,
    double,
    long_double,
    // implementation-defined
}

struct Float(f : vll::Float_Tag);

Bool

The Bool type is in fact a builtin enum type, exactly equivalent to:

enum(u1) Bool #[exhaustive, castable] {
    false = 0,
    true = 1,
}

let const false = Bool::false;
let const true  = Bool::true;

Except that Bool, true, and false are all keywords and thus this code would not be valid Vellum. (Note however that despite having a literal following the ::, which is normally invalid syntax, Bool::true is a valid construct which is identical to the unqualified literal true.)

That is, Bool is an enumeration of two values which is explicitly convertible to and from an underlying one-bit unsigned integer type and, in fact, explicitly or contextually convertible from any integer type at all, and from optional or optional-semantic types as well.

Byte

The type Byte is the type of raw memory.

References and pointers

Vellum has 3 reference type categories, which are spelled &T (reference to T), &mut T (mutable reference to T), and &const T (reference to immutable T). Both &mut T and &const T can be implicitly converted to &T, but the reverse conversions are impossible. Additionally, an lvalue of type T can be implicitly converted to &T (or &const T if valid) in certain situations, such as function calls, but not to &mut T. The unary prefix & operator may be used on an lvalue of type T to obtain a &mut T.

Vellum also distinguishes between references to scalar (non-array) types (&T), array types of fixed bound (&[N]T), and array types of unknown or dynamic bound (&[]T), as well as (for multidimensional arrays) combinations of the latter two. For scalar references, the unary postfix ^ operator may be used on any kind of reference to obtain an lvalue of the same mutability. The . operator may implicitly dereference a scalar pointer when looking up the name to bind. No other operations are defined.

For array references, the postfix [] operator can be used to index the referenced array and yield an lvalue referring to an element thereof. Additionally, pointer arithmetic may be performed on array references. No other operations are defined. The . operator may be used on array references when the called function acts on the reference itself, but not when the called function acts on the element type. That is, pa.f will never call pa[0].f.

Vellum does not have null references. Any reference is guaranteed to refer to an object of the specified type. Additionally, dynamic array references keep track of the size of the referenced array. A regular reference may refer to a mutable object and may alias with either mutable or immutable references (but never both). Immutable references cannot refer to mutable objects and cannot alias with mutable references. This lack of aliasing allows for certain optimizations to be made. If nullable reference semantics are desired, use an optional reference instead. This specific case is optimized to avoid storing a discriminator in the union, instead using the underlying null pointer as a discriminating value.

For compatibility with unsafe languages such as C, raw pointer types are also defined. These may be qualified as mutable or immutable in the same way as for references. Note that *const T (as opposed to *mut T and *T) has no direct equivalent in C or C++ and therefore cannot be used with those languages' ABIs.

References may be converted to the corresponding raw pointer type. For references to arrays of unknown bound, this strips the size. For references to arrays of known bound, two conversions exist, one preserving the bound and one dropping it (thus, &[N]T is convertible to both *[N]T and *T). A raw pointer may be null. Raw pointers can be dereferenced with either the ^ or [] operators. Thus, *T acts somewhat like an unchecked (?&T)|(?&[]T).

A raw pointer may be pattern-matched on to eliminate null; the other branch may be a non-optional reference of compatible mutability (not to an array of unknown bound, however, because the dynamic size information cannot be re-introduced).

Optional and Result

Denoted by the ? type constructor (as in ?T), the optional type is in fact a language-level alias for the union type T | None. However, several builtin operators are defined for this specific union configuration.

The Result type, denoted by the ! type constructor, is an alias for T | Fail.

Owner and Shared

The primary means for managing dynamic memory is with the owner(T) and shared(T) smart pointers. owner(T) models unique ownership and may not be copied, only moved. shared(T) models shared ownership via reference counting, and therefore allows copying. Generally it is uncommon to need explicit heap allocations, and when you do need them, owner(T) is typically used at least an order of magnitude more often than shared(T) is. Unlike references, smart pointers may be null.

The declaration

let mut x : owner(T) = f();

declares x as a mutable owner of an immutable object. That is, x may be mutated (such as to release the owned object and obtain a new one) but the owned object cannot be. In contrast,

let y : owner(mut T) = f();

declares y as an immutable owner of a mutable object. That is, the T instance may be mutated, but y will always own exactly that object until its destruction.

None

The None type and singleton is a builtin alias of the empty tuple (). It serves as the primary Unit type for Vellum, though it is possible to define others using structs. It is used to represent empty values and expressions with no meaningful result. It is also used as the replacement for void return types in C.

Because there is only one value of type None, the token None refers to both the value and the type of that value.

In type theory, the Unit type is the analogue of the number 1, and acts as a "multiplicative constant" in that a product type (tuple) gains no extra values by including it. That is, (T, None) is compatible in both directions with (T).

Noreturn

Noreturn is the Bottom type in Vellum. It is a type with no values, aliasing the empty union. It is impossible to instantiate a value of type Noreturn and is used as a signal that an expression does not return a result to its caller. It is the type of flow control operators like result and end, which jump to a different part of the program, as well as functions such as abort, which end execution of the program. As an empty union, Noreturn is a supertype of (is conformable with in expressions) all other types, and cannot itself be a union alternative.

In type theory, the Bottom type is the analogue of the number 0, and acts as an "additive constant", in that a sum type (union) gains no extra values by including it. That is, T | Noreturn is compatible in both directions with (and is in fact definitionally identical to) T.

Type

Type is a builtin alias for a union type with alternatives for structs, enums, tuples, unions, arrays, references, pointers, and functions. It is used for reflection and metaprogramming, and is most commonly used as a parameter type for templates. Values of type Type may be constructed at runtime, but no mutable operations can be performed on them; they are merely tokens which can be inspected to get such properties as the name of a type.

__intN and __uintN

The __intN and __uintN type families are used as the lowest-level implementation types of integers within the compiler (for instance, __int32 is the type of the argument to the int(N) template). There is almost no reason to use them directly, as they are less featured than the regular int(N) and unsigned(N) types. The full list of low-level integer types is exactly: __int8, __int16, __int32, __int64, __uint8, __uint16, __uint32, and __uint64. Each such type is an independent type; they are not instantiations of templates. No other bit widths of low-level integers are defined.

Enums

enum-definition ::= 'enum' enum-name '{' enumerator {',' enumerator} [','] '}'
enumerator      ::= identifier ['=' expression]

An enum is a type with enumerated, named values. They are similar to scoped enums in C++, but by default the conversion to integer type is one-way only. Several attributes are provided to extend the basic functionality.

The builtin Bool type is in fact just an enum type with two values. The only special properties it has are that its enumerators, true and false, are also keywords, and certain unique builtin operators are defined for it.

enum types may be templated.

Tuples

Tuples, also known as product types in type theory, are the most basic way to aggregate multiple values together into a single object. A tuple is comprised of an ordered, fixed-size, heterogenously-typed sequence of anonymous subobjects. Tuples are implicitly created and unwrapped by the language in many contexts. Because the members are anonymous, a tuple type is exactly identified by its list of members. A tuple of tuples is conformable with a tuple formed from the concatenation of all of its member tuples. A tuple of a single element is conformable with that element's bare type.

The None type is in fact an alias for the empty tuple. Tuples may contain the None type, but it may be implicitly dropped or introduced by conversions, given that concatenation with an empty tuple is the identity function.

The primary tuple type constructor is the parenthesized comma-separated list. In contexts where commas and parentheses have special meaning, tuple types may also be constructed with the named tuple(Ts...) type constructor. The empty tuple may be written as (), but None is generally the preferred spelling.

Unions

Unions, also known as variants or as sum types in type theory, are used when a value may be of exactly one of several possible alternative types. Union alternatives are unordered and duplicates are ignored, which is to say that a union type is comprised of a set, rather than a list, of types.

The Noreturn type is in fact an alias for the empty union.

The primary union type constructor is |. When in generic variadic contexts, where binary operators are less ergonomic to write, the named union(Ts...) type constructor may also be used.

Functions/closures

Functions and closure types are categorized along several axes. The most obvious one is the type signature, which is a combination of a tuple of argument types and a return type (which may be, but doesn't have to be, a tuple), separated by ->. Every such combination is a unique function type signature. The normal tuple conformability rules do not apply to function arguments or return types; however, function type signatures that differ only in the insertion or removal of None are still conformable.

Aside from the infinite set of type signatures, function types are divided by purity and by captures. A pure function type is denoted by the fn keyword, while an impure function type is denoted by the proc keyword. References to pure functions can be converted to references to impure functions. Functions may capture either a simple reference, which is denoted by the & character before the argument list, or they may capture an arbitrary tuple of values, which are enclosed in braces before the argument list. A reference to non-capturing function can be converted to a reference to reference-capturing function.

Language linkage is part of the type of a function, and whether function types of different language linkage are convertible is implementation-defined.

No other conversions amongst function types are valid.

Function types are written as (described via a form resembling a regex) (fn|proc) (&|{Cs...})? (As...) -> R, where As... represents the arguments, R represents the return type, and Cs represents the value captures, if any. A reference to a function type is written by placing the & before the fn or proc keyword. Function types may need to be parenthesized in order to include them in unions.

Arrays

Arrays are built into Vellum's type system. Both fixed-size and dynamic array types are provided, and multidimensional types may have any combination of bounds. Dynamic bounds are indicated by the lack of a number in the type-id. A fixed-size linear array is spelled [N]T, while a dynamically-sized linear array is spelled []T. A square array of size 10×10 is spelled [10, 10]T, while a 2D array whose size is unknown in both dimensions is spelled [,]T. Arrays of constant size are allocated on the stack by default, while arrays of dynamic bound are allocated automatically on the heap. This allocation is managed via RAII.

Structures

struct-declaration ::= 'struct' struct-name [argument-list] (';' | struct-body)
struct-body   ::= '{' {struct-member} '}'
struct-member ::= data-member-definition
                | [access-key] fn-declaration
                | [access-key] default-fn-definition
				| 'defer' trait-id 'to' identifier new-line
				| 'implements' trait-id new-line
				| alias-definition
data-member-definition ::= 'let' var-name ':' type-id access-specifier
access-specifier ::= '{' (access-key [',' access-key2] ['mut']) '}'
access-key       ::= 'public' | access-key2
access-key2      ::= 'module' | 'private'

default-fn-definition ::= [access-key] 'fn' default-fn-id '=' 'default' ';'
default-fn-id         ::= 'new' [argument-list] [type-constraint]

The most powerful way to define a new type in Vellum is the struct.

This

Within the body of a struct, the type This refers to that struct. This allows for a more uniform and less noisy declaration syntax.

Type conversions

Generally, conversions between unrelated types is disallowed. However, certain conversions are deemed safe and can be performed implicitly. For instance, integer types can be implicitly converted to wider integer types, and any type may be implicitly converted to a union type containing it. Safe conversions are pure and cannot result in a loss of data. All other conversions are considered potentially-unsafe, and can only occur at defined locations in the program, including converting contexts and explicit casts. Certain operators apply the potentially-unsafe conversion to Bool to some of their operands; in this instance the operator is said to be a converting context. This conversion is as-if a condition expression x were replaced with x.cast(Bool). Function calls can apply the object-to-reference conversion, normally performed by the unary & operator, to their arguments.

Explicit casts are done with the fn as(val, type) and fn cast(val, type) reserved functions. They are identical except that as shall be safe, as described above, and may be considered for implicit type conversions. They simply convert their argument val to the type type, and can be defined by the user for structs. as and cast are somewhat similar to the new reserved function, in that both result in a new object of a particular type. The difference is that as and cast are defined in the scope of the original type, while new is defined in the scope of the target type. Type conversion functions should be pure; a warning is issued for declaring either function with the signature proc (val, type). It is an error if c.cast(Bool) resolves to an impure function when called implicitly in a converting context.

Declarations

Variables

var-declaration    ::= 'let' var-definition var-initializer ';'
                     | 'let' 'const' var-definition var-initializer ';'
                     | 'let' 'mut' var-definition [var-initializer] ';'
var-definition     ::= var-name [':' type-constraint]
var-initializer    ::= equal-initializer | braced-initializer
equal-initializer  ::= '=' expression
braced-initializer ::= '{' expression-list '}'

(* no mutable global variables *)
global-var-declaration ::= 'let' var-definition var-initializer ';'
                         | 'let' 'const' var-definition var-initializer ';'

Variables are introduced by the let keyword. By default, variables are constant, meaning they cannot be modified after declaration. Constant variables must be initialized, while uninitialized mutable variables must have their type fully specified. The mut keyword can be used to declare a mutable variable, which can be modified or reassigned after declaration. The const keyword is used to declare a variable whose value is computed at compile-time and is usable in further compile-time computation; this differs from the meaning of the const reference qualifier. Variables at module or namespace scope may not be declared with the mut qualifier.

Variables may also be introduced by function parameters, in such instances, the syntax is similar but the let keyword is dropped, initializers correspond to default arguments, are not required for constant parameters, nor are type constraints required for mutable parameters without default arguments, because the initialization of function arguments is performed by the caller. In all other respects, function parameters behave as if they were variables declared in a hypothetical scope enclosing only their declarations in order followed by, and returning the result of, the body of the function.

Variables are bound to scopes and have lifetimes. The scope of a variable is the context in which its name is visible, which generally extends from immediately after its point of declaration until the end of the enclosing scope. The lifetime of a variable depends on where it is declared. At module or namespace scope, a variable's lifetime is that of the whole program. At struct scope, a variable's lifetime is bound to that of an instance of the enclosing struct. For more detail, see the section on structures.

At local scope, a variable's lifetime is from when its declaration is evaluated until the scope is exited. Local variables are destroyed promptly on exit, in reverse order of declaration (as if pushed to a stack), which enables a form of automatic resource management known as RAII, or Resource Acquisition Is Initialization.

Local variables, however, have the additional property that their lifetimes, as well as scopes, may be explicitly ended by the program at any point, using the consume operator, as long as it is unconditional, meaning there can never be an expression in which whether a variable is in scope depends on which possible path control flow took to reach it, or the program is ill-formed.

This constraint is purely syntactic and does not take into account any reachability analysis the compiler may be able to perform. If a variable is consumed in one branch of an if statement, it must be consumed in the other as well, and the same goes for match. Only variables local to the body of a loop may be consumed within it.

The consume operator triggers either a destructive move, if its value is used, or else simple destruction. Unlike ordinary move operations, a variable does not need to be mutable to be consumed, because the variable ceases to exist after destruction. Because variable scopes begin after their complete declaration, it is possible to write

let mut a : i32;
a := 1;
let a = consume a;

which has the effect of replacing the mutable variable a with a constant variable also named a, having the same value. At a higher level, it may be described as dropping the mutability of a, and 'finalizing' its value. [Note: may consider replacing this feature with an explicit 'finalize' feature.]

Functions and closures

fn-declaration ::= prototype fn-body
                 | prototype ';'
fn-body        ::= block | '=' expression ';' | '=' 'delete' [reason-expr] ';'
prototype      ::= fn-prologue fn-name argument-list [return-spec]
anon-prototype ::= fn-prologue         argument-list [return-spec]
return-spec    ::= '->' type-constraint
fn-prologue    ::= ['extern' [string-literal]] ('fun' | 'proc')
argument-list   ::= '(' {arg-definition [default-arg]} ')'
arg-definition  ::= arg-name [':' arg-type-constraint]

The basic unit of behavior in Vellum, as in all procedural languages, is the function/procedure. Vellum maintains a distinction in the type system between pure and impure functions. Pure functions are introduced by the fn keyword, while impure functions, or 'procedures' in Vellum parlance, are introduced with the proc keyword. Pure functions are free of side-effects and are referentially transparent, which means that they are very easy to reason about. Additionally, pure functions can be executed at compile-time, which enables very powerful and seamless metaprogramming. However, aside from the most trivial programs, at some point, some manner of stateful interaction with the environment must be performed for the program to do its job. Thus, procedures exist.

Signatures

A function signature represents its type.

fn(InType) -> OutType 
proc(InType) -> OutType 

It is relatively uncommon in Vellum to have to fully specify the types of all arguments to a function or procedure. Most functions, rather, are generic. This may sound problematic if you are familiar with most other statically typed programming languages, because generic functions are more difficult to write. However, in Vellum, the most common way to be generic is to simply write the simplest code that works. Type annotations are generally extra details in Vellum, usually not critical to understanding programs.

Linkage

Functions and procedures have language linkage. This is denoted by extern plus any string constant. This is used for linking to code written in other languages. If unspecified, the default language linkage is extern "Vellum". extern "C" and extern "C++" are the most likely other values.

ABI

Vellum is a native language, and like all native languages, provides the ability to call or be called by programs written in any other native language. However, native interfaces are dominated by—nay, defined by—C and its particular strengths and weaknesses. Thus, no matter what is natural in any other language, they must support the data representations and the level of abstraction provided by C. Vellum is no different. The industry-standard way that C interoperability is achieved is ABI compatibility. Vellum additionally aims to target C++ ABI compatibility, in order to more efficiently communicate higher-level concepts such as ownership semantics. However, C++ is a much more complex target.

For this purpose, Vellum provides a number of ABI-compatibility features. Other than language linkage, most of these are in the forms of type definitions in the std::abi namespace. Using these types, function signatures can be written that are compatible between languages.

Statements

Vellum is an expression-oriented language, which means that statements are a very minor part of the syntax. In fact, the only kinds of statements that are not also expressions are variable declarations and assignments.

Templates

Various constructs besides functions can also be parametrized; the mechanism Vellum provides for this is templates. struct, enum, and trait declarations can all be templated.

Traits

Concatenative Macros

Reflection

Expressions

Primary expressions

Literals

Flow control

All control flow in Vellum is done with expressions, rather than statements. This is unusual for a procedural, structured language, but it is the standard for concatenative languages. Aside from having atypical and complex syntaxes, all control flow keywords are simply operators.

Conditionals

if-expression ::= ('if' | 'unless') expression block ['else' (block | if-expression)]
                | expression '?' expression ':' expression

Binary-conditional expressions include if-expressions and expressions using the operators ? :, ?:, and ??. The latter two are infix binary operators. match-expressions are also conditional expressions, but are described in a separate subsection for clarity.

if-expressions and the ? : operator work in essentially exactly the same way. The difference is that if requires its branches to be brace-enclosed blocks, but its else branch is optional, while ? : takes three plain expressions but its else-equivalent is non-optional. if is intended to be used for conditional execution of multiple lines of code, where the result is usually unused, while ? : is intended to be used within expressions and its result is expected to always be used. If the else block of an if-expression is not present, the behavior is as if an empty block were present. unless acts exactly as if not.

The 'orelse' operator ?:, also known colloquially as the 'elvis' operator, returns its left operand if it evaluates to true, or else its right operand. a ?: b is similar in effect to a ? a : b, except that a is evaluated only once. It is an error if the conversion of its left operand to Bool potentially-modifies that object.

The 'recovery' operator ?? acts as a type-level conditional operator. If the left operand's dynamic type is None or Fail, it evaluates and returns its right operand. Otherwise, it returns its left operand. It is similar to the "null-coalescing" operator found in many dynamic/GC languages, except that Vellum does not have universal nullability, rather it uses sum types to represent optionality. (TODO: figure out how to handle None results which can fail)

match

match-expression ::= 'match' [label] expression '{'
                     match-rule { ',' match-rule} [',' [trivial-match]]
				   '}'
match-rule ::= match-specifier '->' (expression | block)
match-specifier ::= argument-list | case-specifier
(* can stack cases *)
case-specifier ::= 'case' primary-expression [case-specifier]

match is like a generalization of switch in many other languages. It can be used to inspect objects in complex ways, including matching on dynamic type or on values of members. Semantically, each match specifier is checked in order and the first matching one is evaluated. match expressions may be labeled, which allows result expressions to be bound to them by name, even in the case of nesting.

Loops

(* All loops can be labeled, allowing for named multi-level break *)
loop-expression ::=
    (* See below for examples *)
    'for' [label] for-loop-header block ['else' expression]
    (* while and do-while loops:
      while keepgoing() { go() }
      do { go() } until stop();
    *)
    | loop-constraint [label] condition block ['else' expression]
    | 'do' [label] block loop-constraint condition (';' | ['else' expression])
    (* infinite loop:
      loop {
        go();
        break if stop()
        go();
      }
      'loop' loops have no else because they have no condition that can be false
    *)
    | 'loop' [label] block

for-loop-header ::=
    (* iteration:
      for i in range { print(i) }
    *)
      var-definition 'in' expression
    (* counting:
      //count 0 to 9
      for i = 0 -> 10        { print(i) }
      // count 10 to 1
      for i = 10 ; -1        { print(i) }
      // count -10 to 8 by 2s
      for i = -10 -> 10 ; +2 { print(i) }
    *)
    | loop-var-init ['->' expression]
      [';' ('+' | '-' | ) integer-literal]
    (* counting from 0:
      // count 0 to 9
      for i -> 10 { print(i) }
    *)
    | ['mut'] var-definition
      [';' ('+' | '-' | ) integer-literal]
    (* generalized for loop:
      for i = 10      while i > 1      ; i / 2    { print(i) }
      for i = begin() while i != end() ; i.next() { print(i) }
    *)
    | loop-var-init loop-constraint
      [';' expression]

loop-constraint ::= 'while'
                  (* equivalent to `while not` *)
                  | 'until'
loop-var-init ::= ['mut'] var-definition equal-initializer
                | '_'
label ::= ':' identifier
        | ':' ('for' | 'while' | 'do' | 'loop')

condition ::= expression
            | 'let' var-definition equal-initializer

There are four basic types of loop expressions: loop, while, do, and for loops. Any kind of loop can be labeled, which allows continue, end, and result-based expressions to be linked to specific loops despite nesting. The identifiers used for labels do not share a namespace with local variables, but may not be the same as any enclosing label or function/procedure name. Unlike in other cases, the last expression in a loop body is not used as the result of the loop, because it may be evaluated arbitrarily many times. Instead, the result of a loop expression can be provided via either the operand of a result expression, or it is provided by evaluating an else clause attached to the loop. If there is no else clause, it is as if one simply containing None were present.

There is a special kind of result expression solely for loops, called break. break, by default, exits the nearest enclosing loop with result None. A label, operand, or both may be provided, in which case it acts exactly the same as result. The control flow expressions continue and end are unique to loops. continue jumps directly to the end of the loop body, evaluating local destructors but nothing else. It may be labeled, but takes no operand. end is similar to break, but it jumps to the else clause instead of to outside the loop. It can be labeled, but cannot take an operand. As normal, the result type of all of these operators is Noreturn.

The simplest kind of loop is the loop-loop, which simply repeatedly evaluates a block until explicitly exited by some means. Other than lacking an else clause (because it would be unreachable), loop loops are semantically defined by a direct translation to while true loops.

while (including until) and do loops are very similar, in that they repeatedly evaluate their body while a condition holds. while loops evaluate the condition first, then the loop body, then repeat, while do loops evaluate the loop body first, then the condition, then repeat. Thus, a do loop will always evaluate its body at least once. If the loop-constraint of a while or do loop is while, the body is evaluated if the condition evaluates to true, while if the loop-constraint is until, the body is evaluated if the condition evaluates to false. In the event that the loop exits because of the condition, the loop's else clause, if any, is executed.

for loops are the highest-level form of loops, and have the distinguishing feature of binding an iteration variable. The most general form is generalized for loop, which binds an iteration variable, evaluates a condition (which may be constrained by either while or until), evaluates the loop body, and then rebinds the iteration variable to the result of an iteration-expression. If no iteration-expression is provided, the iteration variable is unchanged.

For while loops, the condition can be either an expression or a variable declaration, in the latter case, the variable is initialized once and lives until the end of the loop's associated else expression, and its value is used as the expression for the loop condition. For do loops, which check their condition at the end of the body, as well as loop loops, which do not check any condition, a declaration can be inserted before the body, but it is not implicitly used as the condition.

In the translation to substrate, while, do, and generalized for loops are translated to tail-recursive immediately-invoked lambda expressions.

The simplest other form of for loop is the counting for loop, which iterates over a range of numbers until it reaches an upper bound. Semantically, it is defined by a straightforward translation from something like for i = init -> bound; + incr to for i = init while i < bound ; i + incr. If no increment is provided, it is assumed to be +1 if init <= bound and -1 if init > bound

The other form is the range for loop, known in many languages as a "foreach" loop. It binds the loop variable to each element of a range in turn, where a range is any iterable type. Range for loops are semantically defined by a translation to generalized for loop which caches the end iterator and loops from the begin iterator to it, binding a local variable in the loop body to the result of dereferencing the iterator.

result and yield

result-expression  ::= 'result' [label] [expression]
                     | 'yield' [label] [expression]
                     | 'return' [expression]
                     | 'break' [label] [expression]

Result expressions are used to exit scopes prematurely, potentially carrying a value. All result expressions, including those with operands, have type Noreturn. result simply returns to an outer scope, while yield suspends execution of the generator expression or coroutine it is within, allowing it to be resumed at a later point. return is equivalent to result except that it is restricted and can only return from the nearest enclosing function. break is equivalent to result except that by default it exits the nearest enclosing loop expression. result with no label returns from the nearest enclosing labelable scope of any kind, that is, loops, match expressions, or functions/procedures. yield suspends and returns from the nearest enclosing generator expression or coroutine. [N.B.: coroutines and generator expressions have yet to be operationally defined.]

[Experimental: A labeled result can be used to exit from nested lambdas and reference-capturing local functions/procedures that meet certain restrictions, that can be summarized as maintaining a predictable stack/notional tuple state. Recursing using annotated #[call/cc] calls is allowed, but not if the lambda or local function/procedure potentially recurses in any other way. This is used in the language to lower loops to substrate expressions, but probably should be strongly avoided in regular code due to breaking the expected execution model.]

Compound expressions

Arithmetic operators

operator arity position meaning
+ 2 infix addition
+? 2 infix checked addition
+% 2 infix wrapping addition
+ 1 prefix unary plus*
- 2 infix subtraction
-% 2 infix wrapping subtraction
-? 2 infix checked subtraction
- 1 prefix negation
* 2 infix multiplication
/ 2 infix division
% 2 infix remainder
mod 2 infix modulus

* The unary plus operator is defined in the core language, for numeric operands, to return its operand unchanged. It exists for consistency with unary minus, better known as negation.

Bitwise operators

operator arity position meaning
& 2 infix bitwise AND
| 2 infix bitwise OR
xor 2 infix bitwise exclusive OR
~ 1 prefix bitwise complement
<< 2 infix left shift
>> 2 infix right shift

Logical operators

operator arity position meaning
and 2 infix logical AND
or 2 infix logical OR
not 1 prefix logical complement

The logical AND and OR operators short-circuit, as in most languages.

Relational Operators

operator arity position meaning Overloadable
== 2 infix equal to Yes
!= 2 infix not equal to No
<=> 3 infix 3-way comparison* Yes
< 2 infix less than No
> 2 infix greater than No
<= 2 infix less than or equal to No
>= 2 infix greater than or equal to No

* Also known as the spaceship operator. It is used as a basis for the more usual comparison operators.

Assignment Operators

operator arity position meaning
= 2 infix initialization*
:= 2 infix assignment*
+= 2 infix increase
+?= 2 infix checked increase
+%= 2 infix wrapping increase
-= 2 infix decrease
-?= 2 infix checked decrease
-%= 2 infix wrapping decrease
*= 2 infix multiply
/= 2 infix divide
%= 2 infix remainder

* Technically, the = and := tokens are not operators, but rather syntactic elements. They cannot be used within expressions, but only in their dedicated syntactic constructs.

Indirection operators

Operator arity position meaning
& 1 prefix address-of
. 2 postfix member access
^ 1 postfix dereference*
[] 2+ postfix element access

Contrasting with C's prefix *, the Vellum dereference operator ^ is postfix, such that all pointer-consuming operators are uniformly on the right. Also contrasting with C, the . operator will automatically* dereference pointers, rendering the C -> operator redundant.

* Due to UFCS, the . operator may find a function taking a pointer as an argument and thus it will not dereference the argument. This is usually the desired behavior anyway, but in rare instances it may be necessary to write p^.x, which is actually the direct equivalent to -> in C. The lack of need for parentheses make a dedicated operator for this case unhelpful.

Special Operators

Operator arity position meaning
(: 1 prefix begin substrate expression
:) 1 postfix end substrate expression
:: 2 infix scope resolution
! 2 infix overload disambiguation
|> 2 infix pipeline rewrite
# 1 prefix reflection
@ 1 prefix injection
consume 1 prefix destructive move
drop 1 prefix discard value
?: 2 infix orelse
?? 2 infix recovery

The sequence (: ... :) is an abbreviation of, and exactly identical in meaning to, the sequence substrate { ... }.

:: and ! are discriminators, rather than being operators in the conventional sense, and have the same infix syntax even within substrate expressions. :: is used to refer to a member of a specified scope. ! is used to disambiguate overloaded functions by determining the number of arguments to pass. This is typically useful only within substrate expressions, which otherwise pass the least number of arguments that they can to an overloaded function. The other kind of discriminator is to prefix the function name with its signature in parentheses, which is needed only when taking a reference to a function, and never when calling it.

\|> is the pipeline rewrite operator.

# and @ are the reflection and injection operators, respectively.

The consume operator is described in the section on variables. The drop operator is used to explicitly discard the value of an expression; this is analogous to casting to void in languages such as C and C++, and in the structured syntax it is typically used only to suppress warnings. In substrate expressions, however, it is an essential low-level primitive which discards the rightmost value from the notional tuple.

?: and ?? are conditional operators, described in detail in the section on control flow.

Custom operators

Custom operators may be declared by operator declarations, which assign a fixity and precedence to a particular custom operator.

Functions and Procedures

Vellum supports several mechanisms by which an object can participate in the environment.

  • Functions, which may not have side effects.
  • Procedures, which may have side effects.
  • Overloaded data structures (not finalized)

Vellum has two syntaxes for handling callable functions and procedures:

  • Structured Syntax
  • Concatenative Syntax

A Primer for Concatenative Programming

The concatenative substrate of Vellum is what makes it somewhat unique, as concatenative programming is largely overlooked by most programmers, but nonetheless quite powerful. It is a useful statement to say it is the dual of Lisp, in that rather than everything being a list that may be reduced, in concatenative expressions, everything is a function composed of other functions.

Others have noted that concatenative programming is like an amplifier: bad or naive programmers produce unreadable write-once code, whereas good programmers can write brilliant, crystal clear programs. Be warned that concatenative syntax is seductive and easy to misuse and as such, we include a synaptic guide to avoid misunderstandings and avoidable mistakes.

Tenets of Concatenative Programming

Words and Vocabularies

Concatenative programming languages borrow much terminology from Forth, the first concatenative language, most notably, the terms word and vocabulary. A word is analogous to a function, and a vocabulary is analogous to a namespace. Forth didn't have explicit namespaces, using only the global namespace, but modern languages like Factor and Kitten do. The distinction between words and functions is at least somewhat useful because as will be explained later, functions in applicative languages are not equivalent to words in concatenative languages. We will hence borrow this terminology. Keep in mind that for all intents and purposes, a function is a word, but a word is not necessarily a function.

Point Free Style

Point Free programming is a style in which named variables are not used. Instead, data is passed from one function to another function, note necessarily given an identifier, like a variable. This is similar to the Unix pipeline, in which a program often provides text data to stdout, and the | operator can pipe it in to the next program without the original data needing to be explicitly named, such as in the example of find . | grep "snakes and dogs".

Named variables are useful whenever a single invariant value appears multiple times in an expression, such as in mathematical formulas. It obfuscates the intent to use stack shuffling (described below) to avoid binding a name to a value to compute such things, and defining words for each value pollutes the environment. and so there are mechanisms to create scoped variable names. In general, however, concatenative expressions should not be used in such situations directly, as the infix notation of the main language is more than adequate for expressing those operations.

Stack Shuffling

Functions in most procedural languages make use of a stack to store temporary local variables. In concatenative programming, this is more true than ever (although the vocobulary is another story, analogous to the heap). Concatenative programming requires stack shuffling for the lowest level of abstraction. The operations in plain english are

  • duplicate (the top item)
  • swap (the top two items)
  • rotate (the top three items i.e. item 3->1, 1->2, 2->3)
  • drop (the top item)
  • over (duplicate the second item, put it on top)

These words are the interface between the low level, literal machine operations, and the high level, (hopefully) crystal clear abstractions. It is common for mid-level words to use these to get data from the environment, but it is usually discouraged as they more often than not impair the readability and reasonability of the code.

Everything is a function

Some languages, like Forth, take this to an absurd extreme, allowing this as valid:

: 5 4 ;

Which redefines 5 as a word that puts 4 onto the stack. This is not a valuable feature in any language outside of experimental or esoteric programming or code obfuscators (whose value is questionable to begin with), and hence most modern concatenative languages do not allow this. Nonetheless, it highlights the notion that all expressions in a concatenative language, even literals, behave exactly like regular functions that consume no data and return themselves.

The Concatenative Monoid

A monoid in functional programming is a group with an identity value and a set that is closed under all operations. There must also be at least one operation that is commutative and associative, and every operation like this must have a corresponding inverse operation.

Concatenative programming takes the concept of a monoid to another extreme because all functions accept the working environment as an argument and return the new working environment as a result. This is referred to as Stack Polymorphism in stack based languages. Vellum's model includes a stack in much the same way that procedural languages use stack allocation, but it also takes advantage of registers where possible.

Quotations

Quotations are one of several methods for metaprogramming in Vellum (others including reflection, AST transformation, and templates). They can be thought of as anonymous words and create a higher level of indirection. Higher order functions (functions that accept functions as arguments) in applicative languages are analogous to words that accept quotations. Quotations, however, provide more power than lambdas or anonymous functions because they also allow for the creation of concatenative macros, i.e. transforming one or more simple functions into a more complex function at compile time.

Concatenative Best Practices

  • Definitions should be no longer than 10 words in the vast majority of cases, and have very descriptive names.
  • Definitions should always have a clear environment effect. Many concatenative languages are stack based and hence refer to this as the stack effect. Vellum uses a different model for better conceptual compatibility with the main applicative language and hence refers to this as the environment effect
  • Quotations should not contain large sub-expressions, but preferably one or two clearly named words or literals.
  • Avoid stack shuffling in mid-level abstraction words. One of the powers of Vellum is that you can blend applicative code with concatenative code. Feel free to create functions to let the compiler handle stack shuffling for you so you can focus (and let those reading your code focus) on the intent.

While this may be obvious, concatenative programming is not a silver bullet. It can make expressing some solutions easier or more clearly, but this usually requires a great deal of care to do well. The benefit is programs that are easier to read, but there is always the risk of writing incomprehensible code.

Further Reading

Will add more resources later.

Structured Syntax

Concatenative Syntax

Signatures

TBD

Modules