Compiler Overview

draft tech-compiler-internals march 03, 2021

The compiler is fairly complicated, with two different tokenizers and three different argument evaluators to go from ScriptText to compiled Blueprint code. This document details the specifics of what happens so we don't forget how it works a year from now.

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph ALL[OUTLINE]
  T[text] -.-> CB([Split Text into Lines]) -.-> SL[script lines]
  SL -.-> ST([ScriptLine Tokenizer]) -.-> SU[script units]
  SU -.-> SC([ScriptUnit Compiler]) -.-> SMC[SMC functions]
  class CB OutNode
  class ST OutNode
  class SC OutNode
end
class ALL Block

Here's what the entire process looks like. There are two main phases during compilation:

PHASE 1: Converting ScriptText to ScriptUnits
PHASE 2: Converting ScriptUnits to SMC Functions

SMC Functions are what are executed at runtime.

The following sections describe the compilation process in more detail.

Phase 1: Convert Text to ScriptUnits

The ScriptText is a string with lines demarqued by linefeeds. This is split into an array of strings, each string representing a line of ScriptText.

The Script Tokenizer scans the array of lines and converts each one into a ScriptUnit, which is an array of data objects consisting of a keyword node followed by an arbitrary number of argument nodes.

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
classDef KeyFill fill:#9d9
classDef ArgFill fill:#cfc

subgraph PAR[LINE PROCESSOR]
  B1[script lines] -.-> B2
  B2([ScriptLine Tokenizer]) -.-> B3[KeyNode]
  B3 -.-> B4
  B2 -.-> B22[ArgNode1] -.-> B4[ScriptUnit Array]
  B2 -.-> B23[ArgNode2] -.-> B4
  B2 -.-> B24[ArgNodeN] -.-> B4
  B4 --> B5([ScriptUnit Compiler])
  class B3 KeyFill
  class B22 ArgFill
  class B23 ArgFill
  class B24 ArgFill
  class B2 OutNode
  class B5 OutNode
  class B6 OutNode
  class B7 OutNode
end
class PAR Block

Phase 2: Convert ScriptUnits to SMC Functions

Each ScriptUnit is a complete command followed by its parameters. Since the ScriptUnit consists of data objects, theses must be processed in three steps.

First, convert all the argument nodes into arguments that are Javascript literal values, except for ones that can't be represented this way (expressions, blocks, objrefs)
Then, look-up the keyword compiler function for the keyword, and pass the converted arguments to it to generate runtime code in the form of a SMC Function
As each scriptline is processed, the resulting SMC Functions are pushed into a Program Bundle to create the Agent Blueprint. This is essentially a collection of arrays of SMC Function which are our "executable runtime code"

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
classDef KeyFill fill:#9d9
classDef ArgFill fill:#cfc

subgraph PAR[SCRIPTUNIT COMPILER]
  B4[ScriptUnit Array] -.-> B5([ScriptUnit Compiler])
  B5 -.-> SU[keyword, token, ...]
  SU --> |step 1|B6([Convert Tokens to Args])
  SU -.-> |step 2|B7([Compile Keyword with Args])
  B6 -.-> B8[Args]
  B7 -.-> B9[SMC Function]
  class B3 KeyFill
  class B22 ArgFill
  class B23 ArgFill
  class B24 ArgFill
  class B2 OutNode
  class B5 OutNode
  class B6 OutNode
  class B7 OutNode
end
class PAR Block

Phase 2.1: Token Conversion

The ScriptUnit array is converted from an array of Token to an array of Arg. An arg is an "expanded" version of a ScriptUnit where types that can be represented as Javascript types (literals and identifiers) are taken out of their Token wrapper. The converted Tokens are the input to the SMC Function generator.

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph CEXP[Phase 2.1. Token Conversion]
  C2[keyword, token, ...] --> C4([Process Args])
  C4 -.-> C41[primitive] -.- |pass as-is|C5A[javascript literal] -.-> C5
  C4 -.-> |special|C42[expression string] -.- |convert|C5B[AST for expr-evaluator] -.-> C5
  C4 -.-> |special|C43[code block] -.- |recursive compile|C5C[array of SMC Functions] -.-> C5
  C4 -.-> |special|C44[objref] -.- |convert|C5D[objref as parts array] -.-> C5
  C5[keyword, arg1, ...]
  class C4 OutNode
end
class CEXP Block

Primitive argument types like string and number are converted directly into their javascript equivalent.
special type expression is converted from {{ Javascript expression }} into an ExpressionTree which can be evaluated at runtime using expr-evalute.
special type code block is an array of ScriptLines that were contained inside a [[ ]] block, where the trailing [[ on a line denotes the start and a single ]] on a line by itself denotes the end. This type of argnode is recursively run through Compiler, eventually returning an array of SMC Functions
special type object ref is an object.property reference of arbitrary path length, used to designate properties like agent.x and agent.Costume.pose.

Phase 2.2: Keyword Compilation

The keyword's corresponding CompileFunction is passed the converted arguments from Step 1. It generates a SMC Function that is the code that is actually executed during runtime. An SMC Function is a regular Javascript function that performs the script action, using any arguments that were passed to it at compile time as needed.

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0

subgraph KEY[Phase 2.2 Keyword Compiler Detail]
  D1[keyword, arg1, arg2, ...] -.-> D2{KeywordCompile}
  D2 --> |step 1|D3[Generate new SMC Function]
  D2 --> |step 2|D4[Bind Args to SMC Function]
  D2 --> |step 3|D5[Return SMC Function]
  class D2 OutNode
end
class KEY Block

The compiler function is interesting code in that it relies on Javascript's first-class functions and closures to bind compile-time arguments at later runtime:

function compile(scriptUnit) {
  const [kw, arg1, arg2] = scriptUnit;
  return [
    (agent, state)=>{
      console.log(kw,arg1,arg2);
    }
  ]
}

The compile() function example above unpacks the contents of the ScriptUnit and returns an array of SMC Function that uses them in the execution context agent,state when the simulation is actually running. Thanks to closures, the arguments arg1, arg2 are still available even after the compile() function has ended, as they are bound to the returned function inside the array.

Phase 2.3: Program Bundle Assembly

As each ScriptLine in the ScriptText is compiled, the resulting SMC Function is added to one of several Program Arrays stored in a Program Bundle.

graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0

subgraph PROG[Phase 2.3. Program Bundle Assembly]
  ST[ScriptText] -.-> SL1[ScriptLine 1]
  ST -.-> SL2[ScriptLine 2]
  ST -.-> SL3[ScriptLine N]
  SL1 -.-> SU1[ScriptUnit 1]
  SL2 -.-> SU2[ScriptUnit 2]
  SL3 -.-> SU3[ScriptUnit N]
  SU1 -.-> SF1[SMCFunction 1]
  SU2 -.-> SF2[SMCFunction 2]
  SU3 -.-> SF3[SMCFunction N]
  SF1 -.-> PB([Program Bundler])
  SF2 -.-> PB
  SF3 -.-> PB
  PB -.-> AR[Array of SMCFunctions 1..N]
  class PB OutNode
end
class PROG Block

The Program Bundle actually stores several named programs for each Blueprint definition:

define - run at blueprint instantiation
update - run during simulation update phase
think - run during simulation AI phase
exec - run during simulation action phase
condition - run during global condition checks
event - run during event messaging phase

See datacore/dc-script-bundle.ts for implementation details.

Runtime Details

At runtime, each agent instance executes several programs that are defined in the blueprint's program bundle. Since a program is just an array of functions, running a program is just a matter of calling each function one-by-one.

Execution Context

For persistence memory operations, the current agent instance and a state object is passed to each function. The state object implements a stack and an ALU similar to that of an 8-bit microprocessor. An SMC Function that needs to receive input data from previous SMC Functions can retrieve them from the stack. Likewise, an SMC Function can push values on the stack.

SMC Functions have the following signature:

type SMCFunction = (agent:IAgent, state:SMState) => any[];

The GAgent class which is the base of all Agents implements a single execution entry point:exec( program, context, ...args) . To run any program on an agent instance, call its exec() function. It looks something like this:

type TSMCProgram = Array<SMCFunction>;

class GAgent {
  .
  .
  exec_smc(program: TSMCProgram, ctx, ...args) {
    const state = new SM_State([...args], ctx);
    program.forEach((op, index) => {
      op(this, state);
    });
    return state.stack;
  }
  .
  .
}

SMC Function Arguments at Runtime

Every SMC Function retains access to all the arguments provided at Compile time. Because each keyword compiler function knows the order of arguments and their meaning, it can use their values as-needed. For special case argument types, additional runtime processing may be necessary. For example, reference to an agent property value is unknown at compile time, so it must be looked-up at run time.

The expr-evaluator module provides the following runtime lookup services:

Evaluate( exprAST, context ) - evaluate the expression exprAST which is a binary tree, using the provided context object to look-up object references.
EvalArg( arg, context ) - if the passed argument is an expression or objref, return the evaluated value using the provided context. Program blocks are returned as a program array.
EvalUnitArgs( scriptUnitArray, context ) - calls EvalArg() on each argument, returning a new array of the evaluated arguments.
DerefProp( refArg ) - given an object reference path for an GAgent property, drill-down into the referenced object and return the indicated agent property object (a GVar).
DerefFeatureProp( refArg ) - given an object reference path for a GFeature property, drill-down into the agent's GFeature property dictionary and return the indicated property object (also a GVar)

Note on context: The context object passes { agent } at minimum, and also any other referenced agent instances as their blueprint name (e.g. { agent, Bee, Fish } in when clauses). It is the responsibility of the program invoker to provide this context, as the SMCFunction can not look it up.

To keep execution overhead lower, the Evaluate and Dereference operations are not automatically run on every SMCFunction at runtime. It's up to the keyword author to selective apply runtime argument expansion using one of the above methods in expr-evaluator

TSMCProgram Execution Entry Points

The most common place that compiled scripts run is in the GAgent class and during scriptable moments during the simulation lifecycle. You can search for them by looking for the string .exec( in the codebase, searching for the various phases listed in api-sim-gameloop.js, or consulting this list:

sim-agents Update() - calls GAgent.agentUPDATE() for every instance in the sim
sim-agents Think() - calls GAgent.agentTHINK() for every instance in the sim
sim-agents Exec() - calls GAgent.agentEXEC() for every instance in the sim
sim-agents queueUpdateMessage() - pushes a program into the agent's Update queue, executed during GAgent.agentUPDATE()
sim-agents queueThinkMessage() - pushes a program into the agent's Update queue, executed during GAgent.agentTHINK()
sim-agents queueExecMessage() - pushes a program into the agent's Update queue, executed during GAgent.agentEXEC()
sim-conditions Update() - related to global tests which are computed once, and inspected by agent scripts during their own update
sim-conditions Update() - also runs any queued system events, looking up blueprints that have subscribed to them and sending all their instances a "handler program" that is defined by an onEvent keyword

There are two special case programs that use a similar mechanism but are not the same as a TSMCProgram.

test programs receive an arbitrary number of testing argument and return true or false, and are used in sim-conditions. They are stored in a global dictionary.
filter programs are regular javascript functions that are used to filter sets of agents based on criteria. The "touches" test, for example, is such a function. They are stored in a global dictionary.
pragma programs are generated by the _pragma internal keyword, and emit runtime code that expects to receive the special COMPILER_AGENT and COMPILER_STATE contexts at compile-time. These are NOT put into a program bundle, but instead are run immediately to return pragma information or change the compiler state.

Compiler-to-Instance Call Structure

Pseudocode follows:

// CONVERTING TEXT TO BLUEPRINT
scriptUnits = ScriptifyText( text )
  lines = text.split('\n');
  scriptUnits = scriptifier.tokenize(lines)
  return scriptUnits
bundle = CompileBlueprint( scriptUnits )
  bundle = new ProgramBundle();
  check scriptUnits[0] is 'blueprint' pragma
  scriptUnits.forEach as unit
    check unit[0] is pragma and process
    check unit[0] is not duplicate 'blueprint' pragma
    objcode = r_CompileUnit(unit)
      unit = r_ExpandArgs(unit)
        unit.forEach as item
          if item.expr, return ParseExpression(expr)
          if item.objref, return as-is for proc at runtime
          if item.program, return GetProgram(program) 
          if item.block:
            blockScript = scriptifier.tokenize(block)
            blockObjcode = r_CompileBlock(blockScript)
              blockObjcode = []
              blockScript.forEach as blockUnit
                skip directives and comments
                blcode = r_CompileUnit(blockUnit) -- recursive call
                blockObjcode.push(blcode);
                return blockObjcode      	      	
            return blockObjcode
          if anything else, return arg as-is
      keyword = unit[0]
      unitCompiler = GetKeyword(keyword)
      return unitCompiler(compile(unit))
    AddToBundle(bundle,objcode)A
    return bundle    
 RegisterBlueprint( bundle )
 
 // ADDING AN INSTANCE DEFINITION
 instanceDef =  DefineInstance( specObj ) 
   { blueprint:string, init:TSMCProgram, name:string } from spec
   INSTANCES if a Map of Array<specObj>
   store instance defs by blueprint name
   specObj.id = INSTANCE_COUNTER++
   INSTANCES.get(blueprint).push(specObj);

// CREATING AGENTS FROM ALL INSTANCE DEFINITIONS
instances = GetAllInstances()
instance.forEach as specObj
  MakeAgent(spec)
     { blueprint:string, name:string } from specObj  -- this is instance name
    bundle = GetBlueprint(blueprint)
    agent = new GAgent(name)
    agent.setBlueprint(bundle)
      agent.blueprint = bundle;
      agent.exec(bundle.define);
      agent.exec(bundle.init);
    return SaveAgent(agent)
      AGENT_DICT is a Map of Map
      store agent instance id by blueprint name

Compiler Overview - theRAPTLab/gsgo GitHub Wiki