Compiler Overview - theRAPTLab/gsgo GitHub Wiki
draft tech-compiler-internals march 03, 2021
The compiler is fairly complicated, with two different tokenizers and three different argument evaluators to go from ScriptText to compiled Blueprint code. This document details the specifics of what happens so we don't forget how it works a year from now.
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph ALL[OUTLINE]
T[text] -.-> CB([Split Text into Lines]) -.-> SL[script lines]
SL -.-> ST([ScriptLine Tokenizer]) -.-> SU[script units]
SU -.-> SC([ScriptUnit Compiler]) -.-> SMC[SMC functions]
class CB OutNode
class ST OutNode
class SC OutNode
end
class ALL Block
Here's what the entire process looks like. There are two main phases during compilation:
- PHASE 1: Converting ScriptText to ScriptUnits
- PHASE 2: Converting ScriptUnits to SMC Functions
SMC Functions are what are executed at runtime.
The following sections describe the compilation process in more detail.
The ScriptText is a string with lines demarqued by linefeeds. This is split into an array of strings, each string representing a line of ScriptText.
The Script Tokenizer scans the array of lines and converts each one into a ScriptUnit, which is an array of data objects consisting of a keyword node followed by an arbitrary number of argument nodes.
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
classDef KeyFill fill:#9d9
classDef ArgFill fill:#cfc
subgraph PAR[LINE PROCESSOR]
B1[script lines] -.-> B2
B2([ScriptLine Tokenizer]) -.-> B3[KeyNode]
B3 -.-> B4
B2 -.-> B22[ArgNode1] -.-> B4[ScriptUnit Array]
B2 -.-> B23[ArgNode2] -.-> B4
B2 -.-> B24[ArgNodeN] -.-> B4
B4 --> B5([ScriptUnit Compiler])
class B3 KeyFill
class B22 ArgFill
class B23 ArgFill
class B24 ArgFill
class B2 OutNode
class B5 OutNode
class B6 OutNode
class B7 OutNode
end
class PAR Block
Each ScriptUnit is a complete command followed by its parameters. Since the ScriptUnit consists of data objects, theses must be processed in three steps.
- First, convert all the argument nodes into arguments that are Javascript literal values, except for ones that can't be represented this way (expressions, blocks, objrefs)
- Then, look-up the keyword compiler function for the keyword, and pass the converted arguments to it to generate runtime code in the form of a SMC Function
- As each scriptline is processed, the resulting SMC Functions are pushed into a Program Bundle to create the Agent Blueprint. This is essentially a collection of arrays of SMC Function which are our "executable runtime code"
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
classDef KeyFill fill:#9d9
classDef ArgFill fill:#cfc
subgraph PAR[SCRIPTUNIT COMPILER]
B4[ScriptUnit Array] -.-> B5([ScriptUnit Compiler])
B5 -.-> SU[keyword, token, ...]
SU --> |step 1|B6([Convert Tokens to Args])
SU -.-> |step 2|B7([Compile Keyword with Args])
B6 -.-> B8[Args]
B7 -.-> B9[SMC Function]
class B3 KeyFill
class B22 ArgFill
class B23 ArgFill
class B24 ArgFill
class B2 OutNode
class B5 OutNode
class B6 OutNode
class B7 OutNode
end
class PAR Block
The ScriptUnit array is converted from an array of Token
to an array of Arg
. An arg is an "expanded" version of a ScriptUnit where types that can be represented as Javascript types (literals and identifiers) are taken out of their Token wrapper. The converted Tokens are the input to the SMC Function generator.
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph CEXP[Phase 2.1. Token Conversion]
C2[keyword, token, ...] --> C4([Process Args])
C4 -.-> C41[primitive] -.- |pass as-is|C5A[javascript literal] -.-> C5
C4 -.-> |special|C42[expression string] -.- |convert|C5B[AST for expr-evaluator] -.-> C5
C4 -.-> |special|C43[code block] -.- |recursive compile|C5C[array of SMC Functions] -.-> C5
C4 -.-> |special|C44[objref] -.- |convert|C5D[objref as parts array] -.-> C5
C5[keyword, arg1, ...]
class C4 OutNode
end
class CEXP Block
- Primitive argument types like string and number are converted directly into their javascript equivalent.
- special type expression is converted from
{{ Javascript expression }}
into an ExpressionTree which can be evaluated at runtime usingexpr-evalute
. - special type code block is an array of ScriptLines that were contained inside a
[[ ]]
block, where the trailing[[
on a line denotes the start and a single]]
on a line by itself denotes the end. This type of argnode is recursively run through Compiler, eventually returning an array of SMC Functions - special type object ref is an
object.property
reference of arbitrary path length, used to designate properties likeagent.x
andagent.Costume.pose
.
The keyword's corresponding CompileFunction is passed the converted arguments from Step 1. It generates a SMC Function that is the code that is actually executed during runtime. An SMC Function is a regular Javascript function that performs the script action, using any arguments that were passed to it at compile time as needed.
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph KEY[Phase 2.2 Keyword Compiler Detail]
D1[keyword, arg1, arg2, ...] -.-> D2{KeywordCompile}
D2 --> |step 1|D3[Generate new SMC Function]
D2 --> |step 2|D4[Bind Args to SMC Function]
D2 --> |step 3|D5[Return SMC Function]
class D2 OutNode
end
class KEY Block
The compiler function is interesting code in that it relies on Javascript's first-class functions and closures to bind compile-time arguments at later runtime:
function compile(scriptUnit) {
const [kw, arg1, arg2] = scriptUnit;
return [
(agent, state)=>{
console.log(kw,arg1,arg2);
}
]
}
The compile()
function example above unpacks the contents of the ScriptUnit and returns an array of SMC Function
that uses them in the execution context agent,state
when the simulation is actually running. Thanks to closures, the arguments arg1, arg2
are still available even after the compile()
function has ended, as they are bound to the returned function inside the array.
As each ScriptLine in the ScriptText is compiled, the resulting SMC Function is added to one of several Program Arrays stored in a Program Bundle.
graph LR
classDef OutNode color:red,stroke-width:2px,stroke:#ff0000
classDef Block fill:#fff0d0
subgraph PROG[Phase 2.3. Program Bundle Assembly]
ST[ScriptText] -.-> SL1[ScriptLine 1]
ST -.-> SL2[ScriptLine 2]
ST -.-> SL3[ScriptLine N]
SL1 -.-> SU1[ScriptUnit 1]
SL2 -.-> SU2[ScriptUnit 2]
SL3 -.-> SU3[ScriptUnit N]
SU1 -.-> SF1[SMCFunction 1]
SU2 -.-> SF2[SMCFunction 2]
SU3 -.-> SF3[SMCFunction N]
SF1 -.-> PB([Program Bundler])
SF2 -.-> PB
SF3 -.-> PB
PB -.-> AR[Array of SMCFunctions 1..N]
class PB OutNode
end
class PROG Block
The Program Bundle actually stores several named programs for each Blueprint definition:
-
define
- run at blueprint instantiation -
update
- run during simulation update phase -
think
- run during simulation AI phase -
exec
- run during simulation action phase -
condition
- run during global condition checks -
event
- run during event messaging phase
See datacore/dc-script-bundle.ts
for implementation details.
At runtime, each agent instance executes several programs that are defined in the blueprint's program bundle. Since a program is just an array of functions, running a program is just a matter of calling each function one-by-one.
For persistence memory operations, the current agent instance and a state object is passed to each function. The state object implements a stack and an ALU similar to that of an 8-bit microprocessor. An SMC Function that needs to receive input data from previous SMC Functions can retrieve them from the stack. Likewise, an SMC Function can push values on the stack.
SMC Functions have the following signature:
type SMCFunction = (agent:IAgent, state:SMState) => any[];
The GAgent
class which is the base of all Agents implements a single execution entry point:exec( program, context, ...args)
. To run any program on an agent instance, call its exec()
function. It looks something like this:
type TSMCProgram = Array<SMCFunction>;
class GAgent {
.
.
exec_smc(program: TSMCProgram, ctx, ...args) {
const state = new SM_State([...args], ctx);
program.forEach((op, index) => {
op(this, state);
});
return state.stack;
}
.
.
}
Every SMC Function retains access to all the arguments provided at Compile time. Because each keyword compiler function knows the order of arguments and their meaning, it can use their values as-needed. For special case argument types, additional runtime processing may be necessary. For example, reference to an agent property value is unknown at compile time, so it must be looked-up at run time.
The expr-evaluator
module provides the following runtime lookup services:
-
Evaluate( exprAST, context )
- evaluate the expressionexprAST
which is a binary tree, using the providedcontext
object to look-up object references. -
EvalArg( arg, context )
- if the passed argument is an expression or objref, return the evaluated value using the provided context. Program blocks are returned as a program array. -
EvalUnitArgs( scriptUnitArray, context )
- callsEvalArg()
on each argument, returning a new array of the evaluated arguments. -
DerefProp( refArg )
- given an object reference path for anGAgent
property, drill-down into the referenced object and return the indicated agent property object (aGVar
). -
DerefFeatureProp( refArg )
- given an object reference path for aGFeature
property, drill-down into the agent'sGFeature
property dictionary and return the indicated property object (also aGVar
)
Note on context: The context object passes { agent }
at minimum, and also any other referenced agent instances as their blueprint name (e.g. { agent, Bee, Fish }
in when
clauses). It is the responsibility of the program invoker to provide this context, as the SMCFunction can not look it up.
To keep execution overhead lower, the Evaluate and Dereference operations are not automatically run on every SMCFunction at runtime. It's up to the keyword author to selective apply runtime argument expansion using one of the above methods in expr-evaluator
The most common place that compiled scripts run is in the GAgent
class and during scriptable moments during the simulation lifecycle. You can search for them by looking for the string .exec(
in the codebase, searching for the various phases listed in api-sim-gameloop.js
, or consulting this list:
-
sim-agents Update()
- callsGAgent.agentUPDATE()
for every instance in the sim -
sim-agents Think()
- callsGAgent.agentTHINK()
for every instance in the sim -
sim-agents Exec()
- callsGAgent.agentEXEC()
for every instance in the sim -
sim-agents queueUpdateMessage()
- pushes a program into the agent's Update queue, executed duringGAgent.agentUPDATE()
-
sim-agents queueThinkMessage()
- pushes a program into the agent's Update queue, executed duringGAgent.agentTHINK()
-
sim-agents queueExecMessage()
- pushes a program into the agent's Update queue, executed duringGAgent.agentEXEC()
-
sim-conditions Update()
- related to global tests which are computed once, and inspected by agent scripts during their own update -
sim-conditions Update()
- also runs any queued system events, looking up blueprints that have subscribed to them and sending all their instances a "handler program" that is defined by anonEvent
keyword
There are two special case programs that use a similar mechanism but are not the same as a TSMCProgram
.
-
test programs receive an arbitrary number of testing argument and return true or false, and are used in
sim-conditions
. They are stored in a global dictionary. - filter programs are regular javascript functions that are used to filter sets of agents based on criteria. The "touches" test, for example, is such a function. They are stored in a global dictionary.
-
pragma programs are generated by the
_pragma
internal keyword, and emit runtime code that expects to receive the specialCOMPILER_AGENT
andCOMPILER_STATE
contexts at compile-time. These are NOT put into a program bundle, but instead are run immediately to return pragma information or change the compiler state.
Pseudocode follows:
// CONVERTING TEXT TO BLUEPRINT
scriptUnits = ScriptifyText( text )
lines = text.split('\n');
scriptUnits = scriptifier.tokenize(lines)
return scriptUnits
bundle = CompileBlueprint( scriptUnits )
bundle = new ProgramBundle();
check scriptUnits[0] is 'blueprint' pragma
scriptUnits.forEach as unit
check unit[0] is pragma and process
check unit[0] is not duplicate 'blueprint' pragma
objcode = r_CompileUnit(unit)
unit = r_ExpandArgs(unit)
unit.forEach as item
if item.expr, return ParseExpression(expr)
if item.objref, return as-is for proc at runtime
if item.program, return GetProgram(program)
if item.block:
blockScript = scriptifier.tokenize(block)
blockObjcode = r_CompileBlock(blockScript)
blockObjcode = []
blockScript.forEach as blockUnit
skip directives and comments
blcode = r_CompileUnit(blockUnit) -- recursive call
blockObjcode.push(blcode);
return blockObjcode
return blockObjcode
if anything else, return arg as-is
keyword = unit[0]
unitCompiler = GetKeyword(keyword)
return unitCompiler(compile(unit))
AddToBundle(bundle,objcode)A
return bundle
RegisterBlueprint( bundle )
// ADDING AN INSTANCE DEFINITION
instanceDef = DefineInstance( specObj )
{ blueprint:string, init:TSMCProgram, name:string } from spec
INSTANCES if a Map of Array<specObj>
store instance defs by blueprint name
specObj.id = INSTANCE_COUNTER++
INSTANCES.get(blueprint).push(specObj);
// CREATING AGENTS FROM ALL INSTANCE DEFINITIONS
instances = GetAllInstances()
instance.forEach as specObj
MakeAgent(spec)
{ blueprint:string, name:string } from specObj -- this is instance name
bundle = GetBlueprint(blueprint)
agent = new GAgent(name)
agent.setBlueprint(bundle)
agent.blueprint = bundle;
agent.exec(bundle.define);
agent.exec(bundle.init);
return SaveAgent(agent)
AGENT_DICT is a Map of Map
store agent instance id by blueprint name