Logical REXX VM - adesutherland/CREXX GitHub Wiki

REXX VM Specification

Page Status: This page is work in progress, incomplete, inconsistent and full of errors ... read with care!

Acknowledgement

Much of this work is based on the excellent "Language Implementation Patterns" by Terence Parr.

https://pragprog.com/titles/tpdsl/language-implementation-patterns/

REXX VM Components - Summary

The REXX interpreter simulates a computer with the following components:

Code memory: This 64 or 32-bit unsigned integer array holds a program’s "bytecode" instructions (bytecodes plus operands). Addresses are integers.
ip register: The instruction pointer is a special-purpose “register” that points into code memory at the next instruction to execute.
CPU: To execute instructions, the interpreter has a simulated CPU that conceptually amounts to a loop around a giant “switch on bytecode” statement. This is called the instruction dispatcher. We will use a Threaded VM instead.
Constant pool: Anything that we can’t store as an integer operand goes into the constant pool. This includes strings, arbitrary precision numbers, and function symbols. Instructions like say [STRING] and use an index into the constant pool instead of the actual operand.
Function call stack: The interpreter has a stack to hold function call return addresses as well as parameters and local variables.
fp register: A special-purpose register called the frame pointer that points to the top of the function call stack. StackFrame represents the information we need to invoke functions.
An infinite and regular register set per function call. Functions can access any element in the register array, whereas a stack can only access elements at the end. Registers hold integer, float, string and object values, plus flags that can be used to dynamically store which value formats are valid.

The following capabilities will be implemented in REXX Level B - REXX called by the virtual machine

Variable Pool: Holds slots for variables, each function/procedure has its own pool, and a function/procedure can "link" to variables from the parents pool (to support dynamic expose scenarios). The memory slots can point at Integer, Float, String, and struct instances. Variables are accessed via a linked register. This linking can be done statically or dynamically via a variable search capability
Runtime libraries that care used as "exits"; for example, REXX code to format error messages.

REXX Machine Architecture

cREXX will target a bytecode interpreter tuned for the specific needs of the REXX language. This has two advantages

Provides platform independence and portability
Provides a much simpler target for the cREXX compile compared to real CPU's instruction sets (which after all have to be implemented in hardware)

A bytecode interpreter is like a real processor. The biggest difference is that bytecode instruction sets are much simpler. Also, for example, we can assume there are an infinite number of registers.

The cREXX VM is a register based bytecode interpreter that uses Threading and Super-Instructions, meaning:

Register Based

A Register Based VM instructions use registers. The cREXX VM gives each stack frame an “infinite” set of registers.

Threading

Threading means that the "bytecode" loader converts instruction opcodes to the address of the subroutine that emulates the instruction. This reduces an indirection to each instruction (i.e. instead of switch(opcode) {case ...}) we can do goto opcode_address).

In addition, the code to dispatch to the next instruction is repeated at the end of each instruction subroutine rather than having a single dispatcher. This allows the real CPU's pipelining logic to work better.

The following two examples demonstrate the concept.

Traditional Interpreter

char code[] = {
  ICONST_1, ICONST_2,
  IADD, ...
}
char *pc = code;

/* dispatch loop */
while(true) {
  switch(*pc++) {
    case ICONST_1: *++sp = 1; break;
    case ICONST_2: *++sp = 2; break;
    case IADD:
      sp[-1] += *sp; --sp; break;
   ...
  }
}

Equivalent Threaded Interpreter

void *code[] = {
  &&ICONST_1, &&ICONST_2,
  &&IADD, ...
}
void **pc = code;

/* implementations */
goto **(pc);

ICONST_1: pc++; *++sp = 1; goto **(pc);
ICONST_2: pc++; *++sp = 2; goto **(pc);
IADD:
  pc++; sp[-1] += *sp; --sp; goto **(pc);
...

Note how the next instruction pointer pc is calculated up front (pc++) before the instruction logic. This is to allow the CPU pipeline to work; by the time the goto is being decoded the value of pc should have completed.

Super Instructions

In any interpreter the dispatch to the next instruction is an overhead. The fewer instructions needed to execute logic then the less dispatching overhead. Moreover a complex instruction's native implementation code running linearly maximises the effectiveness of the real CPU's pipeline.

We therefore will have a large number of instructions to cater for common sequences, for example a decrement (decr) instruction might often be followed by a branch if not zero instruction (brnz), e.g. for a loop - a super-instruction combines these two instructions to one (e.g. decrbrnz).

Low-level Function Instructions

The VM provides low-level functions as instruction, for example covering aspects like variable manipulation (like substr()), IO functions, Environment access, and virtual hardware like the timer. This ensures that:

Platform independence / platform drivers / porting etc. is achieved only by implementing these instructions for each platform.
Higher level functionality can all be implemented in REXX.

Instruction coding

There is a balance here - memory usage verses performance, made more complicated because reducing memory usage also has performance benefits. Tests will be needed across different CPUs types to validate our optimisations.

Compiled Mode

REXX VM binary code can be generated for storage (i.e. compiled) to be loaded and run later. In this case the code will have to be loaded and then threaded.

The threading process converts the op_codes into addresses, branches to addresses constant indexes to addresses - everything is converted into real addresses of the machine running the REXX VM, so that when executed (for example) data is accessed via its real address with no lookups/indexes needed. This needs to be all done at load time as the exact real addresses will be different on each machine and for each run (modern OS's randomise load addresses).

We will investigate if there is value in compressing the saved REXX VM binary to speed load times. Java uses ZIP for this and cREXX could do the same - and/or we could pack the instruction coding.

In the short term we will save it as 32 or 64 bit instructions (see following).

This means that the threading process can simply replace the opcode with the opcode address - the 32 or 64 bit memory address can fit in the 32 or 64 bit int instruction code location, we overwrite the opcode integer with the opcode function address.

Interpreter Mode

When interpreting a REXX program the compiler will emit code that will be executed there and then. In principle all the real addresses will be known and this means that the compiler can generate threaded code from the get-go. This way the loading / threading stage can be skipped.

Initial versions will only use the Compiler Mode while the Interpreter Mode is checked for feasibility and to see if there is a noticeable performance gain. One issue we may have is that the loader/threader may be needed anyway for other reasons like late binding/linking of libraries.

Endianness

Where the compiler and runtime are known to be on the same machine then the Endianness and float format will be machine specific.

Where the REXX VM is to be saved then big-endianness / "network order" will be used. And for floats we will use big-endianness double (IEEE 754 / binary64).

The loader / threader will need to convert as need be.

32 / 64 bit

Although we are only expecting a few hundred opcodes (if that!!), we need to store them in an integer of the same size as the machine address pointer size because of the need to overwrite the op_code with its implementation address.

For a 64bit OS we therefore need to store all opcodes and register numbers as 64-bit integers.

Note: If not for threading, there could have been many schemes to compress this, for example a 64bit integer could have held a 16-bit opcode and 3 16-bit register numbers as parameters - very efficient. Therefore we will confirm that the threaded interpreter is still faster than a classic interpreter using this more efficient scheme. Modern CPU's may be much smarter at pipelining!

The 64 bit machine code format is therefore

A 64bit Opcode, followed by
Zero or more (depending on opcode) 64bit parameters. Each parameter (depending on the opcode) can be one of:
- An integer constant
- An double float constant
- Index to an item in the constant pool (e.g. a String, High-precision Number or Object Constant)
- Register Number
- An instruction Number (i.e. a virtual address for a jump)

For 32-bit machines we can use two options

Have an alternative but equivalent format using 32bit integers/floats/addresses. This reduces the size of binaries but reduces compatability between 64-bit and 32-bit machines. 2 versions of binaries may be needed, or the loader could covert on load. An issue is that Float constants would need to be put in the Constant pool.
Use the 64-bit format for 32 bit machines. This means that 50% of the program space will be wasted but will provide compatibility. Note that only 4,294,967,295 registers per stack frame will be supported!

We will explore the different options here including looking at program size and performance.

Standards/Rules for VM Instruction Implementation

The following are the rules that should be followed when implementing VM Instructions.

Key Principles

To maximise CPU performance with respect to speculative execution (including simple pipelining) branches should be avoided and if required (e.g. when dispatching to the next instruction) the target address should be calculated and assigned as early as possible.
The compiler (or human writing the assembler code) is responsible for producing valid assembler including ensuring that assembler registers are initiated and have the required type, and ensuring (for example) then bound checks have been done. Run time validation at the VM Instruction levels should be avoided except for

When in debug mode
When "safe" versions of instructions are used by the compiler - see later.

Badly formed REXX Assembler code will have undefined runtime behaviour.

Justification for this approach is that SPECTRE attacks mean that it is possible for a program to access all process memory in a VM despite any access checks implemented by the VM (see Google's SPECTRE JavaScript demonstrator).

For information, a SPECTRE attack uses the non-functional performance improvement of a data read associated with a speculative execution of an invalid instruction that moved data into the L1 cache. My assessment (and that of security advice) is that designers should assume untrusted code can read across an entire process memory space (and across a CPU core too) and that countermeasures will have limited impact because of the fundamental need for the performance boost provided by speculative execution and caches (we are not giving those up!)

Therefore when deploying cREXX code in a secure manner each VM instance will need to be in its own process space - and that is the way we will need to protect against nefarious code. All the cleaver work done in the Java and .NET VMs to prove code is safe might very well be proved to be pointless :-(

Rules

Calculate the next instruction address as early as possible.
Do not use "if" statements and validation
Conversion errors should cause SIGNALS ("goto SIGNAL;")
When needed by the compiler we will define large instructions that combine several instructions into one - this avoid multiple instruction dispatches.
Use DEBUG Macros to store debug validation - these will be not included in production/release compiled code.