Parser:Formatter - hpgDesigns/hpgdesigns-dev.io GitHub Wiki

Of the most basic concepts behind compiling a game made in ENIGMA is converting the high-level language available to the user into the low level language; here C++. In our case, C++ is already a decently high-level language. Because of this, ENIGMA inherits features from it as well. The trick becomes determining what differences need patched during parse to allow a competent C++ compiler to read it. These changes include the following:

  1. Semicolons need added. This is the most obvious change. Parentheses need added as well.
  2. Non-default template parameters need defaulted, to variant according to ENIGMA's standard.
  3. Operator. needs handled to enable integer access for all type and object classes. This means that the left-hand operand to . must be checked for type and, if the type does not contain the right-hand operand, cast to an integer and pass to the instance system. If the type contains neither the right-hand operand nor an integer conversion method, an error should be thrown.
  4. High level controls such as switch() and with() need replaced with lower-level calls, and parsed appropriately.

Design

ENIGMA's parser uses a pair of strings to perform the parse. One is called code, the other synt. The former contains the human-readible version of the code, the latter a copy of it with words and some symbols replaced with more easily identified "tokens." Among these are 'n' for "name": a simple identifier; 't' for "type": anything that can be used to declare, and '0': any numeral.

The system is not quite so simple, though; token pairing is done to determine a single token or to use a new token. Take the following code snippet for example:

local int a1 = 0; while a1 > 1 a1 -= 10

When we lex the code, we get this:

localinta1=0;whilea1>1a1-=10
LLLLLtttnn=0;wwwwwnn>0nn-=00

Now, with a quick sweep we resolve the 'L' token, which means two different things depending on context. Since we see an 'L'-'t' pair, it resolves to 't':

localinta1=0;whilea1>1a1-=10
ttttttttnn=0;wwwwwnn>0nn-=00

From there we start our semicolon sweep. This is done by keeping a stack of semicolon tokens. At the bottom of the stack, there will always be a single ';'. Each time we need to insert a spacer, the top of the stack is inserted, and then the stack is popped. So, in our example code, when we see "while" ("wwwww" in the lex), we push ')' and then '(' onto the stack. Here is a line by line look at the process:

At 'tn' / nothing needs done. // Stack: ';'
ttttttttnn=0;wwwwwnn>0nn-=00

At '0;' / nothing needs done. // Stack: ';'
ttttttttnn=0;wwwwwnn>0nn-=00

At 'w' / Push ')', '(' // Stack: '(' ')' ';'

At 'wn' / Pop the stack. // Stack: ')' ';'
ttttttttnn=0;wwwww(nn>0nn-=00 // Stack: ';'

At 'n>' / Nothing needs done. // Stack: ')' ';'
ttttttttnn=0;wwwww(nn>0nn-=00 // Stack: ';'

At '>0' / Nothing needs done. // Stack: ')' ';'
ttttttttnn=0;wwwww(nn>0nn-=00 // Stack: ';'

At '0n' / Pop the stack. // Stack: ';'
ttttttttnn=0;wwwww(nn>0)nn-=00 // Stack: ';'

...

At end of code, pop stack until empty, adding to end.

localinta1=0;while(a1>1)a1-=10;
ttttttttnn=0;wwwww(nn>0)nn-=00;

As you can see, the stack is pulled each time a specific pair of tokens is encountered. These include '0n', '0t', ')n', ')t' 'wn' and many more... It's a relatively simple list to figure out. The list specifically excludes pairs such as 'tn' and 'n(', which are declarations and functions, respectively.

As you have probably noticed, whitespace is removed at the beginning. During this process, like-tokens that would otherwise be accidentally concatenated are left with a single space between them. A space translates to an automatic stack pull.

At the same time whitespace is removed, strings and comments are removed. They are stored to be reinserted later. This is fine, as the entire token string needs stored for later linkings anyway.