RXAS CovidOptimiser - adesutherland/CREXX GitHub Wiki
REXX Assembler Keyhole Optimiser (RXAS Covid-Optimiser)
Named after the fog of COVID through which the optimiser was written
Introduction
After processing the input REXX, creating syntax and flow trees, and analysing and manipulating these trees to optimise performance, the compiler finally emits REXX Assembler (RXAS) by recursively processing blocks of logic (aka REXX Instructions) and using code templates.
Ensuring this RXAS is optimised has a number of issues:
-
The template logic can get very complex (spaghetti) as it tries to find a faster specific instruction rather than a slower more generic one. For example iadd r1,r1,1 could be faster as inc r1 or inc1. This complexity grows exponentially once you consider instruction combinations.
-
It is very difficult (or even impossible) to optimise the output across the logic blocks. For example a branch from an if statement may directly jump to a branch instruction in a different control block because the emitter for the first was not aware of the second.
-
As we introduce new specialist RXAS instructions for optimisation, many parts of the compiler emitter logic will need changing to use these new instructions. Time consuming and error prone.
To address these issues we have a keyhole optimiser in the assembler (after the compiler) that applies rewrite rules. When a new specialist instruction is introduced then new rules can also be added. This is simpler and less error prone.
The assembler processing goes through a number of steps in a single pass:
- Lexer / Scanner - Tokenises the RXAS code
- Parser - Parses the structure into a series of instructions
- Binary writer - generates the binary code and constant pool
- Backpatcher - handles forward references
The keyhole optimiser sits between stage 2 and 3. It has an instruction queue (the keyhole) of around 20 instructions (configurable). The parser adds instructions to the back of the queue, the optimiser scans the queue rewriting instructions based on the rules, and sends instruction from the front of the queue to the binary writer.
Use of the Optimiser
By default the Assembler RXAS uses the Covid-Optimiser. This obviously applies to any assembler source (hand written or generated by the compiler), in all cases by default assembler rewriting takes place.
As with the compiler (RXC), the RXAS optimiser can be turned off with the -N options (N=No Optimisation):
RXAS -N file
To investigate / debug the optimiser and find out how it transformed the input assembler, the output binary file needs disassembling with RXDAS disassembler. This will allow the transformed assembler to be viewed.
RXDAS file
HINT: using diff on the two disassembly outputs of the optimised and non-optimised versions will allow the optimiser changes to be investigated "comfortably".
Covid-Optimiser Rules
Each rule set is made up of a number of instructions mappings (individual rules). A rule set ends with a rule flagged END_OF_RULE. The rule sets as a whole end with a rule set just made up of a rule flagged END_OF_RULE
Each rule has
-
a flag indicating
- END_OF_RULE described above
- NO_GAP There can be no instructions before this rule in the ruleset
- NO_HAZARD There can be non-hazardous instructions before matching this rule
- ANY_GAP There can be any instructions before matching this rule
-
The input mapping which is used to map a rule to an instruction.
-
0,1, or 2 output template mappings that are used to replace the mapped instruction.
All the rules of a ruleset need to map to instructions correctly. When they do all the mapped instructions are replaced by the output templates.
Each instruction mapping can have a type of
- 'i' - INSTRUCTION (a normal instruction)
- 'l' - LABEL (a label instruction)
The operand matching is done by mapping the actual register to the rules register number, when that actual register is found again it keeps the same mapping. So each input rule much match the instruction and operands. See examples!
NO_HAZARD
In this mode, there may be other instructions in the source code between the matched input instructions as long as they are not hazardous including being 'irrelevant', meaning that they do not use any of the register operands in the matching rules. Again see examples!
Hazardous instructions, change the data flow and include:
- labels (causes un-analysable flow control)
- branches (causes un-analysable flow control)
- function calls (these use dynamic registers)
- Procedure boundaries
- instructions not part of the ruleset but using registers used in the ruleset
If there is a match then each found instruction is removed from the queue and replaced with the output instruction templates (if any).
NOTE that a branch to an unconditional branch is optimised later, as part of the assembler branch address backpacking logic, so no rules are needed for these scenarios
Annotated Examples (assembler/rxas_opt.c for the actual rule declarations)
1. Rule for two swaps cancelling out: swap r0,r1; swap r0,r1
* flag type type inst op1 op2 op3
input * {NO_HAZARD, 'i', "swap", 'r', 0, 'r', 1, 0, 0,
output 1 * 0},
output 2 *
input * {NO_HAZARD, 'i', "swap", 'r', 0, 'r', 1, 0, 0,
output 1 * 0},
output 2 *
input * {END_OF_RULE},
This rule helps the situation where there are two consecutive calls to procedures and the compiler swaps back a register after the first call only to swap it back for the next call. Note that a second ruleset is needed for the case where the register numbers are reversed in the second swap (see the rules in assembler/rxas_opt.c)
NO_HAZARD is used as a branch in or out, or use of a relevant register would invalidate the rule logic
Example 1.1
swap r4,r8
swap r4,r8
Rule matches - rule r0 maps to register r4, and rule r1 maps to register r8.
No output - the swaps are removed
Example 1.2
swap r4,r8
iadd r2,r3,r5
swap r4,r8
Rule matches - rule r0 maps to register r4, and rule r1 maps to register r8 The iadd instruction is 'irrelevant' as it doesn't use r4 or r8
Output - the swaps are removed
iadd r2,r3,r5
Example 1.3
swap r4,r8
say r4
swap r4,r8
Rule does not match - although rule r0 maps to register r4, and rule r1 maps to register r8 the say instruction uses r4. So no match, the swaps cannot be removed
Output same as input - no change
2. Rule for converting a concat to a faster append
E.g. concat r0,r0,r1 to append r0,r1
* flag type type inst op1 op2 op3
input * {NO_HAZARD, 'i', "concat", 'r', 0, 'r', 0, 'r', 1,
output 1 * 'i', "append", 'r', 0, 'r', 1, 0, 0},
output 2 *
input * {END_OF_RULE},
Example 2.1
concat r4,r4,r8
Rule matches - rule r0 maps to register r4, and rule r1 maps to register r8 noting that the first two operands are the same register Output - concat is removed and replaced with the faster append
append r4,r8
3. Rule for optimising an unconditional branch (br) to a branch true (brt) to a brtf
This is a complex ruleset and one of 3 rulesets (currently) designed to improve performance by reducing branches to branches
* flag type type inst op1 op2 op3
input * {ANY_GAP, 'i',"br", 'b', 0, 0, 0, 0, 0,
output 1 * 'i',"brtf", 'b', 1, 'b', 2, 'r', 0},
output 2 *
input * {ANY_GAP, 'l',0, 'l', 0, 0, 0, 0, 0,
output 1 * 'l',0, 'l', 0, 0, 0, 0, 0},
output 2 *
input * {NO_GAP, 'i',"brt", 'b', 1, 'r', 0, 0, 0,
output 1 * 'i',"brt", 'b', 1, 'r', 0, 0, 0,
output 2 * 'l',0, 'l', 2, 0, 0, 0, 0},
input * {END_OF_RULE},
Especially with control statements (like IF) the compiler glues the different logic blocks together with branches, this leads to scenarios where a branch jumps directly to another branch.
Rule 1 matches an unconditional branch which it proposes to change to a conditional branch directly to the two eventual destinations Rule 2 matches to a label which is the destination of the previous matched br. ANY_GAP is used as intervening instruction can be safely ignored The output template shows this label is unchanged Rule 3 matches the brt. The NO_GAP indicated that the brt must directly follow the label above. The output of this rule is the brt followed by a new label (to be used by the new brtf, from rule 1)
The important details is the 'l' and 'b' mappings. These are intrinsically linked by the optimiser (e.g. 'b' 0 branches to label 'l' 0). So
- b0 (rule 1) maps to l0 (rule 2)
- b1 (rule 3) is the branch true target and is used in the rule 1 output
- b2 is special - there is no input b2 so the system creates a new unique label for it. Rule 1 uses this as the branch false destination, and rule 3's output 2 makes the required label instruction.
NOTE that by the nature of a keyhole optimiser this optimisation only works when the branches are near (currently upto 20 instructions or so apart) each other.
Example 3.1
br lb1
...
lb1:
brt lb2,r3
instf
...
lb2:
instt
As can be seen the control flow from "br lb1" either ends up at "instt" if r3 is true, or "instf" but requires two branches.
The output from the ruleset is:
brtf lb2,lbnew,r3
...
lb1:
brt lb2,r3
lbnew:
instf
...
lb2:
instt
As can be seen the branch now goes directly to the eventual destinations. The rest of the code is unchanged as other logic may be branching to lb1 and lb2; by leaving this alone we know we are not breaking other areas. (Note that after disassembly lp1 will be "removed" (ignored) if it is not used by any other code).
Rule Flag Values
- END_OF_RULE Marks the end of a rule
- NO_GAP No instruction between the last match and this instruction
- NO_HAZARD No hazard instruction (like a branch) before this instruction. This is used for non-branch rules
- ANY_GAP Any instructions are allowed before this matched instruction. This is for branch rules
Operand Type Values
- 'r': Register
- 'i': Integer
- 's': String
- 'c': Char
- 'f': Float
- 'l': Label
- 'b': Branch
- 'p': Procedure
- 0 / NULL - No operand
Special Rules for Branch / Label Operands
Intrinsically Linked
The l and b tokens are intrinsically linked - "quantum!" - they are different token types but must have matching values. This means, as an example, that b1 matches l1 - they need the same value across rules to match.
This also means that if (again, as an example) b4 is mapped to an operand, then l4 also has the same value.
Auto-Generation
If a l or b operand is used in a template (output) rule but has not been used in an input rule then the optimiser automatically generates a new unique label value - and ensured the same new label is used for both the l and b values.