Threaded interpretation and instruction merging - tsafin/tarantool GitHub Wiki
Google Summer of Code 2021 - Tarantool:threaded interpretation
Student: Anastasia Neklepaeva
Mentors: Igor Munkin, Nikita Pettik, Timur Safin
Description: Tarantool uses SQL VM with conventional interpretation technique: loop over instruction, inside which dispatching switch-case executes logic of current opcode. I’ve worked on threaded-code technique. Threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It helps to avoid the second jump in code processing by jumping directly to the code handling the next instruction.
Pre-GSoC period
I started working on this project with learning about internals of Sqlite VDBE. I have read several articles about virtual machines and their differences and analyzed a Tarantool VDBE source code.
Coding period
First steps were defining DISPATCH()
macro and replacing case OP_value
with Exec_OP_value
label. This action required
some changes for vdbe.c file parser to generate opcodes numeric values. Some compilers provide computed goto feature
(e.g. gcc ).
But there are some compilers not providing computed goto feature. That's why threaded code should be supported
by Tarantool as well as swith-technique.
For this purpose switch-based interpreter was rewritten with START_EVAL()
, EXECUTE()
macros. After that
tarantool supports both techniques: switch-based interpreter and threaded code.
Also, some macros for debug were defined. Script for generation of array of goto addresses was created.
Performance have been measured.
Tasks completed during GSoC period:
- Define macro
DISPATCH()
that hides internal implementation of dispatching: if flag-DWITHOUT_COMPUTED_GOTO
was specified thenDISPATCH()
implements jump to switch otherwise jump directly to the code handling the next instruction - Define macro
START_EVAL()
to hide switch switch-based interpreter and jump to the label of first instruction - Define macro
EXECUTE()
to hide case OP_value for switch-based interpreter and labelExec_OP_value
for threaded code - Replace break and case keywords with mentioned macro
- Define some macros for debug
- Generate an array of goto addresses in the same order as corresponding
OP_values
are defined - Benchmark results using recursive query that calculates sum from 1 to 1000000
Query:
WITH RECURSIVE temp (n, sum, dif) AS
(SELECT 0, 0, 0
UNION ALL
SELECT n+1, (n+1)+sum, dif+(n-1) FROM temp
WHERE n < 1000000)
SELECT * FROM temp;
Results:
CPU usage
Time of query execution
Compiler: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
Tarantool Version: 2.9.0, Linux-x86_64-RelWithDebInfo
Perf command to measure performance: sudo perf record -g -- ./src/tarantool ./src/test.lua
./src/test.lua contains recursive query
The computed goto version is faster because of two reasons:
-
the switch does a bit more per iteration because of bounds checking;
-
the effects of hardware branch prediction.
What’s left to do:
-
Add instruction merging as another optimization
-
Fix reviewed code