Threaded interpretation and instruction merging - tsafin/tarantool GitHub Wiki

Google Summer of Code 2021 - Tarantool:threaded interpretation

Student: Anastasia Neklepaeva

Mentors: Igor Munkin, Nikita Pettik, Timur Safin

Description: Tarantool uses SQL VM with conventional interpretation technique: loop over instruction, inside which dispatching switch-case executes logic of current opcode. I’ve worked on threaded-code technique. Threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It helps to avoid the second jump in code processing by jumping directly to the code handling the next instruction.

Pre-GSoC period

I started working on this project with learning about internals of Sqlite VDBE. I have read several articles about virtual machines and their differences and analyzed a Tarantool VDBE source code.

Coding period

First steps were defining DISPATCH() macro and replacing case OP_value with Exec_OP_value label. This action required some changes for vdbe.c file parser to generate opcodes numeric values. Some compilers provide computed goto feature (e.g. gcc ). But there are some compilers not providing computed goto feature. That's why threaded code should be supported by Tarantool as well as swith-technique. For this purpose switch-based interpreter was rewritten with START_EVAL(), EXECUTE() macros. After that tarantool supports both techniques: switch-based interpreter and threaded code. Also, some macros for debug were defined. Script for generation of array of goto addresses was created. Performance have been measured.

Tasks completed during GSoC period:

Define macro DISPATCH() that hides internal implementation of dispatching: if flag -DWITHOUT_COMPUTED_GOTO was specified then DISPATCH() implements jump to switch otherwise jump directly to the code handling the next instruction
Define macro START_EVAL() to hide switch switch-based interpreter and jump to the label of first instruction
Define macro EXECUTE() to hide case OP_value for switch-based interpreter and label Exec_OP_value for threaded code
Replace break and case keywords with mentioned macro
Define some macros for debug
Generate an array of goto addresses in the same order as corresponding OP_values are defined
Benchmark results using recursive query that calculates sum from 1 to 1000000

Query:

WITH RECURSIVE temp (n, sum, dif) AS
(SELECT 0, 0, 0
UNION ALL
SELECT n+1, (n+1)+sum, dif+(n-1) FROM temp
WHERE n < 1000000)
SELECT * FROM temp;

Results:

CPU usage

cpu_usage

Time of query execution

query_time

Compiler: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0

CPU: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Tarantool Version: 2.9.0, Linux-x86_64-RelWithDebInfo

Perf command to measure performance: sudo perf record -g -- ./src/tarantool ./src/test.lua

./src/test.lua contains recursive query

The computed goto version is faster because of two reasons:

the switch does a bit more per iteration because of bounds checking;
the effects of hardware branch prediction.

PR with code

What’s left to do:

Add instruction merging as another optimization
Fix reviewed code