GSoC 2022 ideas - tsafin/tarantool GitHub Wiki
Google Summer of Code 2022 ideas
Please find below the preliminary list of ideas we would like to explore during this season of Google Summer of Code. Each idea has corresponding difficulty level, required/optional skill sets, and mentor names.
Not all ideas have same amount of detailiness in explanation, but we believe they might be of a good start in an interesting direction.
Table of Contents
Database engines
Add column storage engine
Tarantool storage engines MemTx and Vinyl are row (tuple) based, which is working fine for OLTP activity, but is suboptimal for OLAP queries. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with column based storage. As a bonus, for cluster mode, column storage may store different column families on different nodes for better locality. Such approach may provide (or may be not) better scalability for OLAP queries in a wide-column cluster.
- We need to experiment with local column storage scenario
- And wide-column scenario for column families spread over cluster nodes;
- Measure benefits and overhead;
Difficulty: Hard
Required skills: C/C++
Mentor: Kirill Yukhin
Add B+ Tree storage engine
Persistent, LSM-based Tarantool Vinyl storage engine provides good performance for write-only activities, but is suboptimal for read queries, especially if one query cold data. On the other hand B+ tree based storage engines better fit the scenarios when we not only write data (across clusters) but also need to query cold data for simple analytic needs. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with B+ tree based storage.
- One need to create this engine, plug it in to the generic query flow in Tarantool (making sure that Lua and SQL requests still work for new engine);
- One need to compare its performance against Vinyl, measure memory benefits or losses.
Difficulty: Hard
Required skills: C/C++
Mentor: Timur Safin, Cyrill Gorkunov
Calculate bloom filter and page size automatically
Vinyl is a disk storage based on LSM tree data structure. To speed up data access (which may take a while for a disk storage) we use probabilistic data structure - bloom filter. It allows us to eliminate excess data lookups - we can skip some files on disk if bloom filter application returns negative result. We should adjust bloom filters and page size automatically to fit a pre-defined page index size. We have the following automatic adjustment strategies, which we should all employ:
- avoid having bloom filters on lower levels; the deeper the LSM level, the less value per byte used they provide
- increase page size (again, starting from the bottom-most pages) - the larger is the page size, the smaller is the page index.
Related issues :
Difficulty: Medium
Required skills: C
Mentor: Nikita Pettik, Anastasiy Belyaev
Implement persistent tuple cache
To speed up data access in Vinyl we also maintain cache of tuples (to be more precise - chains of tuples) in RAM. However, it is not persistent, i.e. in instance reload its content is erased. It would be nice to dump tuple cache during checkpoint, and recover it during vinyl recovery.
Related issues :
vinyl: persistent tuple cache #2065
Difficulty: Hard
Required skills: C
Mentor: Nikita Pettik, Anastasiy Belyaev
Make vinyl memory level dumping independent from memtx
Currently, the only way of forcing LSM-tree based vinyl engine's memory level dump is by calling box.snapshot()
. However, this call can be slow, because it also dumps memtx engine's memory and it has some side effects, such as creating a checkpoint and invoking xlog garbage collection. It would be nice if there were a separate handle for that, e.g. box.ctl.vinyl.dump()
.
Related issues:
Difficulty: Medium
Required skills: C
Mentors: Nikita Pettik, Georgiy Lebedev, Anastasiy Belyaev
Add massive end-to-end testing for concurrent operations in vinyl
In #4821 we found that Tarantool may crash or get stuck if a failed memory dump happens concurrently with a DDL operation. However, this is not the only case where dump failure occurring in parallel with another operation may lead to bugs. Thus, we need provide test scenarios covering combination of dump/compact fails and concurrent index/space drop/alter/creation operations.
This projects involves researching various test scenarios and test means (e.g., generation of DDL operations, possibly even fuzzing), and also investigating and fixing the bugs discovered during research.
Related issues:
- vinyl: provide long run stress-like testing
- vinyl: assertion fault in vy_scheduler_peek_dump()
- Improve test checking secondary index consistency
Difficulty: Medium
Required skills: C
Mentors: Nikita Pettik, Georgiy Lebedev
LuaJIT
Fuzzing Lua (LuaJIT) in Tarantool
LuaJIT is a heart of Tarantool and the stability of it is closely connected to the stability of LuaJIT. But statistics of issues said that we catch bugs with LuaJIT [1] using Tarantool on production. We need to decrease the probability of such bugs and introduce more generative testing for LuaJIT in the Tarantool release cycle.
Some ideas for testing are:
- Metamorphic testing using existing corpus of Lua scripts.
- Automated generation of scripts using Lua grammar and introspection of Tarantool embedded functions.
- Fault injection in memory allocation using embedded error injection engine (see
src/lib/core/errinj.c
) or using external library with LD_PRELOAD
References
- LuaJIT/LuaJIT#568
- https://github.com/tarantool/tarantool/issues?q=is%3Aissue+is%3Aopen+label%3Acrash+label%3Aluajit
- https://github.com/tarantool/tarantool/issues/4823
Difficulty: Moderate
Required skills: C/C++, Lua
Desired skills: LuaJIT internals, metamorphic testing, fuzzing testing, LibFuzzer
Mentors: Sergey Bronnikov, Igor Munkin
SQL
AST-based fuzzing of SQL engine
Tarantool has an SQL engine inherited from SQLite. We are interested in a good testing of SQL engine.
Difficulty: Medium
Required skills: C/C++
Optional skills: SQL, Lua
Mentor: Sergey Bronnikov
Integration with SQLancer
SQLancer is a tool for finding logic bugs in SQL engines, written in Java. It uses Pivoted Query Synthesis (PQS), Non-optimizing Reference Engine Construction (NoREC), Ternary Logic Partitioning (TLP) for building correct from syntax point of view queries and execute in database. Efficiency of SQLances has proven with widely-used DBMSs like PostgreSQL, MySQL, sqlite etc. where SQLancer found over 450 unique, previously unknown bugs.
References
Difficulty: Medium
Required skills: Java
Optional skills: SQL
Mentor: Sergey Bronnikov
LLVM JIT for Tarantool SQL engine (stage 2)
There are multiple virtual machines used at the moment by Tarantool:
- LuaJIT within its 2 operating modes (Lua bytecode interpreter, and JIT traces with native code generated);
- And SQLite-derivative VM code for SQL engine known as vdbe.
Vdbe is not providing best performance and looks alien to the rest of Tarantool code base. One of GSoC 2021 projects explored LLVM JIT facilities for SQL engine speed improvements - LLVM JIT engine for Tarantool's DQL
One should pick this project up, make it feature complete, benchmark performance, and fine-tune for general case.
Difficulty: Hard
Required skills: C/C++, JIT
Optional skills: LLVM, SQLite
Mentor: Georgiy Lebedev, Timur Safin, Nikita Pettik
Ecosystem
Implement Tarantool/LuaJIT backend for Compiler Explorer
Code Explorer also known as Godbolt (godbolt.org) became the greatest community tool used by multiple languages community for code share and performance investigations. It started with C++ compilers, but now it includes Ada, Go, Haskell, and many others (and more to come).
Unfortunately, there is no LuaJIT/Tarantool support and furthermore there is no any good substitution for missing godbolt inside of LuaJIT community.
There is a nice LuaJIT web inspector by @mejedi though, which shows bytecode IR and generated x86 assembly for the given Lua script.
- But there is no Tarantool version of LuaJIT integrated;
- Tere is no way to generate assembly listing for non-x86 targets;
- and there is no convenient way to share link as we used to in Godbolt;
It looks like that godbolt.org is the best infrastructure which should be used here:
- It has script persistence and short-cutter activated;
- It has facilities to run any compiler inside of Docker VM;
- It has builtin facilities to parse compiler assembly and for intercepting and redirecting it to disassembly (or IR) panes inside of Compiler Explorer.
So Lua/LuaJITs should be integrated to the Godbolt infrastructure similarly to many other languages already presented there.
Difficulty: Medium
Required skills: TypeScript/JavaScript, Docker
Optional skills: C/C++, LuaJIT
Mentor: Timur Safin, Igor Munkin
Introduce dynamic trace probes to Tarantool kernel for SystemTap/dtrace
Dynamic tracing probes as introduced to DTrace in Solaris, have been ported to BSD*, Linux and even Windows lately. Similarly, Linux has SystemTap probing facilities which support tracing functionality but with less ergonomics.
It's common approach to use trace probes in database engines to make easier collecting of system internals on production sites.
Please see examples of dynamic tracing developed for database engines:
- PostgreSQL dynamic tracing https://www.postgresql.org/docs/current/dynamic-trace.html
- Tracing Tracing mysqld Using DTrace
- Adding User Space Probing to an Application https://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps
One should add tracing probes, evaluate overhead, and provide tutorial guidelines for using DTrace/SystemTap for Tarantool performance investigations.
Difficulty: Medium
Required skills: C/C++, Linux kernel
Mentor: Cyrill Gorkunov, Timur Safin
Deterministic simulator
Testing of distributed systems is hard, because any non-deterministic behaviour (changed timings in networking, differences of cpu scheduler quotas, or even cpu shrtages) may lead to flacky tests failures.
Injection of errors (box.error.injection
) used by Tarantool considered a
reliable approach to deterministically generate events during testing. But
there is no way to make multi-mode networking or cpu timings behave
deterministically in current architecture.
One may try to virtualize various layers of API (e.g. monotonic clock or networking) via framework similar to whirl-framework
Another approach similar to error injection - is to generate single-threaded system simulator with all IO and scheduling being fully virtualized, using FoundationDB approach.
- “Testing Distributed Systems w/ Deterministic Simulation” by Will Wilson
- Simulation and Testing
- FoundationDB or: How I Learned to Stop Worrying and Trust the Database by Markus Pilman
One should explore possible approaches to gain full determenism of testing for "distributed" system, select the least invasive, and try to apply the selected approach to some of currently available replication tests.
Difficulty: Hard
Required skills: C/C++, Linux
Mentor: Timur Safin
Full Tarantool/Lua support for IntelliJ IDEA platform
At the moment there is rather decent Lua syntax support available for IntelliJ IDEA platform. There is no integrated Tarantool box API yet, so there is no native code complete available for builtin modules. And there is no currently possible to debug remotely Tarantool Lua scripts. Or run specific tests natively from inside of IDE.
- One should add Tarantool box api definitions to syntax highlighter;
- Integrate LuaTest framework support;
- and modifying mobdebug by Paul Kluchenko for Tarantool flavour support, one should create IntelliJ IDEA-extension which would allow to remotely attach to Tarantool lua debuggee, set breakpoints, watch variables, step into and out.
Difficulty: Medium
Required skills: Java, Lua
Optional skills: C/C++
Mentor: Timur Safin