GSoC 2022 ideas - tsafin/tarantool GitHub Wiki

Google Summer of Code 2022 ideas

Please find below the preliminary list of ideas we would like to explore during this season of Google Summer of Code. Each idea has corresponding difficulty level, required/optional skill sets, and mentor names.

Not all ideas have same amount of detailiness in explanation, but we believe they might be of a good start in an interesting direction.

Table of Contents

Database engines

Add column storage engine

Tarantool storage engines MemTx and Vinyl are row (tuple) based, which is working fine for OLTP activity, but is suboptimal for OLAP queries. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with column based storage. As a bonus, for cluster mode, column storage may store different column families on different nodes for better locality. Such approach may provide (or may be not) better scalability for OLAP queries in a wide-column cluster.

  • We need to experiment with local column storage scenario
  • And wide-column scenario for column families spread over cluster nodes;
  • Measure benefits and overhead;

Difficulty: Hard

Required skills: C/C++

Mentor: Kirill Yukhin

Add B+ Tree storage engine

Persistent, LSM-based Tarantool Vinyl storage engine provides good performance for write-only activities, but is suboptimal for read queries, especially if one query cold data. On the other hand B+ tree based storage engines better fit the scenarios when we not only write data (across clusters) but also need to query cold data for simple analytic needs. There is infrastructure in Tarantool to switch engines for spaces, and we may experiment with B+ tree based storage.

  • One need to create this engine, plug it in to the generic query flow in Tarantool (making sure that Lua and SQL requests still work for new engine);
  • One need to compare its performance against Vinyl, measure memory benefits or losses.

Difficulty: Hard

Required skills: C/C++

Mentor: Timur Safin, Cyrill Gorkunov

Calculate bloom filter and page size automatically

Vinyl is a disk storage based on LSM tree data structure. To speed up data access (which may take a while for a disk storage) we use probabilistic data structure - bloom filter. It allows us to eliminate excess data lookups - we can skip some files on disk if bloom filter application returns negative result. We should adjust bloom filters and page size automatically to fit a pre-defined page index size. We have the following automatic adjustment strategies, which we should all employ:

  • avoid having bloom filters on lower levels; the deeper the LSM level, the less value per byte used they provide
  • increase page size (again, starting from the bottom-most pages) - the larger is the page size, the smaller is the page index.

Related issues :

Difficulty: Medium

Required skills: C

Mentor: Nikita Pettik, Anastasiy Belyaev

Implement persistent tuple cache

To speed up data access in Vinyl we also maintain cache of tuples (to be more precise - chains of tuples) in RAM. However, it is not persistent, i.e. in instance reload its content is erased. It would be nice to dump tuple cache during checkpoint, and recover it during vinyl recovery.

Related issues :

vinyl: persistent tuple cache #2065

Difficulty: Hard

Required skills: C

Mentor: Nikita Pettik, Anastasiy Belyaev

Make vinyl memory level dumping independent from memtx

Currently, the only way of forcing LSM-tree based vinyl engine's memory level dump is by calling box.snapshot(). However, this call can be slow, because it also dumps memtx engine's memory and it has some side effects, such as creating a checkpoint and invoking xlog garbage collection. It would be nice if there were a separate handle for that, e.g. box.ctl.vinyl.dump().

Related issues:

Difficulty: Medium

Required skills: C

Mentors: Nikita Pettik, Georgiy Lebedev, Anastasiy Belyaev

Add massive end-to-end testing for concurrent operations in vinyl

In #4821 we found that Tarantool may crash or get stuck if a failed memory dump happens concurrently with a DDL operation. However, this is not the only case where dump failure occurring in parallel with another operation may lead to bugs. Thus, we need provide test scenarios covering combination of dump/compact fails and concurrent index/space drop/alter/creation operations.

This projects involves researching various test scenarios and test means (e.g., generation of DDL operations, possibly even fuzzing), and also investigating and fixing the bugs discovered during research.

Related issues:

Difficulty: Medium

Required skills: C

Mentors: Nikita Pettik, Georgiy Lebedev

LuaJIT

Fuzzing Lua (LuaJIT) in Tarantool

LuaJIT is a heart of Tarantool and the stability of it is closely connected to the stability of LuaJIT. But statistics of issues said that we catch bugs with LuaJIT [1] using Tarantool on production. We need to decrease the probability of such bugs and introduce more generative testing for LuaJIT in the Tarantool release cycle.

Some ideas for testing are:

  • Metamorphic testing using existing corpus of Lua scripts.
  • Automated generation of scripts using Lua grammar and introspection of Tarantool embedded functions.
  • Fault injection in memory allocation using embedded error injection engine (see src/lib/core/errinj.c) or using external library with LD_PRELOAD

References

  1. LuaJIT/LuaJIT#568
  2. https://github.com/tarantool/tarantool/issues?q=is%3Aissue+is%3Aopen+label%3Acrash+label%3Aluajit
  3. https://github.com/tarantool/tarantool/issues/4823

Difficulty: Moderate

Required skills: C/C++, Lua

Desired skills: LuaJIT internals, metamorphic testing, fuzzing testing, LibFuzzer

Mentors: Sergey Bronnikov, Igor Munkin

SQL

AST-based fuzzing of SQL engine

Tarantool has an SQL engine inherited from SQLite. We are interested in a good testing of SQL engine.

Difficulty: Medium

Required skills: C/C++

Optional skills: SQL, Lua

Mentor: Sergey Bronnikov

Integration with SQLancer

SQLancer is a tool for finding logic bugs in SQL engines, written in Java. It uses Pivoted Query Synthesis (PQS), Non-optimizing Reference Engine Construction (NoREC), Ternary Logic Partitioning (TLP) for building correct from syntax point of view queries and execute in database. Efficiency of SQLances has proven with widely-used DBMSs like PostgreSQL, MySQL, sqlite etc. where SQLancer found over 450 unique, previously unknown bugs.

References

Difficulty: Medium

Required skills: Java

Optional skills: SQL

Mentor: Sergey Bronnikov

LLVM JIT for Tarantool SQL engine (stage 2)

There are multiple virtual machines used at the moment by Tarantool:

  • LuaJIT within its 2 operating modes (Lua bytecode interpreter, and JIT traces with native code generated);
  • And SQLite-derivative VM code for SQL engine known as vdbe.

Vdbe is not providing best performance and looks alien to the rest of Tarantool code base. One of GSoC 2021 projects explored LLVM JIT facilities for SQL engine speed improvements - LLVM JIT engine for Tarantool's DQL

One should pick this project up, make it feature complete, benchmark performance, and fine-tune for general case.

Difficulty: Hard

Required skills: C/C++, JIT

Optional skills: LLVM, SQLite

Mentor: Georgiy Lebedev, Timur Safin, Nikita Pettik

Ecosystem

Implement Tarantool/LuaJIT backend for Compiler Explorer

Code Explorer also known as Godbolt (godbolt.org) became the greatest community tool used by multiple languages community for code share and performance investigations. It started with C++ compilers, but now it includes Ada, Go, Haskell, and many others (and more to come).

Unfortunately, there is no LuaJIT/Tarantool support and furthermore there is no any good substitution for missing godbolt inside of LuaJIT community.

There is a nice LuaJIT web inspector by @mejedi though, which shows bytecode IR and generated x86 assembly for the given Lua script.

  • But there is no Tarantool version of LuaJIT integrated;
  • Tere is no way to generate assembly listing for non-x86 targets;
  • and there is no convenient way to share link as we used to in Godbolt;

It looks like that godbolt.org is the best infrastructure which should be used here:

  • It has script persistence and short-cutter activated;
  • It has facilities to run any compiler inside of Docker VM;
  • It has builtin facilities to parse compiler assembly and for intercepting and redirecting it to disassembly (or IR) panes inside of Compiler Explorer.

So Lua/LuaJITs should be integrated to the Godbolt infrastructure similarly to many other languages already presented there.

Difficulty: Medium

Required skills: TypeScript/JavaScript, Docker

Optional skills: C/C++, LuaJIT

Mentor: Timur Safin, Igor Munkin

Introduce dynamic trace probes to Tarantool kernel for SystemTap/dtrace

Dynamic tracing probes as introduced to DTrace in Solaris, have been ported to BSD*, Linux and even Windows lately. Similarly, Linux has SystemTap probing facilities which support tracing functionality but with less ergonomics.

It's common approach to use trace probes in database engines to make easier collecting of system internals on production sites.

Please see examples of dynamic tracing developed for database engines:

One should add tracing probes, evaluate overhead, and provide tutorial guidelines for using DTrace/SystemTap for Tarantool performance investigations.

Difficulty: Medium

Required skills: C/C++, Linux kernel

Mentor: Cyrill Gorkunov, Timur Safin

Deterministic simulator

Testing of distributed systems is hard, because any non-deterministic behaviour (changed timings in networking, differences of cpu scheduler quotas, or even cpu shrtages) may lead to flacky tests failures.

Injection of errors (box.error.injection) used by Tarantool considered a reliable approach to deterministically generate events during testing. But there is no way to make multi-mode networking or cpu timings behave deterministically in current architecture.

One may try to virtualize various layers of API (e.g. monotonic clock or networking) via framework similar to whirl-framework

Another approach similar to error injection - is to generate single-threaded system simulator with all IO and scheduling being fully virtualized, using FoundationDB approach.

One should explore possible approaches to gain full determenism of testing for "distributed" system, select the least invasive, and try to apply the selected approach to some of currently available replication tests.

Difficulty: Hard

Required skills: C/C++, Linux

Mentor: Timur Safin

Full Tarantool/Lua support for IntelliJ IDEA platform

At the moment there is rather decent Lua syntax support available for IntelliJ IDEA platform. There is no integrated Tarantool box API yet, so there is no native code complete available for builtin modules. And there is no currently possible to debug remotely Tarantool Lua scripts. Or run specific tests natively from inside of IDE.

  • One should add Tarantool box api definitions to syntax highlighter;
  • Integrate LuaTest framework support;
  • and modifying mobdebug by Paul Kluchenko for Tarantool flavour support, one should create IntelliJ IDEA-extension which would allow to remotely attach to Tarantool lua debuggee, set breakpoints, watch variables, step into and out.

Difficulty: Medium

Required skills: Java, Lua

Optional skills: C/C++

Mentor: Timur Safin