Tarantool proxy - tsafin/tarantool GitHub Wiki

Automatic sharding with Tarantool/Proxy

Introduction into Tarantool/Box

Tarantool is an in-memory database extensible with Lua stored procedures and triggers. There is no built-in sharding or automatic fail-over mechanism.

To extract maximum performance from the database, even on a signle host, it's recommended to shard user data across multiple instances. An instance owns a chunk of memory available on a machine and processes connections on its own port.

For example, a machine with 32GB of RAM typically runs 3 instances, each given 8GB of RAM.

These 8GB, in turn, are usually split to 2GB index memory and 6GB tuple memory. The remaining 8GB are reserved for snapshotting.

What proxy is

The goal of the proxy server is to hide existence of data sharding from the end user of Tarantool. A proxy is typically running on the same host as the application server, so that the added latency of an additional hop is minimal. The application server directs all requests to the local proxy. The task of the proxy is to route requests to an appropriate shard, depending on the key specified in the request.

The proxy also needs to account for a possible resharding which may be in progress, as well as facilitate ongoing resharding.

For these purposes, the proxy maintains a connection pool to all existing shards, and has a formula to evaluate a shard id from request data.

Sharding formula and resharding

Sharding formulae and principles vary and depend greatly on application requirements. In general, a formula is a function which accepts a unique key and returns a shard id. Depending on the formula, it may or may not be possible to add an additional shard purely by means of master-slave or multi-master replication. In general, resharding may require that a subset of keys that belong to one shard migrates to a different shard. Migration paths may go from old shards to the new shard or shards, as well as between old shards.

Resharding is not an instant procedure - it may take a while to migrate data in accordance with the new sharding formula.

Resharding is also not a task done by the proxy. However, the proxy needs to be aware of a possible ongoing resharding.

To find the right shard in presense of ongoing resharding, the proxy maintains a list of formulae, from newest to oldest, which it consequently evaluates to find a shard id. If there is no key in the shard found according to the first formula, the proxy tries the second formula, and checks the key again (provided that the second formula returns a different shard id). If the key is found on an "old" shard, the proxy moves the key to the new shard after returning it to the client.

For example, let's assume that initially the sharding formula is

shard_id = hash(key) % 2

After two more shards are added to the cluster, the formula becomes

shard_id = hash(key) % 4.

The proxy first tries to find the key on the shard according to the latest formula, and, on failure, uses the previous one.

Steps for resharding (DBA perspective)

A new resharding can not be started until the previous resharding has finished.

Add a new shard or shards, set up new instances of the server, ensuring all instances contain the same configuration. Do not start the instances yet.
Update the list of existing shards and the sharding formula on each instance (be it a proxy or a shard).

The sharding formula and the list of shards are best stored in a Lua function and Lua table respectively. Proxies need to know it to route requests. Old shards need to know it to reshard data in the background.

To update the list of shards it's necessary to reload the Lua module responsible for sharding. Since such a reload in Tarantool can happen without server restart, it's best to perform it with a deployment management tool, e.g. capistrano or similar:

a) modify a central VCS repository with new shard data

b) issue a cap deploy, so that all hosts pull changes from VCS and upload them to local Tarantool/Box instances.

c) Start background resharding (safe to do since the new shards are still down).

The background resharding is a Lua background process which moves the keys between shards. What it does depends significantly on the application. In the simplest case, there is no background resharding at all - instead, the process expires old keys after a period of time, thus guaranteeing that eventually all keys not accessed for sufficiently long time are deleted. Another option, is to never stop background resharding, and run it similarly to an expiration daemon, iterating over all existing keys in an infinite loop and checking whether or not they need to be moved or deleted. From implementation perspective, the background resharding logic is not much different from the proxy logic, i.e. it needs to "touch" every existing key on the shard, check the new formula for it and move it if necessary.
Start the new shards. It's important to only start new shards after all servers are brougth up to date with the new configuration, to avoid cases when some keys have already been moved to the new shards, but not all proxies are aware of their existence yet.
Wait for resharding to complete. Resharding is complete when the background fiber running the resharding procedure has exited successfully -- thus a simple status check is sufficient. This step is optional - if resharding is lazy and is part of key expiration, it can just run in the background at all times.
Drop the old sharding formula from each proxy. This step is optional, as it only affects performance. It is best done with help of a deployment management tool as well.

Implementation of the proxy

A proxy server configuration (spaces and indexes) must be identical to shard configuration, with the exception that a proxy should not store any data.

Tarantool/Box supports a system-wide INSTEAD OF trigger, and a Lua stored procedure executed from the trigger performs the job of routing the request to the right shard. The format of the binary protocol, which explicitly states spaces, indexes and keys used in the request, makes it particularly easy to do.

Moving an individual key

If a key is not found on a shard according to the latest formula, it needs to be moved. Moving a key from one shard to another may incur a race condition when two concurrent requests for the same key arrive from two proxies, initiating two concurrent moves. One of the proxies is bound to overwrite the results of the other proxy.

To avoid this condition, before moving a key, the proxy updates the destination shard with a stub key (a key which has no corresponding tuple, or which has an appropriate flag indicating that it's locked for reads and updates).

If another proxy meets a stub key, it waits for the stub to get replaced with a value, rather than initiates a concurrent move. Waiting is done using box.ipc.channel API.

Proxying queries on the secondary key

Proxying CALL statements

Secondary keys and stored procedure CALLs are notoriously difficult to shard. The first version of the proxy does not need to support sharded execution of these calls.

The following options are on the table for the future:

let the user define sharding principles for these, using Lua
for example, for CALL, shard id can be evaluated baed on the CALL arguments. This still won't work while resharding is in progress, so, during resharding, the proxy may have to execute the procedure locally, routing individual requests made inside the procedure.

Future development steps

Manual maintenance of configuration on each shard is very error-prone. In future, domain-based replication needs to be used to distribute information about new shards across the existing configuration instead of VCS.

Domain-based replication.

In domain-based replication, a replica subscribes to a domain, not to the entire data set contained on the master. For sharding, spaces which must belong to the global doamin, include:

information about existing shards
superspaces, i.e. spaces containing information about user-data spaces.

When all spaces which belong to a domain are replicated automatically, addition of a new shard is done by simply modifying the space with meta-information.

Open issues

QQ. How to best wait for a status check is completed? Monitor server logs? Send an email once done?

QQ. Should instead-of trigger be space-wide, not system-wide?

QQ. It could be useful to specify a shard id in the request explicitly.

QQ. Moving data key-by-key is not always optimal. With master-master relication, moving entire ranges of keys may be a better idea. Alternatively, each vshard could map to its own space. In that case it becomes easy to bring a single space down, copy it to the new shard, and then take back online. What's important not how resharding is done specifically, but that it may be useful to allow shutting down a range of keys in the proxy, so that it can be moved without a hassle.