Full performance info - tsafin/tarantool GitHub Wiki

Table of content

visualisation + database

The current state of performance

Now, grafana used data from influxdb data source.

Grafana machine is based in VKCS too: https://37.139.41.116/grafana/ These machines are connected by the same network: grafana_performance_network

To login influxdb machine go to ssh -i grafana-performance-nd6XDTOq.pem [email protected]

Same for grafana boxes.

Pem key and credentials you can find in 1password

bench-run

How we use bench-run:

numactl --membind=1 --cpunodebind=1 --physcpubind=6,7,8,9,10,11
 sysbench $test --db-driver=tarantool —threads=200
    --time=TIME --warmup-time=5 run

We connected our run with certain CPUs by using --physcpubind=6,7,8,9,10,11, where 6-11 are CPU’s IDs on the host, where we execute the performance test:

[host tarantool]# lscpu
…
NUMA node0 CPU(s):     0-5
NUMA node1 CPU(s):     6-11

Sysbench

We use bench-run with a number of threads: 200

Time option is about until what time we have to run test (make selects/updates and so on depending on the test). There are different types of transactions (read and write). Sysbench is counting the amount of transactions for reading and write separately. As corollary transactions per second = total transactions / total test time

Sysbench includes 11 SQL-base tests:

ARRAY_TESTS=(
    "oltp_read_only"
    "oltp_write_only"
    "oltp_read_write"
    "oltp_update_index"
    "oltp_update_non_index"
    "oltp_insert"
    "oltp_delete"
    "oltp_point_select"
    "select_random_points"
    "select_random_ranges"
    "bulk_insert"
)

Most of them are based on the functions from src/lua/oltp_common.lua. Before execution tests prepare tables, selects, etc according to the database driver (can be put by --db-driver). Preparation time is not included in the time of perf test results.

oltp_read_only includes Simple, sum, order and distinct selects

oltp_write_only includes index_updates, non_index_updates, deletes+inserts

oltp_read_write is a combination of both previous tests Simple, sum, order and distinct selects and index_updates, non_index_updates, deletes+inserts

oltp_update_index includes index_updates

oltp_update_non_index includes non_index_updates

oltp_insert includes Insert (NB! For tarantool we use non-standard sysbench insert)

oltp_delete includes deletes+inserts

oltp_point_select includes point selects

select_random_points random points

select_random_ranges random ranges

bulk_insert is turned off

So, during tarantool perf testing, 10 tests (except bulk_insert) are running ~20 sec with 10 reruns (count of reruns is a bench-run setting).

As was mentioned above database driver can be passed through option --db-driver. Tarantool driver is represented only on the tarantool-integration-draft branch in the tarantool/sysbench repo. Tarantool is starting before sysbench would be executed in bench-run script: run.sh. It executes run_tnt.sh, which is setting up tarantool on localhost:3301.

Cbench

Cbench includes infrastructure shell coded on Lua. It takes C-based functions. In the beginning, it generates random encoded strings and/or uint according to tests. There are different tests for num and str and its pairs (num, num), (num, str), (str, num), (str, str). Also, these tests can be divided by engine and as a result by some options.

For memtx we can test different index types (there are tree and hash indexes now). Amount of keys to benchmark: count = 1000000.

For vinyl, it has wal_mode = fsync and wal_mode = write mode with 500 count (bench-run opt) of benchmarked keys.

Both engines repeat theirs performance tests twice.

tests = {
  'replaces',
  'selects',
  'selrepl', - select + replace
  'updates',
  'deletes'
}

These tests execute C-based functions from tarantool: box_index_get, box_replace, box_update, box_delete.

Linkbench

Linkbench builds a graph and then making requests from different threads. Pls, read README.md.

Linkbench has a property file. In the case of Tarantool it is called LinkConfigTarantool.properties. During bench-run preparation we change/set some of the properties:

requesters = 1
requests = 2000000

Also, change the base property from FBWorkload.properties: maxid1 = 5000000

Our linkbench testing is based on engine='vinyl' and type_idx = 'tree'. In spite of the linkbench could make requests from different threads, now testing is used only one thread (it provides by requesters = 1).

Linkbench provides -l option for loading and -r for making requests. These steps are separated in bench-run and executed sequentially.

Linkbench uses linkbench driver and Tarantool is applied by Java class with implementations of Lua functions:

private static final String METHOD_ADD_LINK = "linkbench.insert_link";
private static final String METHOD_ADD_BULK_LINKS = "linkbench.insert_links";
private static final String METHOD_GET_LINK = "linkbench.get_link";
private static final String METHOD_MULTI_GET_LINK = "linkbench.multi_get_link";
private static final String METHOD_DELETE_LINK = "linkbench.delete_link";
private static final String METHOD_GET_LINK_LIST = "linkbench.get_link_list";
private static final String METHOD_GET_LINK_LIST_TIME = "linkbench.get_link_list_time";
private static final String METHOD_COUNT_LINKS = "linkbench.count_links";
private static final String METHOD_ADD_COUNTS = "linkbench.add_counts";
private static final String METHOD_ADD_BULK_NODES = "linkbench.add_bulk_nodes";
private static final String METHOD_GET_NODE = "linkbench.get_node";
private static final String METHOD_UPDATE_NODE = "linkbench.update_node";
private static final String METHOD_DELETE_NODE = "linkbench.delete_node";

Linkbench is able to provide percentiles in the next structure:

GET_LINKS_LIST count = 12678653  p25 = [0.7,0.8]ms  p50 = [1,2]ms
               p75 = [1,2]ms  p95 = [10,11]ms  p99 = [15,16]ms
               max = 2064.476ms  mean = 2.427ms

After requests are completed by linkbench shows the next info:

INFO 2021-09-30 15:38:34,631 [Thread-0]: ThreadID = 0 total requests = 2000000 requests/second = 1203 found = 7179 not found = 19402 history queries = 480/1014695
4613
INFO 2021-09-30 15:38:34,634 [main]: ADD_NODE count = 51363  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.1,0.2]ms  p95 = [0.2,0.3]ms  p99 = [3,4]ms  max = 14.981ms  mean = 0.228ms
4614
INFO 2021-09-30 15:38:34,635 [main]: UPDATE_NODE count = 147323  p25 = [0.1,0.2]ms  p50 = [0.2,0.3]ms  p75 = [0.2,0.3]ms  p95 = [0.4,0.5]ms  p99 = [4,5]ms  max = 28.076ms  mean = 0.358ms
4615
INFO 2021-09-30 15:38:34,635 [main]: DELETE_NODE count = 20243  p25 = [0.1,0.2]ms  p50 = [0.2,0.3]ms  p75 = [0.2,0.3]ms  p95 = [0.4,0.5]ms  p99 = [4,5]ms  max = 24.185ms  mean = 0.367ms
4616
INFO 2021-09-30 15:38:34,635 [main]: GET_NODE count = 258153  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.1,0.2]ms  p95 = [0.3,0.4]ms  p99 = [3,4]ms  max = 46.07ms  mean = 0.274ms
4617
INFO 2021-09-30 15:38:34,636 [main]: ADD_LINK count = 179229  p25 = [0.1,0.2]ms  p50 = [0.2,0.3]ms  p75 = [0.2,0.3]ms  p95 = [0.3,0.4]ms  p99 = [4,5]ms  max = 18.727ms  mean = 0.31ms
4618
INFO 2021-09-30 15:38:34,636 [main]: DELETE_LINK count = 60449  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.2,0.3]ms  p95 = [0.3,0.4]ms  p99 = [3,4]ms  max = 15.768ms  mean = 0.255ms
4619
INFO 2021-09-30 15:38:34,636 [main]: UPDATE_LINK count = 160440  p25 = [0.1,0.2]ms  p50 = [0.2,0.3]ms  p75 = [0.2,0.3]ms  p95 = [0.3,0.4]ms  p99 = [3,4]ms  max = 20.695ms  mean = 0.306ms
4620
INFO 2021-09-30 15:38:34,636 [main]: COUNT_LINK count = 97566  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.1,0.2]ms  p95 = [0.2,0.3]ms  p99 = [3,4]ms  max = 16.465ms  mean = 0.207ms
4621
INFO 2021-09-30 15:38:34,637 [main]: MULTIGET_LINK count = 10539  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.4,0.5]ms  p95 = [1,2]ms  p99 = [7,8]ms  max = 26815.559ms  mean = 3.465ms
4622
INFO 2021-09-30 15:38:34,637 [main]: GET_LINKS_LIST count = 1014695  p25 = [0.1,0.2]ms  p50 = [0.1,0.2]ms  p75 = [0.2,0.3]ms  p95 = [0.5,0.6]ms  p99 = [4,5]ms  max = 1198.111ms  mean = 0.724ms
4623
INFO 2021-09-30 15:38:34,637 [main]: REQUEST PHASE COMPLETED. 2000000 requests done in 1662 seconds. Requests/second = 1203

The final result, which we write to the linkbench.ssd_result.txt file is Requests/second

TPC-C

TPC-C includes different OLTP transactions. More info you can find in the original TPC-C benchmark specification.

The first step is creating the SQL table by create_table.lua. (NB! A few years ago box.sql.execute was changed to box.execute and TPC-С perf testing it wasn't fixed, but bench-run provides the next awkward correction. sed 's#box.sql#box#g' -i /opt/tpcc/create_table.lua)

Then, a bench-run script loads part by part ([part]: 1 = ITEMS 2 = WAREHOUSE 3 = CUSTOMER 4 = ORDERS - ‘-l’ option). If part is not provided, then all parts would be loaded:

LoadItems (); - random items
LoadWare (); - generate warehouse data
LoadCust (); - ids connected with warehouses and const DIST_PER_WARE
LoadOrd (); - init random orders

A few threads make different selects / updates / deletes / inserts. If there is no enough response time (more than 90%) log contains NG against OK for New-Order, Payment, Order-Status, Delivery, Stock-Level.

The result is transactions per minute (tpmC).

At the current time, this benchmark is not working.

Full information about the bench could be found here.

The most important part of ycbs benchmark is the possibility to run different configurations based on a ratio of read/update/insert transactions.

Next workloads are using now.

a: Read/update ratio: 50/50
b: Read/update ratio: 95/5
с: Read/update ratio: 100/0
d: Read/update/insert ratio: 95/0/5
e: Scan/insert ratio: 95/5
f: Read/read-modify-write ratio: 50/50

Other benches could be created easily by using configs params and set special values of needed proportion.

The first step is load. The second is to run the bench.

It executes runs=1 times. Full results could be presented in the next format:

[OVERALL],RunTime(ms), 10077
[OVERALL],Throughput(ops/sec), 9923.58836955443
[UPDATE], Operations, 50396
[UPDATE], AverageLatency(ms), 0.04339630129375347
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 338
[UPDATE], Return=0, 50396
[UPDATE], 0, 0.10264765784114054
[UPDATE], 2000, 0.026989343690867442
[UPDATE], 4000, 0.0352882703777336
[UPDATE], 6000, 0.004238958990536277
[UPDATE], 8000, 0.052813085033008175
[UPDATE], 10000, 0.0
[READ], Operations, 49604
[READ], AverageLatency(ms), 0.038242883638416256
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 230
[READ], Return=0, 49604
[READ], 0, 0.08997245741099663
[READ], 2000, 0.02207505518763797
[READ], 4000, 0.03188493260913297
[READ], 6000, 0.004869141813755326
[READ], 8000, 0.04355329949238579
[READ], 10000, 0.005405405405405406

But bench-run provide results for all configuration only next line Throughput(ops/sec), 9923.58836955443.

It also can be divided into index types. Now we have tests for tree and hash indices.

nosqlbench

There is good README.md in repo.

All configurations are written in config file nosqlbench.conf We change in our workflow some of them

port 3301
benchmark 'time_limit' (we have time limit until we execute benchmark)
time_limit 2000
request_batch_count 10 (amount of requests per query)

Statistics are provided in the next format.

TOTAL RPS STATISTICS:

----------.---------------.---------------.---------------.
|   type   |    minimal    |    average    |     maximum   |"
.----------.---------------.---------------.---------------.
| read/s   |    %7d    |    %7d    |    %8d   |"
| write/s  |    %7d    |    %7d    |    %8d   |"
| req/s     |    %7d    |    %7d    |    %8d   |"
'----------.---------------.---------------.---------------

Also, percentiles are involved in the default report.

---------.---------.---------.----------------.----------------.------------.------------.------------.
|  req/s  | read/s  | write/s | min lat. %s | max lat. %s |     90%%<   |     99%%<   |    99.9%%<  |"
---------.---------.---------.----------------.----------------.------------.------------.------------.

Indices can be defined by a user. Now we use hash and tree

Amount of threads=10, which are created at_once.

TPC-H

TBC