Full performance info - tsafin/tarantool GitHub Wiki
Table of content
visualisation + database
The current state of performance
Now, grafana used data from influxdb data source.
Grafana machine is based in VKCS too: https://37.139.41.116/grafana/ These machines are connected by the same network: grafana_performance_network
To login influxdb machine go to ssh -i grafana-performance-nd6XDTOq.pem [email protected]
Same for grafana boxes.
Pem key and credentials you can find in 1password
bench-run
How we use bench-run:
numactl --membind=1 --cpunodebind=1 --physcpubind=6,7,8,9,10,11
sysbench $test --db-driver=tarantool —threads=200
--time=TIME --warmup-time=5 run
We connected our run with certain CPUs by using --physcpubind=6,7,8,9,10,11
, where 6-11
are CPU’s IDs on the host, where we execute the performance test:
[host tarantool]# lscpu
…
NUMA node0 CPU(s): 0-5
NUMA node1 CPU(s): 6-11
Sysbench
We use bench-run with a number of threads: 200
Time option is about until what time we have to run test (make selects/updates and so on depending on the test). There are different types of transactions (read and write). Sysbench is counting the amount of transactions for reading and write separately. As corollary transactions per second = total transactions / total test time
Sysbench includes 11 SQL-base tests:
ARRAY_TESTS=(
"oltp_read_only"
"oltp_write_only"
"oltp_read_write"
"oltp_update_index"
"oltp_update_non_index"
"oltp_insert"
"oltp_delete"
"oltp_point_select"
"select_random_points"
"select_random_ranges"
"bulk_insert"
)
Most of them are based on the functions from src/lua/oltp_common.lua
. Before execution tests prepare tables, selects, etc according to the database driver (can be put by --db-driver
). Preparation time is not included in the time of perf test results.
oltp_read_only includes Simple, sum, order and distinct selects
oltp_write_only includes index_updates, non_index_updates, deletes+inserts
oltp_read_write is a combination of both previous tests Simple, sum, order and distinct selects and index_updates, non_index_updates, deletes+inserts
oltp_update_index includes index_updates
oltp_update_non_index includes non_index_updates
oltp_insert includes Insert (NB! For tarantool we use non-standard sysbench insert)
oltp_delete includes deletes+inserts
oltp_point_select includes point selects
select_random_points random points
select_random_ranges random ranges
bulk_insert is turned off
So, during tarantool perf testing, 10 tests (except bulk_insert) are running ~20 sec with 10 reruns (count of reruns is a bench-run setting).
As was mentioned above database driver can be passed through option --db-driver.
Tarantool driver is represented only on the tarantool-integration-draft
branch in the tarantool/sysbench
repo.
Tarantool is starting before sysbench would be executed in bench-run script: run.sh
. It executes run_tnt.sh, which is setting up tarantool on localhost:3301
.
Cbench
Cbench includes infrastructure shell coded on Lua. It takes C-based functions.
In the beginning, it generates random encoded strings
and/or uint
according to tests. There are different tests for num
and str
and its pairs (num, num), (num, str), (str, num), (str, str)
. Also, these tests can be divided by engine and as a result by some options.
For memtx
we can test different index types (there are tree
and hash indexes
now). Amount of keys to benchmark: count = 1000000
.
For vinyl
, it has wal_mode = fsync
and wal_mode = write
mode with 500 count
(bench-run opt) of benchmarked keys.
Both engines repeat theirs performance tests twice
.
tests = {
'replaces',
'selects',
'selrepl', - select + replace
'updates',
'deletes'
}
These tests execute C-based functions from tarantool: box_index_get, box_replace, box_update, box_delete
.
Linkbench
Linkbench builds a graph and then making requests from different threads. Pls, read README.md.
Linkbench has a property file. In the case of Tarantool it is called LinkConfigTarantool.properties
.
During bench-run preparation we change/set some of the properties:
requesters = 1
requests = 2000000
Also, change the base property from FBWorkload.properties
:
maxid1 = 5000000
Our linkbench testing is based on engine='vinyl'
and type_idx = 'tree'
. In spite of the linkbench could make requests from different threads, now testing is used only one thread (it provides by requesters = 1
).
Linkbench provides -l
option for loading and -r
for making requests. These steps are separated in bench-run and executed sequentially.
Linkbench uses linkbench driver and Tarantool is applied by Java class with implementations of Lua functions:
private static final String METHOD_ADD_LINK = "linkbench.insert_link";
private static final String METHOD_ADD_BULK_LINKS = "linkbench.insert_links";
private static final String METHOD_GET_LINK = "linkbench.get_link";
private static final String METHOD_MULTI_GET_LINK = "linkbench.multi_get_link";
private static final String METHOD_DELETE_LINK = "linkbench.delete_link";
private static final String METHOD_GET_LINK_LIST = "linkbench.get_link_list";
private static final String METHOD_GET_LINK_LIST_TIME = "linkbench.get_link_list_time";
private static final String METHOD_COUNT_LINKS = "linkbench.count_links";
private static final String METHOD_ADD_COUNTS = "linkbench.add_counts";
private static final String METHOD_ADD_BULK_NODES = "linkbench.add_bulk_nodes";
private static final String METHOD_GET_NODE = "linkbench.get_node";
private static final String METHOD_UPDATE_NODE = "linkbench.update_node";
private static final String METHOD_DELETE_NODE = "linkbench.delete_node";
Linkbench is able to provide percentiles in the next structure:
GET_LINKS_LIST count = 12678653 p25 = [0.7,0.8]ms p50 = [1,2]ms
p75 = [1,2]ms p95 = [10,11]ms p99 = [15,16]ms
max = 2064.476ms mean = 2.427ms
After requests are completed by linkbench shows the next info:
INFO 2021-09-30 15:38:34,631 [Thread-0]: ThreadID = 0 total requests = 2000000 requests/second = 1203 found = 7179 not found = 19402 history queries = 480/1014695
4613
INFO 2021-09-30 15:38:34,634 [main]: ADD_NODE count = 51363 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.1,0.2]ms p95 = [0.2,0.3]ms p99 = [3,4]ms max = 14.981ms mean = 0.228ms
4614
INFO 2021-09-30 15:38:34,635 [main]: UPDATE_NODE count = 147323 p25 = [0.1,0.2]ms p50 = [0.2,0.3]ms p75 = [0.2,0.3]ms p95 = [0.4,0.5]ms p99 = [4,5]ms max = 28.076ms mean = 0.358ms
4615
INFO 2021-09-30 15:38:34,635 [main]: DELETE_NODE count = 20243 p25 = [0.1,0.2]ms p50 = [0.2,0.3]ms p75 = [0.2,0.3]ms p95 = [0.4,0.5]ms p99 = [4,5]ms max = 24.185ms mean = 0.367ms
4616
INFO 2021-09-30 15:38:34,635 [main]: GET_NODE count = 258153 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.1,0.2]ms p95 = [0.3,0.4]ms p99 = [3,4]ms max = 46.07ms mean = 0.274ms
4617
INFO 2021-09-30 15:38:34,636 [main]: ADD_LINK count = 179229 p25 = [0.1,0.2]ms p50 = [0.2,0.3]ms p75 = [0.2,0.3]ms p95 = [0.3,0.4]ms p99 = [4,5]ms max = 18.727ms mean = 0.31ms
4618
INFO 2021-09-30 15:38:34,636 [main]: DELETE_LINK count = 60449 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.2,0.3]ms p95 = [0.3,0.4]ms p99 = [3,4]ms max = 15.768ms mean = 0.255ms
4619
INFO 2021-09-30 15:38:34,636 [main]: UPDATE_LINK count = 160440 p25 = [0.1,0.2]ms p50 = [0.2,0.3]ms p75 = [0.2,0.3]ms p95 = [0.3,0.4]ms p99 = [3,4]ms max = 20.695ms mean = 0.306ms
4620
INFO 2021-09-30 15:38:34,636 [main]: COUNT_LINK count = 97566 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.1,0.2]ms p95 = [0.2,0.3]ms p99 = [3,4]ms max = 16.465ms mean = 0.207ms
4621
INFO 2021-09-30 15:38:34,637 [main]: MULTIGET_LINK count = 10539 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.4,0.5]ms p95 = [1,2]ms p99 = [7,8]ms max = 26815.559ms mean = 3.465ms
4622
INFO 2021-09-30 15:38:34,637 [main]: GET_LINKS_LIST count = 1014695 p25 = [0.1,0.2]ms p50 = [0.1,0.2]ms p75 = [0.2,0.3]ms p95 = [0.5,0.6]ms p99 = [4,5]ms max = 1198.111ms mean = 0.724ms
4623
INFO 2021-09-30 15:38:34,637 [main]: REQUEST PHASE COMPLETED. 2000000 requests done in 1662 seconds. Requests/second = 1203
The final result, which we write to the linkbench.ssd_result.txt
file is Requests/second
TPC-C
TPC-C includes different OLTP transactions. More info you can find in the original TPC-C benchmark specification.
The first step is creating the SQL table by create_table.lua
. (NB! A few years ago box.sql.execute
was changed to box.execute
and TPC-С perf testing it wasn't fixed, but bench-run provides the next awkward correction.
sed 's#box.sql#box#g' -i /opt/tpcc/create_table.lua
)
Then, a bench-run script loads part by part ([part]: 1 = ITEMS 2 = WAREHOUSE 3 = CUSTOMER 4 = ORDERS - ‘-l’ option). If part is not provided, then all parts would be loaded:
LoadItems (); - random items
LoadWare (); - generate warehouse data
LoadCust (); - ids connected with warehouses and const DIST_PER_WARE
LoadOrd (); - init random orders
A few threads make different selects / updates / deletes / inserts. If there is no enough response time (more than 90%) log contains NG against OK for New-Order, Payment, Order-Status, Delivery, Stock-Level.
The result is transactions per minute (tpmC).
At the current time, this benchmark is not working.
Full information about the bench could be found here.
The most important part of ycbs benchmark is the possibility to run different configurations based on a ratio of read/update/insert transactions.
Next workloads are using now.
a: Read/update ratio: 50/50
b: Read/update ratio: 95/5
с: Read/update ratio: 100/0
d: Read/update/insert ratio: 95/0/5
e: Scan/insert ratio: 95/5
f: Read/read-modify-write ratio: 50/50
Other benches could be created easily by using configs params and set special values of needed proportion.
The first step is load. The second is to run the bench.
It executes runs=1
times. Full results could be presented in the next format:
[OVERALL],RunTime(ms), 10077
[OVERALL],Throughput(ops/sec), 9923.58836955443
[UPDATE], Operations, 50396
[UPDATE], AverageLatency(ms), 0.04339630129375347
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 338
[UPDATE], Return=0, 50396
[UPDATE], 0, 0.10264765784114054
[UPDATE], 2000, 0.026989343690867442
[UPDATE], 4000, 0.0352882703777336
[UPDATE], 6000, 0.004238958990536277
[UPDATE], 8000, 0.052813085033008175
[UPDATE], 10000, 0.0
[READ], Operations, 49604
[READ], AverageLatency(ms), 0.038242883638416256
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 230
[READ], Return=0, 49604
[READ], 0, 0.08997245741099663
[READ], 2000, 0.02207505518763797
[READ], 4000, 0.03188493260913297
[READ], 6000, 0.004869141813755326
[READ], 8000, 0.04355329949238579
[READ], 10000, 0.005405405405405406
But bench-run provide results for all configuration only next line Throughput(ops/sec), 9923.58836955443
.
It also can be divided into index types. Now we have tests for tree and hash indices.
nosqlbench
There is good README.md in repo.
All configurations are written in config file nosqlbench.conf
We change in our workflow some of them
port 3301
benchmark 'time_limit' (we have time limit until we execute benchmark)
time_limit 2000
request_batch_count 10 (amount of requests per query)
Statistics are provided in the next format.
TOTAL RPS STATISTICS:
----------.---------------.---------------.---------------.
| type | minimal | average | maximum |"
.----------.---------------.---------------.---------------.
| read/s | %7d | %7d | %8d |"
| write/s | %7d | %7d | %8d |"
| req/s | %7d | %7d | %8d |"
'----------.---------------.---------------.---------------
Also, percentiles are involved in the default report.
---------.---------.---------.----------------.----------------.------------.------------.------------.
| req/s | read/s | write/s | min lat. %s | max lat. %s | 90%%< | 99%%< | 99.9%%< |"
---------.---------.---------.----------------.----------------.------------.------------.------------.
Indices can be defined by a user. Now we use hash
and tree
Amount of threads=10
, which are created at_once
.
TPC-H
TBC