Performance Evaluation: Scalability in Terms of CPU Number - ytyou/ticktock GitHub Wiki

Table of Contents

1. Introduction

2. IoTDB Benchmark

3. Experiment settings

3.1. Hardware

3.2. IoTDB Benchmark Settings

4. Throughput

4.1. Write Throughput

4.2. Read Throughput

5. Response time

5.1. Write Response Time

5.2. Read Response Time

6. CPU, memory, IO

7. Conclusion

1. Introduction

This performance evaluation is to show how TickTock scales in terms of CPU cores in one host.

A few quick notes:

  • We use docker to run TSDBs.
  • We use IoTDB benchmark.
  • We use a mixed read-write workload scenario (Read : write = 1 : 9).

Also please be advised that performance is not the only aspect to compare different TSDBs though it may be a very important one, if not the most among all. You may have to consider other aspects like ease-of-use, APIs adoption, reliability, community supports, and costs etc. This report only focuses on performance.

2. IoTDB Benchmark

We use IoTDB-benchmark for performance evaluation. IoTDB-benchmark is developed by THULAB, Tsinghua University, Beijing. Please refer to the same description in another performance evaluation.

3. Experiment Settings

3.1. Hardware

We run our experiments on an Ideapad Gaming laptop with the following specification:

  • CPU: AMD Ryzen5 5600H, 6 core 12 hyper-thread, 1.3MHz
  • Memory: 20GB DDR4 3200MHz
  • Disk: 1T 5400 RPM HDD
  • OS: Ubuntu 20.04.3 LTS

We run TickTock in a docker container, and we increased the container resources by 1 vCPU and 2GB memory at a time (with docker option: --cpuset-cpus 0-1 -m 4g). We started from (1 vCPU + 2GB) to (4 vCPU + 8GB). For example, running TickTock 0.4.0-beta with 2 vCPU + 4GB:

docker run -d --privileged --name ticktock -h ticktock -p 6182:6182 -p 6181:6181 --cpuset-cpus 0-1 -m 4g ytyou/ticktock:0.4.0-beta --tsdb.timestamp.resolution millisecond --tcp.buffer.size 10mb –http.listener.count 10 –tcp.listener.count 10 –http.responders.per.listener 1 –tcp.responders.per.listener 1

To avoid network congestion under high ingestion rate, we run the benchmark on the same laptop instead of on a separated machine.

3.2. IoTDB Benchmark Settings

We use a mixed read-write scenario. The read:write ratio is 1:9. According to our experience, in DevOps, TSDBs mostly handle writes sent by machines being monitored. Query workload is relatively low, even lower than 10%.

We use asynchronous writes (i.e., [TCP protocol[(https://github.com/ytyou/ticktock/wiki/Usage-Examples#11-support-three-protocols-tcp-udp-and-http)) in our tests.

As explained above, IoTDB benchmark simulates a wind-farm with a number of devices and multiple sensors (sensor number is also write batch size) in each device, sent by a number of clients. In our experiments, we use 200 clients. Each client is bound to 1 device with 10 sensors. So there are 2000 metrics (200 devices * 10 sensors). Our goal is to compare max throughput, so we require that one of the CPU/memory/IO resources are saturated or close to saturation. In all of our tests, CPU is the bottleneck and saturated before other resources.

4. Throughput

4.1. Write Throughput

The figure above shows that TickTock write throughput goes up when there are more CPU and memory. Let's look at the scaling factor in detail.

core+memory 1 vCPU + 2GB 2 vCPU + 4GB 3 vCPU + 6GB 4 vCPU + 8GB
throughput 2623767 3262576 4890237 6202833
ratio to (1vCPU+2GB) 1 1.24 1.86 2.36
ratio to (2vCPU+4GB) 0.80 1 1.50 1.90

The table above shows that (2vCPU+4GB) only has 1.24x throughput as (1vCPU+2GB) instead of 2x. However, TickTock scales almost linearly with 2, 3, and 4 vCPU,about 1.6M/s/(1vCPU+2GB). We repeated tests several times and they have similar results.

We don't have resources to test 5 vCPU or more since the test machine (Ideapad) is saturated with TickTock (4vCPU+8GB) and the benchmark running in the same machine. The benchmark (in Java) consumes much more than TickTock (in C++), almost 8 vCPUs.

4.2. Read Throughput

Similar to write throughput, read throughput also goes up when vCPU number goes up. 2vCPU doesn't have 2x throughput as 1 vCPU. But TickTock scales closed to linearly with 2, 3, and 4 vCPU.

5. Response Time

5.1. Write Response Time

The above figure shows the response time of write (INGESTION) operations. Most of the response time (at P99 and percentile below) are less than 1 millisecond. Please note that one INGESTION operation contains BATCH_PER_WRITE * SENSOR_number data points. We use BATCH_PER_WRITE=20 and SENSOR_number=10 so there are 200 data points per operation.

The P999 response time suddenly spikes to several seconds except the 4 vCPU case whose P999 is 4.46 millisecond. The high P999 response time is due to the nature of the asynchronous protocol used for writes. Once the server's network buffer is full, balancing between connections become crucial. Any imbalance will cause certain connections to be blocked longer than usual.

We also did test synchronous write (i.e., HTTP protocol). Its P999 response time is very closed to P99, although average response time is higher than asynchronous write. We will release another performance report to compare TCP and HTTP writes.

5.2. Read Response Time

5.2.1. PRECISE_POINT

Precise point query: select v1... from data where time=? and device in ?

The figure above shows the read operation (PRECISE_POINT) response time. Interestingly, the more vCPUs there are, the higher response time. We think more vCPUs means more parallelism, which incurs more synchronization overheads as all threads need to access same data and resources.

There is a big jump from P99 to P999 response time. But the relative gap is not as high as writes. This is because read operations are synchronous.

Sections below (5.2.2 to 5.2.6) are for other read operations. They have similar pattern as PRECISE_POINT. For simplicity, we don't add explanations further.

5.2.2. TIME_RANGE

Time range query: select v1... from data where time > ? and time < ? and device in ?

5.2.3. AGG_RANGE

Aggregation query with time filter: select func(v1)... from data where device in ? and time > ? and time < ?.

5.2.4. GROUP_BY

Group by time range is hard to be represented by a standard SQL, but is useful for time series data, 
e.g., achieving down sampling. Suppose there is a time series which covers the data in 1 day. 
By grouping the data by 1 hour, we can get a new time series which only contains 24 data points.

5.2.5. LATEST_POINT

Latest point query: select time, v1... where device = ? and time = max(time)

6. CPU, memory, IO

As described above, CPUs are saturated throughout all tests. CPU is our performance bottleneck. We show you an example of (2vCPU+4GB) test below.

7. Conclusion

  • For writes, TickTock scales linearly in terms of CPU number in one host. The max write throughput is about 1.6M data point/s/vCPU.
  • For reads, more CPUs would incur more overheads and higher response time.