Performance Optimization - nsfcac/MetricsBuilder GitHub Wiki

Motivation

In our original design and implementation, Metrics Builder typically required an unacceptable waiting time to query and process data from InfluxDB. The following figure shows the data query performance at different time intervals over different time ranges. In general, time increases along with the time range at the same time interval. When the query interval is small, time increases quickly. In our experiments, even the shortest time was about 50 seconds, which indicates that Metrics Builder is not a responsive service.

To find out what is causing the poor performance, we used cProfile to analyze the details of time consumption in Metrics Builder. As shown in the figure, querying BMC-related data points takes almost 80% of the total running time, and querying UGE-related data points takes more than 10% of the time. These queries account for about 90% of the total running time. Therefore, if we can save querying time, we can achieve significant performance improvement. To try to save querying time, we have explored the following optimization strategies.

Storing data on SSDs

The original InfluxDB service resided on a host where data are stored on HDDs (hard disk drives) with a disk bandwidth of 103MB/sec. To test out the performance of using SSDs (solid state drives) without affecting the collection process, we migrated the collected data to a host with SSDs, which provides around 391 MB/sec of I/O bandwidth, nearly 4x faster than an HDD. As depicted in the figure, even if we use faster storage, the performance gains are limited, which is roughly 1.5x to 2.1x faster, and the response time is still long.

Optimizing database schemas

Our next approach is to redesign the time-series database schema. This optimization is based on the knowledge that query performance scales with series cardinality. In our original design and implementation, we had two versions of the database schema. The first version used different measurements to store different metrics. For example, CPU temperature, fan speed, and job information were all stored into different measurements. We also saved metadata such as threshold information into fields. The second version saved all metrics into a unified measurement and each job information is stored into a dedicated measurement. Both versions of the schema coexist in the same database, which introduced a large series of cardinality.

In order to better manage the data, we proposed an optimized schema and converted all historical data into this redesigned schema. We used binary integer epoch time instead of date strings. The optimized schema not only results in considerable performance improvement but also reduces the total data volume. As shown in the figure, the new schema has only 28.02% of the data volume of the previous schema. We also gained a 1.6x to 1.76x performance boost compared to using the previous schema on the SSDs as depicted in the figure. These experiences have shown that database schema plays a vital role in the performance of the monitoring system.

Concurrent Querying

The next approach we have investigated to improve performance is to take advantage of concurrent queries in InfluxDB, where data points in each measurement are searched in a concurrent manner. The following figure shows the performance improvement compared to the sequential approach. We achieved 5.5x to 6.5x performance improvement from concurrent querying, which shows that concurrent querying is another vital technique and design consideration in the monitoring system.

The following figure summaries the performance improvements using the above approaches collectively. Overall, the proposed strategies allow Metrics Builder to perform 17x to 25x faster than the original implementation. The query and processing time was as low as 3.78 seconds when querying 6 hours of data and 12.9 seconds when querying 72 hours of data.

Transmitting compressed data

These optimization approaches we have discussed earlier collectively reduce the waiting time to an acceptable range. However, when data analysis invokes Metrics Builder API remotely, the response time is still long, especially when obtaining long-range data. To further understand the reasons, we decompose the time consumption into query-processing time and transmission time, as depicted in the figure. From the figure, we can observe that when querying long-range data, the transmission time is much longer than the query-processing time, up to 1.65 times longer. This observation motivates another optimization of compressing data points and transmitting compressed data to reduce the transmission time.

Metrics Builder API provides JSON format data to data analysis. In our experiments, we used zlib library to compress JSON data into compressed data format. The following figure illustrates the compression ratio. The compressed data volume is only about 5% of the uncompressed data volume.

This figure shows the total response time (on top) and the total response time distribution (at bottom) when using compressed data, compared against the cases without compression. Attributing to the compression, the transmission time is significantly shorter and the overall performance is about 2x faster than transmitting uncompressed data even though the query-processing time increases only slightly.