spark sql parquet io - animeshtrivedi/notes GitHub Wiki
Parquet on NVMe and 100Gps
In this blog post we explore the performance of the Parquet file format. Parquet (https://parquet.apache.org/) is a popular, standard, and open-source format to store columnar data sets. It is supported our-of-the-box by Spark where it is used widely by many developers.
The following table summarizes our from this blog post:
In conclusion, ...
Context and Setup
In a recent Spark Summit talk, Matei Zaharia sketches a hardware configuration (http://www.slideshare.net/SparkSummit/trends-for-big-data-and-apache-spark-in-2017-by-matei-zaharia, slide 11) which Spark targets. He points out that recent efforts in Spark are focused on optimizing its performance for 100-1,000 MB/sec storage and 10 Gbps networking speeds, which are fairly standard configuration one can find in today's cloud environment. However, in our research, we went a step further and asked how does these optimization perform on hardware of tomorrow which can provide 1,000-10,000 MB/sec storage and 100+ Gbps networking bandwidth (more than 10x improvement over today's hardware) with ultra low device access latency? A natural question would be does my workload experience similar performance gains if I upgrade my hardware?
For this experiment, we choose Spark/SQL framework. SQL is one of the most important workloads and Spark/SQL has received a lot of development effort between 1.6v and 2.0v development cycle of Spark. Spark/SQL takes parquet data stored in a columnar data format on HDFS as input, processes (e.g., a join) the data, and writes it out again as parquet data. For this blog, we are focusing on the input and output part of the system, leaving the processing part for a future post.
The specific cluster configuration used for the experiments in this blog: Cluster 128 node OpenPower cluster Node configuration CPU: 2x OpenPOWER Power8 10-core @2.9Ghz DRAM: 512GB DDR4 Storage: 4x Huawei ES3600P V3 1.2TB NVMe SSD Network: 100Gbit/s Ethernet Mellanox ConnectX-4 EN (RoCE) Software Ubuntu 16.04 with Linux kernel version 4.4.0-31 Spark 2.0.0 Crail 1.0 (Crail only used during shuffle, input/output is on HDFS)
Setup
We first start with a simple 3 machine setup where one machine runs the HDFS NameNode, YARN ResourceManager, and Spark driver. The second machine is used for running a HDFS DataNode which stores data on a various storage medias (e.g., HDD, SDD, NVMe) and serves data over 1, 10, and 100 Gbps networks. And the last machine is used to run a YARN NodeManager, and a Spark Executor where our benchmark code runs. The benchmark reads a large parquet file from HDFS, materializes the rows, and writes the data out as it is. The goal of the benchmark is to stress the parquet read/write path of Spark. Our benchmark code is open sourced here [?]. The size of input/output parquet data is 200 GB and is generated using parquet-generator. All experiments start with a cold HDFS cache.
Parquet Performance: Past, Present, and Future
We first start our analysis by setting the baseline performance with disk and 1 Gbps network. In this setup figure shows the performance (in time/row) as we increase the amount of parallelism. On the Y-axis we show the time which is split between the input and output, and on the X-axis we show number of cores employed. It takes about abc msec/Row for input and xyz msec/Row for output. The output performance is fast as data is not written out to the media. For HDD and 1 Gbps, Spark/SQL delivers 99% performance of the raw hardware to the application. In this setup, the whole stack is bottlenecked by the raw device performance.
We now move the same experiment on SSDs with 1 GB read performance and 10 Gbps network. Figure 2 shows our results. The first thing we notice is that proportional gains are limited. We only see 2x performance gains for 10x hardware performance improvements. This is where overheads from the software stack start to come into the play. For example, in this setup HDFS can only deliver 7 Gbps network bandwidth.
Now we run our setup on NVMe (2 GB/sec read, 2 device = 4 GB/sec) and 100 Gbps, and the gap between the performance of hardware and the application increases.
Understanding performance
There are three main components which are involved in the parquet IO path. First is the raw performance by the IO devices. Second is the performance of the shared, distributed file system, here in our case HDFS. And lastly, Spark's IO path. The end-to-end performance of the benchmark application depends upon the minimum of these three components.
We now give a detailed breakdown of what in
Applying CrailFS
Summary
Misc notes
Other data formats of interest might include plaintext, CVS, or JSON. However, these formats include various amount of overheads in data storage efficiency. In this blog we focus on Parquet/HDFS with Spark on a high-performance cluster.