Parquet io performance - animeshtrivedi/notes GitHub Wiki

Performance of Parquet reading from HDFS vs Crail

Input preparation

The input parquet data file is generated using the Parquet Generator tool as:

-p 10 -r 100000000 -t 80 -o crail://flex11-40g0:9060/sql/parquet-100m

for the 10 slave nodes we have. In the end, the block distribution looks symmetric as can be seen by

./bin/crail fsck -t blockStatistics -f /sql/parquet-100m
...
10.40.0.14	1223
10.40.0.15	1221
10.40.0.16	1222
10.40.0.21	1220
10.40.0.22	1220
10.40.0.23	1222
10.40.0.13	1220
10.40.0.20	1220
10.40.0.18	1221
10.40.0.19	1221

The total file size is 11.9 GB contains 100m rows of schema type [Int, Long, Double, Float, String] from the ParquetExample case class in the tool.

Spark Config setup

Reading benchmark

HDFS vs Crail performance

These are first set of performance numbers (all times in ns per Row) for ParquetExample format. It has <int, long, double, float, String(size:100))

Source | Sink | min | max | average :---: | :---: | :---: | :---: | :---: | :---: | null | null | 63 | 68 | 64.9 | | | | | | hdfs | null | 86 | 363 | 140 | crail | null | 85 | 334 | 141 | | | | | | hdfs | hdfs | 1102 | 1874 | 1203 | crail| crail | 686 | 1320 | 880*| *trigger network as Crail/HDFS file writes are distributed.

What does the table tell us?

crail and HDFS have the same read performance on this schema of parquet reading
crail is ~30% better in writing