Parquet io performance - animeshtrivedi/notes GitHub Wiki
Performance of Parquet reading from HDFS vs Crail
Input preparation
The input parquet data file is generated using the Parquet Generator tool as:
-p 10 -r 100000000 -t 80 -o crail://flex11-40g0:9060/sql/parquet-100m
for the 10 slave nodes we have. In the end, the block distribution looks symmetric as can be seen by
./bin/crail fsck -t blockStatistics -f /sql/parquet-100m
...
10.40.0.14 1223
10.40.0.15 1221
10.40.0.16 1222
10.40.0.21 1220
10.40.0.22 1220
10.40.0.23 1222
10.40.0.13 1220
10.40.0.20 1220
10.40.0.18 1221
10.40.0.19 1221
The total file size is 11.9 GB
contains 100m
rows of schema type [Int, Long, Double, Float, String]
from the ParquetExample
case class in the tool.
Spark Config setup
Reading benchmark
HDFS vs Crail performance
These are first set of performance numbers (all times in ns per Row) for ParquetExample format. It has <int, long, double, float, String(size:100))
Source | Sink | min | max | average :---: | :---: | :---: | :---: | :---: | :---: | null | null | 63 | 68 | 64.9 | | | | | | hdfs | null | 86 | 363 | 140 | crail | null | 85 | 334 | 141 | | | | | | hdfs | hdfs | 1102 | 1874 | 1203 | crail| crail | 686 | 1320 | 880*| *trigger network as Crail/HDFS file writes are distributed.
What does the table tell us?
- crail and HDFS have the same read performance on this schema of parquet reading
- crail is ~30% better in writing