Parquet - zhongjiajie/zhongjiajie.github.com GitHub Wiki

Parquet

文件储存格式textfile orc parquet的区别

Parquet vs ORC vs ORC with Snappy

从矢量化的角度来说,其中主要的区别(correct as of Hive 2.0 and Spark 2.1):

  • Hive has a vectorized ORC reader but no vectorized parquet reader.
  • Spark has a vectorized parquet reader and no vectorized ORC reader.
  • Spark performs best with parquet, hive performs best with ORC. seen similar differences when running ORC and Parquet with Spark. Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

从优缺点来说

  • Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
  • Apache ORC might be better if your file-structure is flattened.
  • And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

Hive ORC和Parquet


⚠️ **GitHub.com Fallback** ⚠️