ORC vs Avro vs Parquet - ignacio-alorre/Hive GitHub Wiki

Usually Big Data Pipelines receive data in human readable format like JSON, XML and CSV. But storing data in those Raw formats is terribly inefficient/ Plus, those file formats cannot be stored in a parallel manner.

There are three optimized file formats for use in Hadoop clusters:

  • Avro
  • Parquet
  • Optimized Row Columnar (ORC)

Overview of the three formats

Parquet

parquet

Features

  • Column-oriented (store data in columns): column-oriented data stores are optimized for read-heavy analytical workloads
  • High compression rates (up to 75% with Snappy compression)
  • Only required columns would be fetched/read (reducing the disk I/O)
  • Can be read and write using Avro API and Avro Schema
  • Support predicate pushdown (reducing disk I/O cost)

Avro

avro

Features

  • Row-based (store data in rows): row-based databases are best for write-heavy transactional workloads
  • Support serialization
  • Fast binary format
  • Support block compression and splittable
  • Support schema evolution (the use of JSON to describe the data, while using binary format to optimize storage size)
  • Stores the schema in the header of file so data is self-describing.
  • The use of indexes provide read performance??

ORC

orc

Features

  • Column-oriented (store data in columns): column-oriented data stores are optimized for read-heavy analytical workloads
  • High compression rates (ZLIB)
  • Hive type support (datetime, decimal, and the complex types like struct, list, map, and union)
  • Metadata stored using Protocol Buffers, which allows addition and removal of fields
  • Compatible on HiveQL
  • Support serialization

Similarities and Differences

What is Common

  • The three formats are optimized for storage on Hadoop and provide some degree of compression.
  • Are machine-readable binary formats, which is to say that the files look like gibberish to humans
  • Files stored in these formats can be split across multiple disks, which lends themselves to scalability and parallel processing. You cannot split JSON and XML files, and that limits their scalability and parallelism.
  • All three formats carry the data schema in the files themselves, which is to say they’re self-described. You can take an ORC, Parquet, or Avro file from one cluster and load it on a completely different machine, and the machine will know what the data is and be able to process it.
  • Are on-the-wire formats, which means you can use them to pass data between nodes in your Hadoop cluster.

What is Different

Read/Write Intensive vs Query Pattern

  • Row-based data formats are overall better for storing write-intensive data because appending new records is easier.
  • If only a small subset of columns will be queried frequently, columnar formats will be your good friends as only those needed columns will be accessed and transmitted (whereas row formats need to pull all the columns).

Compression
When you get huge amount of data (e.g. IoT data), you need good compression. Columnar formats are better than row-based formats in terms of compression because storing the same type of values together allows more efficient compression. To be specific, a different and more efficient encoding is utilized for each column. That’s also why columnar formats are good for sparse datasets! ORC has the best compression rate of all three, thanks to its stripes.

Schema Evolution
One challenge of handling big data is the frequent changing of data schema: e.g. adding/dropping columns and changing columns names. If your data schema changes a lot and you need high compatibility for your old/new applications, Avro is here for you. Plus, Avro’s data schema is in JSON and Avro is able to keep data compact even when many different schemas exist. Among the two columnar formats, ORC offers better schema evolution than Parquet.

Nested Columns
If you have a lot of complex nested columns in your dataset and often only query a subset of the subcolumns, Parquet would be a good choice. Parquet is implemented using the record shredding and assembly algorithm described in the Dremel paper, which allows you to access and retrieve subcolumns without pulling the rest of the nested column.

Platform Support You should consider as well the platform/framework you are using when choosing a data format, as data formats perform differently depending on where they are used. ORC works best with Hive (since it is made for Hive). Spark provides great support for processing Parquet formats. Avro is often a good choice for Kafka.

However, if your use case is more in the line of retrieving all rows from the table, like to display all available products, that would be a SQL row-based query, and row based format will be more efficient.

Because of the way the data is optimized for fast retrieval column-based stores offer higher compression rates than the row-based format. Let's use as an example an scenario where you are collecting massive amounts of data, like IoT, columnar-oriented format stores all those values with the same type next to each other, which allows you to do more efficient compression on them than if you're storing rows of data.

Another aspect to consider is support for schema evolution, or the ability for the file structure to change over time. Among the two columnar formats, ORC offers better schema evolution. However Avro offers superior schema evolution thanks to its innovative use of JSON to descrbe the data, while using binary format to optimize storage size.

Summary

  • row-based: If data is wide (it has a large number of attributes) and is write-heavy
  • column-based: If data is narrow (has a small number of attributes) and is read-heavy
formats

Sources

⚠️ **GitHub.com Fallback** ⚠️