Apache Parquet - cchantra/bigdata.github.io GitHub Wiki

Apache Parquet is a file format designed to support fast data processing for complex data, with several notable characteristics:

  1. Columnar: Unlike row-based formats such as CSV or Avro, Apache Parquet is column-oriented – meaning the values of each table column are stored next to each other, rather than those of each record:
  1. Open-source: Parquet is free to use and open source under the Apache Hadoop license, and is compatible with most Hadoop data processing frameworks.

  2. Self-describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record – making it easier to decouple services that write, store, and read Parquet files.

Parquet has some advantages on:

  • Compression

File compression is the act of taking a file and making it smaller. In Parquet, compression is performed column by column and it is built to support flexible compression options and extendable encoding schemas per data type – e.g., different encoding can be used for compressing integer and string data.

Parquet data can be compressed using these encoding methods:

  1. Dictionary encoding: this is enabled automatically and dynamically for data with a small number of unique values.
  2. Bit packing: Storage of integers is usually done with dedicated 32 or 64 bits per integer. This allows more efficient storage of small integers. 3.Run length encoding (RLE): when the same value occurs multiple times, a single value is stored once along with the number of occurrences.

Parquet implements a combined version of bit packing and RLE, in which the encoding switches based on which produces the best compression results.

  • Performance

As opposed to row-based file formats like CSV, Parquet is optimized for performance. When running queries on your Parquet-based file-system, you can focus only on the relevant data very quickly. Moreover, the amount of data scanned will be way smaller and will result in less I/O usage. Parquet is a self-described format, so each file contains both data and metadata. Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group:

This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned). For example, if you have a table with 1000 columns, which you will usually only query using a small subset of columns. Using Parquet files will enable you to fetch only the required columns and their values, load those in memory and answer the query. If a row-based file format like CSV was used, the entire table would have to have been loaded in memory, resulting in increased I/O and worse performance.

  • Schema evolution

When using columnar file formats like Parquet, users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. In these cases, Parquet supports automatic schema merging among these files.

  • Open source and non-proprietary

Apache Parquet is part of the open-source Apache Hadoop ecosystem. Development efforts around it are active, and it is being constantly improved and maintained by a strong community of users and developers.

  • Column-oriented vs row based storage for analytic querying

Data is often generated and more easily conceptualized in rows. We are used to thinking in terms of Excel spreadsheets, where we can see all the data relevant to a specific record in one neat and organized row. However, for large-scale analytical querying, columnar storage comes with significant advantages with regards to cost and performance. Complex data such as logs and event streams would need to be represented as a table with hundreds or thousands of columns, and many millions of rows. Storing this table in a row based format such as CSV would mean:

  1. Queries will take longer to run since more data needs to be scanned, rather than only querying the subset of columns we need to answer a query (which typically requires aggregating based on dimension or category)
  2. Storage will be more costly since CSVs are not compressed as efficiently as Parquet.
  3. Columnar formats provide better compression and improved performance out-of-the-box, and making query data vertically – column by column.
  • Parquet metadata
  1. Version (of Parquet format)
  2. Data Schema
  3. Column metadata (type, number of values, location, encoding)
  4. Number of row groups
  5. Additional key-value pairs

parquet.ipynb

test_parquet.ipynb