binary file format ORC - ghdrako/doc_snipets GitHub Wiki

ORC (Optimized Row Columnar), like Parquet, is also a columnar file format and successor to Apache Hive’s standard RCFile format. ORC provides ACID support, has built-in indexes for faster data retrieval, and supports complex data types like struct, lists, and maps.

The ORC file format is as follows:

  • Each ORC file contains the header, footer, and multiple data blocks known as stripes.
  • The header has details to indicate that the file is in ORC format.
  • Each stripe consists of index data, row data, and stripe footer.
  • Index data holds the indexes for the stored data and stores the min and max values for row groups.
  • Row data holds the actual data used in scanning.
  • The stripe footer contains details related to each column, including encoding, location, min, and max values.
  • The file footer contains statistics related to stripes in the files, the number of rows in each stripe, and other helpful information that can help skip data.

ORC has 3 levels of indexes:

  • File level to store column statistics across the entire file
  • Stripe level to store column statistics across for each stripe
  • Row level to store column statistics across each set of 10,000 rows within a stripe

File- and stripe-level column statistics are stored in the file footer (similar to Parquet) to skip the rest of the file during scanning

ORC has been used mostly in Hadoop ecosystems that use Hive extensively for processing data.

ORC is the only file format that enables Hive to offer ACID features for transactional processing.

Key features and benefits of ORC

The ORC (Optimized Row Columnar) format stands out for its distinctive blend of efficiency and performance, including:

  • High compression: ORC offers better compression than other columnar formats, which results in reduced storage costs and improved query performance
  • Light-weight compression algorithms: ORC uses Zlib or Snappy for compression, offering a good balance between storage efficiency and query performance
  • Predicate pushdown: ORC supports predicate pushdown, which helps reduce the amount of data read from disk during queries, thus improving query performance
  • Built-in support for complex data types: ORC natively supports complex data types such as structs, lists, and maps
  • ACID support: ORC provides support for ACID transactions in Hive, allowing users to perform update and delete operations

ORC schema design and data types

ORC uses a schema to define the structure of the stored data. The schema consists of columns, each with a specific data type. ORC supports various data types, including the following:

  • Primitive data types: integer, long, float, double, Boolean, string,date, and timestamp
  • Complex data types: struct, list, and ma

Many data processing frameworks, such as Hive, Spark, and Presto,have built-in support for reading and writing ORC files. This support makes it easy to integrate ORC into your data processing pipelines without the need for additional libraries or tools.

import pyarrow.orc as orc
import pyarrow as pa
import pandas as pd
# Creating a pandas DataFrame
data = pd.DataFrame({
'id': [123456, 123457],
'lastName': ['Doe', 'Smith'],
'firstName': ['John', 'Jane'],
'age': [30, 25],
 'email': ['[email protected]', '[email protected]'],
'address': ['123 Main Street', '456 Oak Avenue'],
'city': ['City', 'Oak'],
'country': ['Country', 'Tree'],
'phoneType': ['mobile', 'work'],
'phoneNumber': ['1234567890', '0987654321']
})
# Convert the DataFrame into an Arrow Table
table = pa.Table.from_pandas(data)
# Write the Table to an ORC file
with open('user.orc', 'wb') as f:
orc.write_table(table, f)
# Reading the ORC file
with open('user.orc', 'rb') as f:
table2 = orc.ORCFile(f).read()
# Convert the Table back into a DataFrame
data2 = table2.to_pandas()
print(data2)

This code creates a pandas DataFrame with user data, converts the DataFrame into an arrow table, then writes the table to an ORC file. Then, it reads the ORC file into another Table, which it then converts back into a DataFrame to print it.