binary file format Parquet - ghdrako/doc_snipets GitHub Wiki
Parquet is an efficient, binary file format for table data.
Parquet offers several key features and benefits that make it well-suited for large-scale data processing tasks:
- Columnar storage: By organizing data in a columnar format, Parquet enables better compression and more efficient query execution, particularly for analytical workloads.
- Schema evolution: Parquet supports schema evolution, allowing users to modify the schema of a dataset without needing to rewrite the entire dataset.
- Compression and encoding: Parquet supports a variety of compression algorithms and encoding techniques, enabling users to optimize storage efficiency and query performance based on the specific characteristics of their data.
- Integration with data processing frameworks: Parquet is widely supported by popular data processing frameworks such as Apache Spark, Apache Hive, and Apache Impala, making it easy to integrate into existing data processing pipelines.
- Vectorized processing: By storing data in a columnar format, Parquet enables modern analytical engines to leverage vectorized processing, further improving query performance.
One of the main advantages of Parquet is the format has an explicit schema that is embedded within the file - and that schema includes type information.
The Parquet file structure is as follows:
- Every parquet file consists of a header, footer, and data block.
- The header has details that indicate the file is in Parquet format.
- The data block consists of multiple row groups that logically combine various rows within the file.
- The row groups consist of columns present in the file.
- The values within each column are stored as pages. Pages are the most granular data elements within Parquet.
- The file footer consists of metadata of row groups and columns. The metadata includes stats like min/max values. Compute engines use these min/max values from the footer for data skipping
Parquet schema design and data types
Parquet uses a hierarchical schema representation, such as JSON or Avro, which allows for complex and nested data structures. The schema is defined using a combination of basic data types (for example, int,long, float, double, Boolean, and binary) and complex data types (for example, arrays, maps, and structs).
When designing a Parquet schema, it is important to consider the specific requirements of the data and the intended analytical workloads. Factors such as data types, nullability, and column ordering can impact storage efficiency and query performance. For example, placing frequently accessed columns together can help reduce the amount of I/O required for analytical queries.
The representation of types is also standardised. Parquet has both a date type and the datetime type (both sensibly recorded as integers in UTC).
Parquet provides a single way to represent missing data - the null
type.
Parquet is partly row oriented and partly column oriented. The data going into a Parquet file is broken up into "row chunks" - largeish sets of rows. Inside a row chunk each column is stored separately in a "column chunk" - this best facilitates all the tricks to make the data smaller. Compression works better when similar data is adjacent. Run-length encoding is possible. So is delta encoding.
At the end of the file is the index, which contains references to all the other row chunks, column chunks, etc. because the index is at the end of the file you can't stream it. Instead, with Parquet, you tend to split your data across multiple files (there is explicit support for this in the format) and then use the indexes to skip around to find the data you want. But again - that requires random access - no streaming.
Compression and encoding techniques in Parquet
Parquet supports a variety of compression algorithms, including Snappy, LZO, Gzip, and LZ4, allowing users to choose the best compression method based on their data characteristics and performance requirements. In addition to compression, Parquet also supports several encoding techniques, such as dictionary encoding, run-length encoding, and delta encoding, which can further improve storage efficiency and query performance.Choosing the right combination of compression and encoding techniques depends on the specific characteristics of the data, as well as the requirements of the analytical workloads. In general, it is recommended to test different compression and encoding options to determine the optimal configuration for a given dataset.
Performance considerations and best practices When working with Parquet, it is essential to consider various performance factors to ensure optimal storage efficiency and query performance. Here are some best practices and performance considerations to keep in mind: • Choose the right compression and encoding techniques: As mentioned earlier, selecting the appropriate compression algorithm and encoding technique can significantly impact storage efficiency and query performance. Test different options to find the best combination for your specific data and workload: • Partitioning: Partitioning your data can dramatically improve query performance by reducing the amount of data that needs to be read for a given query. Use partition columns that are commonly used in filter conditions to achieve the most significant performance gains. • Column ordering: Place frequently accessed columns together in the schema to minimize I/O during analytical queries. This can help improve query performance by reducing the amount of data that needs to be read from the disk. • Row group size: Parquet organizes data into row groups, which are the unit of parallelism during query execution. Choosing the right row group size can impact query performance, as smaller row groups may lead to increased parallelism, while larger row groups can result in better compression. The optimal row group size depends on the specific data and workload, so it’s essential to experiment with different row group sizes to determine the best configuration. • Use vectorized processing: Modern analytical engines can leverage vectorized processing to improve query performance further. Ensure that your data processing framework supports vectorized processing with Parquet and enables it when possible.
You can add .parquet to any csvbase table url to get a Parquet file, so that's an easy way to try the format out:
import pandas as pd
df = pd.read_parquet("https://csvbase.com/meripaterson/stock-exchanges.parquet")
pip install -U parquet-tools
curl -O "https://csvbase.com/meripaterson/stock-exchanges.parquet"
parquet-tools inspect --detail stock-exchanges.parquet
That shows a lot of detail and in conjunction with the spec can help you understand exactly how the format is arranged.
Example of how one might write and read data with the Parquet format using the PyArrow library in Python:
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
# Creating a pandas DataFrame
data = pd.DataFrame({
'id': [123456, 123457],
'lastName': ['Doe', 'Smith'],
'firstName': ['John', 'Jane'],
'age': [30, 25],
'email': ['[email protected]', '[email protected]'],
'address': ['123 Main Street', '456 Oak Avenue'],
'city': ['City', 'Oak'],
'country': ['Country', 'Tree'],
'phoneType': ['mobile', 'work'],
'phoneNumber': ['1234567890', '0987654321']
})
# Convert the DataFrame into an Arrow Table
table = pa.Table.from_pandas(data)
# Write the Table to a Parquet file
pq.write_table(table, 'user.parquet')
# Reading the Parquet file
table2 = pq.read_table('user.parquet')
# Convert the Table back into a DataFrame
data2 = table2.to_pandas()
print(data2)
Please note that this code requires the pyarrow and pandas libraries,