Apache Parquet - kdwivedi1985/system-design GitHub Wiki

Parquet

  • Apache Parquet is a columnar storage format optimized for efficient data processing, storage, and retrieval, especially in big data and analytical workloads. It's widely used in the Hadoop, Spark, and cloud data ecosystems.
  • It stores data in Column-oriented binary file format. For e.g. a Pandas dataframe can be converted to Parquet binary as is.
  • Columnar Storage: Stores data by column instead of row, enabling efficient filtering, compression, and projection (read only needed columns).
  • Columns typically contain similar data types have better compression (e.g., Snappy, Gzip, Brotli).
  • It is supported by Apache Spark, Hadoop, Hive, Trino, Flink, Dask etc.

Parquet VS CSV VS JSON

Feature Parquet CSV JSON
Format Binary, columnar Plain text, row-based Plain text, row-based
Compression ✅ Built-in (Snappy, Gzip, etc.) ❌ None (unless externally zipped) ❌ None (unless externally zipped)
Read Efficiency ✅ Very fast for selective columns ❌ Reads entire rows ❌ Reads entire rows
Write Efficiency ❌ Slower due to encoding overhead ✅ Very fast 🟡 Moderate
Schema ✅ Strong, self-describing schema ❌ None (requires manual handling) ✅ Some (self-describing, flexible)
Data Types ✅ Rich (nested, typed) ❌ Flat, all strings unless parsed ✅ Supports nested & typed data
Splittable ✅ Yes (ideal for parallel processing) ✅ Yes ❌ No (not naturally splittable)
Human-readable ❌ No (binary format) ✅ Yes ✅ Yes
Storage Size 🟢 Very small (compressed, binary) 🔴 Large (plain text) 🔴 Large (verbose)
Best For Analytics, data lakes, ML pipelines Simple exports, small datasets APIs, config, logs, semi-structured data
Tools Supported ✅ Hadoop, Spark, Pandas, Trino, etc. ✅ Excel, Pandas, R, databases ✅ Python, Node, APIs, NoSQL DBs