Apache Parquet - kdwivedi1985/system-design GitHub Wiki

Apache Parquet is a columnar storage format optimized for efficient data processing, storage, and retrieval, especially in big data and analytical workloads. It's widely used in the Hadoop, Spark, and cloud data ecosystems.
It stores data in Column-oriented binary file format. For e.g. a Pandas dataframe can be converted to Parquet binary as is.
Columnar Storage: Stores data by column instead of row, enabling efficient filtering, compression, and projection (read only needed columns).
Columns typically contain similar data types have better compression (e.g., Snappy, Gzip, Brotli).
It is supported by Apache Spark, Hadoop, Hive, Trino, Flink, Dask etc.

Feature	Parquet	CSV	JSON
Format	Binary, columnar	Plain text, row-based	Plain text, row-based
Compression	✅ Built-in (Snappy, Gzip, etc.)	❌ None (unless externally zipped)	❌ None (unless externally zipped)
Read Efficiency	✅ Very fast for selective columns	❌ Reads entire rows	❌ Reads entire rows
Write Efficiency	❌ Slower due to encoding overhead	✅ Very fast	🟡 Moderate
Schema	✅ Strong, self-describing schema	❌ None (requires manual handling)	✅ Some (self-describing, flexible)
Data Types	✅ Rich (nested, typed)	❌ Flat, all strings unless parsed	✅ Supports nested & typed data
Splittable	✅ Yes (ideal for parallel processing)	✅ Yes	❌ No (not naturally splittable)
Human-readable	❌ No (binary format)	✅ Yes	✅ Yes
Storage Size	🟢 Very small (compressed, binary)	🔴 Large (plain text)	🔴 Large (verbose)
Best For	Analytics, data lakes, ML pipelines	Simple exports, small datasets	APIs, config, logs, semi-structured data
Tools Supported	✅ Hadoop, Spark, Pandas, Trino, etc.	✅ Excel, Pandas, R, databases	✅ Python, Node, APIs, NoSQL DBs