Architecture - MiguelElGallo/iparq GitHub Wiki
Architecture
Back to Home.
Overview
iparq is a small, focused CLI for inspecting Parquet files without loading full datasets into memory. The project is intentionally compact: nearly all application logic lives in src/iparq/source.py, with Typer exposing a command-line interface, PyArrow reading Parquet metadata, Pydantic modeling the output structure, and Rich rendering terminal-friendly tables.
At a high level, the tool:
- Accepts one or more file paths or glob patterns.
- Expands and deduplicates the file list.
- Reads Parquet metadata through PyArrow.
- Builds a column-by-column view by enriching a shared
ParquetColumnInfomodel in stages. - Emits either Rich terminal output or JSON.
Repository Structure
iparq/
├── src/iparq/
│ ├── __init__.py
│ ├── source.py
│ └── py.typed
├── tests/
│ ├── conftest.py
│ ├── test_cli.py
│ └── dummy.parquet
├── media/
│ └── iparq.png
├── .github/workflows/
│ ├── python-package.yml
│ ├── test.yml
│ ├── merge.yml
│ ├── python-publish.yml
│ └── copilot-setup-steps.yml
├── pyproject.toml
├── uv.lock
├── CONTRIBUTING.md
├── LICENSE
└── README.md
Architectural Style
The codebase follows a single-module CLI architecture:
- One primary source file keeps the project easy to scan and maintain.
- Functional orchestration is preferred over deep class hierarchies.
- Typed data models define the internal data contract.
- Progressive enrichment builds column metadata in multiple passes.
This is a good fit for iparq because the domain is narrow, the execution path is linear, and startup simplicity matters more than framework abstraction.
Technology Stack
| Concern | Technology | Role |
|---|---|---|
| CLI | Typer | Defines the iparq command and options |
| Metadata access | PyArrow | Reads Parquet metadata efficiently |
| Data modeling | Pydantic | Structures and validates file/column metadata |
| Terminal rendering | Rich | Prints styled tables and console output |
| Build backend | Hatchling | Packaging and distribution |
| Environment / dependency workflow | uv | Lockfile and dependency management |
| Linting | Ruff | Static lint checks |
| Type checking | mypy | Type analysis |
| Formatting | Black | Code formatting |
| Testing | pytest + pytest-cov | CLI and behavior validation |
Entry Point and CLI Wiring
The CLI entry point is the Typer application:
app = typer.Typer(...)
pyproject.toml exposes it as:
[project.scripts]
iparq = "iparq.source:app"
That means installing the package creates an iparq executable that dispatches into iparq.source.
The main user-facing command is inspect, which is also registered as the default command:
iparq ...iparq inspect ...
Both invoke the same handler.
Core Domain Models
The project uses Pydantic models as the internal schema for output-ready data.
OutputFormat
An Enum that constrains output to:
richjson
This keeps CLI option parsing explicit and avoids ad hoc string handling.
ParquetMetaModel
Represents file-level metadata:
created_bynum_columnsnum_rowsnum_row_groupsformat_versionserialized_size
This model is the stable contract between raw PyArrow metadata and presentation.
ColumnInfo
Represents a single column chunk within a row group. It includes:
- identity: row group, column name, column index
- storage: compression type, compressed size, uncompressed size
- statistics: min, max, exactness flags
- feature flags: bloom filter presence, encryption status
- cardinality-style information: number of values
Because Parquet metadata is row-group based, ColumnInfo represents a column chunk, not just a logical schema field.
ParquetColumnInfo
A container model holding List[ColumnInfo].
This acts as the shared mutable aggregate that the metadata collection functions progressively populate.
Module-Level Flow
CLI args
|
v
inspect(...)
|
+--> glob expansion + deduplication
|
+--> for each file
|
v
inspect_single_file(...)
|
+--> read_parquet_metadata(...)
|
+--> build ParquetMetaModel
|
+--> create empty ParquetColumnInfo
|
+--> print_compression_types(...)
+--> print_bloom_filter_info(...)
+--> print_min_max_statistics(...)
|
+--> optional column filtering
|
+--> rich output OR JSON output
Detailed Execution Pipeline
1. Input Collection
inspect() accepts one or more filenames or glob patterns. It:
- expands patterns with
glob.glob - preserves literal filenames when no match is found
- removes duplicates while preserving original order
This gives users flexible shell-like matching without requiring shell expansion.
2. File Metadata Read
read_parquet_metadata(filename) opens the file with pyarrow.parquet.ParquetFile and reads only metadata, not table data. It also scans row groups and columns to collect the set of compression codecs used in the file.
This is the key performance design choice of the project: inspect structure, not payload.
3. File-Level Model Construction
inspect_single_file() converts raw PyArrow metadata into a ParquetMetaModel.
This creates a clean separation:
- PyArrow objects remain external implementation details.
- Pydantic models become the app's internal representation.
4. Column Information Assembly
An empty ParquetColumnInfo instance is created and then filled in stages:
print_compression_types()creates the initialColumnInfoentries.print_bloom_filter_info()updates those entries with bloom filter flags.print_min_max_statistics()updates them again with statistics and exactness.
This is effectively a builder-style pipeline:
- step 1 creates the base objects
- later steps enrich the same objects
- the final aggregate is presentation-ready
5. Optional Filtering
If the user passes a column filter, the final list is narrowed to matching column names before rendering.
6. Presentation
Two output modes are supported:
- Rich mode: prints the metadata model, a formatted table, and codec summary
- JSON mode: serializes
metadata,columns, andcompression_codecs
Column Metadata Builder Pattern
Although there is no formal builder class, iparq uses a practical builder pattern through staged mutation of ParquetColumnInfo.
empty ParquetColumnInfo
|
v
compression pass
-> create ColumnInfo records
|
v
bloom filter pass
-> update existing records
|
v
statistics pass
-> update existing records
|
v
final output-ready model
Why this works well:
- each pass has a single responsibility
- PyArrow feature access stays localized
- the final rendering code can assume a normalized structure
Tradeoff:
- later passes search for existing entries by row group and column index, which is simple and readable but not optimized for very large metadata graphs
For this project size and use case, readability wins.
Presentation Layer
Rich Output
print_column_info_table() builds a Rich Table with:
- row group
- column name
- column index
- compression
- bloom filter status
- encryption status
- min / max values
- exactness indicator
- optional size and compression ratio details
The table is optimized for interactive terminal use and quick human inspection.
JSON Output
output_json() converts the Pydantic models with model_dump() and prints a single JSON document.
This makes the CLI usable in scripts, automation, and downstream tooling.
Error Handling Strategy
The project uses lightweight CLI-oriented error handling:
- file open or metadata read failures are surfaced at the file-processing stage
- metadata collection helpers catch exceptions and print styled error messages
- multi-file inspection continues after individual file failures
This favors resilience and good CLI ergonomics over strict fail-fast behavior.
Design Decisions
Why a single-file implementation?
Benefits:
- very low cognitive overhead
- easy onboarding for contributors
- straightforward debugging
- minimal packaging complexity
Tradeoff:
- as features grow,
source.pybecomes a mixed home for CLI, domain models, orchestration, and presentation
For the current scope, the simplicity is intentional and reasonable.
Why Pydantic?
Pydantic gives the project:
- explicit schemas
- typed fields
- predictable serialization
- a clean boundary between raw library objects and output contracts
That is especially useful for JSON output and for keeping the Rich and JSON paths aligned.
Why PyArrow metadata access only?
Reading metadata without loading table contents keeps the tool:
- fast
- memory-efficient
- safe for large Parquet files
This matches the core mission of iparq: inspection, not data processing.
Why a staged enrichment pipeline?
The column metadata is not assembled in one large function. Instead, each pass adds one concern:
- storage/compression
- bloom filter signals
- min/max statistics
This keeps feature growth manageable and reduces cross-coupling.
Testing and Quality Gates
The test suite lives under tests/ and focuses on CLI behavior and output. Supporting quality tooling is defined in pyproject.toml and exercised in GitHub Actions workflows.
The repository includes automation for:
- linting
- type checking
- tests
- coverage
- merge validation
- package publishing
This gives the project a small runtime surface area but a stronger delivery pipeline around it.
Future Refactoring Boundaries
If the project grows, the current architecture naturally splits into these modules:
cli.py- Typer commands and argument parsingmodels.py- Pydantic models and enumsmetadata.py- PyArrow access and metadata extractionrenderers.py- Rich and JSON output
The present architecture keeps those seams implicit, which is acceptable for the current size.
Summary
iparq is a deliberately compact CLI with a simple architecture:
- one main module
- one command-oriented execution flow
- typed output models
- staged metadata enrichment
- dual human/machine output modes
Its design prioritizes clarity, low overhead, and fast metadata inspection over abstraction-heavy structure.
Back to Home.