Architecture - MiguelElGallo/iparq GitHub Wiki

Architecture

Back to Home.

Overview

iparq is a small, focused CLI for inspecting Parquet files without loading full datasets into memory. The project is intentionally compact: nearly all application logic lives in src/iparq/source.py, with Typer exposing a command-line interface, PyArrow reading Parquet metadata, Pydantic modeling the output structure, and Rich rendering terminal-friendly tables.

At a high level, the tool:

Accepts one or more file paths or glob patterns.
Expands and deduplicates the file list.
Reads Parquet metadata through PyArrow.
Builds a column-by-column view by enriching a shared ParquetColumnInfo model in stages.
Emits either Rich terminal output or JSON.

Repository Structure

iparq/
├── src/iparq/
│   ├── __init__.py
│   ├── source.py
│   └── py.typed
├── tests/
│   ├── conftest.py
│   ├── test_cli.py
│   └── dummy.parquet
├── media/
│   └── iparq.png
├── .github/workflows/
│   ├── python-package.yml
│   ├── test.yml
│   ├── merge.yml
│   ├── python-publish.yml
│   └── copilot-setup-steps.yml
├── pyproject.toml
├── uv.lock
├── CONTRIBUTING.md
├── LICENSE
└── README.md

Architectural Style

The codebase follows a single-module CLI architecture:

One primary source file keeps the project easy to scan and maintain.
Functional orchestration is preferred over deep class hierarchies.
Typed data models define the internal data contract.
Progressive enrichment builds column metadata in multiple passes.

This is a good fit for iparq because the domain is narrow, the execution path is linear, and startup simplicity matters more than framework abstraction.

Technology Stack

Concern	Technology	Role
CLI	Typer	Defines the `iparq` command and options
Metadata access	PyArrow	Reads Parquet metadata efficiently
Data modeling	Pydantic	Structures and validates file/column metadata
Terminal rendering	Rich	Prints styled tables and console output
Build backend	Hatchling	Packaging and distribution
Environment / dependency workflow	uv	Lockfile and dependency management
Linting	Ruff	Static lint checks
Type checking	mypy	Type analysis
Formatting	Black	Code formatting
Testing	pytest + pytest-cov	CLI and behavior validation

Entry Point and CLI Wiring

The CLI entry point is the Typer application:

app = typer.Typer(...)

pyproject.toml exposes it as:

[project.scripts]
iparq = "iparq.source:app"

That means installing the package creates an iparq executable that dispatches into iparq.source.

The main user-facing command is inspect, which is also registered as the default command:

iparq ...
iparq inspect ...

Both invoke the same handler.

Core Domain Models

The project uses Pydantic models as the internal schema for output-ready data.

`OutputFormat`

An Enum that constrains output to:

rich
json

This keeps CLI option parsing explicit and avoids ad hoc string handling.

`ParquetMetaModel`

Represents file-level metadata:

created_by
num_columns
num_rows
num_row_groups
format_version
serialized_size

This model is the stable contract between raw PyArrow metadata and presentation.

`ColumnInfo`

Represents a single column chunk within a row group. It includes:

identity: row group, column name, column index
storage: compression type, compressed size, uncompressed size
statistics: min, max, exactness flags
feature flags: bloom filter presence, encryption status
cardinality-style information: number of values

Because Parquet metadata is row-group based, ColumnInfo represents a column chunk, not just a logical schema field.

`ParquetColumnInfo`

A container model holding List[ColumnInfo].

This acts as the shared mutable aggregate that the metadata collection functions progressively populate.

Module-Level Flow

CLI args
   |
   v
inspect(...)
   |
   +--> glob expansion + deduplication
   |
   +--> for each file
           |
           v
     inspect_single_file(...)
           |
           +--> read_parquet_metadata(...)
           |
           +--> build ParquetMetaModel
           |
           +--> create empty ParquetColumnInfo
           |
           +--> print_compression_types(...)
           +--> print_bloom_filter_info(...)
           +--> print_min_max_statistics(...)
           |
           +--> optional column filtering
           |
           +--> rich output OR JSON output

Detailed Execution Pipeline

1. Input Collection

inspect() accepts one or more filenames or glob patterns. It:

expands patterns with glob.glob
preserves literal filenames when no match is found
removes duplicates while preserving original order

This gives users flexible shell-like matching without requiring shell expansion.

2. File Metadata Read

read_parquet_metadata(filename) opens the file with pyarrow.parquet.ParquetFile and reads only metadata, not table data. It also scans row groups and columns to collect the set of compression codecs used in the file.

This is the key performance design choice of the project: inspect structure, not payload.

3. File-Level Model Construction

inspect_single_file() converts raw PyArrow metadata into a ParquetMetaModel.

This creates a clean separation:

PyArrow objects remain external implementation details.
Pydantic models become the app's internal representation.

4. Column Information Assembly

An empty ParquetColumnInfo instance is created and then filled in stages:

print_compression_types() creates the initial ColumnInfo entries.
print_bloom_filter_info() updates those entries with bloom filter flags.
print_min_max_statistics() updates them again with statistics and exactness.

This is effectively a builder-style pipeline:

step 1 creates the base objects
later steps enrich the same objects
the final aggregate is presentation-ready

5. Optional Filtering

If the user passes a column filter, the final list is narrowed to matching column names before rendering.

6. Presentation

Two output modes are supported:

Rich mode: prints the metadata model, a formatted table, and codec summary
JSON mode: serializes metadata, columns, and compression_codecs

Column Metadata Builder Pattern

Although there is no formal builder class, iparq uses a practical builder pattern through staged mutation of ParquetColumnInfo.

empty ParquetColumnInfo
   |
   v
compression pass
   -> create ColumnInfo records
   |
   v
bloom filter pass
   -> update existing records
   |
   v
statistics pass
   -> update existing records
   |
   v
final output-ready model

Why this works well:

each pass has a single responsibility
PyArrow feature access stays localized
the final rendering code can assume a normalized structure

Tradeoff:

later passes search for existing entries by row group and column index, which is simple and readable but not optimized for very large metadata graphs

For this project size and use case, readability wins.

Presentation Layer

Rich Output

print_column_info_table() builds a Rich Table with:

row group
column name
column index
compression
bloom filter status
encryption status
min / max values
exactness indicator
optional size and compression ratio details

The table is optimized for interactive terminal use and quick human inspection.

JSON Output

output_json() converts the Pydantic models with model_dump() and prints a single JSON document.

This makes the CLI usable in scripts, automation, and downstream tooling.

Error Handling Strategy

The project uses lightweight CLI-oriented error handling:

file open or metadata read failures are surfaced at the file-processing stage
metadata collection helpers catch exceptions and print styled error messages
multi-file inspection continues after individual file failures

This favors resilience and good CLI ergonomics over strict fail-fast behavior.

Design Decisions

Why a single-file implementation?

Benefits:

very low cognitive overhead
easy onboarding for contributors
straightforward debugging
minimal packaging complexity

Tradeoff:

as features grow, source.py becomes a mixed home for CLI, domain models, orchestration, and presentation

For the current scope, the simplicity is intentional and reasonable.

Why Pydantic?

Pydantic gives the project:

explicit schemas
typed fields
predictable serialization
a clean boundary between raw library objects and output contracts

That is especially useful for JSON output and for keeping the Rich and JSON paths aligned.

Why PyArrow metadata access only?

Reading metadata without loading table contents keeps the tool:

fast
memory-efficient
safe for large Parquet files

This matches the core mission of iparq: inspection, not data processing.

Why a staged enrichment pipeline?

The column metadata is not assembled in one large function. Instead, each pass adds one concern:

storage/compression
bloom filter signals
min/max statistics

This keeps feature growth manageable and reduces cross-coupling.

Testing and Quality Gates

The test suite lives under tests/ and focuses on CLI behavior and output. Supporting quality tooling is defined in pyproject.toml and exercised in GitHub Actions workflows.

The repository includes automation for:

linting
type checking
tests
coverage
merge validation
package publishing

This gives the project a small runtime surface area but a stronger delivery pipeline around it.

Future Refactoring Boundaries

If the project grows, the current architecture naturally splits into these modules:

cli.py - Typer commands and argument parsing
models.py - Pydantic models and enums
metadata.py - PyArrow access and metadata extraction
renderers.py - Rich and JSON output

The present architecture keeps those seams implicit, which is acceptable for the current size.

Summary

iparq is a deliberately compact CLI with a simple architecture:

one main module
one command-oriented execution flow
typed output models
staged metadata enrichment
dual human/machine output modes

Its design prioritizes clarity, low overhead, and fast metadata inspection over abstraction-heavy structure.

Back to Home.