Architecture - MiguelElGallo/iparq GitHub Wiki

Architecture

Back to Home.

Overview

iparq is a small, focused CLI for inspecting Parquet files without loading full datasets into memory. The project is intentionally compact: nearly all application logic lives in src/iparq/source.py, with Typer exposing a command-line interface, PyArrow reading Parquet metadata, Pydantic modeling the output structure, and Rich rendering terminal-friendly tables.

At a high level, the tool:

  1. Accepts one or more file paths or glob patterns.
  2. Expands and deduplicates the file list.
  3. Reads Parquet metadata through PyArrow.
  4. Builds a column-by-column view by enriching a shared ParquetColumnInfo model in stages.
  5. Emits either Rich terminal output or JSON.

Repository Structure

iparq/
├── src/iparq/
│   ├── __init__.py
│   ├── source.py
│   └── py.typed
├── tests/
│   ├── conftest.py
│   ├── test_cli.py
│   └── dummy.parquet
├── media/
│   └── iparq.png
├── .github/workflows/
│   ├── python-package.yml
│   ├── test.yml
│   ├── merge.yml
│   ├── python-publish.yml
│   └── copilot-setup-steps.yml
├── pyproject.toml
├── uv.lock
├── CONTRIBUTING.md
├── LICENSE
└── README.md

Architectural Style

The codebase follows a single-module CLI architecture:

  • One primary source file keeps the project easy to scan and maintain.
  • Functional orchestration is preferred over deep class hierarchies.
  • Typed data models define the internal data contract.
  • Progressive enrichment builds column metadata in multiple passes.

This is a good fit for iparq because the domain is narrow, the execution path is linear, and startup simplicity matters more than framework abstraction.

Technology Stack

Concern Technology Role
CLI Typer Defines the iparq command and options
Metadata access PyArrow Reads Parquet metadata efficiently
Data modeling Pydantic Structures and validates file/column metadata
Terminal rendering Rich Prints styled tables and console output
Build backend Hatchling Packaging and distribution
Environment / dependency workflow uv Lockfile and dependency management
Linting Ruff Static lint checks
Type checking mypy Type analysis
Formatting Black Code formatting
Testing pytest + pytest-cov CLI and behavior validation

Entry Point and CLI Wiring

The CLI entry point is the Typer application:

app = typer.Typer(...)

pyproject.toml exposes it as:

[project.scripts]
iparq = "iparq.source:app"

That means installing the package creates an iparq executable that dispatches into iparq.source.

The main user-facing command is inspect, which is also registered as the default command:

  • iparq ...
  • iparq inspect ...

Both invoke the same handler.

Core Domain Models

The project uses Pydantic models as the internal schema for output-ready data.

OutputFormat

An Enum that constrains output to:

  • rich
  • json

This keeps CLI option parsing explicit and avoids ad hoc string handling.

ParquetMetaModel

Represents file-level metadata:

  • created_by
  • num_columns
  • num_rows
  • num_row_groups
  • format_version
  • serialized_size

This model is the stable contract between raw PyArrow metadata and presentation.

ColumnInfo

Represents a single column chunk within a row group. It includes:

  • identity: row group, column name, column index
  • storage: compression type, compressed size, uncompressed size
  • statistics: min, max, exactness flags
  • feature flags: bloom filter presence, encryption status
  • cardinality-style information: number of values

Because Parquet metadata is row-group based, ColumnInfo represents a column chunk, not just a logical schema field.

ParquetColumnInfo

A container model holding List[ColumnInfo].

This acts as the shared mutable aggregate that the metadata collection functions progressively populate.

Module-Level Flow

CLI args
   |
   v
inspect(...)
   |
   +--> glob expansion + deduplication
   |
   +--> for each file
           |
           v
     inspect_single_file(...)
           |
           +--> read_parquet_metadata(...)
           |
           +--> build ParquetMetaModel
           |
           +--> create empty ParquetColumnInfo
           |
           +--> print_compression_types(...)
           +--> print_bloom_filter_info(...)
           +--> print_min_max_statistics(...)
           |
           +--> optional column filtering
           |
           +--> rich output OR JSON output

Detailed Execution Pipeline

1. Input Collection

inspect() accepts one or more filenames or glob patterns. It:

  • expands patterns with glob.glob
  • preserves literal filenames when no match is found
  • removes duplicates while preserving original order

This gives users flexible shell-like matching without requiring shell expansion.

2. File Metadata Read

read_parquet_metadata(filename) opens the file with pyarrow.parquet.ParquetFile and reads only metadata, not table data. It also scans row groups and columns to collect the set of compression codecs used in the file.

This is the key performance design choice of the project: inspect structure, not payload.

3. File-Level Model Construction

inspect_single_file() converts raw PyArrow metadata into a ParquetMetaModel.

This creates a clean separation:

  • PyArrow objects remain external implementation details.
  • Pydantic models become the app's internal representation.

4. Column Information Assembly

An empty ParquetColumnInfo instance is created and then filled in stages:

  1. print_compression_types() creates the initial ColumnInfo entries.
  2. print_bloom_filter_info() updates those entries with bloom filter flags.
  3. print_min_max_statistics() updates them again with statistics and exactness.

This is effectively a builder-style pipeline:

  • step 1 creates the base objects
  • later steps enrich the same objects
  • the final aggregate is presentation-ready

5. Optional Filtering

If the user passes a column filter, the final list is narrowed to matching column names before rendering.

6. Presentation

Two output modes are supported:

  • Rich mode: prints the metadata model, a formatted table, and codec summary
  • JSON mode: serializes metadata, columns, and compression_codecs

Column Metadata Builder Pattern

Although there is no formal builder class, iparq uses a practical builder pattern through staged mutation of ParquetColumnInfo.

empty ParquetColumnInfo
   |
   v
compression pass
   -> create ColumnInfo records
   |
   v
bloom filter pass
   -> update existing records
   |
   v
statistics pass
   -> update existing records
   |
   v
final output-ready model

Why this works well:

  • each pass has a single responsibility
  • PyArrow feature access stays localized
  • the final rendering code can assume a normalized structure

Tradeoff:

  • later passes search for existing entries by row group and column index, which is simple and readable but not optimized for very large metadata graphs

For this project size and use case, readability wins.

Presentation Layer

Rich Output

print_column_info_table() builds a Rich Table with:

  • row group
  • column name
  • column index
  • compression
  • bloom filter status
  • encryption status
  • min / max values
  • exactness indicator
  • optional size and compression ratio details

The table is optimized for interactive terminal use and quick human inspection.

JSON Output

output_json() converts the Pydantic models with model_dump() and prints a single JSON document.

This makes the CLI usable in scripts, automation, and downstream tooling.

Error Handling Strategy

The project uses lightweight CLI-oriented error handling:

  • file open or metadata read failures are surfaced at the file-processing stage
  • metadata collection helpers catch exceptions and print styled error messages
  • multi-file inspection continues after individual file failures

This favors resilience and good CLI ergonomics over strict fail-fast behavior.

Design Decisions

Why a single-file implementation?

Benefits:

  • very low cognitive overhead
  • easy onboarding for contributors
  • straightforward debugging
  • minimal packaging complexity

Tradeoff:

  • as features grow, source.py becomes a mixed home for CLI, domain models, orchestration, and presentation

For the current scope, the simplicity is intentional and reasonable.

Why Pydantic?

Pydantic gives the project:

  • explicit schemas
  • typed fields
  • predictable serialization
  • a clean boundary between raw library objects and output contracts

That is especially useful for JSON output and for keeping the Rich and JSON paths aligned.

Why PyArrow metadata access only?

Reading metadata without loading table contents keeps the tool:

  • fast
  • memory-efficient
  • safe for large Parquet files

This matches the core mission of iparq: inspection, not data processing.

Why a staged enrichment pipeline?

The column metadata is not assembled in one large function. Instead, each pass adds one concern:

  • storage/compression
  • bloom filter signals
  • min/max statistics

This keeps feature growth manageable and reduces cross-coupling.

Testing and Quality Gates

The test suite lives under tests/ and focuses on CLI behavior and output. Supporting quality tooling is defined in pyproject.toml and exercised in GitHub Actions workflows.

The repository includes automation for:

  • linting
  • type checking
  • tests
  • coverage
  • merge validation
  • package publishing

This gives the project a small runtime surface area but a stronger delivery pipeline around it.

Future Refactoring Boundaries

If the project grows, the current architecture naturally splits into these modules:

  • cli.py - Typer commands and argument parsing
  • models.py - Pydantic models and enums
  • metadata.py - PyArrow access and metadata extraction
  • renderers.py - Rich and JSON output

The present architecture keeps those seams implicit, which is acceptable for the current size.

Summary

iparq is a deliberately compact CLI with a simple architecture:

  • one main module
  • one command-oriented execution flow
  • typed output models
  • staged metadata enrichment
  • dual human/machine output modes

Its design prioritizes clarity, low overhead, and fast metadata inspection over abstraction-heavy structure.

Back to Home.