Data Access Operations - jbrucker/home-log GitHub Wiki

  1. Should DAO operations return model objects or schema objects?

    • Design Decision (June 2025): DAO return SqlAlchemy model instances, not schema instances.
  2. Flexible get_entity_by method for DAO

    • How to write a single get_entities_by(session, filters, conditions) that accepts a variety of filters and conditions, instead of specific methods like get_entities_by_name, get_entities_by_owner, ...

Should DAO operations return model objects or schema objects?

Design Decision (June 2025): DAO return SqlAlchemy model instances, not schema instances.

Rationale:

  • Preserve Whole Object design principle
  • Separation of Concerns. DAO is concerned with retrieving data as models, not in how its presented (schema).
  • Other reasons explained below.

Consider these CRUD operations for User objects:

from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from pydantic import EmailStr
from datetime import datetime, timezone
from . import schemas
from .models import User

async def get_user(session: AsyncSession, user_id: int) -> schemas.User | None:
    """Get a user by ID and return as Pydantic schema."""
    stmt = select(User).where(User.id == user_id)
    result = await session.execute(stmt)
    if user := result.scalar_one_or_none():
        return schemas.User.model_validate(user)
    return None

async def create_user(session: AsyncSession, user_data: schemas.UserCreate) -> schemas.User:
    """Create a new user from Pydantic schema."""
    user = User(
        email=user_data.email,
        username=user_data.username
    )
    session.add(user)
    await session.commit()
    await session.refresh(user)
    return schemas.User.model_validate(user)

Each of these CRUD operations return a Pydantic schema object instead of ORM model object.

Why?


According to Deepseek the rationale are:

1. Separation of Concerns

  • SQLAlchemy Models represent your database structure (data layer)
  • Pydantic Schemas represent your API contracts (presentation layer)

By returning schemas, you enforce a clean separation between how data is stored and how it's presented.

My View:

  • A better separation of concerns is for the REST API endpoints to be concerned with presentation, DAO to be concerned with data (model) persistence.

2. Security & Data Filtering

SQLAlchemy models contain all fields, including sensitive ones you might not want to expose:

# In your SQLAlchemy model
class User(Base):
    __tablename__ = "users"
    password_hash: Mapped[str] = mapped_column(String(255))  # Should NEVER be returned via API!

Pydantic schemas let you control exactly what gets exposed:

class UserPublic(BaseModel):
    id: int
    email: EmailStr
    username: Optional[str]
    # No password_hash here!

My View:

  • This is unnecessary. A REST endpoint serializes the result itself (typically to JSON), or converts it to a schema object and lets FastAPI do automatic serialization.
  • The serialization methods (provided by Pydantic) accept a SqlAlchemy model object as parameter.
  • Returning a model object from the DAO gives the REST endpoint a chance to catch validation errors itself, instead of doing it at the data access layer.

3. Serialization Safety

SQLAlchemy objects:

  • Contain circular relationships (e.g., user.posts[0].author)
  • Have lazy-loaded attributes that may trigger unexpected DB queries (wrong: they trigger exceptions)
  • Include SQLAlchemy-specific metadata that shouldn't be serialized

Pydantic schemas:

  • Provide predictable serialization
  • Explicitly define what can be serialized
  • Prevent accidental data leaks or N+1 query problems

My View:

  • If you need eager instantiation you should do it in the context of a session, that is, in the DAO.
  • If the session is closed before lazy attributes are instantiated it can cause an exception, esp. using async code.
  • The application layer would use Pydantic to perform serialization anyway, so there is no risk of leaking metadata or sensitive data.

4. Validation Guarantees

When you return a Pydantic schema:

@router.get("/users/{id}", response_model=schemas.User)
async def get_user(id: int):
    # The response will be automatically validated against the schema
    return await crud.get_user(id)

You get automatic validation that:

  • All required fields are present
  • Data types match what your API promises
  • No extra fields slip through

My View:

  • The application layer (FastAPI function) can perform serialization and validation itself. It is a better place to catch validation/serialization errors.
  • A validation error in converting model -> schema indicates the Pydantic schema is inconsistent with the database constraints (given in model definition). A programming error.

5. Versioning & API Stability

If you need to change your database structure:

  1. Your SQLAlchemy models can evolve independently
  2. Your API contracts (Pydantic schemas) remain stable
  3. You can support multiple schema versions simultaneously

My View:

  • #1 seems like a good feature to have
  • #3 seems like over future-proofing for most applications (YAGNI)

6. Performance Optimization

Pydantic schemas let you:

  • Compute derived fields (e.g., full_name = f"{first} {last}")
  • Apply different serialization rules (e.g., datetime formatting)
  • Exclude computationally expensive relationships when not needed

When Might You Return SQLAlchemy Models Directly?

There are rare cases where bypassing schemas makes sense:

  1. Internal Microservice Communication (when caller needs full ORM capabilities)
  2. Admin Interfaces (where you need access to everything)
  3. Performance-Critical Bulk Operations (though still risky)
  4. Want lazy instantiation of related objects, esp. 1-to-many relations.

Best Practice Implementation

The cleanest pattern is:

  1. Accept Pydantic schemas as input (validation)
  2. Work with SQLAlchemy models internally (business logic)
  3. Return Pydantic schemas as output (serialization)

This approach gives you strong guarantees about data integrity while maintaining clean layer separation.

# In your CRUD layer
async def create_user(db: AsyncSession, user_data: schemas.UserCreate) -> schemas.User:
    # Input validated by UserCreate schema
    db_user = User(**user_data.model_dump())  # Convert to SQLAlchemy
    db.add(db_user)
    await db.commit()
    return schemas.User.model_validate(db_user)  # Convert back to schema

References

These references that support the practice of returning Pydantic schemas instead of SQLAlchemy models in Python web applications:

1. Official FastAPI Documentation

  • "Separate Data Schemas (Pydantic) from DB Models"
    FastAPI explicitly recommends this separation for security and compatibility:
    FastAPI SQL (Relational) Databases Guide

    "By using Pydantic models/schemas, you can define the exact data structure you want to receive and return, independently of your database structure."

Not!

  • In this reference, the example code returns a database table model! The example uses SQLModel instead of SqlAlchemy, where a model class can be both a database table model and a Pydantic data model.

The example for a FastAPI endpoint uses a database model throughout the function. It then uses the FastAPI response_model= field on the decorator to automatically convert to a schema model on return. In this code Hero is the database model and contains all fields (including private data), HeroPublic is a schema model.

@app.get("/heroes/{hero_id}", response_model=HeroPublic)
def read_hero(hero_id: int, session: SessionDep):
    hero = session.get(Hero, hero_id)
    if not hero:
        raise HTTPException(status_code=404, detail="Hero not found")
    return hero

2. Pydantic Documentation

  • "ORM Mode (From_ORM)" โ†’ Now model_validate()
    Pydantic v2's official guidance on converting ORM objects:
    Pydantic ORM Mode Documentation

    "Used to create models from arbitrary class instances by reading instance attributes. Designed for compatibility with SQLAlchemy, Django ORM, etc."

3. SQLAlchemy Best Practices

  • "Donโ€™t expose models directly through APIs"
    From SQLAlchemy's creator in Essential SQLAlchemy (O'Reilly):

    "Your database models should be considered an implementation detail. Always transform them into API-specific schemas before serialization."

4. Microsoft REST API Guidelines

  • "Entity vs. API Resource Separation"
    Microsoft API Design Guidelines

    "The database representation (entity) and API resource representation should be decoupled to allow independent evolution."

5. Clean Architecture (Robert C. Martin)

  • "Interface Adapters Layer"
    The Clean Architecture principle that inspired this pattern:

    "Database models belong in the Enterprise Business Rules layer, while API schemas are Interface Adapters that translate between external agencies and internal systems."

6. Real-World Examples

  • FastAPI's Official SQL Tutorial
    Uses separate Pydantic schemas for all operations:
    FastAPI SQL Tutorial

    Not! This is the same as reference #1.

  • Django REST Framework's Serializers
    Analogous concept in Django:
    DRF Serializers Docs

    "Serializers allow complex data such as querysets and model instances to be converted to native Python datatypes that can then be easily rendered into JSON."

Key Technical Reasons Cited

  1. Security: OWASP recommends this pattern to prevent Mass Assignment Vulnerabilities.
    • My design uses Schema instances for data input parameters to DAO.
  2. Forward Compatibility: Google API Design Guide advises decoupling internal storage from external representations.
  3. Performance: Pydantic's compiled validation is faster than ad-hoc ORM serialization (benchmarked in Pydantic v2 vs. v1).

Flexible get_entity_by method for DAO

Using DataSources (model) as an example:

from sqlachemy import select, and_
from app.models import DataSource

async def get_data_sources_by(session: AsyncSession, *conditions, **filters) -> list[DataSource]:
    """Get all entities matching the specified conditions and filters.

       :param conditions: zero or more SqlAlchemy conditions 
       :param filters: named assignments of required attribute values, e.g. `owner_id=12`.
                       The name must by an attribute of models.DataSource
    """
    stmt = select(DataSource)
    all_conditions = list(conditions)
    if filters:
        all_conditions += [getattr(DataSource, k) == v for k, v in filters.items()]
    if all_conditions:
        stmt = stmt.where(and_(*all_conditions))
    result = await session.execute(stmt)
    return result.scalars().all()

Example usage:

from datetime import datetime, timezone
from app import models

# Find by owner and name contains some text
await get_data_sources_by(
    session,
    models.DataSource.owner_id == 11,
    models.DataSource.name.contains("Electric")
)

# Find by created_at <= some datetime
await get_data_sources_by(
    session,
    models.DataSource.created_at <= datetime(2024, 1, 1, tzinfo=timezone.utc)
)