Skip to content

M05C06: Pydantic v2 as Smart Constructors – Runtime Enforcement Without Losing ADTs

Progression Note

By the end of Module 5, you will model every domain concept as immutable algebraic data types (products and tagged sums), eliminating whole classes of runtime errors through exhaustive pattern matching, mypy-checked totality, and pure serialization contracts.

Module Focus Key Outcomes
4 Safe Recursion & Error Handling Stack-safe tree recursion, folds, Result/Option, streaming validation/retries
5 Advanced Type-Driven Design ADTs, exhaustive pattern matching, total functions, refined types
6 Monadic Flows as Composable Pipelines bind/and_then, Reader/State-like patterns, error-typed flows

Core question
How do you use Pydantic v2 only at the edges as smart constructors — enforcing runtime invariants, providing stable serialization, and computing derived fields — while keeping the core domain as plain frozen dataclasses for maximum performance and purity?

Every production system eventually discovers the same painful truth:

“Our ‘simple’ JSON → dataclass pipeline silently accepts garbage data, crashes deep inside the embedding stage, and produces unversioned, unstable serialization that breaks every deployment.”

The naïve pattern everyone writes first:

# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json)   # accepts missing fields, wrong types, NaN embedding
serialized = json.dumps(asdict(chunk))   # order-unstable, no version, no validation on read

Garbage in, explosion later.

The production pattern: use Pydantic only at the edges (ingress/egress, config loading) to validate, compute derived fields, and serialize stably — then immediately bridge to pure frozen core ADTs for the rest of the pipeline.

# AFTER – safe at edge, pure in core
validated = ChunkModel.model_validate(raw_json)   # clear ValidationError early
core_chunk = to_core_chunk(validated)             # → frozen dataclass, zero runtime cost inside pipeline
serialized = validated.model_dump_json(by_alias=True)  # stable, versioned, reproducible

Validation happens once at the boundary. Core stays fast, pure, and type-checked.

Audience: Engineers who have ever debugged “why did this field become None?” hours after ingestion and want bulletproof I/O with zero runtime cost in hot paths.

Outcome 1. Every raw JSON/dict → validated Pydantic model → core frozen ADT. 2. Runtime invariants enforced exactly once at the edge. 3. Stable, versioned, round-trippable serialization forever.

Tiny Non-Domain Example – Production Config Loading

class ProdConfigModel(BaseModel):
    model_config = ConfigDict(extra="forbid", frozen=True, populate_by_name=True)

    port: int = Field(ge=1, le=65535)
    host: str = Field(pattern=r"^[a-z0-9.-]+$")
    timeout_ms: int = Field(gt=0)
    debug: bool = False

    @model_validator(mode="after")
    def _no_localhost(self) -> "ProdConfigModel":
        if self.host in {"localhost", "127.0.0.1", "::1"}:
            raise ValueError("localhost disallowed in prod")
        return self

    @computed_field
    def timeout_seconds(self) -> float:
        return self.timeout_ms / 1000.0

# Usage
config_model = ProdConfigModel.model_validate(raw_dict)   # raises clear error if bad
core_config = CoreConfig(
    port=config_model.port,
    host=config_model.host,
    timeout_seconds=config_model.timeout_seconds,
    debug=config_model.debug,
)

All checks in one place, derived field free, core stays plain frozen dataclass.

Why Pydantic at the Edges Only? (Three bullets every engineer should internalise)

  • Runtime enforcement: model_validator, field constraints, discriminated unions → illegal states impossible at boundary.
  • Stable serialization: model_dump_json(by_alias=True) + discriminators → order-independent, versioned, reproducible JSON forever.
  • Zero cost in core: Validate once at edge → convert to frozen dataclass → full speed + mypy totality inside pipeline.

Pydantic is only for I/O and config. Core domain stays pure frozen dataclasses.

1. Laws & Invariants (machine-checked)

Invariant Description Enforcement
Construction Invariant Invalid input raises ValidationError early Pydantic validation + tests
Round-Trip deserialize_model(serialize_model(x)) == x (with exclude_unset=True and defaults) Hypothesis property tests
Schema Stability Schema changes explicit and reviewed Snapshot tests
Discriminator Uniqueness No ambiguous union parsing Pydantic + tests
Computed Field Purity Derived fields deterministic, no side effects Reproducibility tests

2. Decision Table – Where to Use Pydantic

Location Need runtime validation? Need stable serde? Use Pydantic?
Ingress (JSON → domain) Yes Yes Yes
Core pipeline No (already validated) No No
Egress (domain → JSON) No Yes Yes
Config loading Yes Yes Yes
Hot loops No No No

3. Public API (boundaries/pydantic_edges.py – mypy --strict clean)

from __future__ import annotations

from typing import Annotated, Any, Dict, List, Literal, TypeVar
from pydantic import BaseModel, Field, ConfigDict, model_validator, computed_field, TypeAdapter
import math

from funcpipe_rag.fp.core import Chunk, make_chunk  # plain frozen core

__all__ = [
    "ChunkModel",
    "to_core_chunk",
    "from_core_chunk",
    "serialize_model",
    "deserialize_model",
]

T = TypeVar("T")

StrictConfig = ConfigDict(
    strict=True,
    frozen=True,
    extra="forbid",
    populate_by_name=True,
)

class ChunkModel(BaseModel):
    model_config = StrictConfig

    version: Literal[1] = 1
    text: str = Field(min_length=1, max_length=200_000)
    metadata: Dict[str, Any] = Field(default_factory=dict)
    embedding: List[float] | None = None

    @model_validator(mode="after")
    def _validate_embedding(self) -> "ChunkModel":
        if self.embedding is None:
            return self
        if not self.embedding:
            raise ValueError("embedding must be non-empty if present")
        if len(self.embedding) > 8192:
            raise ValueError("embedding too long")
        for i, v in enumerate(self.embedding):
            if not math.isfinite(v):
                raise ValueError(f"embedding[{i}] not finite")
            if abs(v) > 100.0:
                raise ValueError(f"embedding[{i}] out of reasonable range")
        return self

    @computed_field
    def length(self) -> int:
        return len(self.text)

def to_core_chunk(model: ChunkModel) -> Chunk:
    return make_chunk(
        text=model.text,
        path=(),
        metadata=model.metadata,
    )

def from_core_chunk(core: Chunk) -> ChunkModel:
    return ChunkModel(
        text=core.text,
        metadata=core.metadata,
    )

def serialize_model(model: BaseModel) -> str:
    return model.model_dump_json(by_alias=True, exclude_unset=True)

def deserialize_model(json_str: str, typ: type[T]) -> T:
    return TypeAdapter(typ).validate_json(json_str)

3.1 Pattern: Discriminated Unions for Core ADTs (e.g. Result)

from typing import Annotated, Generic, Literal, TypeAlias, TypeVar, Union
from pydantic import BaseModel, ConfigDict, Field

StrictConfig = ConfigDict(
    strict=True,
    frozen=True,
    extra="forbid",
    populate_by_name=True,
)

T = TypeVar("T")

class ErrInfoModel(BaseModel):
    model_config = StrictConfig
    code: str
    msg: str

class OkModel(BaseModel, Generic[T]):
    model_config = StrictConfig
    kind: Literal["ok"] = "ok"
    value: T

class ErrModel(BaseModel):
    model_config = StrictConfig
    kind: Literal["err"] = "err"
    error: ErrInfoModel

ResultModel: TypeAlias = Annotated[Union[OkModel[T], ErrModel], Field(discriminator="kind")]

4. Reference Implementations (continued)

4.1 Before vs After – Chunk Ingestion

# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json)   # accepts negative length, NaN embedding, etc.

# AFTER – validated at edge, safe core
validated = ChunkModel.model_validate(raw_json)   # clear ValidationError if bad
core_chunk = to_core_chunk(validated)             # → pure frozen dataclass

4.2 RAG Integration – Safe Ingestion Pipeline

def ingest_raw_chunk(raw: dict[str, Any]) -> Chunk:
    validated = ChunkModel.model_validate(raw)
    return to_core_chunk(validated)

def persist_chunk(core: Chunk) -> str:
    model = from_core_chunk(core)
    return serialize_model(model)   # stable, versioned JSON

5. Property-Based Proofs (tests/test_pydantic_edges.py)

import math
import pytest
from hypothesis import given, strategies as st
from funcpipe_rag.boundaries.pydantic_edges import ChunkModel, serialize_model, deserialize_model

nonfinite = st.sampled_from([float("nan"), float("inf"), float("-inf")])

@given(text=st.text(min_size=1, max_size=1000),
       metadata=st.dictionaries(st.text(), st.integers() | st.text()))
def test_chunk_roundtrip(text, metadata):
    model = ChunkModel(text=text, metadata=metadata)
    json_str = serialize_model(model)
    reloaded = deserialize_model(json_str, ChunkModel)
    assert model == reloaded
    assert reloaded.length == len(text)

@given(bad_emb=st.lists(nonfinite, min_size=1))
def test_nonfinite_embedding_rejected(bad_emb):
    with pytest.raises(ValueError):
        ChunkModel(text="x", embedding=bad_emb)

@given(emb=st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_large_embedding_rejected_if_out_of_range(emb):
    if any(abs(x) > 100 for x in emb):
        with pytest.raises(ValueError):
            ChunkModel(text="x", embedding=emb)

def test_schema_stable(snapshot):
    assert ChunkModel.model_json_schema() == snapshot

6. Big-O & Allocation Guarantees

Operation Time Heap Notes
Validation O(#fields) O(#fields) Once at edge
Serialization O(#fields) O(#fields) Stable order via aliases
computed_field O(1) or O(N) O(1) Recomputed on access; keep pure & fast

7. Anti-Patterns & Immediate Fixes

Anti-Pattern Symptom Fix
Raw **kwargs → dataclass Silent invalid states Pydantic model_validate at edge
Manual JSON serde Unstable order, no versioning model_dump_json / validate_json
Pydantic in hot path 10–100× slowdown Validate once → convert to frozen core
Missing discriminator Union parse ambiguity Annotated[Union[...], Field(discriminator="kind")]
Mutable models in core Accidental mutation frozen=True + extra="forbid"

8. Pre-Core Quiz

  1. Pydantic at edges for…? → Runtime validation + stable serde
  2. model_validator(mode="after") for…? → Cross-field checks
  3. Discriminated unions use…? → kind tag
  4. computed_field gives…? → Pure derived properties
  5. Core stays…? → Plain frozen dataclasses

9. Post-Core Exercise

  1. Wrap one core ADT in a Pydantic model → add model_validator + computed_field.
  2. Add discriminated union for a sum type → test parsing.
  3. Replace one raw JSON → dataclass with Pydantic edge + bridge.
  4. Add schema snapshot test for a model → verify stability.

Next: M05C07 – Pattern Matching in Python 3.10+ for ADTs.

You now have bulletproof I/O: every external payload is validated exactly once at the edge, serialized stably forever, and the core pipeline runs at full speed on pure frozen ADTs. The rest of Module 5 adds pattern matching for orchestration and final serialization contracts.