M05C06: Pydantic v2 as Smart Constructors – Runtime Enforcement Without Losing ADTs¶
Progression Note¶
By the end of Module 5, you will model every domain concept as immutable algebraic data types (products and tagged sums), eliminating whole classes of runtime errors through exhaustive pattern matching, mypy-checked totality, and pure serialization contracts.
| Module | Focus | Key Outcomes |
|---|---|---|
| 4 | Safe Recursion & Error Handling | Stack-safe tree recursion, folds, Result/Option, streaming validation/retries |
| 5 | Advanced Type-Driven Design | ADTs, exhaustive pattern matching, total functions, refined types |
| 6 | Monadic Flows as Composable Pipelines | bind/and_then, Reader/State-like patterns, error-typed flows |
Core question
How do you use Pydantic v2 only at the edges as smart constructors — enforcing runtime invariants, providing stable serialization, and computing derived fields — while keeping the core domain as plain frozen dataclasses for maximum performance and purity?
Every production system eventually discovers the same painful truth:
“Our ‘simple’ JSON → dataclass pipeline silently accepts garbage data, crashes deep inside the embedding stage, and produces unversioned, unstable serialization that breaks every deployment.”
The naïve pattern everyone writes first:
# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json) # accepts missing fields, wrong types, NaN embedding
serialized = json.dumps(asdict(chunk)) # order-unstable, no version, no validation on read
Garbage in, explosion later.
The production pattern: use Pydantic only at the edges (ingress/egress, config loading) to validate, compute derived fields, and serialize stably — then immediately bridge to pure frozen core ADTs for the rest of the pipeline.
# AFTER – safe at edge, pure in core
validated = ChunkModel.model_validate(raw_json) # clear ValidationError early
core_chunk = to_core_chunk(validated) # → frozen dataclass, zero runtime cost inside pipeline
serialized = validated.model_dump_json(by_alias=True) # stable, versioned, reproducible
Validation happens once at the boundary. Core stays fast, pure, and type-checked.
Audience: Engineers who have ever debugged “why did this field become None?” hours after ingestion and want bulletproof I/O with zero runtime cost in hot paths.
Outcome 1. Every raw JSON/dict → validated Pydantic model → core frozen ADT. 2. Runtime invariants enforced exactly once at the edge. 3. Stable, versioned, round-trippable serialization forever.
Tiny Non-Domain Example – Production Config Loading¶
class ProdConfigModel(BaseModel):
model_config = ConfigDict(extra="forbid", frozen=True, populate_by_name=True)
port: int = Field(ge=1, le=65535)
host: str = Field(pattern=r"^[a-z0-9.-]+$")
timeout_ms: int = Field(gt=0)
debug: bool = False
@model_validator(mode="after")
def _no_localhost(self) -> "ProdConfigModel":
if self.host in {"localhost", "127.0.0.1", "::1"}:
raise ValueError("localhost disallowed in prod")
return self
@computed_field
def timeout_seconds(self) -> float:
return self.timeout_ms / 1000.0
# Usage
config_model = ProdConfigModel.model_validate(raw_dict) # raises clear error if bad
core_config = CoreConfig(
port=config_model.port,
host=config_model.host,
timeout_seconds=config_model.timeout_seconds,
debug=config_model.debug,
)
All checks in one place, derived field free, core stays plain frozen dataclass.
Why Pydantic at the Edges Only? (Three bullets every engineer should internalise)¶
- Runtime enforcement:
model_validator, field constraints, discriminated unions → illegal states impossible at boundary. - Stable serialization:
model_dump_json(by_alias=True)+ discriminators → order-independent, versioned, reproducible JSON forever. - Zero cost in core: Validate once at edge → convert to frozen dataclass → full speed + mypy totality inside pipeline.
Pydantic is only for I/O and config. Core domain stays pure frozen dataclasses.
1. Laws & Invariants (machine-checked)¶
| Invariant | Description | Enforcement |
|---|---|---|
| Construction Invariant | Invalid input raises ValidationError early | Pydantic validation + tests |
| Round-Trip | deserialize_model(serialize_model(x)) == x (with exclude_unset=True and defaults) |
Hypothesis property tests |
| Schema Stability | Schema changes explicit and reviewed | Snapshot tests |
| Discriminator Uniqueness | No ambiguous union parsing | Pydantic + tests |
| Computed Field Purity | Derived fields deterministic, no side effects | Reproducibility tests |
2. Decision Table – Where to Use Pydantic¶
| Location | Need runtime validation? | Need stable serde? | Use Pydantic? |
|---|---|---|---|
| Ingress (JSON → domain) | Yes | Yes | Yes |
| Core pipeline | No (already validated) | No | No |
| Egress (domain → JSON) | No | Yes | Yes |
| Config loading | Yes | Yes | Yes |
| Hot loops | No | No | No |
3. Public API (boundaries/pydantic_edges.py – mypy --strict clean)¶
from __future__ import annotations
from typing import Annotated, Any, Dict, List, Literal, TypeVar
from pydantic import BaseModel, Field, ConfigDict, model_validator, computed_field, TypeAdapter
import math
from funcpipe_rag.fp.core import Chunk, make_chunk # plain frozen core
__all__ = [
"ChunkModel",
"to_core_chunk",
"from_core_chunk",
"serialize_model",
"deserialize_model",
]
T = TypeVar("T")
StrictConfig = ConfigDict(
strict=True,
frozen=True,
extra="forbid",
populate_by_name=True,
)
class ChunkModel(BaseModel):
model_config = StrictConfig
version: Literal[1] = 1
text: str = Field(min_length=1, max_length=200_000)
metadata: Dict[str, Any] = Field(default_factory=dict)
embedding: List[float] | None = None
@model_validator(mode="after")
def _validate_embedding(self) -> "ChunkModel":
if self.embedding is None:
return self
if not self.embedding:
raise ValueError("embedding must be non-empty if present")
if len(self.embedding) > 8192:
raise ValueError("embedding too long")
for i, v in enumerate(self.embedding):
if not math.isfinite(v):
raise ValueError(f"embedding[{i}] not finite")
if abs(v) > 100.0:
raise ValueError(f"embedding[{i}] out of reasonable range")
return self
@computed_field
def length(self) -> int:
return len(self.text)
def to_core_chunk(model: ChunkModel) -> Chunk:
return make_chunk(
text=model.text,
path=(),
metadata=model.metadata,
)
def from_core_chunk(core: Chunk) -> ChunkModel:
return ChunkModel(
text=core.text,
metadata=core.metadata,
)
def serialize_model(model: BaseModel) -> str:
return model.model_dump_json(by_alias=True, exclude_unset=True)
def deserialize_model(json_str: str, typ: type[T]) -> T:
return TypeAdapter(typ).validate_json(json_str)
3.1 Pattern: Discriminated Unions for Core ADTs (e.g. Result)¶
from typing import Annotated, Generic, Literal, TypeAlias, TypeVar, Union
from pydantic import BaseModel, ConfigDict, Field
StrictConfig = ConfigDict(
strict=True,
frozen=True,
extra="forbid",
populate_by_name=True,
)
T = TypeVar("T")
class ErrInfoModel(BaseModel):
model_config = StrictConfig
code: str
msg: str
class OkModel(BaseModel, Generic[T]):
model_config = StrictConfig
kind: Literal["ok"] = "ok"
value: T
class ErrModel(BaseModel):
model_config = StrictConfig
kind: Literal["err"] = "err"
error: ErrInfoModel
ResultModel: TypeAlias = Annotated[Union[OkModel[T], ErrModel], Field(discriminator="kind")]
4. Reference Implementations (continued)¶
4.1 Before vs After – Chunk Ingestion¶
# BEFORE – raw dict → core dataclass, silent garbage
chunk = make_chunk(**raw_json) # accepts negative length, NaN embedding, etc.
# AFTER – validated at edge, safe core
validated = ChunkModel.model_validate(raw_json) # clear ValidationError if bad
core_chunk = to_core_chunk(validated) # → pure frozen dataclass
4.2 RAG Integration – Safe Ingestion Pipeline¶
def ingest_raw_chunk(raw: dict[str, Any]) -> Chunk:
validated = ChunkModel.model_validate(raw)
return to_core_chunk(validated)
def persist_chunk(core: Chunk) -> str:
model = from_core_chunk(core)
return serialize_model(model) # stable, versioned JSON
5. Property-Based Proofs (tests/test_pydantic_edges.py)¶
import math
import pytest
from hypothesis import given, strategies as st
from funcpipe_rag.boundaries.pydantic_edges import ChunkModel, serialize_model, deserialize_model
nonfinite = st.sampled_from([float("nan"), float("inf"), float("-inf")])
@given(text=st.text(min_size=1, max_size=1000),
metadata=st.dictionaries(st.text(), st.integers() | st.text()))
def test_chunk_roundtrip(text, metadata):
model = ChunkModel(text=text, metadata=metadata)
json_str = serialize_model(model)
reloaded = deserialize_model(json_str, ChunkModel)
assert model == reloaded
assert reloaded.length == len(text)
@given(bad_emb=st.lists(nonfinite, min_size=1))
def test_nonfinite_embedding_rejected(bad_emb):
with pytest.raises(ValueError):
ChunkModel(text="x", embedding=bad_emb)
@given(emb=st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=1))
def test_large_embedding_rejected_if_out_of_range(emb):
if any(abs(x) > 100 for x in emb):
with pytest.raises(ValueError):
ChunkModel(text="x", embedding=emb)
def test_schema_stable(snapshot):
assert ChunkModel.model_json_schema() == snapshot
6. Big-O & Allocation Guarantees¶
| Operation | Time | Heap | Notes |
|---|---|---|---|
| Validation | O(#fields) | O(#fields) | Once at edge |
| Serialization | O(#fields) | O(#fields) | Stable order via aliases |
| computed_field | O(1) or O(N) | O(1) | Recomputed on access; keep pure & fast |
7. Anti-Patterns & Immediate Fixes¶
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Raw **kwargs → dataclass | Silent invalid states | Pydantic model_validate at edge |
| Manual JSON serde | Unstable order, no versioning | model_dump_json / validate_json |
| Pydantic in hot path | 10–100× slowdown | Validate once → convert to frozen core |
| Missing discriminator | Union parse ambiguity | Annotated[Union[...], Field(discriminator="kind")] |
| Mutable models in core | Accidental mutation | frozen=True + extra="forbid" |
8. Pre-Core Quiz¶
- Pydantic at edges for…? → Runtime validation + stable serde
- model_validator(mode="after") for…? → Cross-field checks
- Discriminated unions use…? → kind tag
- computed_field gives…? → Pure derived properties
- Core stays…? → Plain frozen dataclasses
9. Post-Core Exercise¶
- Wrap one core ADT in a Pydantic model → add model_validator + computed_field.
- Add discriminated union for a sum type → test parsing.
- Replace one raw JSON → dataclass with Pydantic edge + bridge.
- Add schema snapshot test for a model → verify stability.
Next: M05C07 – Pattern Matching in Python 3.10+ for ADTs.
You now have bulletproof I/O: every external payload is validated exactly once at the edge, serialized stably forever, and the core pipeline runs at full speed on pure frozen ADTs. The rest of Module 5 adds pattern matching for orchestration and final serialization contracts.