Core 4: Property-Based Regression & Invariant Testing for Pure Pipelines (Hypothesis in CI)¶
Module 10
Core question:
How do you use property-based testing with Hypothesis to catch regressions and enforce invariants in pure functional pipelines, integrating these tests into CI for continuous verification of laws like equivalence, idempotence, and associativity?
In this core, we integrate property-based testing (PBT) into the FuncPipe RAG Builder (now at funcpipe-rag-10). PBT uses Hypothesis to generate diverse inputs, verifying properties (e.g., laws from prior modules) rather than specific examples. This catches regressions in pure pipelines, where traditional unit tests miss edge cases. We focus on invariants (e.g., idempotence), regressions (e.g., equivalence post-refactor), and CI integration (e.g., fail on property violations). Builds on Core 1's equivalence predicates.
Motivation Bug: Example-based tests miss rare failures (e.g., deep trees breaking recursion limits); PBT explores vast inputs, shrinking failures for minimal repros—essential for pure pipelines where bugs hide in compositions.
Delta from Core 3: Observable pipelines are debuggable; PBT automates invariant checks in CI.
PBT Protocol (Contract, Entry/Exit Criteria): - Properties: Formal statements (e.g., "f(f(x)) == f(x)" for idempotence); derived from laws/invariants. Derive properties from specs + stated preconditions (e.g., if a law assumes sorted input, the strategy must generate sorted data or use assume(...)). Strategy encodes domain; assume is for rare conditional edges, not basic validity. - Semantics: Tests generate/shrink inputs; fail on counterexamples; seed for repro. - Purity: Focus on pure funcs; mock effects with fakes. - Error Model: Properties include failure paths (e.g., error propagation). - Resource Safety: Bound generation (e.g., max_size); timeout per test. - Integration: Add to RAG (test clean_doc idempotence); CI runs full suite. - Mypy Config: --strict; strategies typed. - Exit: Key laws pass, diversity targets hit (via events/targets), runtime budget met, CI green. Code coverage >80% (via pytest-cov) as side metric.
Audience: Engineers ensuring pipeline reliability via automated laws.
Outcome: 1. Write PBT for invariants/regressions. 2. Integrate Hypothesis into CI. 3. Test RAG properties, catch bugs.
Note: Hypothesis "coverage" refers to input-space exploration (via events/targets), not path coverage (measured separately with pytest-cov).
Split PBT into unit (small sizes, strict deadlines for pure stages) and e2e (fewer examples, no deadlines, fixed fakes for pipelines).
Property Taxonomy: - Algebraic laws (idempotence, associativity). - Metamorphic properties (e.g., adding noise preserves core outputs). - Round-trips (serialize/deserialize == id). - Observational equivalence (refactor matches baseline).
1. Laws & Invariants¶
| Invariant | Description | Enforcement |
|---|---|---|
| Property Inv | All stated properties hold for generated inputs; failures shrink to minimal. | Hypothesis runs |
| Equivalence Inv | Refactors preserve Core 1 predicates. | PBT equivalence |
| Idempotence Inv | If intended, f(f(x)) == f(x). | PBT idempotence |
| Associativity Inv | For monoids/folds, (a op b) op c == a op (b op c). | PBT associativity |
| Diversity Inv | Properties explore key input dimensions; measured via Hypothesis events/targets. | CI metrics |
These automate law enforcement.
2. Decision Table¶
| Scenario | Need Diversity | CI Integration | Recommended |
|---|---|---|---|
| Check laws | High | Yes | PBT invariants |
| Regression guard | High | Yes | PBT equivalence |
| Edge cases | High | Yes | Custom strategies |
| Simple asserts | Low | No | Unit tests |
Use PBT for pure funcs; units for effects.
3. Public API (Strategies for RAG Domain)¶
Domain strategies for Hypothesis. Note: text strategies intentionally include diverse Unicode to stress handling; constrain alphabet if testing non-Unicode paths. Pipeline accepts raw string categories.
from hypothesis import strategies as st
from funcpipe_rag import RawDoc, CleanDoc, Chunk
from random import Random
def raw_doc_strategy() -> st.SearchStrategy[RawDoc]:
return st.builds(
RawDoc,
doc_id=st.text(min_size=1, max_size=20, alphabet=st.characters(categories=("Ll", "Lu", "Nd"))),
# Constrain to letters/numbers for realistic IDs
title=st.text(max_size=50),
abstract=st.text(min_size=1, max_size=500),
categories=st.text(max_size=20),
)
clean_doc_strategy = raw_doc_strategy().map(clean_doc) # Only reachable CleanDocs
doc_list_strategy = st.lists(raw_doc_strategy(), max_size=50, unique_by=lambda d: d.doc_id)
@st.composite
def chunk_strategy(draw) -> Chunk:
doc = draw(raw_doc_strategy())
size = draw(st.integers(min_value=1, max_value=256))
start = draw(st.integers(min_value=0, max_value=max(0, len(doc.abstract) - 1)))
end = min(len(doc.abstract), start + size)
return Chunk(doc_id=doc.doc_id, text=doc.abstract[start:end], start=start, end=end)
For floating-point equivalence (e.g., embeddings), use a tolerant predicate assuming aligned order and IDs; only valid if stable ordering is guaranteed; otherwise compare by id.
import numpy as np
def eq_numeric(a: list, b: list, atol=1e-5) -> bool:
if len(a) != len(b): return False
return all(np.allclose(np.array(va), np.array(vb), atol=atol) for va, vb in zip(a, b))
4. Reference Implementations¶
4.1 Invariant Properties (Idempotence)¶
Precondition: clean_doc is designed to be idempotent (e.g., normalization like lowercase/strip).
from hypothesis import given, settings, Phase, HealthCheck
from funcpipe_rag import eq_pure
@given(raw=raw_doc_strategy())
@settings(max_examples=200, deadline=1000, phases=(Phase.explicit, Phase.generate, Phase.target),
suppress_health_check=[HealthCheck.filter_too_much])
def test_clean_idempotent(raw):
once = clean_doc(raw)
twice = clean_doc(once)
assert eq_pure([once], [twice], key=lambda d: (d.doc_id, d.title, d.abstract, d.categories))
@given(clean=clean_doc_strategy)
@settings(max_examples=200, deadline=1000, phases=(Phase.explicit, Phase.generate, Phase.target),
suppress_health_check=[HealthCheck.filter_too_much])
def test_clean_fixpoint(clean):
assert eq_pure([clean_doc(clean)], [clean], key=lambda d: (d.doc_id, d.title, d.abstract, d.categories))
4.2 Regression Properties (Equivalence)¶
Control non-determinism with fixed seeds/mocks; use separate RNGs and storage to avoid state sharing. Use tuple(docs) if inputs are immutable. Use multiset equality (order-insensitive).
from collections import Counter
@given(docs=doc_list_strategy, seed=st.integers(0, 2**32-1))
@settings(max_examples=200, deadline=1000, suppress_health_check=[HealthCheck.filter_too_much])
def test_refactor_equiv(docs, seed):
rng_imp = Random(seed)
rng_fp = Random(seed)
clock = lambda _: None # Noop clock
storage_imp = fake_storage()
storage_fp = fake_storage()
imp_input = tuple(docs)
fp_input = tuple(docs)
imp = imperative_rag(imp_input, rng=rng_imp, clock=clock, storage=storage_imp)
fp = fp_rag(fp_input, rng=rng_fp, clock=clock, storage=storage_fp)
# Multiset equality (order-insensitive)
key = lambda c: (c.doc_id, c.start, c.end, c.text)
assert Counter(map(key, imp)) == Counter(map(key, fp))
# Observational equivalence on storage
assert storage_imp.snapshot() == storage_fp.snapshot()
4.3 CI Integration Pattern¶
Cache deps, pin profile, artifact failures. Use fixed seed for PR repro; no seed for nightly exploration (true stochastic search). Note: Failing examples are stored only if the Hypothesis example database is enabled (default: yes).
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with: { python-version: '3.10' }
- name: Cache deps
uses: actions/cache@v3
with: { path: ~/.cache/pip, key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }} }
- name: Install deps
run: pip install -r requirements.txt # Includes hypothesis
- name: Run PBT
env: { HYPOTHESIS_PROFILE: ci }
run: pytest --hypothesis-show-statistics --hypothesis-seed=0 # Seed for repro; stats for diversity
- name: Artifact failing examples
if: failure()
uses: actions/upload-artifact@v3
with: { name: hypothesis-failures, path: .hypothesis/examples/ }
# .github/workflows/nightly.yml (scheduled)
name: Nightly PBT
on:
schedule:
- cron: '0 0 * * *' # Daily
jobs:
test:
# ... similar, but omit --hypothesis-seed=0 for stochastic search
Profile (funcpipe_rag/hypothesis_profile.py); explicit load in conftest.py:
# tests/conftest.py
import os
import funcpipe_rag
from hypothesis import settings
profile = os.getenv("HYPOTHESIS_PROFILE") or ("ci" if os.getenv("CI") else "unit_pbt")
settings.load_profile(profile)
from hypothesis import Phase, settings
settings.register_profile("ci", max_examples=1000, deadline=None, phases=(Phase.explicit, Phase.generate, Phase.target))
settings.register_profile("unit_pbt", max_examples=200, deadline=1000)
4.4 Custom Strategies (Deep Trees)¶
Bound depth explicitly for recursion tests.
@st.composite
def bounded_tree_strategy(draw, max_depth=5) -> TreeDoc:
if max_depth <= 0:
return draw(st.builds(Leaf, content=st.text()))
else:
if draw(st.booleans()):
return draw(st.builds(Leaf, content=st.text()))
else:
children = draw(st.lists(bounded_tree_strategy(max_depth=max_depth-1), min_size=1, max_size=3))
return Node(children=children)
4.5 Stateful Testing for Pipelines¶
For boundary state (caches/retries), use RuleBasedStateMachine.
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import settings
class RagPipelineMachine(RuleBasedStateMachine):
def __init__(self):
super().__init__()
self.storage = fake_storage()
self.docs = []
@rule(doc=raw_doc_strategy())
def add_doc(self, doc):
self.docs.append(doc)
fp_rag(self.docs, storage=self.storage) # Run pipeline
@invariant()
def no_duplicate_chunks(self):
chunks = self.storage.read_chunks()
ids = [(c.doc_id, c.start, c.end) for c in chunks]
assert len(ids) == len(set(ids)) # Example invariant: unique chunks
TestRagPipelineMachine = RagPipelineMachine.TestCase
@settings(max_examples=100, stateful_step_count=50)
class TestRagPipeline(TestRagPipelineMachine):
pass
RAG Integration¶
Add PBT to tests/test_rag.py; CI runs suite with stats.
5. Property-Based Proofs (tests/test_module_10_core4.py)¶
Use events/targets for diversity.
from hypothesis import event, target
@given(docs=doc_list_strategy)
def test_rag_diversity(docs):
event(f"n_docs={len(docs)}")
target(len(docs))
# ... property ...
6. Runtime Preservation Guarantee¶
PBT runs bounded (max_examples/deadline); CI timeouts prevent hangs; separate profiles for unit/e2e.
7. Anti-Patterns & Immediate Fixes¶
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Only examples | Miss edges | Add PBT |
| Unbounded gen | Slow CI | Bound sizes/deadline |
| No shrinking | Hard repro | Use Hypothesis defaults |
| Ignore laws | Regressions | Derive from invariants |
8. Pre-Core Quiz¶
- PBT for…? → Laws/regressions
- Hypothesis generates…? → Diverse inputs
- Shrinking for…? → Minimal failures
- CI profile…? → More examples
- Strategies for…? → Domain data
9. Post-Core Exercise¶
- Write PBT for RAG idempotence.
- Add to CI; run.
- Fix a found bug.
Pipeline Usage (Idiomatic)
Next: core 5. Property-Based Testing for Async and Streaming Pipelines (Hypothesis Strategies and Faked I/O)