Skip to content

Core 4: Property-Based Regression & Invariant Testing for Pure Pipelines (Hypothesis in CI)

Module 10

Core question:
How do you use property-based testing with Hypothesis to catch regressions and enforce invariants in pure functional pipelines, integrating these tests into CI for continuous verification of laws like equivalence, idempotence, and associativity?

In this core, we integrate property-based testing (PBT) into the FuncPipe RAG Builder (now at funcpipe-rag-10). PBT uses Hypothesis to generate diverse inputs, verifying properties (e.g., laws from prior modules) rather than specific examples. This catches regressions in pure pipelines, where traditional unit tests miss edge cases. We focus on invariants (e.g., idempotence), regressions (e.g., equivalence post-refactor), and CI integration (e.g., fail on property violations). Builds on Core 1's equivalence predicates.

Motivation Bug: Example-based tests miss rare failures (e.g., deep trees breaking recursion limits); PBT explores vast inputs, shrinking failures for minimal repros—essential for pure pipelines where bugs hide in compositions.

Delta from Core 3: Observable pipelines are debuggable; PBT automates invariant checks in CI.

PBT Protocol (Contract, Entry/Exit Criteria): - Properties: Formal statements (e.g., "f(f(x)) == f(x)" for idempotence); derived from laws/invariants. Derive properties from specs + stated preconditions (e.g., if a law assumes sorted input, the strategy must generate sorted data or use assume(...)). Strategy encodes domain; assume is for rare conditional edges, not basic validity. - Semantics: Tests generate/shrink inputs; fail on counterexamples; seed for repro. - Purity: Focus on pure funcs; mock effects with fakes. - Error Model: Properties include failure paths (e.g., error propagation). - Resource Safety: Bound generation (e.g., max_size); timeout per test. - Integration: Add to RAG (test clean_doc idempotence); CI runs full suite. - Mypy Config: --strict; strategies typed. - Exit: Key laws pass, diversity targets hit (via events/targets), runtime budget met, CI green. Code coverage >80% (via pytest-cov) as side metric.

Audience: Engineers ensuring pipeline reliability via automated laws.

Outcome: 1. Write PBT for invariants/regressions. 2. Integrate Hypothesis into CI. 3. Test RAG properties, catch bugs.

Note: Hypothesis "coverage" refers to input-space exploration (via events/targets), not path coverage (measured separately with pytest-cov).

Split PBT into unit (small sizes, strict deadlines for pure stages) and e2e (fewer examples, no deadlines, fixed fakes for pipelines).

Property Taxonomy: - Algebraic laws (idempotence, associativity). - Metamorphic properties (e.g., adding noise preserves core outputs). - Round-trips (serialize/deserialize == id). - Observational equivalence (refactor matches baseline).


1. Laws & Invariants

Invariant Description Enforcement
Property Inv All stated properties hold for generated inputs; failures shrink to minimal. Hypothesis runs
Equivalence Inv Refactors preserve Core 1 predicates. PBT equivalence
Idempotence Inv If intended, f(f(x)) == f(x). PBT idempotence
Associativity Inv For monoids/folds, (a op b) op c == a op (b op c). PBT associativity
Diversity Inv Properties explore key input dimensions; measured via Hypothesis events/targets. CI metrics

These automate law enforcement.


2. Decision Table

Scenario Need Diversity CI Integration Recommended
Check laws High Yes PBT invariants
Regression guard High Yes PBT equivalence
Edge cases High Yes Custom strategies
Simple asserts Low No Unit tests

Use PBT for pure funcs; units for effects.


3. Public API (Strategies for RAG Domain)

Domain strategies for Hypothesis. Note: text strategies intentionally include diverse Unicode to stress handling; constrain alphabet if testing non-Unicode paths. Pipeline accepts raw string categories.

from hypothesis import strategies as st
from funcpipe_rag import RawDoc, CleanDoc, Chunk
from random import Random


def raw_doc_strategy() -> st.SearchStrategy[RawDoc]:
    return st.builds(
        RawDoc,
        doc_id=st.text(min_size=1, max_size=20, alphabet=st.characters(categories=("Ll", "Lu", "Nd"))),
        # Constrain to letters/numbers for realistic IDs
        title=st.text(max_size=50),
        abstract=st.text(min_size=1, max_size=500),
        categories=st.text(max_size=20),
    )


clean_doc_strategy = raw_doc_strategy().map(clean_doc)  # Only reachable CleanDocs

doc_list_strategy = st.lists(raw_doc_strategy(), max_size=50, unique_by=lambda d: d.doc_id)


@st.composite
def chunk_strategy(draw) -> Chunk:
    doc = draw(raw_doc_strategy())
    size = draw(st.integers(min_value=1, max_value=256))
    start = draw(st.integers(min_value=0, max_value=max(0, len(doc.abstract) - 1)))
    end = min(len(doc.abstract), start + size)
    return Chunk(doc_id=doc.doc_id, text=doc.abstract[start:end], start=start, end=end)

For floating-point equivalence (e.g., embeddings), use a tolerant predicate assuming aligned order and IDs; only valid if stable ordering is guaranteed; otherwise compare by id.

import numpy as np
def eq_numeric(a: list, b: list, atol=1e-5) -> bool:
    if len(a) != len(b): return False
    return all(np.allclose(np.array(va), np.array(vb), atol=atol) for va, vb in zip(a, b))

4. Reference Implementations

4.1 Invariant Properties (Idempotence)

Precondition: clean_doc is designed to be idempotent (e.g., normalization like lowercase/strip).

from hypothesis import given, settings, Phase, HealthCheck
from funcpipe_rag import eq_pure


@given(raw=raw_doc_strategy())
@settings(max_examples=200, deadline=1000, phases=(Phase.explicit, Phase.generate, Phase.target),
          suppress_health_check=[HealthCheck.filter_too_much])
def test_clean_idempotent(raw):
    once = clean_doc(raw)
    twice = clean_doc(once)
    assert eq_pure([once], [twice], key=lambda d: (d.doc_id, d.title, d.abstract, d.categories))


@given(clean=clean_doc_strategy)
@settings(max_examples=200, deadline=1000, phases=(Phase.explicit, Phase.generate, Phase.target),
          suppress_health_check=[HealthCheck.filter_too_much])
def test_clean_fixpoint(clean):
    assert eq_pure([clean_doc(clean)], [clean], key=lambda d: (d.doc_id, d.title, d.abstract, d.categories))

4.2 Regression Properties (Equivalence)

Control non-determinism with fixed seeds/mocks; use separate RNGs and storage to avoid state sharing. Use tuple(docs) if inputs are immutable. Use multiset equality (order-insensitive).

from collections import Counter

@given(docs=doc_list_strategy, seed=st.integers(0, 2**32-1))
@settings(max_examples=200, deadline=1000, suppress_health_check=[HealthCheck.filter_too_much])
def test_refactor_equiv(docs, seed):
    rng_imp = Random(seed)
    rng_fp = Random(seed)
    clock = lambda _: None  # Noop clock
    storage_imp = fake_storage()
    storage_fp = fake_storage()
    imp_input = tuple(docs)
    fp_input = tuple(docs)
    imp = imperative_rag(imp_input, rng=rng_imp, clock=clock, storage=storage_imp)
    fp = fp_rag(fp_input, rng=rng_fp, clock=clock, storage=storage_fp)
    # Multiset equality (order-insensitive)
    key = lambda c: (c.doc_id, c.start, c.end, c.text)
    assert Counter(map(key, imp)) == Counter(map(key, fp))
    # Observational equivalence on storage
    assert storage_imp.snapshot() == storage_fp.snapshot()

4.3 CI Integration Pattern

Cache deps, pin profile, artifact failures. Use fixed seed for PR repro; no seed for nightly exploration (true stochastic search). Note: Failing examples are stored only if the Hypothesis example database is enabled (default: yes).

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with: { python-version: '3.10' }
      - name: Cache deps
        uses: actions/cache@v3
        with: { path: ~/.cache/pip, key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }} }
      - name: Install deps
        run: pip install -r requirements.txt  # Includes hypothesis
      - name: Run PBT
        env: { HYPOTHESIS_PROFILE: ci }
        run: pytest --hypothesis-show-statistics --hypothesis-seed=0  # Seed for repro; stats for diversity
      - name: Artifact failing examples
        if: failure()
        uses: actions/upload-artifact@v3
        with: { name: hypothesis-failures, path: .hypothesis/examples/ }

# .github/workflows/nightly.yml (scheduled)
name: Nightly PBT
on:
  schedule:
    - cron: '0 0 * * *'  # Daily
jobs:
  test:
    # ... similar, but omit --hypothesis-seed=0 for stochastic search

Profile (funcpipe_rag/hypothesis_profile.py); explicit load in conftest.py:

# tests/conftest.py
import os
import funcpipe_rag
from hypothesis import settings

profile = os.getenv("HYPOTHESIS_PROFILE") or ("ci" if os.getenv("CI") else "unit_pbt")
settings.load_profile(profile)
from hypothesis import Phase, settings
settings.register_profile("ci", max_examples=1000, deadline=None, phases=(Phase.explicit, Phase.generate, Phase.target))
settings.register_profile("unit_pbt", max_examples=200, deadline=1000)

4.4 Custom Strategies (Deep Trees)

Bound depth explicitly for recursion tests.

@st.composite
def bounded_tree_strategy(draw, max_depth=5) -> TreeDoc:
    if max_depth <= 0:
        return draw(st.builds(Leaf, content=st.text()))
    else:
        if draw(st.booleans()):
            return draw(st.builds(Leaf, content=st.text()))
        else:
            children = draw(st.lists(bounded_tree_strategy(max_depth=max_depth-1), min_size=1, max_size=3))
            return Node(children=children)

4.5 Stateful Testing for Pipelines

For boundary state (caches/retries), use RuleBasedStateMachine.

from hypothesis.stateful import RuleBasedStateMachine, rule, invariant
from hypothesis import settings

class RagPipelineMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.storage = fake_storage()
        self.docs = []

    @rule(doc=raw_doc_strategy())
    def add_doc(self, doc):
        self.docs.append(doc)
        fp_rag(self.docs, storage=self.storage)  # Run pipeline

    @invariant()
    def no_duplicate_chunks(self):
        chunks = self.storage.read_chunks()
        ids = [(c.doc_id, c.start, c.end) for c in chunks]
        assert len(ids) == len(set(ids))  # Example invariant: unique chunks

TestRagPipelineMachine = RagPipelineMachine.TestCase

@settings(max_examples=100, stateful_step_count=50)
class TestRagPipeline(TestRagPipelineMachine):
    pass

RAG Integration

Add PBT to tests/test_rag.py; CI runs suite with stats.


5. Property-Based Proofs (tests/test_module_10_core4.py)

Use events/targets for diversity.

from hypothesis import event, target

@given(docs=doc_list_strategy)
def test_rag_diversity(docs):
    event(f"n_docs={len(docs)}")
    target(len(docs))
    # ... property ...

6. Runtime Preservation Guarantee

PBT runs bounded (max_examples/deadline); CI timeouts prevent hangs; separate profiles for unit/e2e.


7. Anti-Patterns & Immediate Fixes

Anti-Pattern Symptom Fix
Only examples Miss edges Add PBT
Unbounded gen Slow CI Bound sizes/deadline
No shrinking Hard repro Use Hypothesis defaults
Ignore laws Regressions Derive from invariants

8. Pre-Core Quiz

  1. PBT for…? → Laws/regressions
  2. Hypothesis generates…? → Diverse inputs
  3. Shrinking for…? → Minimal failures
  4. CI profile…? → More examples
  5. Strategies for…? → Domain data

9. Post-Core Exercise

  1. Write PBT for RAG idempotence.
  2. Add to CI; run.
  3. Fix a found bug.

Pipeline Usage (Idiomatic)

@given(docs=doc_list_strategy)
def test_rag_invariant(docs):
    # ...

Next: core 5. Property-Based Testing for Async and Streaming Pipelines (Hypothesis Strategies and Faked I/O)