Module 2: First-Class Functions and Expressive Python¶

Progression Note¶

By the end of Module 2, you'll master first-class functions for configurability, expression-oriented code, and debugging taps. This prepares for lazy iteration in Module 3. See the series progression map in the repo root for full details.

Here's a snippet from the progression map:

Module	Focus	Key Outcomes
1: Foundational FP Concepts	Purity, contracts, refactoring	Spot impurities, write pure functions, prove equivalence with Hypothesis
2: First-Class Functions & Expressive Python	Closures, partials, composable configurators	Configure pure pipelines without globals
3: Lazy Iteration & Generators	Streaming/lazy pipelines	Efficient data processing without materializing everything

M02C08 – Tiny Data-Driven DSLs (Using Frozen Data to Express Domain Rules)¶

Core question:
How do you replace sprawling if-else chains and hard-coded domain logic with tiny, composable data-driven DSLs—so rules become printable, testable, evolvable, and flow through M02C07 pipelines without scattering behaviour across the codebase?

This core introduces tiny data-driven DSLs in Python:
- Represent rules as frozen data (dataclasses with paths, operators, values).
- Use pure interpreters to evaluate rule data into behaviour.
- Compose via combinators like All/AnyOf/Not.
- Build on M02C06 config-as-data for rules as values, M02C07 combinators for orchestration.

We extend the running project from m02-rag.md—the FuncPipe RAG Builder—evolving from hard-coded rules to data-driven DSLs that preserve baseline equivalence for the chunk sequence.

Audience: Developers from M02C07 with combinator pipelines but still embedding domain logic in if-else or scattered predicates.
Outcome:
1. Identify rule smells (if-else sprawl, mutable flags) and explain their impact on evolvability.
2. Refactor domain logic to frozen rule data + pure interpreter.
3. Write Hypothesis properties proving DSL equivalence, with a shrinking example.

1. Conceptual Foundation¶

1.1 Tiny Data-Driven DSLs in One Precise Sentence¶

Tiny data-driven DSLs represent domain rules as immutable data (frozen dataclasses with paths and operators) evaluated by pure interpreters—ensuring rules are composable, testable, and flow like config through M02C07 pipelines.

1.2 The One-Sentence Rule¶

Represent domain rules as frozen data with paths and operators evaluated by pure interpreters—never use if-else or mutable flags in core; pass rules like config.

1.3 Why This Matters Now¶

M02C07 gave combinators for pipelines, but hard-coded rules limit evolvability. Data-driven DSLs make rules data, enabling full M02C01–M02C07 power with printable, testable domain logic.

1.4 DSLs as Values in 5 Lines¶

DSLs as first-class enable dynamic rules:

from functools import partial

from funcpipe_rag import All, LenGt, Pred, StartsWith, eval_pred

rules: dict[str, Pred] = {
    "cs": StartsWith("categories", "cs."),
    "long": LenGt("abstract", 500),
}

keep_pred = All((rules["cs"], rules["long"]))
keep_fn = partial(eval_pred, pred=keep_pred)  # RawDoc -> bool

Rule data, evaluated by pure functions, allows storage in dicts, composition with M02C01 partial, and testing as values—explicit and evolvable.

2. Mental Model: If-Else Sprawl vs Data-Driven DSLs¶

2.1 One Picture¶

If-Else Sprawl (Messy)                        Data-Driven DSLs (Clean)
+---------------------------+                 +-----------------------------------+
| if d.categories == "cs":  |                 | cs_rule = StartsWith("categories", "cs.") |
| if len(d.abstract) > 500: |                 | long_rule = LenGt("abstract", 500)|
| return True               |                 | rule = All(cs_rule, long_rule)    |
| ...                       |                 | eval_pred(d, rule)                |
+---------------------------+                 +-----------------------------------+
   ↑ Hardcoded, Rigid                             ↑ Data, Composable

2.2 Contract Table¶

Aspect	If-Else Sprawl	Data-Driven DSLs
Evolvability	Code changes	Data changes
Testability	Mock contexts	Generate rules
Readability	Nested branches	Linear data
Composability	Manual nesting	All/AnyOf/Not
Auditing	Trace execution	Print rule/decision
Mutable Defaults in Partials	Breaks Determinism	Use frozen dataclasses or immutable types for configs

Note on If-Else Choice: Use if-else only for trivial logic; always prefer DSLs for domain rules.

3. Running Project: FuncPipe RAG Builder¶

We extend the FuncPipe RAG Builder from m02-rag.md:
- Dataset: 10k arXiv CS abstracts (arxiv_cs_abstracts_10k.csv).
- Goal: Turn hard-coded rules into data-driven DSL.
- Start: Hard-coded version (core8_start.py).
- End: DSL rules as data, preserving equivalence.

3.1 Types (Canonical, Used Throughout)¶

Use the project’s DSL types from src/funcpipe_rag/core/rules_pred.py (re-exported from funcpipe_rag):

from funcpipe_rag import All, CleanConfig, LenGt, RagConfig, RagEnv, RulesConfig, StartsWith

CS_RULE = StartsWith("categories", "cs.")
LONG_RULE = LenGt("abstract", 500)
KEEP_PRED = All((CS_RULE, LONG_RULE))
CS_LONG_RULES = RulesConfig(keep_pred=KEEP_PRED)

config = RagConfig(env=RagEnv(512), keep=CS_LONG_RULES, clean=CleanConfig())

Note: DEFAULT_RULES is RulesConfig(keep_pred=All(())) (no conditions ⇒ keep everything). Pass an explicit rules config like CS_LONG_RULES to actually filter.

3.2 Hard-Coded Start (Anti-Pattern)¶

from funcpipe_rag import RawDoc


def hard_keep(d: RawDoc) -> bool:
    # Hard-coded path ("categories") and values ("cs.", 500)
    return d.categories.startswith("cs.") and len(d.abstract) > 500

Smells:
- Hard-coded paths/values (categories == "cs.").
- If-else sprawl.
- Magic numbers (500).
Problem: Hard to evolve/test; scattered logic.

4. Refactor to DSL: Data-Driven Rules + Interpreter¶

4.1 DSL Data (Frozen, Composable)¶

Rule data in config (as defined in §3.1: CS_RULE, LONG_RULE, KEEP_PRED, CS_LONG_RULES).

Properties:
- Frozen: Immutable.
- Composable: All/AnyOf/Not.
- In config: Flows like data.

4.1.1 Before-and-After Refactoring Snippet¶

To cement the transition from if-else to DSL, here's an explicit mini-example showing the "ugly before" with hard-coded if-else (e.g., from the anti-pattern code) and the "clean after" using DSL data + interpreter:

# Before: Ugly hard-coded if-else chain
from functools import partial

from funcpipe_rag import All, LenGt, RawDoc, StartsWith, eval_pred


def hard_keep(d: RawDoc) -> bool:
    return d.categories.startswith("cs.") and len(d.abstract) > 500


# After: Data-driven DSL + pure interpreter (`eval_pred`)
KEEP_PRED = All((StartsWith("categories", "cs."), LenGt("abstract", 500)))
dsl_keep = partial(eval_pred, pred=KEEP_PRED)  # RawDoc -> bool


assert dsl_keep(RawDoc("id", "title", "x" * 501, "cs.AI")) is True
assert dsl_keep(RawDoc("id", "title", "short", "cs.AI")) is False

This refactor eliminates hard-coded logic, making the rules data that is easy to test, evolve, and compose—same inputs always yield the same outputs.

4.2 Pure Interpreter (Evaluates Data)¶

The project’s pure interpreter is funcpipe_rag.eval_pred (implemented in src/funcpipe_rag/core/rules_pred.py). It only supports the known RawDoc paths (doc_id, title, abstract, categories).

from funcpipe_rag import eval_pred

Properties:
- Pure: Deterministic, no effects.
- Tied to data: Evaluates rule structures.

4.3 Refactored Core (Uses DSL)¶

Updated core with DSL (building on M02C07 combinators as implemented in src/funcpipe_rag/fp.py):

from funcpipe_rag import (
    All,
    LenGt,
    RagConfig,
    RagEnv,
    RulesConfig,
    StartsWith,
    eval_pred,
    ffilter,
    flatmap,
    flow,
    fmap,
    gen_chunk_doc,
    get_deps,
    structural_dedup_chunks,
)

config = RagConfig(
    env=RagEnv(512),
    keep=RulesConfig(keep_pred=All((StartsWith("categories", "cs."), LenGt("abstract", 500)))),
)
deps = get_deps(config)

keep_rule = lambda d: eval_pred(d, config.keep.keep_pred)
pipeline = flow(
    lambda: docs,
    ffilter(keep_rule),
    fmap(deps.cleaner),
    flatmap(lambda cd: gen_chunk_doc(cd, config.env)),
    fmap(deps.embedder),
)

chunks = structural_dedup_chunks(pipeline())

Properties:
- Data-driven: Rules as data.
- Composable: Via M02C07 combinators.

4.4 Public API (Unchanged from M02C05–M02C07)¶

from funcpipe_rag import full_rag_api_docs, full_rag_api_path, get_deps

chunks, obs = full_rag_api_docs(docs, config, get_deps(config))
res = full_rag_api_path("arxiv_cs_abstracts_10k.csv", config, boundary_deps)

Properties:
- Keeps Result; boundaries unchanged.

4.5 Configurator Tie-In (M02C01)¶

from funcpipe_rag import make_rag_fn

rag_fn = make_rag_fn(chunk_size=512, keep=CS_LONG_RULES)  # docs -> (chunks, obs)

Wins: DSLs compose with M02C01 partial for variants. Note: RagConfig.keep defaults to DEFAULT_RULES (keep everything).

5. Equational Reasoning: Substitution Exercise¶

Hand Exercise: Substitute in eval_pred.
1. Inline KEEP_PRED = All((CS_RULE, LONG_RULE)) → fixed data.
2. Substitute into eval_pred → parametric bool.
3. Result: Behaviour fixed for fixed rule data (immutable).
Bug Hunt: In hard-coded version, if-else breaks substitution.

Example:
- Hard-coded: if d.categories == "cs." → rigid, not substitutable.
- DSL: eval_pred(d, KEEP_PRED) → data-driven, substitutable with fake rule.

6. Property-Based Testing: Proving DSL Behaviour¶

Use Hypothesis to prove refactor preserves data-driven rules.

6.1 Custom Strategy¶

From tests/conftest.py. Add a raw_doc_strategy if needed for single docs.

6.2 DSL Equivalence Property¶

# tests/test_rag_api.py (DSL equivalence)
from hypothesis import given

from funcpipe_rag import All, LenGt, RawDoc, StartsWith, eval_pred
from tests.conftest import doc_list_strategy

KEEP_PRED = All((StartsWith("categories", "cs."), LenGt("abstract", 500)))


def hard_keep(d: RawDoc) -> bool:
    return d.categories.startswith("cs.") and len(d.abstract) > 500


@given(docs=doc_list_strategy())
def test_dsl_matches_hard_keep(docs):
    dsl_kept = [d for d in docs if eval_pred(d, KEEP_PRED)]
    hard_kept = [d for d in docs if hard_keep(d)]
    assert dsl_kept == hard_kept

Note: Tests DSL matches hard-coded keep.

6.3 DSL Rule Equality Property¶

from dataclasses import replace


@given(docs=doc_list_strategy())
def test_equal_rules_equal_behaviour(docs):
    rules1 = KEEP_PRED
    rules2 = replace(rules1)
    out1 = [d for d in docs if eval_pred(d, rules1)]
    out2 = [d for d in docs if eval_pred(d, rules2)]
    assert out1 == out2

Note: Verifies rule equality implies behaviour equality.

6.4 DSL Algebraic Property¶

from hypothesis import given
import hypothesis.strategies as st

from funcpipe_rag import All, AnyOf, LenGt, Not, Pred, RawDoc, StartsWith, eval_pred
from tests.conftest import raw_doc_strategy

pred_strategy = st.recursive(
    st.one_of(
        st.builds(StartsWith, st.just("categories"), st.text(max_size=10)),
        st.builds(LenGt, st.just("abstract"), st.integers(min_value=0, max_value=1000)),
    ),
    lambda child: st.one_of(
        st.builds(All, st.tuples(child, child)),
        st.builds(AnyOf, st.tuples(child, child)),
        st.builds(Not, child),
    ),
    max_leaves=20,
)


@given(pred=pred_strategy, doc=raw_doc_strategy())
def test_dsl_double_negation(pred: Pred, doc: RawDoc):
    assert eval_pred(doc, pred) == eval_pred(doc, Not(Not(pred)))

Note: Verifies DSL algebraic properties (e.g., double negation) with generated contexts.

6.5 Idempotence Property (DSL-Driven)¶

@given(chunk_size=st.integers(128, 1024))
def test_rag_idempotence(chunk_size):
    from funcpipe_rag import Ok, RagBoundaryDeps, RagConfig, RagEnv, full_rag_api_path, get_deps

    class FakeReader:
        def __init__(self, docs):
            self._docs = docs

        def read_docs(self, path):
            _ = path
            return Ok(self._docs)

    from funcpipe_rag import All, LenGt, RulesConfig, StartsWith

    keep = RulesConfig(keep_pred=All((StartsWith("categories", "cs."), LenGt("abstract", 500))))
    config = RagConfig(env=RagEnv(chunk_size), keep=keep)
    deps = RagBoundaryDeps(core=get_deps(config), reader=FakeReader([]))
    res1 = full_rag_api_path("fake_path", config, deps)
    res2 = full_rag_api_path("fake_path", config, deps)
    assert res1 == res2

Note: Ensures no hidden state with immutable DSL rules and faked deps (see tests/test_rag_api.py for a minimal FakeReader pattern).

6.6 Full Pipeline Equivalence Property¶

# tests/test_rag_api.py (baseline equivalence)
from hypothesis import given

from funcpipe_rag import (
    DEFAULT_RULES,
    RagConfig,
    clean_doc,
    embed_chunk,
    full_rag_api_docs,
    gen_chunk_doc,
    get_deps,
    structural_dedup_chunks,
)
from tests.conftest import doc_list_strategy, env_strategy


def _baseline_chunks(docs, env):
    cleaned = [clean_doc(d) for d in docs]
    embedded = [embed_chunk(c) for cd in cleaned for c in gen_chunk_doc(cd, env)]
    return structural_dedup_chunks(embedded)


@given(docs=doc_list_strategy(), env=env_strategy())
def test_full_rag_api_docs_matches_baseline(docs, env):
    config = RagConfig(env=env, keep=DEFAULT_RULES)
    deps = get_deps(config)
    chunks, obs = full_rag_api_docs(docs, config, deps)
    assert chunks == _baseline_chunks(docs, env)
    assert obs.total_docs == len(docs)

Note: Tests the full API matches a baseline built from the pure stages (with DEFAULT_RULES ⇒ keep everything).

6.7 Shrinking Demo: Catching a Leaky Bug¶

Bad interpreter with mutable:

from funcpipe_rag import All, LenGt, Not, StartsWith, eval_pred

KEEP_PRED = All((StartsWith("categories", "cs."), LenGt("abstract", 500)))
MUTABLE_PRED = KEEP_PRED


def bad_keep(doc) -> bool:
    global MUTABLE_PRED
    MUTABLE_PRED = Not(MUTABLE_PRED)  # Leaky mutation
    return eval_pred(doc, MUTABLE_PRED)

Property:

from hypothesis import given

from tests.conftest import raw_doc_strategy


@given(doc=raw_doc_strategy())
def test_bad_dsl_is_not_idempotent(doc):
    global MUTABLE_PRED
    MUTABLE_PRED = KEEP_PRED
    out1 = bad_keep(doc)
    out2 = bad_keep(doc)
    assert out1 == out2

Failure Trace (Example):

Falsifying example: test_bad_dsl_is_not_idempotent(
    doc=RawDoc(doc_id='1', title='t', abstract='...', categories='cs.AI'),
)
AssertionError

Analysis: Shrinks to a minimal RawDoc where toggling the global predicate flips the result; catches mutation bug.

7. When DSLs Aren't Worth It¶

Use if-else only in:
- Trivial one-rule logic.
- Legacy code wrapping DSLs.
Guardrails: Isolate to <5 lines; prefer DSLs for domain rules.

Example:

# Trivial
if x > 0: print(x)  # OK for one-off

8. Pre-Core Quiz¶

If-else chain? → Hard-coded logic.
Mutable rule? → frozen=True.
Magic path? → LenGt("path", value).
Global rule? → Pass as param.
Prove rules? → Hypothesis recursive.

9. Post-Core Reflection & Exercise¶

Reflect: Find if-else domain logic. Refactor to frozen rule data + interpreter; add Hypothesis for equivalence/idempotence.
Project Exercise: Apply to RAG (e.g., keep as DSL); run properties.
- Did data reduce branches?
- Did interpreter enable tests?
- Did composability clarify logic?

Next: Core 9 – Debugging FP Code.

Verify all patterns with Hypothesis—examples provided show how to detect impurities like globals or non-determinism.

Further Reading: For more on closures in Python, see 'Fluent Python' by Luciano Ramalho. Explore toolz for advanced partials once comfortable.