Skip to content

Module 2: First-Class Functions and Expressive Python

Progression Note

By the end of Module 2, you'll master first-class functions for configurability, expression-oriented code, and debugging taps. This prepares for lazy iteration in Module 3. See the series progression map in the repo root for full details.

Here's a snippet from the progression map:

Module Focus Key Outcomes
1: Foundational FP Concepts Purity, contracts, refactoring Spot impurities, write pure functions, prove equivalence with Hypothesis
2: First-Class Functions & Expressive Python Closures, partials, composable configurators Configure pure pipelines without globals
3: Lazy Iteration & Generators Streaming/lazy pipelines Efficient data processing without materializing everything

M02C06 – Configuration-as-Data (dataclasses/dicts + Partial Application for Behaviour)

Core question:
How do you turn raw settings from env, files, or CLI into immutable, validated data (frozen dataclasses or dicts) that drive behaviour via partial and closures—so pipelines from M02C01–M02C05 are deterministic and testable?

This core introduces configuration-as-data in Python:
- Parse raw sources (env, files, CLI) into immutable data at M02C05 boundaries.
- Use frozen dataclasses for self-documenting config, bound via M02C01 partial/closures.
- Validate at edges, ensuring core sees only typed, complete data.

We extend the running project from m02-rag.md—the FuncPipe RAG Builder—evolving from leaky globals/env to validated immutable data that preserves Module 1 equivalence.

Audience: Developers from M02C05 with sealed boundaries but still using globals, env leaks, or mutable config that break determinism.
Outcome:
1. Identify config smells (globals, env leaks) and explain their impact on testing.
2. Refactor raw sources to validated immutable data + binding.
3. Write Hypothesis properties proving config-driven behaviour, with a shrinking example.


1. Conceptual Foundation

1.1 Configuration-as-Data in One Precise Sentence

Configuration-as-data parses raw sources into immutable, validated values (frozen dataclasses or dicts) at boundaries, bound via partial or closures—ensuring behaviour is explicit, deterministic, and composable.

1.2 The One-Sentence Rule

Parse raw config (env, files, CLI) into frozen data at M02C05 boundaries; bind via partial/closures and pass explicitly—never use globals or env in core.

1.3 Why This Matters Now

M02C05 sealed effects at boundaries, but globals or env leaks introduce hidden state. Config-as-data makes settings explicit values from raw sources, enabling full M02C01–M02C05 power with testable variants.

1.4 Config-as-Data as Values in 5 Lines

Config as first-class enables dynamic variants:

from dataclasses import dataclass
from collections.abc import Callable
from functools import partial


# Toy example. In the project, see `src/funcpipe_rag/api/clean_cfg.py`.
@dataclass(frozen=True)
class ToyCleanConfig:
    rule_names: tuple[str, ...] = ("strip", "lower")


RULES: dict[str, Callable[[str], str]] = {
    "strip": str.strip,
    "lower": str.lower,
    "upper": str.upper,
}

configs: dict[str, ToyCleanConfig] = {
    "standard": ToyCleanConfig(),
    "minimal": ToyCleanConfig(rule_names=()),
}


def clean_abstract(text: str, cfg: ToyCleanConfig) -> str:
    for name in cfg.rule_names:
        text = RULES[name](text)
    return text


cleaners: dict[str, Callable[[str], str]] = {
    "standard": partial(clean_abstract, cfg=configs["standard"]),
    "minimal": partial(clean_abstract, cfg=configs["minimal"]),
}

Immutable config data, bound via partial, allows storage in dicts, composition with M02C01 partials, and testing as values—explicit and mutation-free.

Note: Raw dicts from env/CLI live only at the boundary; inside, configuration is always represented as frozen dataclasses (possibly stored in dict lookups). Configs should be serializable data (strings, ints, enums); functions come from registries, not from raw config.


2. Mental Model: Leaky Globals vs Immutable Data

2.1 One Picture

Leaky Globals (Flaky)                   Immutable Data (Deterministic)
+-----------------------+               +------------------------------+
| global CHUNK_SIZE     |               | @dataclass(frozen=True)      |
| CHUNK_SIZE = 512      |               | class RagConfig:             |
| # Mutated elsewhere?  |               |     chunk_size: int = 512    |
| rag() # Hidden dep    |               | rag_fn = partial(rag,        |
+-----------------------+               |     cfg=RagConfig())         |
   ↑ Non-deterministic                  +------------------------------+
                                           ↑ Explicit, Testable

2.2 Contract Table

Aspect Leaky Globals Immutable Data
Dependencies Hidden globals/env Explicit dataclass params
Determinism Breaks (mutations) Safe (frozen)
Testing Flaky (mock globals) Safe (pass fake config)
Composability Races / scattered Flows like values
Validation Scattered checks At boundary only
Mutable Defaults in Partials Breaks Determinism Use frozen dataclasses or immutable types for configs

Note on Leaky Choice: Use globals/mutables only in trivial scripts; always freeze for reuse.


3. Running Project: FuncPipe RAG Builder

We extend the FuncPipe RAG Builder from m02-rag.md:
- Dataset: 10k arXiv CS abstracts (arxiv_cs_abstracts_10k.csv).
- Goal: Turn leaky globals/env into validated immutable config data.
- Start: Leaky version with globals/env (core6_start.py).
- End: Validated immutable data bound via partial, preserving equivalence.

3.1 Types (Canonical, Used Throughout)

Extend with config data:

from dataclasses import replace

from funcpipe_rag import (
    CleanConfig,
    Err,
    Ok,
    RagBoundaryDeps,
    RagConfig,
    RagCoreDeps,
    RagEnv,
    Reader,
    Result,
)

3.2 Leaky Start (Anti-Pattern)

# core6_start.py: Leaky config with globals/env
from dataclasses import replace
import os

from funcpipe_rag import Err, Ok, RagBoundaryDeps, RagEnv, boundary_rag_config, full_rag_api_path

GLOBAL_CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", "512"))  # Leaky env
GLOBAL_CLEAN_RULES: list[str] = ["strip", "lower"]  # Mutable global


def leaky_full_rag_api_path(
    path: str,
    raw_cfg: dict[str, object],
    deps: RagBoundaryDeps,
):
    global GLOBAL_CLEAN_RULES
    GLOBAL_CLEAN_RULES.append("upper")  # Mutation!

    raw = dict(raw_cfg)
    raw["chunk_size"] = GLOBAL_CHUNK_SIZE
    raw["clean_rules"] = list(GLOBAL_CLEAN_RULES)

    cfg_res = boundary_rag_config(raw)
    if isinstance(cfg_res, Err):
        return cfg_res

    return full_rag_api_path(path, cfg_res.value, deps)

Smells:
- Env leak (os.getenv).
- Mutable list (append).
- Global override.
Problem: Breaks determinism; hard to trace/test.


4. Refactor to Config-as-Data: Validated Immutable Data + Binding

To strengthen pedagogy, here's a concrete before/after example for redesigning an unfriendly API:

import os
from dataclasses import dataclass
from typing import Dict

# Before: Unfriendly API with implicit context
def foo(data: Dict[str, int]) -> int:
    threshold = int(os.environ.get('THRESHOLD', '5'))  # Hidden env dep
    return sum(v for v in data.values() if v > threshold)  # Non-deterministic if env changes

@dataclass(frozen=True)
class FooConfig:
    threshold: int

# After: FP-Friendly with explicit deps
def foo(data: Dict[str, int], *, config: FooConfig) -> int:
    return sum(v for v in data.values() if v > config.threshold)  # Pure: Depends only on inputs

This makes the function testable (inject mock config) and composable—no surprises from environment variables.

4.1 Parametric Core (Driven by Config Data)

Updated M02C05 core with config data:

from funcpipe_rag import RagConfig, RagEnv, get_deps, iter_rag_core

config = RagConfig(env=RagEnv(512))
deps = get_deps(config)
chunks_iter = iter_rag_core(docs, config, deps)  # streaming core (no boundary effects)

Properties:
- Parametric over immutable config data.
- Lazy: Builds on M02C03.

4.2 Post-Clean Sub-Core

Internal sub-core:

from funcpipe_rag import iter_chunks_from_cleaned

chunks_iter = iter_chunks_from_cleaned(cleaned_docs, config, deps.embedder)

Properties:
- Parametric over config.

4.3 Config Boundary (Parse Raw to Validated Data)

Parse raw to immutable config:

from funcpipe_rag import Err, Ok, boundary_rag_config

res = boundary_rag_config({"chunk_size": 512, "clean_rules": ["strip", "lower", "collapse_ws"]})
assert isinstance(res, Ok)

Properties:
- Thin: Validates at boundary (M02C05). Config parsing is the only place where we convert from untyped external data to typed internal data.
- Result: Explicit errors, consistent with M02C05.
- Frozen: Immutable output.

4.4 Public API (Edge, Uses Config Data)

From M02C05, with config data:

from funcpipe_rag import full_rag_api_docs, full_rag_api_path

chunks, obs = full_rag_api_docs(docs, config, deps)
res = full_rag_api_path("arxiv_cs_abstracts_10k.csv", config, boundary_deps)

Properties:
- Passes immutable config data.
- Matches Module 1 with default config.

4.5 Configurator Tie-In (M02C01, with Closures)

from functools import partial
from funcpipe_rag import CleanConfig, RagConfig, RagEnv, full_rag_api_docs, get_deps, make_rag_fn

test_clean_cfg = CleanConfig(rule_names=("strip",))
config = RagConfig(env=RagEnv(512), clean=test_clean_cfg)
deps = get_deps(config)

# Bind a configured API shape (docs -> (chunks, obs))
rag_docs_fn = partial(full_rag_api_docs, config=config, deps=deps)

# Or use the canonical configurator helper (recommended)
rag_fn = make_rag_fn(chunk_size=512, clean_cfg=test_clean_cfg)

Wins: Config data bound via partial/closures; variants via replace. Composes with M02C01.

Advanced Note: If you know about Codensity, you’ll recognise the spirit here; otherwise ignore this.


5. Equational Reasoning: Substitution Exercise

Hand Exercise: Substitute in iter_rag_core.
1. At the boundary, inline cfg = CleanConfig(...) into make_cleaner(cfg) → fixed cleaning function.
2. In the core, inline cleaner = deps.cleaner and env = config.env → fixed functions/values.
3. Substitute into the generator → a stream fully determined by (docs, config, deps).
Bug Hunt: In leaky version, global/env breaks substitution (hidden changes).

Example:
- Leaky: GLOBAL_CLEAN_RULES.append → mutable, not substitutable.
- Data: config.clean.rule_names → immutable, substitutable.


6. Property-Based Testing: Proving Config-Driven Behaviour

Use Hypothesis to prove refactor preserves determinism with config data.

6.1 Custom Strategy

From tests/conftest.py (as in Module 1).
See tests/test_rag_api.py for a minimal FakeReader used in boundary-shape tests.

6.2 Core Equivalence Property

# tests/test_rag_api.py
from hypothesis import given

from funcpipe_rag import Ok, RagBoundaryDeps, RagConfig, full_rag_api_docs, full_rag_api_path, get_deps
from tests.conftest import doc_list_strategy, env_strategy


class FakeReader:
    def __init__(self, docs):
        self._docs = docs

    def read_docs(self, path):
        _ = path
        return Ok(self._docs)


@given(docs=doc_list_strategy(), env=env_strategy())
def test_docs_api_matches_boundary_api(docs, env):
    config = RagConfig(env=env)
    deps = get_deps(config)
    chunks1, _ = full_rag_api_docs(docs, config, deps)

    boundary_deps = RagBoundaryDeps(core=deps, reader=FakeReader(docs))
    res = full_rag_api_path("fake.csv", config, boundary_deps)
    assert isinstance(res, Ok)
    chunks2, _ = res.value
    assert chunks1 == chunks2

Note: Tests parametric core with config data matches Module 1.

6.3 Prefix Equivalence (Streaming with Config)

from itertools import islice
import hypothesis.strategies as st

from funcpipe_rag import RagConfig, get_deps, iter_rag_core


@given(docs=doc_list_strategy(), env=env_strategy(), k=st.integers(0, 50))
def test_iter_rag_core_prefix_equivalence(docs, env, k):
    config = RagConfig(env=env)
    deps = get_deps(config)
    eager = list(iter_rag_core(docs, config, deps))
    prefix = list(islice(iter_rag_core(docs, config, deps), k))
    assert prefix == eager[:k]

Note: Verifies streaming with config data matches Module 1.

6.4 Config Validation at Boundary

def test_boundary_validation():
    raw = {"chunk_size": "invalid"}
    res = boundary_rag_config(raw)
    assert isinstance(res, Err)
    assert "Invalid config" in res.error

Note: Tests boundary returns Err on invalid raw config.

6.5 Idempotence Property (Config-Driven)

@given(docs=doc_list_strategy(), env=env_strategy())
def test_full_rag_api_path_idempotent(docs, env):
    config = RagConfig(env=env)
    deps = RagBoundaryDeps(core=get_deps(config), reader=FakeReader(docs))
    res1 = full_rag_api_path("fake.csv", config, deps)
    res2 = full_rag_api_path("fake.csv", config, deps)
    assert res1 == res2

Note: Ensures no hidden state with immutable config and faked deps.

6.6 Config Equality Property

@given(docs=doc_list_strategy(), env=env_strategy())
def test_same_config_same_behaviour(docs, env):
    config1 = RagConfig(env=env)
    config2 = replace(config1)  # Structurally equal
    deps = get_deps(config1)
    out1, _ = full_rag_api_docs(docs, config1, deps)
    out2, _ = full_rag_api_docs(docs, config2, deps)
    assert out1 == out2

Note: Verifies config equality implies behaviour equality.

6.7 Shrinking Demo: Catching a Leaky Bug

Bad config with mutable:

from dataclasses import replace

from funcpipe_rag import RagBoundaryDeps, RagConfig, RagEnv, Result, full_rag_api_path

MUTABLE_CFG = {"chunk_size": 512}

def bad_full_rag_api_path(
    path: str,
    config: RagConfig,
    deps: RagBoundaryDeps
) -> Result:
    global MUTABLE_CFG
    MUTABLE_CFG["chunk_size"] += 1  # Leaky mutation
    mutated_config = replace(config, env=RagEnv(MUTABLE_CFG["chunk_size"]))
    return full_rag_api_path(path, mutated_config, deps)

Property:

from hypothesis import given
import hypothesis.strategies as st

from funcpipe_rag import Ok, RagBoundaryDeps, RagConfig, RagEnv, get_deps


class FakeReader:
    def __init__(self, docs):
        self._docs = docs

    def read_docs(self, path):
        _ = path
        return Ok(self._docs)


@given(chunk_size=st.integers(128, 1024))
def test_bad_rag_idempotence(chunk_size):
    global MUTABLE_CFG
    MUTABLE_CFG = {"chunk_size": chunk_size}
    config = RagConfig(env=RagEnv(chunk_size))
    deps = RagBoundaryDeps(core=get_deps(config), reader=FakeReader([]))
    res1 = bad_full_rag_api_path("fake_path", config, deps)
    res2 = bad_full_rag_api_path("fake_path", config, deps)
    assert res1 == res2

Failure Trace (Example):

Falsifying example: test_bad_rag_idempotence(
    chunk_size=128,
)
AssertionError

Analysis: Shrinks to minimal; catches mutation changing chunk_size between calls.


7. When Config-as-Data Isn't Worth It

Use globals/mutables only in:
- Trivial one-off scripts (no variants). Never in code that’s imported from elsewhere or that is part of the RAG library.
- Short-lived notebooks for exploration.
- Legacy adapters wrapping data config.
Guardrails: Isolate to <5 lines; prefer data for tests/reuse.

Example:

# Trivial script
print(512)  # OK for one-off constant

8. Pre-Core Quiz

  1. Mutable cfg.x = y? → Violates immutability.
  2. os.getenv in core? → Boundary parse.
  3. Magic 512? → cfg.chunk_size.
  4. Global CFG? → Pass as param.
  5. Prove that config controls behaviour? → Use Hypothesis properties over configs.

9. Post-Core Reflection & Exercise

Reflect: Find a global or mutable config. Refactor to frozen dataclass + partial; validate at boundary. Add Hypothesis for equivalence/idempotence.
Project Exercise: Apply to RAG (e.g., CleanConfig as data); run properties.
- Did immutability catch mutations?
- Did data enable easier tests?
- Did binding clarify behaviour?

Next: Core 7 – Callback Hell to Combinators.

Verify all patterns with Hypothesis—examples provided show how to detect impurities like globals or non-determinism.

Further Reading: For more on closures in Python, see 'Fluent Python' by Luciano Ramalho. Explore toolz for advanced partials once comfortable.