Core 1: FP in Stdlib – itertools, functools, operator, pathlib¶

Module 09

Core question:
How do you use Python's stdlib modules like itertools, functools, operator, and pathlib to build functional pipelines, enhancing composability, purity, and expressiveness without external dependencies in FuncPipe?

In this core, we explore functional programming patterns in Python's standard library for the FuncPipe RAG Builder (now at funcpipe-rag-09). We focus on itertools for iterator transformations (e.g., chain, groupby, tee, accumulate, filterfalse, compress, takewhile/dropwhile), functools for higher-order functions like partial, reduce, and lru_cache (with gotchas like mutable defaults and cache purity), operator for functional operators (e.g., itemgetter, attrgetter, methodcaller to avoid lambdas), and pathlib for immutable path handling with functional ops (e.g., glob, read_text, PurePath for no-IO paths). These tools enable pure, composable data flows, integrating with prior modules' lazy streams and monadic patterns. We refactor RAG stages to use stdlib-only FP, verifying equivalence and laws like associativity for compositions (with preconditions for associative operators). No external libs are used, ensuring portability; we highlight sharp edges like groupby requiring sorted input and tee memory traps.

Motivation Bug: Relying on custom combinators or external libs like toolz leads to dependency bloat and portability issues; stdlib FP provides built-in, battle-tested alternatives for core patterns.

Delta from Module 08: Async flows add concurrency; stdlib FP grounds patterns in core Python for broader applicability.

Stdlib FP Protocol (Contract, Entry/Exit Criteria): - Composability: Functions like chain, partial return callables/iterables that chain without side effects. - Purity: Use pure ops (e.g., map with lambda); avoid mutating inputs; note functools.partial avoids mutable defaults; lru_cache preserves purity for pure funcs. - Laziness: Prefer generators (islice, map) over lists; tee duplicates iterators safely if consumed carefully (bounded skew required to avoid memory exhaustion). - Semantics: Laws like reduce associativity hold if operator is associative; verified via properties with preconditions. - Integration: Replace custom combinators with stdlib equivalents in RAG. - Mypy Config: --strict, with Callable typing.

Audience: Engineers building FP pipelines who want stdlib-only solutions for composability without external deps.

Outcome: 1. Use itertools/functools/operator/pathlib for FP patterns, handling gotchas. 2. Refactor RAG to stdlib FP. 3. Prove equivalence/laws with Hypothesis.

1. Laws & Invariants¶

Law	Description	Enforcement
Associativity Law	reduce(op, xs) == manual left/right folds if op associative (e.g., add); precondition: op associative.	Property tests
Purity Law	Ops depend only on inputs; no globals/mutation.	Code reviews
Laziness Inv	Generators defer computation; no eager materialization.	Tests with islice
Equivalence Law	Stdlib refactors yield same outputs as custom versions.	Hypothesis equivalence

These laws ensure stdlib FP integrates reliably.

2. Decision Table¶

Scenario	Need Laziness	Gotchas	Recommended
Merge streams	Yes	None	itertools.chain
Configure funcs	No	Mutable defaults	functools.partial
Fold aggregations	No	Op must be associative	functools.reduce
Extract attrs	No	None	operator.attrgetter
File glob/read	No	FS mutable	pathlib.glob/read_text

Choose based on needs; mind gotchas like groupby requiring sorted input.

3. Public API (No Wrappers Needed; Direct Stdlib Use)¶

Repo alignment note (end-of-Module-09): - This repo keeps the earlier Module 07–08 abstractions (ports/effects/async) intact. - Module 09 does not delete them; it adds a stdlib-first layer for day-to-day pipeline work. - For runnable examples, see src/funcpipe_rag/rag/stdlib_fp.py and src/funcpipe_rag/interop/stdlib_fp.py.

4. Reference Implementations¶

4.1 Itertools for Stream Transforms¶

from itertools import chain, groupby, tee, accumulate, filterfalse, compress, takewhile, dropwhile

# Chain for merge
def merge_streams(*streams: Iterable[T]) -> Iterator[T]:
    return chain(*streams)

# Groupby with sort precondition
def group_docs(docs: Iterable[RawDoc], key: Callable[[RawDoc], str]) -> Iterator[tuple[str, Iterator[RawDoc]]]:
    sorted_docs = sorted(docs, key=key)  # Precondition: eager sort; use only when groups must be exact. For streaming, use dict accumulator or accept approximate.
    return groupby(sorted_docs, key=key)

# Tee for multicast (careful consumption)
def multicast_stream(stream: Iterable[T]) -> tuple[Iterator[T], Iterator[T]]:
    return tee(stream, 2)  # Memory grows with skew between consumers; consume copies evenly (bounded skew required).

# Accumulate for running folds
def running_sum(nums: Iterable[int]) -> Iterator[int]:
    return accumulate(nums, operator.add)

# Filterfalse for negation
def filter_non_ai(docs: Iterable[RawDoc]) -> Iterator[RawDoc]:
    return filterfalse(lambda d: 'cs.AI' in d.categories, docs)  # Lambda acceptable for clarity; attrgetter for complex.

# Compress for masking
def masked_docs(docs: Iterable[RawDoc], mask: Iterable[bool]) -> Iterator[RawDoc]:
    return compress(docs, mask)

# Takewhile/dropwhile for fencing
def take_short_abstracts(abstracts: Iterable[str]) -> Iterator[str]:
    return takewhile(lambda s: len(s) < 512, abstracts)  # Lambda fine; methodcaller('__len__') < 512 not possible.

4.2 Functools for Higher-Order¶

from functools import partial, reduce, lru_cache, singledispatch
from itertools import starmap

# Partial for configurators
def partial_clean(cfg: CleanCfg) -> Callable[[RawDoc], CleanDoc]:
    return partial(clean_doc, cfg=cfg)

# Reduce for folds
def reduce_embeddings(embs: Iterable[tuple[float, ...]]) -> tuple[float, ...]:
    it = iter(embs)
    first = next(it)  # raise if empty
    return reduce(lambda a,b: tuple(starmap(operator.add, zip(a, b, strict=True))), it, first)  # Precondition: same length; zip(strict=True) enforces (Py3.10+).

# Lru_cache for memoization
@lru_cache(maxsize=128)
def cached_embed(text: str) -> tuple[float, ...]:
    return embed_chunk(text)  # Preserves purity for pure funcs

# Singledispatch for polymorphism (at boundary; mutable registry)
@singledispatch
def process_doc(doc):
    raise NotImplementedError

@process_doc.register(RawDoc)
def _(doc: RawDoc):
    return clean_doc(doc)

@process_doc.register(CleanDoc)
def _(doc: CleanDoc):
    return chunk_doc(doc)

4.3 Operator for Lambda Avoidance¶

from operator import itemgetter, attrgetter, methodcaller

# Itemgetter for extraction
get_title = itemgetter('title')

# Attrgetter for attrs
get_doc_id = attrgetter('doc_id')

# Methodcaller for methods
read_text = methodcaller('read_text')

4.4 Pathlib for Paths¶

from pathlib import Path, PurePath

# Glob and read
def read_csvs(dir_path: str) -> Iterator[str]:
    path = Path(dir_path)
    for csv in path.glob('*.csv'):
        yield csv.read_text()

# PurePath for no-IO
def pure_normalize(p: str) -> str:
    return str(PurePath(p))

4.5 Integration in RAG¶

Refactor RAG cleaning to stdlib FP.

# RAG with stdlib
def rag_stdlib_clean(docs: Iterator[RawDoc]) -> Iterator[CleanDoc]:
    return map(partial(clean_doc, cfg=CleanCfg()), docs)

4.5 Before/After Refactor¶

# Before: Custom merge
def custom_merge(a, b): return list(a) + list(b)
# After: Stdlib
def std_merge(a, b): return chain(a, b)

5. Property-Based Proofs (repo tests)¶

Runnable properties live in tests/unit/interop/test_stdlib_fp.py and tests/unit/rag/test_stdlib_fp.py.

from hypothesis import given, strategies as st
from hypothesis import assume
import operator
from functools import reduce

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_chain_equiv(a, b):
    assert list(chain(a, b)) == a + b

@given(st.integers(1, 10))
def test_partial_equiv(n):
    add_n = partial(operator.add, n)
    assert add_n(5) == n + 5

@given(st.lists(st.integers(), min_size=3))
def test_reduce_associativity(xs):
    assume(len(xs) >= 3)
    a, b, c = xs[:3]
    assert operator.add(operator.add(a, b), c) == operator.add(a, operator.add(b, c))
    assert reduce(operator.add, xs) == sum(xs)

@given(st.text(min_size=1, alphabet=st.characters(blacklist_categories=['Cc'])))  # Avoid nulls/invalid
def test_path_immutable(s):
    p = PurePath(s)
    assert str(p) == s.replace('\\', '/')  # Normalized on POSIX-like

6. Runtime Preservation Guarantee¶

Stdlib FP is efficient; verified via benchmarks.

7. Anti-Patterns & Immediate Fixes¶

Anti-Pattern	Symptom	Fix
List materialization	Memory blowup	Use generators/islice
groupby unsorted	Wrong groups	Sort input first
tee unbalanced	Memory exhaustion	Consume copies evenly
Mutable defaults	Shared state bugs	Avoid in partial

8. Pre-Core Quiz¶

itertools for…? → Iterator transforms
functools for…? → HOFs like partial
operator for…? → Lambda avoidance
pathlib for…? → Immutable paths
Why stdlib? → Portability, no deps

9. Post-Core Exercise¶

Refactor custom to stdlib FP.
Prove equivalence with Hypothesis.

Pipeline Usage (Idiomatic)

merged = chain(stream1, stream2)

Next Core: 2. FP Helper Libraries – toolz / returns in Real Pipelines