Skip to content

Core 1: FP in Stdlib – itertools, functools, operator, pathlib

Module 09

Core question:
How do you use Python's stdlib modules like itertools, functools, operator, and pathlib to build functional pipelines, enhancing composability, purity, and expressiveness without external dependencies in FuncPipe?

In this core, we explore functional programming patterns in Python's standard library for the FuncPipe RAG Builder (now at funcpipe-rag-09). We focus on itertools for iterator transformations (e.g., chain, groupby, tee, accumulate, filterfalse, compress, takewhile/dropwhile), functools for higher-order functions like partial, reduce, and lru_cache (with gotchas like mutable defaults and cache purity), operator for functional operators (e.g., itemgetter, attrgetter, methodcaller to avoid lambdas), and pathlib for immutable path handling with functional ops (e.g., glob, read_text, PurePath for no-IO paths). These tools enable pure, composable data flows, integrating with prior modules' lazy streams and monadic patterns. We refactor RAG stages to use stdlib-only FP, verifying equivalence and laws like associativity for compositions (with preconditions for associative operators). No external libs are used, ensuring portability; we highlight sharp edges like groupby requiring sorted input and tee memory traps.

Motivation Bug: Relying on custom combinators or external libs like toolz leads to dependency bloat and portability issues; stdlib FP provides built-in, battle-tested alternatives for core patterns.

Delta from Module 08: Async flows add concurrency; stdlib FP grounds patterns in core Python for broader applicability.

Stdlib FP Protocol (Contract, Entry/Exit Criteria): - Composability: Functions like chain, partial return callables/iterables that chain without side effects. - Purity: Use pure ops (e.g., map with lambda); avoid mutating inputs; note functools.partial avoids mutable defaults; lru_cache preserves purity for pure funcs. - Laziness: Prefer generators (islice, map) over lists; tee duplicates iterators safely if consumed carefully (bounded skew required to avoid memory exhaustion). - Semantics: Laws like reduce associativity hold if operator is associative; verified via properties with preconditions. - Integration: Replace custom combinators with stdlib equivalents in RAG. - Mypy Config: --strict, with Callable typing.

Audience: Engineers building FP pipelines who want stdlib-only solutions for composability without external deps.

Outcome: 1. Use itertools/functools/operator/pathlib for FP patterns, handling gotchas. 2. Refactor RAG to stdlib FP. 3. Prove equivalence/laws with Hypothesis.


1. Laws & Invariants

Law Description Enforcement
Associativity Law reduce(op, xs) == manual left/right folds if op associative (e.g., add); precondition: op associative. Property tests
Purity Law Ops depend only on inputs; no globals/mutation. Code reviews
Laziness Inv Generators defer computation; no eager materialization. Tests with islice
Equivalence Law Stdlib refactors yield same outputs as custom versions. Hypothesis equivalence

These laws ensure stdlib FP integrates reliably.


2. Decision Table

Scenario Need Laziness Gotchas Recommended
Merge streams Yes None itertools.chain
Configure funcs No Mutable defaults functools.partial
Fold aggregations No Op must be associative functools.reduce
Extract attrs No None operator.attrgetter
File glob/read No FS mutable pathlib.glob/read_text

Choose based on needs; mind gotchas like groupby requiring sorted input.


3. Public API (No Wrappers Needed; Direct Stdlib Use)

Repo alignment note (end-of-Module-09): - This repo keeps the earlier Module 07–08 abstractions (ports/effects/async) intact. - Module 09 does not delete them; it adds a stdlib-first layer for day-to-day pipeline work. - For runnable examples, see src/funcpipe_rag/rag/stdlib_fp.py and src/funcpipe_rag/interop/stdlib_fp.py.


4. Reference Implementations

4.1 Itertools for Stream Transforms

from itertools import chain, groupby, tee, accumulate, filterfalse, compress, takewhile, dropwhile

# Chain for merge
def merge_streams(*streams: Iterable[T]) -> Iterator[T]:
    return chain(*streams)

# Groupby with sort precondition
def group_docs(docs: Iterable[RawDoc], key: Callable[[RawDoc], str]) -> Iterator[tuple[str, Iterator[RawDoc]]]:
    sorted_docs = sorted(docs, key=key)  # Precondition: eager sort; use only when groups must be exact. For streaming, use dict accumulator or accept approximate.
    return groupby(sorted_docs, key=key)

# Tee for multicast (careful consumption)
def multicast_stream(stream: Iterable[T]) -> tuple[Iterator[T], Iterator[T]]:
    return tee(stream, 2)  # Memory grows with skew between consumers; consume copies evenly (bounded skew required).

# Accumulate for running folds
def running_sum(nums: Iterable[int]) -> Iterator[int]:
    return accumulate(nums, operator.add)

# Filterfalse for negation
def filter_non_ai(docs: Iterable[RawDoc]) -> Iterator[RawDoc]:
    return filterfalse(lambda d: 'cs.AI' in d.categories, docs)  # Lambda acceptable for clarity; attrgetter for complex.

# Compress for masking
def masked_docs(docs: Iterable[RawDoc], mask: Iterable[bool]) -> Iterator[RawDoc]:
    return compress(docs, mask)

# Takewhile/dropwhile for fencing
def take_short_abstracts(abstracts: Iterable[str]) -> Iterator[str]:
    return takewhile(lambda s: len(s) < 512, abstracts)  # Lambda fine; methodcaller('__len__') < 512 not possible.

4.2 Functools for Higher-Order

from functools import partial, reduce, lru_cache, singledispatch
from itertools import starmap

# Partial for configurators
def partial_clean(cfg: CleanCfg) -> Callable[[RawDoc], CleanDoc]:
    return partial(clean_doc, cfg=cfg)

# Reduce for folds
def reduce_embeddings(embs: Iterable[tuple[float, ...]]) -> tuple[float, ...]:
    it = iter(embs)
    first = next(it)  # raise if empty
    return reduce(lambda a,b: tuple(starmap(operator.add, zip(a, b, strict=True))), it, first)  # Precondition: same length; zip(strict=True) enforces (Py3.10+).

# Lru_cache for memoization
@lru_cache(maxsize=128)
def cached_embed(text: str) -> tuple[float, ...]:
    return embed_chunk(text)  # Preserves purity for pure funcs

# Singledispatch for polymorphism (at boundary; mutable registry)
@singledispatch
def process_doc(doc):
    raise NotImplementedError

@process_doc.register(RawDoc)
def _(doc: RawDoc):
    return clean_doc(doc)

@process_doc.register(CleanDoc)
def _(doc: CleanDoc):
    return chunk_doc(doc)

4.3 Operator for Lambda Avoidance

from operator import itemgetter, attrgetter, methodcaller

# Itemgetter for extraction
get_title = itemgetter('title')

# Attrgetter for attrs
get_doc_id = attrgetter('doc_id')

# Methodcaller for methods
read_text = methodcaller('read_text')

4.4 Pathlib for Paths

from pathlib import Path, PurePath

# Glob and read
def read_csvs(dir_path: str) -> Iterator[str]:
    path = Path(dir_path)
    for csv in path.glob('*.csv'):
        yield csv.read_text()

# PurePath for no-IO
def pure_normalize(p: str) -> str:
    return str(PurePath(p))

4.5 Integration in RAG

Refactor RAG cleaning to stdlib FP.

# RAG with stdlib
def rag_stdlib_clean(docs: Iterator[RawDoc]) -> Iterator[CleanDoc]:
    return map(partial(clean_doc, cfg=CleanCfg()), docs)

4.5 Before/After Refactor

# Before: Custom merge
def custom_merge(a, b): return list(a) + list(b)
# After: Stdlib
def std_merge(a, b): return chain(a, b)

5. Property-Based Proofs (repo tests)

Runnable properties live in tests/unit/interop/test_stdlib_fp.py and tests/unit/rag/test_stdlib_fp.py.

from hypothesis import given, strategies as st
from hypothesis import assume
import operator
from functools import reduce

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_chain_equiv(a, b):
    assert list(chain(a, b)) == a + b

@given(st.integers(1, 10))
def test_partial_equiv(n):
    add_n = partial(operator.add, n)
    assert add_n(5) == n + 5

@given(st.lists(st.integers(), min_size=3))
def test_reduce_associativity(xs):
    assume(len(xs) >= 3)
    a, b, c = xs[:3]
    assert operator.add(operator.add(a, b), c) == operator.add(a, operator.add(b, c))
    assert reduce(operator.add, xs) == sum(xs)

@given(st.text(min_size=1, alphabet=st.characters(blacklist_categories=['Cc'])))  # Avoid nulls/invalid
def test_path_immutable(s):
    p = PurePath(s)
    assert str(p) == s.replace('\\', '/')  # Normalized on POSIX-like


6. Runtime Preservation Guarantee

Stdlib FP is efficient; verified via benchmarks.


7. Anti-Patterns & Immediate Fixes

Anti-Pattern Symptom Fix
List materialization Memory blowup Use generators/islice
groupby unsorted Wrong groups Sort input first
tee unbalanced Memory exhaustion Consume copies evenly
Mutable defaults Shared state bugs Avoid in partial

8. Pre-Core Quiz

  1. itertools for…? → Iterator transforms
  2. functools for…? → HOFs like partial
  3. operator for…? → Lambda avoidance
  4. pathlib for…? → Immutable paths
  5. Why stdlib? → Portability, no deps

9. Post-Core Exercise

  1. Refactor custom to stdlib FP.
  2. Prove equivalence with Hypothesis.

Pipeline Usage (Idiomatic)

merged = chain(stream1, stream2)

Next Core: 2. FP Helper Libraries – toolz / returns in Real Pipelines