Memoization

M04C03: Memoization & Caching of Pure Functions (lru_cache, Custom & Persistent Caches)¶

Progression Note¶

By the end of Module 4, you will master safe recursion over unpredictable tree-shaped data, monoidal folds as the universal recursion pattern, Result/Option for streaming error handling, validation aggregators, retries, and structured error reporting — all while preserving laziness, equational reasoning, and constant call-stack usage.

Here's a snippet from the progression map:

Module	Focus	Key Outcomes
3	Lazy Iteration & Generators	Memory-efficient streaming, itertools mastery, short-circuiting, observability
4	Safe Recursion & Error Handling in Streams	Stack-safe tree recursion, folds, Result/Option, streaming validation/retries/reports
5	Advanced Type-Driven Design	ADTs, exhaustive pattern matching, total functions, refined types

Core question:
How do you safely memoize pure but expensive computations (e.g., embedding, hashing, parsing) so that repeated or recursive calls on identical inputs become O(1), while guaranteeing the cache is observationally invisible and never breaks purity or determinism?

We now take the TreeDoc hierarchy from M04C01–C02 and ask the most common performance question in real RAG systems:

“We have 100 000 nodes but only ~8 000 unique text values scattered throughout the tree. Computing the embedding (or deterministic content hash) for each one costs 50–200 ms. How do we avoid doing the same work millions of times?”

The naïve solution is to just call the expensive pure function everywhere:

def embed_all_naive(tree: TreeDoc) -> None:
    embedding = expensive_embed(tree.node.text)   # 50–200 ms
    for child in tree.children:
        embed_all_naive(child)

Total time → minutes of wasted CPU.

The production solution must: - be pure and deterministic, - turn repeated calls on identical inputs into O(1) cache hits, - support in-process LRU and optional disk persistence, - remain observationally invisible (cached_f(x) == f(x) always), - work seamlessly with folds and lazy pipelines.

This is exactly what safe memoization gives us.

Audience: Engineers who have expensive pure functions (embedding models, parsers, hashers, normalizers) called repeatedly in recursive or streaming contexts and refuse to waste CPU on recomputation.

Outcome:
1. You will apply lru_cache_custom or memoize_keyed to any pure function and prove via instrumentation that recomputations drop to the number of unique inputs.
2. You will build bounded or persistent caches when the stdlib decorator is insufficient.
3. You will ship memoized operations that are observationally pure and integrate cleanly with folds and lazy streams.

We formalise exactly what we want from a correct, production-ready cache: observational purity, hit/miss correctness, bounded memory when required, and optional persistence.

Concrete Motivating Example¶

Same deep Markdown-derived tree from M04C01–C02, now with heavy duplication (boilerplate footers, repeated section intros, identical tables):

Root (title)                     → unique
├── Section 1                    → "Introduction..." (repeated 500× across docs)
│   ├── Subsection 1.1           → unique
│   └── Subsection 1.2           → "Introduction..." again
└── Section 2                    → "Introduction..." again
    └── Subsection 2.1           → "Introduction..." again
        └── … (2000 levels deep) → every leaf has identical legal footer

Desired behaviour:

First occurrence of a given normalized text → compute expensive embedding/hash.
All subsequent identical texts → instant cache hit.
Total unique texts ≈ 8 000 → total compute time drops from minutes to seconds.

1. Laws & Invariants (machine-checked where possible)¶

Law	Formal Statement	Enforcement
Observational Purity	`cached_f(x) == uncached_f(x)` for all x; caching introduces no observable side effects or nondeterminism.	Hypothesis `test_observational_purity`.
Hit/Miss Correctness	First call with key k → compute + store; subsequent calls with k → return cached value without recomputation.	Instrumented property `test_hit_miss_behavior`.
Thread-Safety (best-effort)	Concurrent calls on the same cache instance do not corrupt data (no lost updates, no exceptions).	Property test with multiprocessing (omitted for brevity; present in repo).
Bounded Memory (LRU)	Cache never exceeds configured size; eviction preserves purity (recomputation on miss).	Property test fills cache → verifies size and correctness after eviction.
Persistence (DiskCache)	After process restart with identical namespace/version, previously stored values are retrievable atomically.	Property test writes → restarts process → reads (omitted for brevity; present in repo).

These laws guarantee the cache is a pure optimisation — completely invisible to correct code.

2. Decision Table – Which Cache Do You Actually Use?¶

Scenario	Keys Hashable?	Bounded Size?	Persistence?	Recommended Variant
Simple pure function, in-process	Yes	Yes	No	`@lru_cache_custom(maxsize=...)`
Complex/unhashable keys or custom logic	Via key_fn	Optional	No	`memoize_keyed(key_fn, maxsize=...)` (pure Python LRU or unbounded)
Need disk persistence	Yes	Unbounded on disk	Yes	`DiskCache(dir, namespace, version)`

Never memoize impure functions.
Never use unbounded in-process caches on unbounded inputs.

3. Public API Surface (end-of-Module-04 refactor note)¶

Refactor note: memoization utilities live in funcpipe_rag.policies.memo (src/funcpipe_rag/policies/memo.py).
They are also re-exported from funcpipe_rag.api.core and the top-level package funcpipe_rag.

from funcpipe_rag.api.core import DiskCache, content_hash_key, lru_cache_custom, memoize_keyed

4. Reference Implementations¶

4.1 Simple LRU (for directly hashable arguments)¶

from functools import lru_cache as _lru_cache

def lru_cache_custom(
    maxsize: Optional[int] = 512,
) -> Callable[[Callable[..., V]], Callable[..., V]]:
    """Drop-in replacement for functools.lru_cache with sensible defaults."""
    return _lru_cache(maxsize=maxsize)

4.2 Flexible Keyed Memoizer (pure Python, correct semantics)¶

from __future__ import annotations

import threading
from collections import OrderedDict
from dataclasses import dataclass
from typing import TypeVar, Callable, Hashable, Optional, Any
import functools

K = TypeVar("K", bound=Hashable)
V = TypeVar("V")

@dataclass
class CacheInfo:
    hits: int = 0
    misses: int = 0
    evictions: int = 0   # only meaningful for bounded variant

def memoize_keyed(
    key_fn: Callable[..., K],
    *,
    maxsize: Optional[int] = None,   # entry-bounded pure-Python LRU; None = unbounded
) -> Callable[[Callable[..., V]], Callable[..., V]]:
    """
    Memoize using explicit key_fn.
    Correctly calls fn(*args, **kwargs) while caching by key_fn(*args, **kwargs).
    Thread-safe. cache_info() returns best-effort metrics.
    """
    info = CacheInfo()
    lock = threading.RLock()

    if maxsize is not None:
        cache: OrderedDict[K, V] = OrderedDict()
        def decorator(fn: Callable[..., V]) -> Callable[..., V]:
            @functools.wraps(fn)
            def wrapped(*args: Any, **kwargs: Any) -> V:
                k = key_fn(*args, **kwargs)
                with lock:
                    if k in cache:
                        info.hits += 1
                        cache.move_to_end(k)
                        return cache[k]
                    info.misses += 1
                v = fn(*args, **kwargs)
                with lock:
                    if len(cache) >= maxsize:
                        cache.popitem(last=False)
                        info.evictions += 1
                    cache[k] = v
                    cache.move_to_end(k)
                return v
            wrapped.cache_info = lambda: info
            wrapped.cache_clear = cache.clear
            return wrapped
        return decorator

    else:
        cache: dict[K, V] = {}
        def decorator(fn: Callable[..., V]) -> Callable[..., V]:
            @functools.wraps(fn)
            def wrapped(*args: Any, **kwargs: Any) -> V:
                k = key_fn(*args, **kwargs)
                with lock:
                    if k in cache:
                        info.hits += 1
                        return cache[k]
                v = fn(*args, **kwargs)
                with lock:
                    if k not in cache:
                        cache[k] = v
                        info.misses += 1
                return v
            wrapped.cache_info = lambda: info
            wrapped.cache_clear = cache.clear
            return wrapped
        return decorator

4.3 Persistent Atomic Disk Cache¶

import os
from pathlib import Path

class DiskCache:
    """
    Persistent, atomic, versioned disk cache.
    Keys are strings; values are arbitrary bytes (caller handles serialization).
    Bumping version is the official cache invalidation strategy.
    """
    def __init__(self, dirpath: str, namespace: str = "default", version: str = "v1"):
        self.dir = Path(dirpath)
        self.dir.mkdir(parents=True, exist_ok=True)
        self.prefix = f"{namespace}-{version}-"

    def _path(self, key: str) -> Path:
        h = hashlib.sha256(key.encode("utf-8")).hexdigest()
        return self.dir / f"{self.prefix}{h}.bin"

    def get(self, key: str) -> Optional[bytes]:
        p = self._path(key)
        return p.read_bytes() if p.exists() else None

    def set(self, key: str, value: bytes) -> None:
        p = self._path(key)
        tmp = p.with_suffix(".tmp")
        tmp.write_bytes(value)
        os.replace(tmp, p)   # atomic on POSIX/Windows

4.4 Deterministic Content-Based Key (must stay in sync with normalisation)¶

import hashlib

def content_hash_key(
    chunk: ChunkWithoutEmbedding,
    *,
    norm_version: str = "v1",
) -> str:
    """
    Must be kept exactly in sync with the cleaning/normalisation pipeline
    (currently " ".join(text.strip().lower().split())).
    Changing normalisation without bumping norm_version will cause silent bugs.
    """
    norm_text = " ".join(chunk.text.strip().lower().split())
    h = hashlib.blake2b(digest_size=32)
    h.update(norm_version.encode())
    h.update(b"\x00")
    h.update(norm_text.encode("utf-8"))
    return h.hexdigest()

5. Property-Based Proofs (`tests/test_memo.py`)¶

from hypothesis import given, strategies as st

@given(inputs=st.lists(st.integers(), min_size=100, max_size=1000, unique=False))
def test_observational_purity(inputs):
    def expensive(x: int) -> int:
        return x ** 3 + x ** 2 + x   # pure but "expensive"

    cached = lru_cache_custom(maxsize=None)(expensive)
    uncached_results = [expensive(x) for x in inputs]
    cached_results = [cached(x) for x in inputs]
    assert cached_results == uncached_results

    # Separate dedup check
    calls = 0
    def counted(x: int) -> int:
        nonlocal calls
        calls += 1
        return expensive(x)

    cached2 = lru_cache_custom(maxsize=None)(counted)
    [cached2(x) for x in inputs]
    assert calls == len(set(inputs))

@given(inputs=st.lists(st.text(), min_size=100, max_size=1000, unique=False))
def test_keyed_memo_hit_miss(inputs):
    calls = 0
    def expensive(s: str) -> int:
        nonlocal calls
        calls += 1
        return len(s) * 17

    memo = memoize_keyed(lambda s: s, maxsize=None)(expensive)
    [memo(s) for s in inputs]
    unique = len(set(inputs))
    assert calls == unique
    calls_before = calls
    [memo(s) for s in inputs]
    assert calls == calls_before

@given(tree=tree_strategy())
def test_memoized_fold_dedup(tree):
    calls = 0
    def expensive_hash(text: str) -> str:
        nonlocal calls
        calls += 1
        return hashlib.sha256(text.encode()).hexdigest()

    memo_hash = memoize_keyed(lambda t: t)(expensive_hash)
    unique_texts = len({n.node.text for n in flatten(tree)})
    fold_tree(tree, None, lambda _, n, d, p: memo_hash(n.node.text))
    assert calls == unique_texts

6. Big-O & Allocation Guarantees¶

Variant	Lookup Time	Memory (peak)	Persistence	Notes
lru_cache_custom	O(1)	O(maxsize)	No	Fastest, requires hashable args
memoize_keyed (unbounded)	O(1)	O(unique keys)	No	Thread-safe
memoize_keyed (bounded)	O(1)	O(maxsize)	No	Pure Python OrderedDict LRU
DiskCache	O(1) + I/O	O(disk usage)	Yes	Atomic, versioned invalidation

All variants are observationally pure.

7. Anti-Patterns & Immediate Fixes¶

Anti-Pattern	Symptom	Fix
Memoizing impure functions	Nondeterministic results	Only memoize pure functions
Unbounded in-process cache	Memory leak	Use bounded variant
Mutable objects as keys	Hash/error bugs	Use explicit key_fn
No version in persistent keys	Stale cache after code change	Include version string in key

8. Pre-Core Quiz¶

Memoization requires…? → Pure, deterministic functions
Observational purity means…? → cached_f(x) == f(x) always
Custom key needed when…? → Args not hashable or need normalisation
Persistent cache for…? → Cross-process or cross-run sharing
Prove cache correctness? → Observational purity + hit/miss instrumentation

9. Post-Core Exercise¶

Take your current embedding function → memoize with lru_cache_custom → add hit/miss logging → run on real dataset → measure speedup.
Memoize a fold over TreeDoc that computes per-node embeddings → prove recomputations == unique texts.
Add DiskCache for content-based deduplication hashes that survive restarts.
Find an expensive pure function in your codebase → memoize with memoize_keyed → add equivalence property.

Next: M04C04 – Result/Option Types for Per-Record Failures (Simple FP Error Containers).

You now have the complete toolkit to make any pure computation effectively O(unique inputs) — even across process restarts. The rest of Module 4 is about doing it safely when things go wrong.