Memoization
M04C03: Memoization & Caching of Pure Functions (lru_cache, Custom & Persistent Caches)¶
Progression Note¶
By the end of Module 4, you will master safe recursion over unpredictable tree-shaped data, monoidal folds as the universal recursion pattern, Result/Option for streaming error handling, validation aggregators, retries, and structured error reporting — all while preserving laziness, equational reasoning, and constant call-stack usage.
Here's a snippet from the progression map:
| Module | Focus | Key Outcomes |
|---|---|---|
| 3 | Lazy Iteration & Generators | Memory-efficient streaming, itertools mastery, short-circuiting, observability |
| 4 | Safe Recursion & Error Handling in Streams | Stack-safe tree recursion, folds, Result/Option, streaming validation/retries/reports |
| 5 | Advanced Type-Driven Design | ADTs, exhaustive pattern matching, total functions, refined types |
Core question:
How do you safely memoize pure but expensive computations (e.g., embedding, hashing, parsing) so that repeated or recursive calls on identical inputs become O(1), while guaranteeing the cache is observationally invisible and never breaks purity or determinism?
We now take the TreeDoc hierarchy from M04C01–C02 and ask the most common performance question in real RAG systems:
“We have 100 000 nodes but only ~8 000 unique text values scattered throughout the tree. Computing the embedding (or deterministic content hash) for each one costs 50–200 ms. How do we avoid doing the same work millions of times?”
The naïve solution is to just call the expensive pure function everywhere:
def embed_all_naive(tree: TreeDoc) -> None:
embedding = expensive_embed(tree.node.text) # 50–200 ms
for child in tree.children:
embed_all_naive(child)
Total time → minutes of wasted CPU.
The production solution must:
- be pure and deterministic,
- turn repeated calls on identical inputs into O(1) cache hits,
- support in-process LRU and optional disk persistence,
- remain observationally invisible (cached_f(x) == f(x) always),
- work seamlessly with folds and lazy pipelines.
This is exactly what safe memoization gives us.
Audience: Engineers who have expensive pure functions (embedding models, parsers, hashers, normalizers) called repeatedly in recursive or streaming contexts and refuse to waste CPU on recomputation.
Outcome:
1. You will apply lru_cache_custom or memoize_keyed to any pure function and prove via instrumentation that recomputations drop to the number of unique inputs.
2. You will build bounded or persistent caches when the stdlib decorator is insufficient.
3. You will ship memoized operations that are observationally pure and integrate cleanly with folds and lazy streams.
We formalise exactly what we want from a correct, production-ready cache: observational purity, hit/miss correctness, bounded memory when required, and optional persistence.
Concrete Motivating Example¶
Same deep Markdown-derived tree from M04C01–C02, now with heavy duplication (boilerplate footers, repeated section intros, identical tables):
Root (title) → unique
├── Section 1 → "Introduction..." (repeated 500× across docs)
│ ├── Subsection 1.1 → unique
│ └── Subsection 1.2 → "Introduction..." again
└── Section 2 → "Introduction..." again
└── Subsection 2.1 → "Introduction..." again
└── … (2000 levels deep) → every leaf has identical legal footer
Desired behaviour:
- First occurrence of a given normalized text → compute expensive embedding/hash.
- All subsequent identical texts → instant cache hit.
- Total unique texts ≈ 8 000 → total compute time drops from minutes to seconds.
1. Laws & Invariants (machine-checked where possible)¶
| Law | Formal Statement | Enforcement |
|---|---|---|
| Observational Purity | cached_f(x) == uncached_f(x) for all x; caching introduces no observable side effects or nondeterminism. |
Hypothesis test_observational_purity. |
| Hit/Miss Correctness | First call with key k → compute + store; subsequent calls with k → return cached value without recomputation. | Instrumented property test_hit_miss_behavior. |
| Thread-Safety (best-effort) | Concurrent calls on the same cache instance do not corrupt data (no lost updates, no exceptions). | Property test with multiprocessing (omitted for brevity; present in repo). |
| Bounded Memory (LRU) | Cache never exceeds configured size; eviction preserves purity (recomputation on miss). | Property test fills cache → verifies size and correctness after eviction. |
| Persistence (DiskCache) | After process restart with identical namespace/version, previously stored values are retrievable atomically. | Property test writes → restarts process → reads (omitted for brevity; present in repo). |
These laws guarantee the cache is a pure optimisation — completely invisible to correct code.
2. Decision Table – Which Cache Do You Actually Use?¶
| Scenario | Keys Hashable? | Bounded Size? | Persistence? | Recommended Variant |
|---|---|---|---|---|
| Simple pure function, in-process | Yes | Yes | No | @lru_cache_custom(maxsize=...) |
| Complex/unhashable keys or custom logic | Via key_fn | Optional | No | memoize_keyed(key_fn, maxsize=...) (pure Python LRU or unbounded) |
| Need disk persistence | Yes | Unbounded on disk | Yes | DiskCache(dir, namespace, version) |
Never memoize impure functions.
Never use unbounded in-process caches on unbounded inputs.
3. Public API Surface (end-of-Module-04 refactor note)¶
Refactor note: memoization utilities live in funcpipe_rag.policies.memo (src/funcpipe_rag/policies/memo.py).
They are also re-exported from funcpipe_rag.api.core and the top-level package funcpipe_rag.
4. Reference Implementations¶
4.1 Simple LRU (for directly hashable arguments)¶
from functools import lru_cache as _lru_cache
def lru_cache_custom(
maxsize: Optional[int] = 512,
) -> Callable[[Callable[..., V]], Callable[..., V]]:
"""Drop-in replacement for functools.lru_cache with sensible defaults."""
return _lru_cache(maxsize=maxsize)
4.2 Flexible Keyed Memoizer (pure Python, correct semantics)¶
from __future__ import annotations
import threading
from collections import OrderedDict
from dataclasses import dataclass
from typing import TypeVar, Callable, Hashable, Optional, Any
import functools
K = TypeVar("K", bound=Hashable)
V = TypeVar("V")
@dataclass
class CacheInfo:
hits: int = 0
misses: int = 0
evictions: int = 0 # only meaningful for bounded variant
def memoize_keyed(
key_fn: Callable[..., K],
*,
maxsize: Optional[int] = None, # entry-bounded pure-Python LRU; None = unbounded
) -> Callable[[Callable[..., V]], Callable[..., V]]:
"""
Memoize using explicit key_fn.
Correctly calls fn(*args, **kwargs) while caching by key_fn(*args, **kwargs).
Thread-safe. cache_info() returns best-effort metrics.
"""
info = CacheInfo()
lock = threading.RLock()
if maxsize is not None:
cache: OrderedDict[K, V] = OrderedDict()
def decorator(fn: Callable[..., V]) -> Callable[..., V]:
@functools.wraps(fn)
def wrapped(*args: Any, **kwargs: Any) -> V:
k = key_fn(*args, **kwargs)
with lock:
if k in cache:
info.hits += 1
cache.move_to_end(k)
return cache[k]
info.misses += 1
v = fn(*args, **kwargs)
with lock:
if len(cache) >= maxsize:
cache.popitem(last=False)
info.evictions += 1
cache[k] = v
cache.move_to_end(k)
return v
wrapped.cache_info = lambda: info
wrapped.cache_clear = cache.clear
return wrapped
return decorator
else:
cache: dict[K, V] = {}
def decorator(fn: Callable[..., V]) -> Callable[..., V]:
@functools.wraps(fn)
def wrapped(*args: Any, **kwargs: Any) -> V:
k = key_fn(*args, **kwargs)
with lock:
if k in cache:
info.hits += 1
return cache[k]
v = fn(*args, **kwargs)
with lock:
if k not in cache:
cache[k] = v
info.misses += 1
return v
wrapped.cache_info = lambda: info
wrapped.cache_clear = cache.clear
return wrapped
return decorator
4.3 Persistent Atomic Disk Cache¶
import os
from pathlib import Path
class DiskCache:
"""
Persistent, atomic, versioned disk cache.
Keys are strings; values are arbitrary bytes (caller handles serialization).
Bumping version is the official cache invalidation strategy.
"""
def __init__(self, dirpath: str, namespace: str = "default", version: str = "v1"):
self.dir = Path(dirpath)
self.dir.mkdir(parents=True, exist_ok=True)
self.prefix = f"{namespace}-{version}-"
def _path(self, key: str) -> Path:
h = hashlib.sha256(key.encode("utf-8")).hexdigest()
return self.dir / f"{self.prefix}{h}.bin"
def get(self, key: str) -> Optional[bytes]:
p = self._path(key)
return p.read_bytes() if p.exists() else None
def set(self, key: str, value: bytes) -> None:
p = self._path(key)
tmp = p.with_suffix(".tmp")
tmp.write_bytes(value)
os.replace(tmp, p) # atomic on POSIX/Windows
4.4 Deterministic Content-Based Key (must stay in sync with normalisation)¶
import hashlib
def content_hash_key(
chunk: ChunkWithoutEmbedding,
*,
norm_version: str = "v1",
) -> str:
"""
Must be kept exactly in sync with the cleaning/normalisation pipeline
(currently " ".join(text.strip().lower().split())).
Changing normalisation without bumping norm_version will cause silent bugs.
"""
norm_text = " ".join(chunk.text.strip().lower().split())
h = hashlib.blake2b(digest_size=32)
h.update(norm_version.encode())
h.update(b"\x00")
h.update(norm_text.encode("utf-8"))
return h.hexdigest()
5. Property-Based Proofs (tests/test_memo.py)¶
from hypothesis import given, strategies as st
@given(inputs=st.lists(st.integers(), min_size=100, max_size=1000, unique=False))
def test_observational_purity(inputs):
def expensive(x: int) -> int:
return x ** 3 + x ** 2 + x # pure but "expensive"
cached = lru_cache_custom(maxsize=None)(expensive)
uncached_results = [expensive(x) for x in inputs]
cached_results = [cached(x) for x in inputs]
assert cached_results == uncached_results
# Separate dedup check
calls = 0
def counted(x: int) -> int:
nonlocal calls
calls += 1
return expensive(x)
cached2 = lru_cache_custom(maxsize=None)(counted)
[cached2(x) for x in inputs]
assert calls == len(set(inputs))
@given(inputs=st.lists(st.text(), min_size=100, max_size=1000, unique=False))
def test_keyed_memo_hit_miss(inputs):
calls = 0
def expensive(s: str) -> int:
nonlocal calls
calls += 1
return len(s) * 17
memo = memoize_keyed(lambda s: s, maxsize=None)(expensive)
[memo(s) for s in inputs]
unique = len(set(inputs))
assert calls == unique
calls_before = calls
[memo(s) for s in inputs]
assert calls == calls_before
@given(tree=tree_strategy())
def test_memoized_fold_dedup(tree):
calls = 0
def expensive_hash(text: str) -> str:
nonlocal calls
calls += 1
return hashlib.sha256(text.encode()).hexdigest()
memo_hash = memoize_keyed(lambda t: t)(expensive_hash)
unique_texts = len({n.node.text for n in flatten(tree)})
fold_tree(tree, None, lambda _, n, d, p: memo_hash(n.node.text))
assert calls == unique_texts
6. Big-O & Allocation Guarantees¶
| Variant | Lookup Time | Memory (peak) | Persistence | Notes |
|---|---|---|---|---|
| lru_cache_custom | O(1) | O(maxsize) | No | Fastest, requires hashable args |
| memoize_keyed (unbounded) | O(1) | O(unique keys) | No | Thread-safe |
| memoize_keyed (bounded) | O(1) | O(maxsize) | No | Pure Python OrderedDict LRU |
| DiskCache | O(1) + I/O | O(disk usage) | Yes | Atomic, versioned invalidation |
All variants are observationally pure.
7. Anti-Patterns & Immediate Fixes¶
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Memoizing impure functions | Nondeterministic results | Only memoize pure functions |
| Unbounded in-process cache | Memory leak | Use bounded variant |
| Mutable objects as keys | Hash/error bugs | Use explicit key_fn |
| No version in persistent keys | Stale cache after code change | Include version string in key |
8. Pre-Core Quiz¶
- Memoization requires…? → Pure, deterministic functions
- Observational purity means…? → cached_f(x) == f(x) always
- Custom key needed when…? → Args not hashable or need normalisation
- Persistent cache for…? → Cross-process or cross-run sharing
- Prove cache correctness? → Observational purity + hit/miss instrumentation
9. Post-Core Exercise¶
- Take your current embedding function → memoize with
lru_cache_custom→ add hit/miss logging → run on real dataset → measure speedup. - Memoize a fold over
TreeDocthat computes per-node embeddings → prove recomputations == unique texts. - Add
DiskCachefor content-based deduplication hashes that survive restarts. - Find an expensive pure function in your codebase → memoize with
memoize_keyed→ add equivalence property.
Next: M04C04 – Result/Option Types for Per-Record Failures (Simple FP Error Containers).
You now have the complete toolkit to make any pure computation effectively O(unique inputs) — even across process restarts. The rest of Module 4 is about doing it safely when things go wrong.