Module 1: Foundational FP Concepts¶

Progression Note¶

By the end of Module 1, you'll master purity laws, write pure functions, and refactor impure code using Hypothesis. This builds the foundation for lazy streams in Module 3. See the series progression map in the repo root for full details.

Here's a snippet from the progression map:

Module	Focus	Key Outcomes
1: Foundational FP Concepts	Purity, contracts, refactoring	Spot impurities, write pure functions, prove equivalence with Hypothesis
2: ...	...	...
...	...	...

M01C01: Imperative vs Functional – Purity & Referential Transparency¶

Core question:
How does hidden state prevent substitution, and how do you refactor imperative code to pure functions that enable equational reasoning?

This core introduces the functional mindset in Python:
- Treat code as substitutable expressions (referential transparency) rather than sequences of mutations.
- Default to pure functions (same inputs → same outputs, no side effects) for predictability and composability.
- Isolate impurities (globals, mutation, I/O) to thin edges.

We use a running project—refactoring the FuncPipe RAG Builder (documented in module-01/funcpipe-rag-01/README.md)—to ground every concept. This project evolves across all 10 cores: start with an impure, stateful version; end with a pure, composable pipeline.

Audience: Python developers writing imperative scripts (loops, shared mutation) who face bugs from "it worked last time" or struggle with parallel/testing due to the menace of parallelism and flaky tests.
Outcome:
1. Spot impurity in code and explain why it breaks substitution.
2. Refactor a 20–50 line impure function to pure using explicit state.
3. Write a Hypothesis property proving equivalence, including a shrinking example.

1. Conceptual Foundation¶

1.1 The One-Sentence Rule¶

Write pure functions by default: explicit inputs, new outputs, no hidden state or effects; mutate or I/O only in thin, explicit wrappers.

1.2 Referential Transparency in One Precise Sentence¶

An expression is referentially transparent if you can replace it with its value without changing program behavior—enabled by purity (no side effects, determinism).

In this series, a function is “pure” if: - It depends only on its arguments (no mutable globals, no I/O), - It does not mutate its arguments or any shared state, and - Given the same arguments, it always returns the same result or raises the same exception.

1.3 Why This Matters Now¶

Without referential transparency, you cannot reason locally about code; your tests and parallel execution become guesswork due to hidden dependencies. For example, a function that passes unit tests might fail in production under concurrent calls because of shared globals, leading to data races that are hard to reproduce.

1.4 Functions as Values in 5 Lines¶

Functions as first-class values enable dynamic strategies:

from collections.abc import Callable
from typing import Dict

def discount_price(price: float, tax_rate: float) -> float:
    return price * (1 - tax_rate)

strategies: Dict[str, Callable[[float, float], float]] = {
    "discount": discount_price,
    # Add more strategies
}

def run_job(strategy_key: str, price: float, tax_rate: float) -> float:
    return strategies[strategy_key](price, tax_rate)

Because discount_price is pure (no hidden state, no I/O), we can safely store it in a dict, pass it around, and test it in isolation — just like data.

2. Mental Model: Imperative vs Functional¶

2.1 One Picture¶

Imperative (Hidden State)               Functional (Explicit + Substitutable)
+-----------------------+               +------------------------------+
| globals / RNG / time  |               |   all inputs    + function   |
|        ↓              |               |        ↓              ↓      |
| mutate X → do Y → Z   |               |   output = f(all inputs)     |
| result = hidden deps  |               |   value substitutable        |
+-----------------------+               +------------------------------+
   ↑ Races / Heisenbugs                    ↑ Parallel-safe / Testable

2.2 Contract Table¶

Aspect	Imperative / Stateful	Functional / Pure
Inputs/Outputs	Hidden globals, mutation	Explicit, deterministic
Substitution	Breaks (effects differ)	Safe (expression = value)
Testing	Flaky (depends on order)	Local reasoning (properties hold)
Parallelism	Races	Freedom from data races

Note on Imperative Choice: Rarely, for profiled hot paths (e.g., inner loops), use imperative style hidden behind a pure API.

2.3 Purity Spectrum Table¶

Level	Description	Example
Fully Pure	Explicit inputs/outputs only	`def add(x: int, y: int) -> int: return x + y`
Impure	Globals/I/O/mutation	`def read_file(path: str) -> str: ...`

3. Running Project: FuncPipe RAG Builder¶

Our running project (from module-01/funcpipe-rag-01/README.md) is refactoring an impure RAG pipeline to pure.
- Dataset: 10k arXiv CS abstracts (arxiv_cs_abstracts_10k.csv).
- Goal: Clean, chunk, embed abstracts into Chunk list (including dedup).
- Start: Pedagogical impure version (module-01/funcpipe-rag-01/src/funcpipe_rag/full_rag.py).
- End (this core): Pure core with explicit env/state, preserving structural behavior for equivalence testing.

3.1 Types (Canonical, Used Throughout)¶

These live in module-01/funcpipe-rag-01/src/funcpipe_rag/rag_types.py and are imported as needed. Full code:

from dataclasses import dataclass
from typing import Tuple

@dataclass
class RawDoc:
    doc_id: str
    title: str
    abstract: str
    categories: str

@dataclass(frozen=True)
class CleanDoc:
    doc_id: str
    title: str
    abstract: str
    categories: str

@dataclass(frozen=True)
class ChunkWithoutEmbedding:
    doc_id: str
    text: str
    start: int
    end: int

@dataclass(frozen=True, eq=True)
class Chunk(ChunkWithoutEmbedding):
    embedding: Tuple[float, ...]  # Length 16, [0.0, 1.0]

@dataclass(frozen=True)
class RagEnv:
    chunk_size: int

3.2 Impure Start (Anti-Pattern)¶

Full code:

# module-01/funcpipe-rag-01/src/funcpipe_rag/full_rag.py (legacy dict-based pipeline)
import hashlib
from typing import List, Dict, Tuple
from funcpipe_rag import RawDoc, RagEnv


def _stable_embedding(text: str) -> Tuple[float, ...]:
    h = hashlib.sha256(text.encode("utf-8")).hexdigest()
    step = 4
    return tuple(int(h[i:i + step], 16) / (16 ** step - 1) for i in range(0, 64, step))


def impure_chunks(docs: List[RawDoc], env: RagEnv) -> List[Dict]:
    chunks = []
    for doc in docs:
        text = " ".join(doc.abstract.strip().lower().split())
        for i in range(0, len(text), env.chunk_size):
            chunk_text = text[i:i + env.chunk_size]
            embedding = _stable_embedding(chunk_text)
            chunk_id = f"{doc.doc_id}_{i}"
            chunks.append({
                "doc_id": chunk_id,
                "text": chunk_text,
                "start": i,
                "end": i + len(chunk_text),
                "embedding": embedding
            })
    return chunks

Smells: The shape is still ad-hoc (dictionaries with derived IDs instead of Chunk values), and it intertwines cleaning, chunking, and embedding in one loop. Even though the embedding is deterministic now (via a hash) for pedagogical tests, the legacy shape complicates equational reasoning and deduplication.

4. Refactor to Pure: Explicit Inputs, New Outputs¶

4.1 Pure Core¶

Pass all state explicitly; return new values. Full code:

# module-01/funcpipe-rag-01/src/funcpipe_rag/fp.py
from typing import Callable, List, TypeVar

T = TypeVar("T")
U = TypeVar("U")

def fmap(f: Callable[[T], U]) -> Callable[[List[T]], List[U]]:
    # fmap obeys the usual functor laws for list: identity and composition
    def mapper(xs: List[T]) -> List[U]:
        return [f(x) for x in xs]
    return mapper

# flow is not used in Core 1; see later cores for composition

# module-01/funcpipe-rag-01/src/funcpipe_rag/pipeline_stages.py
import hashlib
from typing import List, Set, Tuple
from funcpipe_rag import RawDoc, CleanDoc, ChunkWithoutEmbedding, Chunk, RagEnv


def clean_doc(doc: RawDoc) -> CleanDoc:
    abstract = " ".join(doc.abstract.strip().lower().split())
    return CleanDoc(doc.doc_id, doc.title, abstract, doc.categories)


def chunk_doc(doc: CleanDoc, env: RagEnv) -> List[ChunkWithoutEmbedding]:
    text = doc.abstract
    return [
        ChunkWithoutEmbedding(doc.doc_id, text[i:i + env.chunk_size], i, i + len(text[i:i + env.chunk_size]))
        for i in range(0, len(text), env.chunk_size)
    ]


def embed_chunk(chunk: ChunkWithoutEmbedding) -> Chunk:
    h = hashlib.sha256(chunk.text.encode("utf-8")).hexdigest()
    step = 4  # Fixed for dimension 16
    vec = tuple(int(h[i:i + step], 16) / (16 ** step - 1) for i in range(0, 64, step))
    return Chunk(chunk.doc_id, chunk.text, chunk.start, chunk.end, vec)


def structural_dedup_chunks(chunks: List[Chunk]) -> List[Chunk]:
    seen: Set[Chunk] = set()
    result: List[Chunk] = []
    for c in chunks:
        if c not in seen:
            seen.add(c)
            result.append(c)
    return result

# module-01/funcpipe-rag-01/src/funcpipe_rag/full_rag.py
from typing import List
from funcpipe_rag import fmap
from funcpipe_rag import clean_doc, chunk_doc, embed_chunk, structural_dedup_chunks
from funcpipe_rag import RawDoc, Chunk, RagEnv


def docs_to_embedded(docs: List[RawDoc], env: RagEnv) -> List[Chunk]:
    cleaned = fmap(clean_doc)(docs)
    chunked = [c for doc in cleaned for c in chunk_doc(doc, env)]
    return fmap(embed_chunk)(chunked)


def full_rag(docs: List[RawDoc], env: RagEnv) -> List[Chunk]:
    return structural_dedup_chunks(docs_to_embedded(docs, env))

Wins: Every stage is independently testable, and full_rag simply wires them together—clean → chunk → embed → structural dedup. Because each helper is deterministic, full_rag(docs, env) is referentially transparent.

4.2 Impure Shell (Edge Only)¶

Full code:

# module-01/funcpipe-rag-01/src/funcpipe_rag/rag_shell.py
from dataclasses import asdict
import csv
import json
from typing import List
from funcpipe_rag import RawDoc, RagEnv
from funcpipe_rag import full_rag


def rag_shell(env: RagEnv, input_path: str, output_path: str) -> None:
    try:
        with open(input_path, encoding="utf-8") as f_in:
            docs: List[RawDoc] = [RawDoc(**row) for row in csv.DictReader(f_in)]
    except (UnicodeDecodeError, TypeError, ValueError) as exc:
        raise ValueError(f"Failed to parse {input_path}: {exc}") from exc

    chunks = full_rag(docs, env)
    with open(output_path, "w", encoding="utf-8") as f_out:
        for chunk in chunks:
            json.dump(asdict(chunk), f_out)
            f_out.write("\n")

Note: This is the only impurity; core remains pure. Use 'with' for resource safety in impure boundaries—full details in Module 7 (inspired by PEP 343). This acts as a shell/adapter layer.

5. Equational Reasoning: Substitution Exercise¶

Hand Exercise: Replace expressions in full_rag.
1. Inline clean_doc(d) → CleanDoc(...) with normalized abstract.
2. Substitute into chunk_doc → list of ChunkWithoutEmbedding.
3. Result: Entire call = fixed value for fixed inputs.
Bug Hunt: In a truly impure version (e.g., with randomness or timestamps), substitution fails due to varying results. In this core we keep the impure version deterministic already; the point here is shape + entanglement, not non-determinism.
Because full_rag is pure and referentially transparent, we can safely replace any call full_rag(docs, env) with its value, or inline clean_doc / chunk_doc step by step, without changing behavior. That is equational reasoning: treating equals as interchangeable in all contexts.

6. Property-Based Testing: Proving Equivalence (Advanced, Optional)¶

Use Hypothesis to prove refactor preserved structural behavior (text, positions).

You can safely skip this on a first read and still follow later cores—come back when you want to mechanically verify your own refactors.

To bridge theory and practice, here's a simple Hypothesis example illustrating impurity detection:

from hypothesis import given
import hypothesis.strategies as st

# Impure function (hidden state)
counter = 0
def impure_add(x: int) -> int:
    global counter
    counter += 1
    return x + counter

# Pure refactor
def pure_add(x: int, state: int) -> tuple[int, int]:  # Returns result and new state
    return x + state, state + 1

# Property test to catch impurity (falsifies because impure_add depends on call history, not on inputs alone)
@given(st.integers())
def test_impure_add_not_deterministic(x: int) -> None:
    # Two calls with same input must give same result for a pure function.
    first = impure_add(x)
    second = impure_add(x)
    assert first == second

# Run: hypothesis will find counterexample showing non-determinism

Hypothesis falsifies this by running multiple inputs, revealing the hidden state change.

6.1 Custom Strategy (RAG Domain)¶

Full code:

# module-01/funcpipe-rag-01/tests/conftest.py (excerpt)
import hypothesis.strategies as st
from funcpipe_rag import RawDoc, RagEnv


def raw_doc_strategy():
    return st.builds(
        RawDoc,
        doc_id=st.text(min_size=1, max_size=20, alphabet="abcdefghijklmnopqrstuvwxyz0123456789-_"),
        title=st.text(max_size=100),
        abstract=st.text(max_size=2000),
        categories=st.text(max_size=50),
    )


def doc_list_strategy():
    return st.lists(raw_doc_strategy(), max_size=50)


def env_strategy():
    return st.builds(RagEnv, chunk_size=st.integers(min_value=128, max_value=1024))

6.2 Equivalence Property¶

# module-01/funcpipe-rag-01/tests/test_laws.py (structure preservation example)
from hypothesis import given
from funcpipe_rag import docs_to_embedded  # impure_chunks from full_rag.py
from funcpipe_rag import impure_chunks
from .conftest import doc_list_strategy, env_strategy


@given(docs=doc_list_strategy(), env=env_strategy())
def test_refactor_preserves_structure(docs, env):
    old = sorted(
        (c["doc_id"], c["text"], c["start"], c["end"])
        for c in impure_chunks(docs, env)
    )
    new = sorted(
        (c.doc_id, c.text, c.start, c.end)
        for c in docs_to_embedded(docs, env)
    )
    assert old == new

Note: Property focuses on structural equivalence (text, positions); embeddings and side effects ignored. We ignore IDs here as they use different schemas.

6.3 Shrinking Demo: Catching a Bug¶

Bad refactor (forgot whitespace collapse in clean_doc):

from funcpipe_rag import RawDoc, CleanDoc, RagEnv, ChunkWithoutEmbedding, Chunk
from typing import List
from funcpipe_rag import chunk_doc, embed_chunk


def bad_clean_doc(doc: RawDoc) -> CleanDoc:
    return CleanDoc(doc.doc_id, doc.title, doc.abstract.strip().lower(), doc.categories)  # No split/join


def full_rag_bad(docs: List[RawDoc], env: RagEnv) -> List[Chunk]:
    cleaned = [bad_clean_doc(d) for d in docs]
    chunked = [c for doc in cleaned for c in chunk_doc(doc, env)]
    embedded = [embed_chunk(c) for c in chunked]
    return embedded

Property (swapped to full_rag_bad):

@given(docs=doc_list_strategy(), env=env_strategy())
def test_bad_rag(docs, env):
    impure = impure_chunks(docs, env)
    bad_pure = full_rag_bad(docs, env)
    old = sorted((c["text"], c["start"], c["end"]) for c in impure)
    new = sorted((c.text, c.start, c.end) for c in bad_pure)
    assert old == new  # Fails on extra spaces

Hypothesis failure trace (run to verify; example):

Falsifying example: test_bad_rag(
    docs=[RawDoc(doc_id='a', title='', abstract='a  b', categories='')], 
    env=RagEnv(chunk_size=128),
)
AssertionError

Initial failure on larger input; shrinks to minimal doc with extra space ("a b").
Shows length mismatch: impure collapses spaces (len=3), bad_pure doesn't (len=4).

This catches the bug via shrinking to the essence.

7. When Purity Isn't Worth It¶

Rarely, for profiled hot paths (e.g., image processing loops), use mutable arrays internally but expose pure API: def process_image(data: bytes) -> bytes: ....

8. Pre-Core Quiz¶

If full_rag were impure (e.g., using time), would full_rag(docs, env) == full_rag(docs, env) always hold? → No.
If embed_chunk used time/random, substituting the call with its value would break referential transparency.
chunks.sort() == sorted(chunks)? → No (mutates vs new). Run it!
Global in clean_doc? → Hidden input → impure.
Cache impure? → Lies about state.

9. Post-Core Reflection & Exercise¶

Reflect: In your code, find one impure func (global/shared mutation). Refactor to pure; add Hypothesis equiv.
Project Exercise: Download arxiv_cs_abstracts_10k.csv; run impure vs pure; verify equiv.

All claims (e.g., referential transparency) are verifiable via the provided Hypothesis examples—run them to confirm.

Further Reading: For more on purity pitfalls, see 'Fluent Python' Chapter on Functions as Objects. Check free resources like Python.org's FP section or Codecademy's Advanced Python course for basics.

Next: Core 2 – Pure Functions & Contracts. (Builds on this RAG pure core.)