Snakemake Deep Dive — Module 04¶

Scaling Patterns: Modularity, Interfaces, CI Gates, and Executor-Proof Semantics¶

Version & scope contract

Target: Snakemake 9.14.x (CLI reporting, profiles, modules, schema validation).

This module is about scaling correctness, not tuning a specific cluster (Module 03) and not checkpoints/dynamic DAGs (Module 02).

Verify:
snakemake --version
snakemake -h | sed -n '1,40p'

Orientation: scaling fails at boundaries¶

When a workflow “doesn’t scale”, it’s usually not CPU. It’s:

hidden coupling across files,
implicit file contracts,
nondeterministic outputs,
or resources that were never proved to the executor.

Unified scaling model¶

Total scaling failure ≈ hidden coupling + implicit contracts + entropy + unproven resources + upgrade drift

flowchart LR
  P[Provider module] -->|File API: paths + formats| C[Consumer workflow]
  C --> V["Schema validate (fail fast)"]
  C --> G[CI gates: lint + dryrun + tests + diffs]
  C --> D[Drift evidence: list-changes + summary]
  C --> R[Resource evidence: logs show threads/mem]
  V --> S[Scale without fear]
  G --> S
  D --> S
  R --> S

Runnable lab (single repo, two assemblies)¶

You will have:

Module assembly (real boundaries): Snakefile
Single-file sanity (fast baseline): Snakefile.reference

Golden layout¶

.
├── Snakefile
├── Snakefile.reference
├── config
│   ├── config.yaml
│   ├── config.schema.yaml
│   └── schemas
│       ├── ref.schema.yaml
│       └── broken.schema.yaml
├── data
│   ├── A.txt
│   └── B.txt
├── modules
│   └── provider
│       └── Snakefile
├── workflow
│   ├── contracts
│   │   └── FILE_API.md
│   └── rules
│       ├── consumer.smk
│       ├── entropy.smk
│       └── resources.smk
├── profiles
│   └── local
│       └── config.v9+.yaml
└── ci
    └── gate.sh

`profiles/local/config.v9+.yaml`¶

executor: local
cores: 2
printshellcmds: true
latency-wait: 5

`config/config.yaml`¶

results_prefix: "results/v1"
samples: ["A", "B"]

`config/config.schema.yaml`¶

type: object
required: [results_prefix, samples]
properties:
  results_prefix: {type: string, minLength: 1}
  samples:
    type: array
    minItems: 1
    items: {type: string, pattern: "^[A-Za-z0-9._-]+$"}
additionalProperties: false

`workflow/contracts/FILE_API.md`¶

# File API (v1)

Provider outputs:
- path: results/v1/provider/{sample}.upper.txt
- semantics: uppercase of data/{sample}.txt

Consumer outputs:
- path: results/v1/consumer/all.upper.txt
- semantics: concatenation of provider outputs in config.samples order

Breaking changes:
- Any change to output paths, patterns, or semantics => bump results_prefix to results/v2

`data/A.txt`, `data/B.txt`¶

printf "hello a\n" > data/A.txt
printf "hello b\n" > data/B.txt

Commissioning sequence (module assembly)¶

snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --list-rules
snakemake --profile profiles/local --rulegraph mermaid-js > .proof/rulegraph.mmd
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true

Expected invariants you can verify immediately:

results/v1/consumer/all.upper.txt exists and contains:

HELLO A
HELLO B

* .proof/summary.txt contains a header with the columns:

filename  modification time  rule version  status  plan

Core 1 — Modularity that scales: `include` vs `module` and real boundaries¶

Learning objectives¶

You will be able to:

Reproduce a silent correctness bug caused by include: namespace leakage.
Replace it with a module boundary + explicit use rule imports.
Prove the boundary using --list-rules and a stable file API.

Definition¶

include: merges Snakefiles into one namespace (globals can collide).
module loads a workflow into its own namespace; use rule ... from ... imports explicitly.

Semantics¶

flowchart TD
  I[include: shared namespace] --> Leak[globals collide]
  M[module namespace] --> Use[use rule imports only what you depend on]
  Use --> API[depend on file API, not provider globals]

Failure signatures¶

Dry-run target set changes with “no obvious reason”.
Consumer changes provider behavior without touching provider code.

Minimal repro (leak via `include`)¶

`modules/provider/Snakefile`¶

SAMPLES = ["A", "B"]

rule provider_make:
    input: "data/{sample}.txt"
    output: "results/v1/provider/{sample}.upper.txt"
    shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"

`workflow/rules/consumer.smk` (bug: overwrites provider global)¶

SAMPLES = ["A"]  # accidental narrowing

rule consumer_all:
    input:
        expand("results/v1/provider/{sample}.upper.txt", sample=SAMPLES)
    output:
        "results/v1/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

Top-level `Snakefile` (bad assembly)¶

include: "modules/provider/Snakefile"
include: "workflow/rules/consumer.smk"

rule all:
    input: "results/v1/consumer/all.upper.txt"

Run:

snakemake --profile profiles/local -n

Expected planning evidence (verbatim file set):

Planned input includes results/v1/provider/A.upper.txt
Does not include results/v1/provider/B.upper.txt

Fix pattern (module boundary + config-derived list)¶

`workflow/rules/consumer.smk` (fixed: no globals)¶

rule consumer_all:
    input:
        expand(f"{config['results_prefix']}/provider/{{ sample }}.upper.txt", sample=config["samples"])
    output:
        f"{config['results_prefix']}/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

Top-level `Snakefile` (good assembly)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")

module provider:
    snakefile: "modules/provider/Snakefile"

use rule provider_make from provider as provider_*

include: "workflow/rules/consumer.smk"

rule all:
    input: f"{config['results_prefix']}/consumer/all.upper.txt"

Run:

snakemake --profile profiles/local -n

Expected planning evidence (verbatim file set):

Planned inputs include both:
results/v1/provider/A.upper.txt
results/v1/provider/B.upper.txt

Proof hook¶

Provide:

snakemake --profile profiles/local --list-rules output (before/after).
Two dry-runs showing the target set changed exactly as described.

Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶

Learning objectives¶

You will be able to:

Fail fast on bad config via schema validation.
Classify changes as breaking/non-breaking mechanically.
Force breaking changes to be explicit via results_prefix bump.

Definition¶

A contract is path + format + semantics, written down and enforced.

Semantics¶

The config schema makes typos impossible to ignore.
The file API doc makes upgrades reviewable.

Failure signatures¶

“It ran, but outputs are wrong” (schema drift or semantic drift).
Typos accepted (missing strict schema).

Minimal repro (schema failure must be immediate)¶

Break config/config.yaml:

results_prefix: "results/v1"
samplez: ["A", "B"]

Run:

snakemake --profile profiles/local -n

Expected failure evidence (verbatim fragment):

A validation error mentioning:
missing required property samples
unexpected property samplez

Fix pattern¶

Keep additionalProperties: false.
Put breaking changes behind results/v2/....

Proof hook¶

Provide:

The schema error excerpt.
The fixed config + a successful -n.

Core 3 — Determinism and drift control: CI as the correctness boundary¶

Learning objectives¶

You will be able to:

Demonstrate nondeterminism with a stable diff.
Enforce a CI gate that catches entropy and drift.
Produce drift artifacts (--summary, --list-changes) as PR evidence.

Definition¶

Determinism means:

stable plan for stable inputs,
stable outputs for stable inputs,
stable provenance signals.

Semantics¶

flowchart LR
  A[Run] --> B[Artifacts: outputs + summary + list-changes]
  C[Change code/params] --> D[list-changes flags impacted outputs]
  E[Entropy in outputs] --> F[diff fails => CI fails]

Failure signatures¶

“CI is flaky” (time/RNG/unordered globs in outputs).
Drift report is empty when it shouldn’t be (metadata dropped or bypassed).

Minimal repro (prove entropy)¶

`workflow/rules/entropy.smk`¶

rule entropy_bad:
    output: f"{config['results_prefix']}/entropy.txt"
    shell:
        "python - << 'PY'\n"
        "import time\n"
        "print(time.time())\n"
        "PY\n"
        "> {output}"

Run twice and diff:

snakemake --profile profiles/local -F results/v1/entropy.txt
cp results/v1/entropy.txt /tmp/e1.txt
snakemake --profile profiles/local -F results/v1/entropy.txt
diff -u /tmp/e1.txt results/v1/entropy.txt && echo OK || echo NONDETERMINISTIC

Expected output (verbatim):

NONDETERMINISTIC

Fix pattern¶

Entropy goes to logs, not semantic outputs.
If randomness is required, seed is config and becomes provenance.

CI gate (minimal, enforceable)¶

`ci/gate.sh`¶

#!/usr/bin/env bash
set -euo pipefail

snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true

Proof hook¶

Provide:

the failing diff (NONDETERMINISTIC) and the fixed diff (OK),
.proof/summary.txt and .proof/list-changes.code.txt.

Core 4 — Resource semantics with evidence: prove what the workflow asked for¶

Learning objectives¶

You will be able to:

Resolve dynamic resources deterministically from explicit evidence (input size).
Prove resolved threads and mem_mb using log artifacts.
Detect oversubscription failures before cluster execution.

Definition¶

Resource correctness is not “I wrote resources:”. It is:

the workflow resolved concrete values,
evidence was produced,
and the executor could have enforced them.

Semantics (portable evidence)¶

flowchart TD
  I[input size] --> R[resolve mem_mb]
  R --> L[write evidence log]
  L --> A[audit in CI / review]

Failure signatures¶

“Resources ignored” (no evidence exists; you’re guessing).
Oversubscription (threads > available cores) causes stalls/rejections.

Minimal repro (resource evidence logs)¶

`workflow/rules/resources.smk`¶

def mem_mb_from_input(wc, input):
    # deterministic; scale with size for demonstration
    return max(200, int(2 * input.size_mb) + 200)

rule resource_probe:
    input: "data/{sample}.txt"
    output: f"{config['results_prefix']}/resources/{{ sample }}.done.txt"
    log: "logs/resources/{sample}.txt"
    threads: 2
    resources:
        mem_mb=mem_mb_from_input
    shell:
        r"""
        printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
        echo OK > {output}
        """

Make B large:

python - << 'PY'
with open("data/B.txt","w") as f:
    f.write("x" * 5_000_000 + "\n")
PY

Run:

snakemake --profile profiles/local results/v1/resources/A.done.txt results/v1/resources/B.done.txt
echo "=== A ==="; cat logs/resources/A.txt
echo "=== B ==="; cat logs/resources/B.txt

Expected log evidence (verbatim lines present in both):

sample=...
threads=2
mem_mb=...
input=data/...

And B’s mem_mb must be > A’s.

Fix pattern¶

Defaults belong in profile; rule-level resources are exceptions.
Any “special” rule must emit an evidence log of resolved resources.

Proof hook¶

Provide:

logs/resources/A.txt and logs/resources/B.txt,
and one sentence: “B mem_mb > A mem_mb (evidence above).”

Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶

Learning objectives¶

You will be able to:

Make a breaking change that is mechanically explicit (v1 → v2).
Prove a non-breaking refactor does not perturb consumers.
Encode team review rules that prevent silent breakage.

Definition¶

A workflow is a versioned product:

stable file API,
pinned dependencies,
explicit upgrade paths.

Semantics (breaking change demo, isolated and concrete)¶

Step 1: baseline (v1)¶

Confirm:

snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary | grep -E "results/v1/provider|results/v1/consumer" | head

Step 2: introduce a breaking change (format semantics)¶

Change provider to output lowercase instead of uppercase (breaking semantics) but keep the path the same (this is the mistake).

Edit modules/provider/Snakefile:

rule provider_make:
    input: "data/{sample}.txt"
    output: "results/v1/provider/{sample}.upper.txt"
    shell: "cat {input} > {output}"  # now wrong semantics for v1

Run:

snakemake --profile profiles/local -F --cores 2
head -n 2 results/v1/consumer/all.upper.txt

Expected evidence (verbatim):

hello a
hello b

This proves: semantic breaking change silently shipped under v1 path.

Step 3: correct governance (bump to v2)¶

Fix by bumping results_prefix in config/config.yaml:

results_prefix: "results/v2"
samples: ["A", "B"]

Update workflow/contracts/FILE_API.md to “File API (v2)” and describe the new semantics. Run:

snakemake --profile profiles/local --cores 2
ls -1 results/v2/provider | head

Expected evidence (verbatim path change):

outputs now exist under results/v2/...
v1 remains intact unless explicitly cleaned

Fix pattern (team checklist)¶

A PR is not reviewable without:

snakemake --lint
snakemake -n
.proof/summary.txt
.proof/list-changes.*.txt
FILE_API.md diff if anything about outputs changed

Proof hook¶

Provide:

the head output showing silent semantic break under v1,
the config + contract bump to v2,
and a directory listing proving v2 outputs exist.

Debugging playbook: scaling boundary failures¶

Symptom	Command	Evidence	Likely cause	First fix
Targets shrink/expand	`snakemake -n`	planned file set differs	include leakage	module boundary + config-derived lists
Config typos accepted	`snakemake -n`	no validation error	missing strict schema	`validate(config, schema)`
CI flaky	`diff`	NONDETERMINISTIC	entropy in outputs	move entropy to logs; seed via config
Unsure what changed	`--list-changes code`	impacted outputs listed	drift	attach drift artifact; rerun targeted
Resource claims untrusted	`cat logs/resources/*.txt`	threads/mem logged	unproved resources	evidence logs per rule
Module import surprises	`--list-rules`	unexpected rules present	wildcard import	import only required rules

`Snakefile` (module assembly, final)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")

include: "workflow/rules/consumer.smk"
include: "workflow/rules/entropy.smk"
include: "workflow/rules/resources.smk"

module provider:
    snakefile: "modules/provider/Snakefile"

use rule provider_make from provider as provider_*

rule all:
    input:
        f"{config['results_prefix']}/consumer/all.upper.txt",
        f"{config['results_prefix']}/entropy.txt",
        f"{config['results_prefix']}/resources/A.done.txt",
        f"{config['results_prefix']}/resources/B.done.txt"

`Snakefile.reference` (single-file sanity, final)¶

from snakemake.utils import validate

configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")

SAMPLES = config["samples"]
P = config["results_prefix"]

rule all:
    input:
        f"{P}/consumer/all.upper.txt",
        f"{P}/entropy.txt",
        f"{P}/resources/A.done.txt",
        f"{P}/resources/B.done.txt"

rule provider_make:
    input: "data/{sample}.txt"
    output: f"{P}/provider/{{ sample }}.upper.txt"
    shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"

rule consumer_all:
    input:
        expand(f"{P}/provider/{{ sample }}.upper.txt", sample=SAMPLES)
    output:
        f"{P}/consumer/all.upper.txt"
    shell:
        "cat {input} > {output}"

rule entropy_bad:
    output: f"{P}/entropy.txt"
    shell:
        "python - << 'PY'\n"
        "import time\n"
        "print(time.time())\n"
        "PY\n"
        "> {output}"

def mem_mb_from_input(wc, input):
    return max(200, int(2 * input.size_mb) + 200)

rule resource_probe:
    input: "data/{sample}.txt"
    output: f"{P}/resources/{{ sample }}.done.txt"
    log: "logs/resources/{sample}.txt"
    threads: 2
    resources:
        mem_mb=mem_mb_from_input
    shell:
        r"""
        printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
        echo OK > {output}
        """

Run sanity:

snakemake --profile profiles/local -s Snakefile.reference --cores 2

Closing recap¶

Scaling Snakemake is boundary engineering:

Modules enforce explicit dependencies; includes invite silent coupling.
Schemas + file API docs turn correctness into something you can prove.
CI gates kill entropy early.
Resources must produce evidence artifacts; otherwise they are folklore.
Breaking changes must be path/version changes, not “refactors”.

Snakemake Deep Dive — Module 04¶

Scaling Patterns: Modularity, Interfaces, CI Gates, and Executor-Proof Semantics¶

Orientation: scaling fails at boundaries¶

Unified scaling model¶

Runnable lab (single repo, two assemblies)¶

Golden layout¶

profiles/local/config.v9+.yaml¶

config/config.yaml¶

config/config.schema.yaml¶

workflow/contracts/FILE_API.md¶

data/A.txt, data/B.txt¶

Commissioning sequence (module assembly)¶

Core 1 — Modularity that scales: include vs module and real boundaries¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (leak via include)¶

modules/provider/Snakefile¶

workflow/rules/consumer.smk (bug: overwrites provider global)¶

Top-level Snakefile (bad assembly)¶

Fix pattern (module boundary + config-derived list)¶

workflow/rules/consumer.smk (fixed: no globals)¶

Top-level Snakefile (good assembly)¶

Proof hook¶

Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (schema failure must be immediate)¶

Fix pattern¶

Proof hook¶

Core 3 — Determinism and drift control: CI as the correctness boundary¶

Learning objectives¶

Definition¶

Semantics¶

Failure signatures¶

Minimal repro (prove entropy)¶

workflow/rules/entropy.smk¶

Fix pattern¶

CI gate (minimal, enforceable)¶

ci/gate.sh¶

Proof hook¶

Core 4 — Resource semantics with evidence: prove what the workflow asked for¶

Learning objectives¶

Definition¶

Semantics (portable evidence)¶

Failure signatures¶

Minimal repro (resource evidence logs)¶

workflow/rules/resources.smk¶

Fix pattern¶

Proof hook¶

Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶

Learning objectives¶

Definition¶

Semantics (breaking change demo, isolated and concrete)¶

Step 1: baseline (v1)¶

Step 2: introduce a breaking change (format semantics)¶

Step 3: correct governance (bump to v2)¶

Fix pattern (team checklist)¶

Proof hook¶

Debugging playbook: scaling boundary failures¶

Snakefile (module assembly, final)¶

Snakefile.reference (single-file sanity, final)¶

Closing recap¶

`profiles/local/config.v9+.yaml`¶

`config/config.yaml`¶

`config/config.schema.yaml`¶

`workflow/contracts/FILE_API.md`¶

`data/A.txt`, `data/B.txt`¶

Core 1 — Modularity that scales: `include` vs `module` and real boundaries¶

Minimal repro (leak via `include`)¶

`modules/provider/Snakefile`¶

`workflow/rules/consumer.smk` (bug: overwrites provider global)¶

Top-level `Snakefile` (bad assembly)¶

`workflow/rules/consumer.smk` (fixed: no globals)¶

Top-level `Snakefile` (good assembly)¶

`workflow/rules/entropy.smk`¶

`ci/gate.sh`¶

`workflow/rules/resources.smk`¶

`Snakefile` (module assembly, final)¶

`Snakefile.reference` (single-file sanity, final)¶