Snakemake Deep Dive — Module 04¶
Scaling Patterns: Modularity, Interfaces, CI Gates, and Executor-Proof Semantics¶
Version & scope contract
- Target: Snakemake 9.14.x (CLI reporting, profiles, modules, schema validation).
- This module is about scaling correctness, not tuning a specific cluster (Module 03) and not checkpoints/dynamic DAGs (Module 02).
- Verify:
Orientation: scaling fails at boundaries¶
When a workflow “doesn’t scale”, it’s usually not CPU. It’s:
- hidden coupling across files,
- implicit file contracts,
- nondeterministic outputs,
- or resources that were never proved to the executor.
Unified scaling model¶
Total scaling failure ≈ hidden coupling + implicit contracts + entropy + unproven resources + upgrade drift
flowchart LR
P[Provider module] -->|File API: paths + formats| C[Consumer workflow]
C --> V["Schema validate (fail fast)"]
C --> G[CI gates: lint + dryrun + tests + diffs]
C --> D[Drift evidence: list-changes + summary]
C --> R[Resource evidence: logs show threads/mem]
V --> S[Scale without fear]
G --> S
D --> S
R --> S
Runnable lab (single repo, two assemblies)¶
You will have:
- Module assembly (real boundaries):
Snakefile - Single-file sanity (fast baseline):
Snakefile.reference
Golden layout¶
.
├── Snakefile
├── Snakefile.reference
├── config
│ ├── config.yaml
│ ├── config.schema.yaml
│ └── schemas
│ ├── ref.schema.yaml
│ └── broken.schema.yaml
├── data
│ ├── A.txt
│ └── B.txt
├── modules
│ └── provider
│ └── Snakefile
├── workflow
│ ├── contracts
│ │ └── FILE_API.md
│ └── rules
│ ├── consumer.smk
│ ├── entropy.smk
│ └── resources.smk
├── profiles
│ └── local
│ └── config.v9+.yaml
└── ci
└── gate.sh
profiles/local/config.v9+.yaml¶
config/config.yaml¶
config/config.schema.yaml¶
type: object
required: [results_prefix, samples]
properties:
results_prefix: {type: string, minLength: 1}
samples:
type: array
minItems: 1
items: {type: string, pattern: "^[A-Za-z0-9._-]+$"}
additionalProperties: false
workflow/contracts/FILE_API.md¶
# File API (v1)
Provider outputs:
- path: results/v1/provider/{sample}.upper.txt
- semantics: uppercase of data/{sample}.txt
Consumer outputs:
- path: results/v1/consumer/all.upper.txt
- semantics: concatenation of provider outputs in config.samples order
Breaking changes:
- Any change to output paths, patterns, or semantics => bump results_prefix to results/v2
data/A.txt, data/B.txt¶
Commissioning sequence (module assembly)¶
snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --list-rules
snakemake --profile profiles/local --rulegraph mermaid-js > .proof/rulegraph.mmd
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true
Expected invariants you can verify immediately:
results/v1/consumer/all.upper.txtexists and contains:
.proof/summary.txt contains a header with the columns: Core 1 — Modularity that scales: include vs module and real boundaries¶
Learning objectives¶
You will be able to:
- Reproduce a silent correctness bug caused by
include:namespace leakage. - Replace it with a
moduleboundary + explicituse ruleimports. - Prove the boundary using
--list-rulesand a stable file API.
Definition¶
include:merges Snakefiles into one namespace (globals can collide).moduleloads a workflow into its own namespace;use rule ... from ...imports explicitly.
Semantics¶
flowchart TD
I[include: shared namespace] --> Leak[globals collide]
M[module namespace] --> Use[use rule imports only what you depend on]
Use --> API[depend on file API, not provider globals]
Failure signatures¶
- Dry-run target set changes with “no obvious reason”.
- Consumer changes provider behavior without touching provider code.
Minimal repro (leak via include)¶
modules/provider/Snakefile¶
SAMPLES = ["A", "B"]
rule provider_make:
input: "data/{sample}.txt"
output: "results/v1/provider/{sample}.upper.txt"
shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"
workflow/rules/consumer.smk (bug: overwrites provider global)¶
SAMPLES = ["A"] # accidental narrowing
rule consumer_all:
input:
expand("results/v1/provider/{sample}.upper.txt", sample=SAMPLES)
output:
"results/v1/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
Top-level Snakefile (bad assembly)¶
include: "modules/provider/Snakefile"
include: "workflow/rules/consumer.smk"
rule all:
input: "results/v1/consumer/all.upper.txt"
Run:
Expected planning evidence (verbatim file set):
- Planned input includes
results/v1/provider/A.upper.txt - Does not include
results/v1/provider/B.upper.txt
Fix pattern (module boundary + config-derived list)¶
workflow/rules/consumer.smk (fixed: no globals)¶
rule consumer_all:
input:
expand(f"{config['results_prefix']}/provider/{{ sample }}.upper.txt", sample=config["samples"])
output:
f"{config['results_prefix']}/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
Top-level Snakefile (good assembly)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")
module provider:
snakefile: "modules/provider/Snakefile"
use rule provider_make from provider as provider_*
include: "workflow/rules/consumer.smk"
rule all:
input: f"{config['results_prefix']}/consumer/all.upper.txt"
Run:
Expected planning evidence (verbatim file set):
-
Planned inputs include both:
-
results/v1/provider/A.upper.txt results/v1/provider/B.upper.txt
Proof hook¶
Provide:
snakemake --profile profiles/local --list-rulesoutput (before/after).- Two dry-runs showing the target set changed exactly as described.
Core 2 — Interface contracts: naming, schemas, versioned outputs, compatibility¶
Learning objectives¶
You will be able to:
- Fail fast on bad config via schema validation.
- Classify changes as breaking/non-breaking mechanically.
- Force breaking changes to be explicit via
results_prefixbump.
Definition¶
A contract is path + format + semantics, written down and enforced.
Semantics¶
- The config schema makes typos impossible to ignore.
- The file API doc makes upgrades reviewable.
Failure signatures¶
- “It ran, but outputs are wrong” (schema drift or semantic drift).
- Typos accepted (missing strict schema).
Minimal repro (schema failure must be immediate)¶
Break config/config.yaml:
Run:
Expected failure evidence (verbatim fragment):
-
A validation error mentioning:
-
missing required property
samples - unexpected property
samplez
Fix pattern¶
- Keep
additionalProperties: false. - Put breaking changes behind
results/v2/....
Proof hook¶
Provide:
- The schema error excerpt.
- The fixed config + a successful
-n.
Core 3 — Determinism and drift control: CI as the correctness boundary¶
Learning objectives¶
You will be able to:
- Demonstrate nondeterminism with a stable diff.
- Enforce a CI gate that catches entropy and drift.
- Produce drift artifacts (
--summary,--list-changes) as PR evidence.
Definition¶
Determinism means:
- stable plan for stable inputs,
- stable outputs for stable inputs,
- stable provenance signals.
Semantics¶
flowchart LR
A[Run] --> B[Artifacts: outputs + summary + list-changes]
C[Change code/params] --> D[list-changes flags impacted outputs]
E[Entropy in outputs] --> F[diff fails => CI fails]
Failure signatures¶
- “CI is flaky” (time/RNG/unordered globs in outputs).
- Drift report is empty when it shouldn’t be (metadata dropped or bypassed).
Minimal repro (prove entropy)¶
workflow/rules/entropy.smk¶
rule entropy_bad:
output: f"{config['results_prefix']}/entropy.txt"
shell:
"python - << 'PY'\n"
"import time\n"
"print(time.time())\n"
"PY\n"
"> {output}"
Run twice and diff:
snakemake --profile profiles/local -F results/v1/entropy.txt
cp results/v1/entropy.txt /tmp/e1.txt
snakemake --profile profiles/local -F results/v1/entropy.txt
diff -u /tmp/e1.txt results/v1/entropy.txt && echo OK || echo NONDETERMINISTIC
Expected output (verbatim):
Fix pattern¶
- Entropy goes to logs, not semantic outputs.
- If randomness is required, seed is config and becomes provenance.
CI gate (minimal, enforceable)¶
ci/gate.sh¶
#!/usr/bin/env bash
set -euo pipefail
snakemake --profile profiles/local --lint
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary > .proof/summary.txt
snakemake --profile profiles/local --list-changes code > .proof/list-changes.code.txt || true
Proof hook¶
Provide:
- the failing diff (
NONDETERMINISTIC) and the fixed diff (OK), .proof/summary.txtand.proof/list-changes.code.txt.
Core 4 — Resource semantics with evidence: prove what the workflow asked for¶
Learning objectives¶
You will be able to:
- Resolve dynamic resources deterministically from explicit evidence (input size).
- Prove resolved
threadsandmem_mbusing log artifacts. - Detect oversubscription failures before cluster execution.
Definition¶
Resource correctness is not “I wrote resources:”. It is:
- the workflow resolved concrete values,
- evidence was produced,
- and the executor could have enforced them.
Semantics (portable evidence)¶
flowchart TD
I[input size] --> R[resolve mem_mb]
R --> L[write evidence log]
L --> A[audit in CI / review]
Failure signatures¶
- “Resources ignored” (no evidence exists; you’re guessing).
- Oversubscription (threads > available cores) causes stalls/rejections.
Minimal repro (resource evidence logs)¶
workflow/rules/resources.smk¶
def mem_mb_from_input(wc, input):
# deterministic; scale with size for demonstration
return max(200, int(2 * input.size_mb) + 200)
rule resource_probe:
input: "data/{sample}.txt"
output: f"{config['results_prefix']}/resources/{{ sample }}.done.txt"
log: "logs/resources/{sample}.txt"
threads: 2
resources:
mem_mb=mem_mb_from_input
shell:
r"""
printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
echo OK > {output}
"""
Make B large:
Run:
snakemake --profile profiles/local results/v1/resources/A.done.txt results/v1/resources/B.done.txt
echo "=== A ==="; cat logs/resources/A.txt
echo "=== B ==="; cat logs/resources/B.txt
Expected log evidence (verbatim lines present in both):
And B’s mem_mb must be > A’s.
Fix pattern¶
- Defaults belong in profile; rule-level resources are exceptions.
- Any “special” rule must emit an evidence log of resolved resources.
Proof hook¶
Provide:
logs/resources/A.txtandlogs/resources/B.txt,- and one sentence: “B mem_mb > A mem_mb (evidence above).”
Core 5 — Workflow as a product: distribution, pinning, upgrade paths, team practice¶
Learning objectives¶
You will be able to:
- Make a breaking change that is mechanically explicit (v1 → v2).
- Prove a non-breaking refactor does not perturb consumers.
- Encode team review rules that prevent silent breakage.
Definition¶
A workflow is a versioned product:
- stable file API,
- pinned dependencies,
- explicit upgrade paths.
Semantics (breaking change demo, isolated and concrete)¶
Step 1: baseline (v1)¶
Confirm:
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --summary | grep -E "results/v1/provider|results/v1/consumer" | head
Step 2: introduce a breaking change (format semantics)¶
Change provider to output lowercase instead of uppercase (breaking semantics) but keep the path the same (this is the mistake).
Edit modules/provider/Snakefile:
rule provider_make:
input: "data/{sample}.txt"
output: "results/v1/provider/{sample}.upper.txt"
shell: "cat {input} > {output}" # now wrong semantics for v1
Run:
Expected evidence (verbatim):
This proves: semantic breaking change silently shipped under v1 path.
Step 3: correct governance (bump to v2)¶
Fix by bumping results_prefix in config/config.yaml:
Update workflow/contracts/FILE_API.md to “File API (v2)” and describe the new semantics. Run:
Expected evidence (verbatim path change):
- outputs now exist under
results/v2/... - v1 remains intact unless explicitly cleaned
Fix pattern (team checklist)¶
A PR is not reviewable without:
snakemake --lintsnakemake -n.proof/summary.txt.proof/list-changes.*.txt- FILE_API.md diff if anything about outputs changed
Proof hook¶
Provide:
- the
headoutput showing silent semantic break under v1, - the config + contract bump to v2,
- and a directory listing proving v2 outputs exist.
Debugging playbook: scaling boundary failures¶
| Symptom | Command | Evidence | Likely cause | First fix |
|---|---|---|---|---|
| Targets shrink/expand | snakemake -n | planned file set differs | include leakage | module boundary + config-derived lists |
| Config typos accepted | snakemake -n | no validation error | missing strict schema | validate(config, schema) |
| CI flaky | diff | NONDETERMINISTIC | entropy in outputs | move entropy to logs; seed via config |
| Unsure what changed | --list-changes code | impacted outputs listed | drift | attach drift artifact; rerun targeted |
| Resource claims untrusted | cat logs/resources/*.txt | threads/mem logged | unproved resources | evidence logs per rule |
| Module import surprises | --list-rules | unexpected rules present | wildcard import | import only required rules |
Snakefile (module assembly, final)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")
include: "workflow/rules/consumer.smk"
include: "workflow/rules/entropy.smk"
include: "workflow/rules/resources.smk"
module provider:
snakefile: "modules/provider/Snakefile"
use rule provider_make from provider as provider_*
rule all:
input:
f"{config['results_prefix']}/consumer/all.upper.txt",
f"{config['results_prefix']}/entropy.txt",
f"{config['results_prefix']}/resources/A.done.txt",
f"{config['results_prefix']}/resources/B.done.txt"
Snakefile.reference (single-file sanity, final)¶
from snakemake.utils import validate
configfile: "config/config.yaml"
validate(config, "config/config.schema.yaml")
SAMPLES = config["samples"]
P = config["results_prefix"]
rule all:
input:
f"{P}/consumer/all.upper.txt",
f"{P}/entropy.txt",
f"{P}/resources/A.done.txt",
f"{P}/resources/B.done.txt"
rule provider_make:
input: "data/{sample}.txt"
output: f"{P}/provider/{{ sample }}.upper.txt"
shell: "tr '[:lower:]' '[:upper:]' < {input} > {output}"
rule consumer_all:
input:
expand(f"{P}/provider/{{ sample }}.upper.txt", sample=SAMPLES)
output:
f"{P}/consumer/all.upper.txt"
shell:
"cat {input} > {output}"
rule entropy_bad:
output: f"{P}/entropy.txt"
shell:
"python - << 'PY'\n"
"import time\n"
"print(time.time())\n"
"PY\n"
"> {output}"
def mem_mb_from_input(wc, input):
return max(200, int(2 * input.size_mb) + 200)
rule resource_probe:
input: "data/{sample}.txt"
output: f"{P}/resources/{{ sample }}.done.txt"
log: "logs/resources/{sample}.txt"
threads: 2
resources:
mem_mb=mem_mb_from_input
shell:
r"""
printf "sample={wildcards.sample}\nthreads={threads}\nmem_mb={resources.mem_mb}\ninput={input}\n" > {log}
echo OK > {output}
"""
Run sanity:
Closing recap¶
Scaling Snakemake is boundary engineering:
- Modules enforce explicit dependencies; includes invite silent coupling.
- Schemas + file API docs turn correctness into something you can prove.
- CI gates kill entropy early.
- Resources must produce evidence artifacts; otherwise they are folklore.
- Breaking changes must be path/version changes, not “refactors”.