Snakemake Deep Dive — Module 03¶
Production Snakemake: HPC/Cloud Execution, Error Handling, Data Locality, Governance¶
Version & scope contract
Target: Snakemake 9.14.x (this module relies on profile version files like
config.vX+.yaml, plugin catalog executors/storage, and the current unit-test generator behavior). Verify:
snakemake --versionsnakemake --help | sed -n '1,40p'- In scope: profiles as policy, executor/storage plugins, retries + incomplete semantics, staging/data locality, CI testing, governance/drift.
- Out of scope: authoring fundamentals (Module 01), checkpoints/wildcard expansion theory (Module 02).
Orientation: production is “policy + plugins + proofs”¶
Production Snakemake means you stop relying on “tribal CLI invocations” and you make execution reproducible by encoding policy in a profile, capabilities in plugins, and correctness via proof artifacts (logs, change reports, tests). Profiles in 9.x are explicitly version-scoped (config.vX+.yaml) and can set any CLI option by YAML key. (Snakemake)
Unified cost model¶
Total pain ≈ scheduler friction + FS latency + staging mistakes + poison artifacts + provenance loss
| What hurts | What you see | Dominant cause | First fix |
|---|---|---|---|
| Scheduler friction | too many tiny jobs | DAG granularity + submit overhead | group/merge jobs; cap submit rates |
| FS latency | “output missing” after job finished | shared FS lag | raise --latency-wait (Snakemake) |
| Staging mistakes | outputs “disappear” / land in wrong place | wrong prefixes / shared-fs-usage lies | make shared-fs-usage explicit + stage to scratch (Snakemake) |
| Poison artifacts | partial outputs break downstream | non-atomic writes + failure | atomic publish + strict incomplete policy (Snakemake) |
| Provenance loss | change reports empty | --drop-metadata | never drop metadata in prod (Snakemake) |
Minimal production lab (runnable baseline)¶
This module uses a tiny workflow that exercises: profiles, executor plugin wiring, retries, incomplete outputs, staging knobs, unit-test generation, and drift reporting.
Golden layout (pre-run)¶
.
├── Snakefile
├── profiles
│ ├── local
│ │ └── config.v9+.yaml
│ └── slurm
│ └── config.v9+.yaml
├── scripts
│ ├── flaky_once.py
│ ├── poison.py
│ └── atomic_writer.py
└── results
Golden “commissioning” command sequence¶
snakemake --profile profiles/local -n
snakemake --profile profiles/local --cores 2
snakemake --profile profiles/local --retries 1 results/flaky_once.txt
snakemake --profile profiles/local --generate-unit-tests
pytest .tests/unit/
snakemake --profile profiles/local --list-changes code
--generate-unit-tests and .tests/unit + pytest invocation are official behavior. (Snakemake) --list-changes is the official drift report for changed input|code|params. (Snakemake)
Core 1 — Execution backends via profiles (cluster-first by construction)¶
Learning objectives¶
You will be able to:
- Encode execution policy in a version-scoped profile and prove it’s applied.
- Switch local ↔ SLURM without editing workflow code.
- Predict and fix “profile not applied” failures using evidence.
Definition¶
A profile is a directory containing config.vX+.yaml (preferred) or config.yaml (fallback). Each CLI flag --foo-bar becomes YAML key foo-bar:; profiles can also include auxiliary files. (Snakemake)
Semantics¶
- Profiles are policy. Workflow code describes the DAG; profile describes how/where it runs. (Snakemake)
- The SLURM executor is a plugin; it can be set via profile with
executor: slurm. (Snakemake)
flowchart LR
A[Snakefile] --> B[Compile DAG]
C[Profile] --> D[Executor choice]
D --> E[local jobs]
D --> F[slurm jobs]
B --> D
Failure signatures¶
- Runs locally despite “cluster intent” → wrong profile path or wrong filename (
config.v9+.yamlmissing). - Unknown executor → SLURM plugin not installed on the submission host.
- Logs missing → SLURM plugin defaults delete successful logs unless configured. (Snakemake)
Minimal repro (complete)¶
1) Two profiles¶
profiles/local/config.v9+.yaml
profiles/slurm/config.v9+.yaml
executor: slurm
jobs: 50
printshellcmds: true
latency-wait: 30
slurm-logdir: logs/slurm
slurm-keep-successful-logs: true
latency-waitwaits for outputs after job completion to tolerate FS latency. (Snakemake)- SLURM plugin settings
--slurm-logdirand--slurm-keep-successful-logsare documented and default to deleting successful logs unless enabled. (Snakemake)
2) Prove the profile is applied¶
snakemake --profile profiles/local -n --print-compilation > .proof/local.compile.txt
snakemake --profile profiles/slurm -n --print-compilation > .proof/slurm.compile.txt
--print-compilation is an official CLI flag for printing the workflow’s Python representation. (Snakemake)
Expected evidence (stable invariants):
- Both outputs contain
Building DAG of jobs... - The SLURM run shows an executor configured as
slurmin the compilation output (search within the file forslurm).
Fix pattern¶
- Put everything operational into the profile: executor, job caps, log retention, latency wait.
- Treat ad-hoc CLI flags as incident response only; if it matters, it belongs in versioned profile files (
config.vX+.yaml). (Snakemake)
Proof hook¶
Attach:
.proof/slurm.compile.txtcontaining “Building DAG” and at least one occurrence ofslurm.- The exact profile file content you used.
Core 2 — Robustness: atomicity, retries, incomplete semantics¶
Learning objectives¶
You will be able to:
- Create failure modes that produce poison outputs, then eliminate them.
- Use
--retries,--keep-incomplete, and--rerun-incompletecorrectly. - Explain why
--drop-metadatadestroys governance tools and refuse it in production.
Definition¶
Robustness is enforcing a strict output contract:
- outputs are either complete and correct, or absent / marked incomplete and rerunnable.
Key CLI:
--retriesrestarts failing jobs. (Snakemake)--keep-incompletekeeps failed-job partial outputs. (Snakemake)--rerun-incompletereruns jobs whose outputs are recognized as incomplete. (Snakemake)
Semantics¶
--retries Nrestarts a job N times; theattemptcounter exists to scale resources across retries. (Snakemake)--keep-incompleteis for forensics; it keeps poison outputs around (dangerous unless paired with strict reruns). (Snakemake)--drop-metadatamakes provenance-based tools like--list-changesempty or incomplete—this is explicitly documented. (Snakemake)
flowchart TD
A[Job starts] --> B[Writes temp output]
B -->|success| C[Atomic rename -> final output]
B -->|fail| D[Temp stays / marked incomplete]
D -->|--rerun-incomplete| A
Failure signatures¶
- “Downstream consumed garbage” → non-atomic writer produced plausible partial output.
- “Works after rerun” → transient failure; you lacked retries.
- “Drift reports show nothing” → metadata was dropped. (Snakemake)
Minimal repro (complete)¶
Repro A — flaky once + retries¶
scripts/flaky_once.py
import os, sys
from pathlib import Path
attempt = int(os.environ.get("SNAKEMAKE_ATTEMPT", "1"))
out = Path(sys.argv[1])
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(f"attempt={attempt}\n")
if attempt == 1:
print("Failing on attempt 1 (intentional).", file=sys.stderr)
sys.exit(42)
print("Succeeded on attempt >=2.", file=sys.stderr)
Run:
snakemake --profile profiles/local results/flaky_once.txt || true
snakemake --profile profiles/local --retries 1 results/flaky_once.txt
cat results/flaky_once.txt
Expected output (verbatim, file content):
Repro B — poison output + incomplete discipline¶
scripts/poison.py
import sys
from pathlib import Path
out = Path(sys.argv[1])
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text("PARTIAL\n")
print("Wrote PARTIAL then crashing.", file=sys.stderr)
sys.exit(13)
Run:
snakemake --profile profiles/local results/poison.txt || true
test -e results/poison.txt && echo "UNSAFE: poison remained" || echo "OK: removed"
snakemake --profile profiles/local --keep-incomplete results/poison.txt || true
printf "poison file content:\n"; cat results/poison.txt
Expected output (verbatim fragments):
- After default failure:
- With
--keep-incomplete:
--keep-incomplete behavior is explicitly defined. (Snakemake)
Repro C — atomic writer (the fix)¶
scripts/atomic_writer.py
import sys
from pathlib import Path
final = Path(sys.argv[1])
tmp = final.with_suffix(final.suffix + ".tmp")
final.parent.mkdir(parents=True, exist_ok=True)
tmp.write_text("COMPLETE\n")
tmp.replace(final) # atomic rename on same filesystem
Rule uses atomic writer:
- If the job fails before
replace(), the final output never appears.
Fix pattern¶
- Never write final outputs “in place” unless the write is atomic by construction.
- Use
--keep-incompleteonly during triage; otherwise you risk poisoning future DAG runs. - Hard rule: do not use
--drop-metadatain production because it invalidates--list-changesand provenance reports. (Snakemake)
Proof hook¶
Submit:
cat results/flaky_once.txtshowingattempt=2.- Evidence that poison file contains
PARTIALonly when run with--keep-incomplete. (Snakemake)
Core 3 — Data locality and staging: storage plugins + explicit prefixes¶
Learning objectives¶
You will be able to:
- Configure staging to local scratch via
--default-storage-provider,--local-storage-prefix,--remote-job-local-storage-prefix, and--shared-fs-usage. - Demonstrate staging with filesystem evidence, and demonstrate a staging failure that proves misconfiguration.
- Encode staging in the profile instead of relying on per-run CLI.
Definition¶
Snakemake can map inputs/outputs to storage providers implemented as plugins. (Snakemake) The fs storage plugin uses rsync to read/write from a locally mounted filesystem and is specifically motivated by avoiding harmful parallel IO patterns on NFS. (Snakemake)
Semantics¶
The fs plugin documentation gives a canonical staging configuration:
--default-storage-provider fs--local-storage-prefix /local/work/$USER--shared-fs-usage persistence software-deployment sources source-cache…and shows how to setremote-job-local-storage-prefixfor job-specific scratch. (Snakemake)
It also explicitly notes you still need a non-remote local storage prefix because some jobs may execute without remote submission. (Snakemake)
flowchart LR
N["Shared FS (NFS/Lustre)"] -->|stage in| S[Scratch prefix]
S --> J[Job executes]
J -->|stage out| N
Failure signatures¶
- Scratch directory stays empty → storage plugin not active (missing plugin install or flags/profile).
- rsync / permission error → scratch prefix not writable (most common real incident).
- Outputs appear locally but not on shared FS → shared-fs-usage / prefix mismatch.
Minimal repro (complete)¶
Repro A — staging success with explicit scratch evidence¶
Install plugin (once, on the submission host):
Installation is documented in the plugin catalog. (Snakemake)
Run with a visible scratch prefix:
rm -rf .scratch .snakemake/storage results/staged_demo.txt
snakemake --profile profiles/local -F results/staged_demo.txt \
--default-storage-provider fs \
--shared-fs-usage persistence software-deployment sources source-cache \
--local-storage-prefix .scratch/$USER
This exact pattern is recommended by the fs plugin docs (with a scratch path). (Snakemake)
Inspect:
Expected output (example, verbatim shape):
(Exact paths vary, but the invariant is: non-empty file list under .scratch/$USER.)
Repro B — staging failure (misconfigured scratch prefix)¶
Force a non-writable scratch prefix:
snakemake --profile profiles/local -F results/staged_demo.txt \
--default-storage-provider fs \
--shared-fs-usage persistence software-deployment sources source-cache \
--local-storage-prefix /root/forbidden_scratch
Expected failure (verbatim fragment):
- A permission error writing into
/root/forbidden_scratch(either from Snakemake or rsync).
Fix pattern¶
- Treat staging configuration as policy: move it into the profile once it works.
-
Encode both:
-
local-storage-prefix(for local jobs) remote-job-local-storage-prefix(for cluster jobs) because Snakemake may execute some jobs without remote submission. (Snakemake)
Proof hook¶
Provide:
- Output of
find .scratch/$USER -type f | head -n 10 - Your profile snippet (or CLI) showing
default-storage-provider: fsandlocal-storage-prefix: ...(Snakemake)
Core 4 — Testing and CI/CD: generate unit tests, then gate¶
Learning objectives¶
You will be able to:
- Generate unit tests with
--generate-unit-tests. - Run pytest and interpret failures as workflow regressions (not “pytest problems”).
- Keep unit tests small and deterministic.
Definition¶
Snakemake can generate unit tests from a successful run by copying representative job inputs into .tests/unit and producing pytest tests. (Snakemake)
Semantics¶
- Generate:
snakemake --generate-unit-tests(Snakemake) - Run:
pytest .tests/unit/(Snakemake) - Each test file is
.tests/unit/test_<rulename>.pyand compares outputs to the “known-good” results; default comparison is byte-by-byte viacmp/zcmp/bzcmp/xzcmp. (Snakemake)
flowchart TD
A[Successful run] --> B[--generate-unit-tests]
B --> C[.tests/unit + fixtures]
C --> D[pytest gate]
D -->|fail| E[workflow regression]
D -->|pass| F[ship]
Failure signatures¶
- “skipped job” warning during generation → representative job inputs not present. (Snakemake)
- pytest fails after legitimate change → you changed a contract; update golden outputs intentionally (and bump version).
- pytest flaky → workflow nondeterminism (random seeds, timestamps, unstable discovery).
Minimal repro (complete)¶
Run once:
Generate tests:
Inspect one generated test file (this is the verbatim evidence):
Run pytest:
Expected pytest tail (verbatim shape):
Fix pattern¶
- Generate tests only from a small dummy dataset; the docs explicitly warn against generating tests from big data. (Snakemake)
-
CI gates (minimum):
-
snakemake --lint(Snakemake) pytest .tests/unit/(Snakemake)
Proof hook¶
Provide:
- First 30–80 lines of one generated
.tests/unit/test_<rulename>.pyfile (viased). - The pytest summary lines showing collection and pass/fail.
Core 5 — Maintainability and governance: drift reports, contracts, versioning¶
Learning objectives¶
You will be able to:
- Detect drift with
--list-changesand explain what changed. - Prove that dropping metadata breaks governance tools (and refuse it).
- Adopt a review checklist that prevents interface breakage.
Definition¶
Governance means: stable interfaces + explicit change control + auditable provenance.
Snakemake provides drift tools:
--list-changes {input,code,params}lists output files whose specified items changed since creation. (Snakemake)--drop-metadatamakes provenance-based reports (including--list_x_changes) empty or incomplete. (Snakemake)
Semantics¶
--list-changes codeis your “what did we invalidate?” query after editing scripts/rules. (Snakemake)- If metadata is dropped, governance fails by definition. This is not a “maybe”; it is stated explicitly. (Snakemake)
flowchart LR
A[Run] --> B[Metadata tracked]
B --> C[--list-changes]
D[--drop-metadata] --> E[Reports empty/incomplete]
E --> F[Governance failure]
Failure signatures¶
- “Why did this rerun?” cannot be answered → metadata missing.
- “We changed code but nothing is flagged” →
--drop-metadatawas used, or outputs were recreated without tracking. - Downstream consumers break → contracts were implicit, not versioned.
Minimal repro (complete)¶
- Converge:
-
Edit a script (e.g., append a harmless comment to
scripts/atomic_writer.py). -
Ask Snakemake to enumerate invalidated outputs:
Expected behavior: at least one output is listed as impacted by code drift (exact formatting varies). (Snakemake)
- Demonstrate governance failure explicitly:
snakemake --profile profiles/local --drop-metadata --cores 2
snakemake --profile profiles/local --list-changes code
Expected behavior: the second --list-changes becomes empty or incomplete specifically because metadata was dropped (this is the documented effect). (Snakemake)
Fix pattern¶
Adopt three hard artifacts:
workflow/CONTRACT.md: file naming + formats + schema expectations.workflow/VERSION: semantic version (bump on contract changes).-
workflow/REVIEW.md: checklist requiring: -
snakemake --lint(Snakemake) snakemake -n --summary --reasonsnakemake --list-changes code|params|inputevidence (Snakemake)- “No
--drop-metadata” attestation (Snakemake)
Proof hook¶
Provide:
- The exact output of
snakemake --list-changes codebefore and after--drop-metadata. - Your
workflow/VERSIONand a short note: “contract changed? yes/no”.
Appendix — Consolidated reference Snakefile (single-file, end-to-end)¶
Snakefile
rule all:
input:
"results/staged_demo.txt",
"results/flaky_once.txt",
"results/atomic_ok.txt",
rule staged_demo:
output:
"results/staged_demo.txt"
shell:
"printf 'staged_demo=ok\\n' > {output}"
rule flaky_once:
output:
"results/flaky_once.txt"
shell:
"python capstone/flaky_once.py {output}"
rule poison:
output:
"results/poison.txt"
shell:
"python capstone/poison.py {output}"
rule atomic_ok:
output:
"results/atomic_ok.txt"
shell:
"python capstone/atomic_writer.py {output}"
Closing recap¶
If you want production-grade Snakemake, stop optimizing rules first. Instead:
- Profiles are policy, version-scoped (
config.vX+.yaml), and they must fully encode how the DAG is executed. (Snakemake) - Robustness is atomic outputs + strict incomplete semantics + retries; poison artifacts are a correctness bug, not an inconvenience. (Snakemake)
- Data locality is explicit: staging to scratch must be configured and proven with filesystem evidence; the fs plugin gives canonical patterns. (Snakemake)
- CI is real only when it runs workflow-derived tests (
--generate-unit-tests+ pytest) and gates merges. (Snakemake) - Governance requires metadata and drift reports;
--drop-metadatais operational malpractice in production because it breaks those tools by design. (Snakemake)