Snakemake Deep Dive
Module 01: First Principles — The File-DAG Contract
01. Mental Model: Rules → Files → DAG → Jobs
- Snakemake is a file-driven DAG engine, not a scripting framework
- Inputs/outputs define state; re-runs via timestamps/checksums and parameter changes
- Reproducibility and provenance as non-negotiable constraints
- Diagram: Rules + file patterns → DAG → scheduled jobs
02. Rule Anatomy: Wildcards, Resources, and Safe File Semantics
- Inputs/outputs/params/log/benchmark; wildcards and constraints to prevent ambiguity
threads vs resources (time/mem/partition/GPU): correctness and scheduling implications temp(), protected(), ancient(), touch()—when they’re safe vs when they create lies shadow: per-job working directories to reduce NFS contention and isolate temp files
03. Configuration Discipline: Config-as-Data + Profiles
config.yaml as the single source of truth; no hidden globals - Profiles as “site policy” (cluster args, default resources, latency-wait, retries)
- Validation patterns: schemas, defensive defaults, and fail-fast config checks
04. Modularity: Includes, Modules, Interfaces, and Boundaries
include: vs module vs subworkflows; when each is justified - Split by stage with explicit contracts (file formats, naming, metadata)
- Shared utilities without state leakage; keep “library code” pure and testable
05. Observability: Debugging the DAG, Not the Symptoms
-n, --reason, --summary, --dag, --rulegraph, --lint - Why a rule re-runs: inputs/code/params/env changes; diagnosing with evidence
- Failure triage: isolate minimal repro, logs per rule, strict output discipline
- Sample sheets/metadata →
expand, glob_wildcards, structured target lists - Preventing accidental cartesian products; constraints as correctness tools
- Validation checkpoints for “are all samples present?” before full expansion
02. Checkpoints: Dynamic DAGs Done Safely (and When They’re a Smell)
- Two-phase execution model; reading discovered outputs correctly
- Discovery traps: nondeterminism, “moving target” outputs, hidden dependencies
- Strong pattern: external discovery logic is deterministic; checkpoint validates/stabilizes
03. Data Integrity and Provenance as First-Class Outputs
- Logs/benchmarks/reports per rule; structured artifact layout
- Provenance: tool versions, params, config snapshots,
--report discipline - Rerun semantics beyond timestamps: parameter drift and environment changes
04. Environments and Containers: Reproducibility Without Slowness
- Per-rule
conda: (or container) as default; pinning versions and channels - Singularity/Apptainer realities on HPC (binds, caches, performance)
- Avoiding “env churn”: caching, reuse strategy, and minimizing solver pain
group / localrules to control scheduler overhead and job granularity - Scatter/gather patterns; staging temp intermediates; compression tradeoffs
- Filesystem constraints: metadata storms, small-file pathologies, and mitigation
Module 03: Production Snakemake — HPC/Cloud Execution, Error Handling, Data Locality, Governance
01. Execution Backends: Cluster-First Operation via Profiles
- Local vs cluster vs cloud backends; what changes in failure and latency behavior
- SLURM essentials through profiles; resource modeling as policy (not ad hoc flags)
- Custom jobscript templates: pre/post hooks, modules, scratch setup, advanced allocation
02. Robustness: Atomic Outputs, Exit Codes, and Recovery Semantics
- Atomic writes, temp staging, and safe cleanup; resumability by construction
- Error handling: exit codes, optional failures, and explicit “fail vs continue” policy
- Partial outputs,
--rerun-incomplete, checkpoint recovery, retries/backoff discipline
03. Scaling + Data Locality: Remote Files and Explicit Staging
- Remote inputs/outputs:
remote(), --default-remote-provider (S3/GS/HTTP) and caching - Data locality on HPC:
shadow + explicit stage-in/stage-out to node-local scratch - Controlling DAG width and batching to keep the scheduler and filesystem stable
04. Testing and CI/CD for Workflows (Real, Not Cosmetic)
- Rule-level tests with minimal fixtures; integration tests for critical DAG paths
- Linting, pinned profiles, reproducible env builds in CI
- Regression tests: outputs + metadata/provenance checks, not just file existence
05. Maintainability: Contracts, Versioning, Workflow Catalogues, Team Practice
- Stable interfaces between modules (formats, schemas, naming, directory conventions)
- Versioned configs/workflows; change control; review checklists for correctness/perf
- Workflow catalogues/registries: reusable modules with explicit semantic versioning
Module 04: Scaling Patterns — Modularity, Interfaces, CI Gates, and Executor-Proof Semantics
01. Modularity That Scales: include, module, and Real Boundaries
- Why “split files” is not modularity: interfaces are the unit of reuse
include: for organization; module for reusable components with pinned versions - Avoiding state leakage: keep shared code pure; no hidden globals, no implicit IO
- Failure modes: circular includes, wildcard drift, hidden cross-module dependencies
- Proof hooks:
--list-rules, --rulegraph, --dag, “consumer stays stable while provider internals change”
02. Interface Contracts: Naming, Schemas, Versioned Outputs, and Compatibility
- Files as APIs: explicit output layout, naming invariants, and format guarantees
- Config + metadata schemas (fail fast): required fields, allowed values, cross-field constraints
- Versioning strategy:
results/v1/... vs results/v2/... and when to use rule version: - Compatibility rules: what changes are non-breaking vs breaking (and how to force reruns correctly)
- Failure signatures: silent schema drift, ambiguous targets, “it ran but outputs are wrong”
- Proof hooks: schema break → hard failure; compatible change → no rerun; breaking change → forced rerun + version bump
03. Determinism and Drift Control: CI as a Correctness Boundary
- “Ran once” is not correctness: plan stability + output stability + provenance stability
- CI gates:
--lint, --dry-run, rule/unit tests with minimal fixtures, golden outputs - Detecting hidden entropy: timestamps, random seeds, external state, non-pinned envs
- Provenance diffs as regressions: stable params/config snapshots,
--list-changes discipline - Failure signatures: flaky tests, nondeterministic outputs, “works locally” drift
- Proof hooks: intentional nondeterminism → CI fails; deterministic fix → CI passes with stable diffs
04. Resource Semantics With Evidence: Dynamic Resources That Map to Executors
- Dynamic
threads/resources from wildcards/input sizes (done safely, reproducibly) - Default-resources as policy; per-rule overrides as exceptions with justification
- Mapping proof:
threads/resources → executor constraints → rendered jobscript/log evidence - Scheduler load controls: grouping, job sizing, and explicit batching policies
- Failure signatures: oversubscription, queue rejection, “resources ignored”, latency-induced flakiness
- Proof hooks: rendered jobscript contains expected directives; intentional under-allocation fails; corrected resources succeed
05. Workflow as a Product: Distribution, Pinning, Upgrade Paths, and Team Practice
- Reusable workflow modules with pinned revisions (commit/tag), explicit interfaces, and changelogs
- Repo layout conventions for scale:
workflow/, profiles/, envs/, scripts/, tests/, schemas/ - Upgrade discipline: contract-compatible refactors vs breaking interface changes
- Review checklist for teams: interface changes, resource changes, provenance changes, determinism risks
- Proof hooks: consumer imports pinned provider; provider refactor does not break consumer; breaking change requires explicit version bump + consumer update