Snakemake Capstone — Reference Workflow¶
A compact, end-to-end Snakemake workflow that demonstrates rigorous engineering practices on toy FASTQ inputs. The biological analysis is intentionally minimal; the emphasis is on workflow correctness, reproducibility, and maintainability: explicit contracts, safe dynamic DAGs, governed configuration, versioned publishing, artifact verification, and execution across contexts (local, CI, cluster, Docker).
This project is designed to be executed, studied, and extended as a reference implementation.
The CI workflow executes full confirmation runs, including workflow execution and artifact validation.
Purpose¶
This capstone provides a practical reference for constructing Snakemake workflows that remain reliable at scale. It addresses common failure modes—implicit dependencies, unsafe dynamic behavior, partial outputs, configuration drift, and reproducibility gaps—through disciplined patterns:
- Early and explicit failure detection
- Clear input/output contracts and deterministic paths
- Safe handling of dynamic DAGs via checkpoints
- Strict configuration validation and modular composition
- Versioned, integrity-checked publishing
- Comprehensive testing and verification gates
The workflow is deliberately small yet complete, allowing full execution and inspection on any machine.
Key Concepts Demonstrated¶
Workflow Design¶
- Explicit rule contracts with stable filenames and directories
- Protected and temporary outputs to prevent partial artifacts
- Safe dynamic DAGs using checkpoints and proper re-evaluation
- Scatter/gather patterns for per-sample and run-level processing
- Modular rule organization with namespacing
- Clear separation of internal intermediates (
results/) and published deliverables (publish/vN/)
Operational Practices¶
- Execution profiles for environment-specific settings
- Structured, per-rule logging and benchmarking
- Resource declarations with resolved-value traceability
- Docker execution surface for consistent runtime
Quality Assurance¶
- Schema-validated configuration as a required step
- Unit tests for pure Python components
- Artifact verification (parsing and sanity checks)
- Clean-room confirmation via Make targets
Pipeline Overview¶
High-Level Stages¶
-
Sample Discovery (checkpoint)
Scansdata/raw/for FASTQ files and produces a sample registry. -
Per-Sample Processing
- Raw FASTQ quality control
- Adapter trimming
- Trimmed quality control
- Deduplication
- k-mer profiling
-
Reference panel screening
-
Run-Level Aggregation
- Consolidation of per-sample results into summary tables
-
HTML report generation
-
Publishing
- Emission of versioned outputs to
publish/v1/ - Generation of a checksummed manifest for integrity verification
Published Artifacts (Stable Interface)¶
Published outputs reside in publish/v1/ and form a stable, externally consumable contract:
discovered_samples.jsonprovenance.jsonsummary.jsonandsummary.tsvreport/index.htmlmanifest.json(inventory with SHA-256 checksums)
Detailed specifications are provided in FILE_API.md.
Internal directories (results/, logs/, benchmarks/, .snakemake/) are not part of the stable interface.
Repository Layout¶
.
├── Snakefile # Top-level orchestration and targets
├── workflow/ # Rule modules and publishing logic
├── src/ # Python package for workflow steps
├── tests/ # Unit tests for Python components
├── config/ # config.yaml + schema.yaml
├── profiles/ # Execution profiles (local, CI, cluster patterns)
├── data/ # Toy inputs (FASTQs, reference panel)
├── FILE_API.md # Published artifact contract
└── Makefile # Quality gates and entrypoints
Runtime-generated: - results/, logs/, benchmarks/, publish/, .snakemake/
Requirements¶
Host execution - Python 3.11+ - Snakemake (system or virtual environment)
Docker execution - Docker daemon available
(Designed to eliminate host Conda dependencies)
Quick Start¶
Full Clean-Room Execution with Verification¶
Executes the workflow from scratch and validates published artifacts (parsing, sanity, manifest integrity).
Dry-Run Preview¶
Displays planned jobs and commands without execution.
Execution via Profiles¶
Profiles separate workflow logic from execution context.
Add -p to print commands as they would be executed.
Note on Checkpoints: The sample discovery checkpoint triggers DAG re-evaluation—an expected and intentional behavior visible in dry-runs.
Makefile Targets (Primary Interface)¶
| Category | Target | Purpose |
|---|---|---|
| Cleanup | make clean | Remove all generated state and outputs |
| Formatting | make fmt, make fmt-check | Format and validate code formatting |
| Linting | make lint, make check | Static analysis and composite checks |
| Testing | make test, make ci | Unit tests and CI-style gate |
| Workflow | make wf-lint | Snakemake lint |
make wf-dryrun | Preview execution plan | |
make run | Execute workflow | |
make dag / make rulegraph | Generate visualizations | |
| Validation | make validate-config | Schema validation |
make verify-artifacts | Parse and sanity-check published outputs | |
make verify | Full run + artifact verification | |
make confirm | Strongest gate: clean + checks + tests + lint + dry-run + run + verify | |
| Docker | make docker-build | Build container image |
make docker-run | Execute workflow in container |
Recommendation: Use make confirm for confidence before commits or releases.
Extending the Workflow¶
Preserve these invariants when adding functionality:
- Deterministic outputs for identical inputs and configuration
- Explicit input/output declarations
- Temporary writes moved atomically
- Stable publish boundary (
publish/vN/) - Validation coverage for new artifacts
To modify the published contract, increment the version directory (e.g., v2) and update FILE_API.md.
License¶
MIT — see the top-level LICENSE. © 2025 Bijan Mousavi.