Skip to content

Health, Readiness, and Drain

Atlas exposes separate ideas that operators should not collapse into one boolean:

  • health
  • readiness
  • overload or drain state

Endpoint Model

flowchart LR
    Runtime[Atlas runtime] --> Health[Health route]
    Runtime --> Ready[Readiness route]
    Runtime --> Overload[Overload route]
    Runtime --> Live[Liveness route]

This endpoint model is here to stop one of the most common operator mistakes: treating every probe as if it were answering the same operational question.

Why the Distinction Matters

flowchart TD
    Healthy[Process is alive] --> NotReady[May still be unready]
    Ready[Can accept traffic] --> Draining[May later drain traffic]
    Overloaded[Overload state] --> Traffic[Traffic shaping decisions]

This distinction diagram explains why Atlas exposes multiple routes. A runtime can be alive, unready, or intentionally shedding work in different combinations, and traffic policy should respond accordingly.

Health answers “is the process alive enough to answer basic liveness checks?”

Readiness answers “should this instance currently receive normal traffic?”

Drain or overload state answers “is the instance reducing or refusing certain work classes?”

Operators get into trouble when they collapse those into a single success signal. Atlas exposes separate endpoints because a process can be alive, not yet ready, and already overloaded in meaningfully different combinations.

Operational Usage

  • use liveness checks to detect dead processes
  • use readiness checks to gate traffic
  • use overload or drain signals to avoid making a bad situation worse
  • decide traffic routing from readiness and overload, not from liveness alone

Practical Checks

curl -s http://127.0.0.1:8080/healthz
curl -s http://127.0.0.1:8080/readyz
curl -s http://127.0.0.1:8080/healthz/overload

Operator Advice

  • do not route normal traffic based only on liveness
  • treat readiness regression as a first-class operational signal
  • observe overload behavior under stress before calling a deployment “ready for production”
  • do not declare an incident resolved just because /healthz came back

What a Healthy Probe Story Looks Like

  • liveness stays boring and stable
  • readiness reflects whether the instance should receive normal traffic
  • overload and drain signals help prevent healthy-looking saturation failures

Purpose

This page explains the Atlas material for health, readiness, and drain and points readers to the canonical checked-in workflow or boundary for this topic.

Stability

This page is part of the canonical Atlas docs spine. Keep it aligned with the current repository behavior and adjacent contract pages.