rollout

Introduction

A high-performance, multi-node reinforcement-learning framework for large language models. Written in Rust. Pluggable in Python.

rollout is a Rust-core reinforcement-learning framework for large language models. It supports PPO, GRPO, DPO/IPO/KTO, SFT, and reward-model training across training, batch inference, and online inference modes, with multi-node distribution from day one. AWS and GCP are first-class infra targets; vLLM is the default inference backend; plugins can be authored in Python or Rust.

LayerWhat it owns
AlgorithmsSFT · RM · PPO · GRPO · DPO / IPO / KTO
SubstrateCoordinator + workers + plugin host (PyO3 / sidecar RPC)
Storage / CloudEmbedded · Postgres · S3 · GCS — AWS/GCP behind a layered trait

See Architecture for the full layered breakdown.

Architecture

See the canonical architecture doc: ARCHITECTURE.md.

Substrate (Phase 2)

Phase 2 lights up the local substrate — the layer that lets a worker start, store state, talk to a peer, load a plugin, and shut down cleanly without touching any cloud. The per-crate chapters land in plan 02-07; this page is the section landing.

What ships in Phase 2

  • rollout-proto — owns transport.proto (heartbeat / control / work) and plugin.proto (sidecar). tonic-build runs in this crate's build.rs; every other crate consumes the generated code.
  • rollout-storage — implements Storage + StorageTxn on top of redb. In-process tokio::sync::broadcast per-prefix for watch(). Always-fsync. Default path ./data/rollout.db. postcard value encoding.
  • rollout-cloud-local — implements ObjectStore (content-addressed sharded FS under ./data/object-store/), Queue (RAM hot path + spill to rollout-storage for restart replay), SecretStore (env-var allowlist, read-only), and ComputeHint (Linux full via /proc + nvml-wrapper; macOS minimal stub via sysinfo).
  • rollout-transport — implements gRPC with three logical channels (heartbeat / control / work). HTTP/2 + rustls is the plan-of-record; QUIC via tonic-h3 is feature-flagged EXPERIMENTAL.
  • rollout-plugin-host — implements PluginHost with all three modes wired: Rust cdylib, PyO3 in-process, Python sidecar (gRPC-over-UDS).
  • rollout-coordinator — minimal binary that registers workers, accepts heartbeats, persists worker registry + heartbeat ledger to Storage, and marks workers failed via deadline-based scan.
  • Smoke testmake smoke + scripts/smoke.sh spawn 1 coordinator + 2 workers + 1 cdylib + 1 Python sidecar, kill w1, and assert deadline detection within 2 × heartbeat_interval.

Plan-of-record vs stretch

HTTP/2 over tonic + rustls is the plan-of-record transport. QUIC via tonic-h3 v0.0.x is feature-flagged EXPERIMENTAL (it is currently pre-1.0 and bidi-streaming is not documented as production-ready). The same .proto schema covers both; the swap-to-QUIC is a single-config change in a later phase.

Trait surface

All trait contracts live in rollout-core. The Wave-0 extensions (plan 02-00) align the rollout-core traits with the spec versions:

  • Storage / StorageTxn — see spec 04 §2. Phase 2 simplifies scan to return owned Vec instead of BoxStream (object-safety + async_trait constraint).
  • PluginHost — see spec 03 §4–§5. Phase 2 uses Vec<u8> payloads in call(); richer typed-payload helpers ship in later phases.
  • Worker / Coordinator — see spec 01 §2. Worker::init / ready land in Phase 2; WorkerContext stays a unit struct until Phase 6.
  • ObjectStore / Queue / SecretStore / ComputeHint — see spec 06 §3. Queue::ack / nack, ObjectStore::exists, ComputeHint::preemption_signal, and SecretStore::put ship in rollout-core in Phase 2.
  • EventEmitter — see spec 09 §2. The trait lives in rollout-core (plan 02-00); the StdoutJsonEmitter impl lands in rollout-coordinator (plan 02-06).

Per-crate chapters

Preflight

Before running make smoke, run bash scripts/preflight.sh. The script checks that cargo, make, and python3 ≥ 3.11 are on PATH and notes whether protoc is available (tonic-build bundles a copy when missing).

rollout-proto — wire-format crate

rollout-proto owns the workspace's .proto files and runs tonic-build exactly once. Downstream crates (rollout-transport, rollout-plugin-host, rollout-coordinator) depend on rollout-proto, not on .proto regeneration.

Why a dedicated crate

Per CONTEXT D-PROTO-01, the workspace has a single tonic-build invocation site. Centralising wire-format ownership means:

  • The HTTP/2 ↔ QUIC swap (CONTEXT D-TRANS-01 fallback) shares the same source-of-truth proto schema.
  • The sidecar IPC protocol (UDS) and the network transport carry the same message types — no drift between in-process and out-of-process plugin calls.
  • Build-time codegen happens once per workspace build; downstream crates compile faster.

Wire-format ownership

FilePackageOwns
proto/transport.protorollout.transport.v1Worker ↔ Coordinator transport
proto/plugin.protorollout.plugin.v1Host ↔ sidecar Plugin protocol

Both are committed to the repo. The build script (build.rs) compiles them via tonic-prost-build on every workspace build; rerun-if-changed keeps incremental builds fast.

Three transport channels

transport.proto defines the three logical channels from spec 05 §3:

ChannelRPC kindCadencePurpose
Heartbeatunary Beatevery 500 msWorker liveness; carries due_at for deadline-based health
Controlserver-stream Subscribeas-neededCoordinator pushes drain / snapshot / cancel
Workbidi StreamburstyPull/submit work items (full semantics land in Phase 6)

All three multiplex over one connection — HTTP/2 today, QUIC behind a feature flag in a later phase.

Plugin sidecar protocol

plugin.proto defines the five sidecar RPCs from spec 03 §3.3 + §4:

  • Init(InitRequest) -> InitResponse — one-time setup per worker.
  • Preflight(PreflightRequest) -> PreflightResponse — optional, after Init.
  • Call(CallRequest) -> CallResponse — generic typed entry point; payload is opaque bytes (JSON or postcard, plugin-defined).
  • Reload(ReloadRequest) -> ReloadResponse — hot reload signal; the sidecar may refuse.
  • Shutdown(ShutdownRequest) -> ShutdownResponse — clean exit with grace period.

The Python sidecar host implementation lands in rollout-plugin-host (plan 02-05).

Python stubs

The in-tree Python sample sidecar (python/examples/sample_sidecar/) uses stdlib length-prefixed JSON framing over UDS — per AGENTS.md §7, every plugin must be testable locally without pip install grpcio or other external services. See 02-RESEARCH.md §"Python sidecar IPC: avoid pip" for the rationale.

If you are authoring a Python plugin that does want real gRPC, opt in via make protos (which shells to cargo xtask gen-protos). Output goes to python/rollout/_proto/. The xtask gracefully degrades when grpcio-tools is missing — it prints a helpful message and exits 0; CI does not require it.

pip install grpcio-tools     # opt-in
make protos                  # writes python/rollout/_proto/*_pb2.py

Versioning

Both protos use a v1 package suffix (rollout.transport.v1, rollout.plugin.v1). Bumping to v2 is a future spec edit; the v1 modules will continue to compile alongside any future versions so in-flight Phase 2 deployments do not break.

Cross-references

Storage

rollout-storage is the Phase-2 embedded Storage + StorageTxn impl, backed by redb 2.5. It is the default backend in v1 per CONTEXT D-STO-01. The Postgres backend lives in Phase 4 (TRAIN-04).

Backend choice

Spec 04 §3.1 prefers redb: pure-Rust, single-file MVCC, copy-on-write, no compaction stalls. The async Storage trait wraps the sync redb API via tokio::task::spawn_blocking.

Table layout

Each StorageKey.namespace maps to its own redb TableDefinition<&[u8], &[u8]>. The Phase-2 namespaces are:

TablePurpose
runsPer-run metadata
workersWorker registry (coordinator-owned)
heartbeatsHeartbeat ledger; coordinator scans this on due_at
queueGeneric queue spill
pluginsPlugin manifest cache
cloudlocal_queuerollout-cloud-local queue restart-replay mirror

Adding a new namespace = adding a const TableDefinition in embedded::tables.rs and a match arm in table_for. No migration; redb opens new tables lazily on first write.

Key encoding

Namespace selects the table, so the redb key bytes only encode (run_id, path). The encoding is a single postcard::to_allocvec(&(&run_id, &path)) so decoding is unambiguous regardless of any 0x00 bytes inside Option<RunId>. See crates/rollout-storage/src/encoding.rs.

Value encoding

postcard per D-STO-04 — compact, schemafull, deterministic, serde-native. The Storage trait is bytes-only (get_bytes / put_bytes / cas_bytes); downstream crates layer postcard over the trait via free helpers when typed payloads are needed.

Durability

Always-fsync per D-STO-03. Each WriteTransaction calls set_durability(Durability::Immediate) before staging writes; redb's commit() blocks until fsync completes. Default DB path is ./data/rollout.db, overridable via [storage.embedded] path = ... in the run config.

watch() semantics

Storage::watch(prefix) -> tokio::sync::broadcast::Receiver<StorageEvent> returns an in-process subscriber to commits whose key extends prefix. The WatchRouter lives inside EmbeddedStorage; per-prefix broadcast::Senders are created on first subscribe.

The critical invariant (RESEARCH §"Pattern 2"): redb has no post-commit hook. The EmbeddedTxn buffers StorageEvents inside the in-flight txn; commit() only fans them out to the router after WriteTransaction::commit() returns Ok. On abort() (explicit or via drop), the buffered events are discarded — subscribers never observe rolled-back writes.

begin()  → buffer = []
put_bytes(K, V) → buffer.push(Put{K})
commit() → spawn_blocking(commit) → on Ok: for evt in buffer { watch.publish(evt) }
abort()  → drop(txn); buffer is dropped

Phase-2 scan semantics

Storage::scan_bytes(KeyRange { prefix, limit }) returns an owned Vec<(StorageKey, Vec<u8>)> rather than the BoxStream shown in spec 04 §2. The async-trait + object-safety constraint on stable Rust forbids stream-returning methods on dyn Storage; the simplification is documented in spec 04 §1a. Phase-2 callers (coordinator deadline scan, cloud-local restart replay) work with small per-namespace working sets where the owned-vec cost is negligible. A future StorageStream newtype can lift this restriction.

The scan iterates the namespace table and decodes each key on the way out; namespace already partitions the data, so tables stay small. limit short-circuits the iteration.

Postgres path constraint (ASCII-printable)

The Postgres backend stores StorageKey.path as TEXT[]. Path components must be ASCII-printable (bytes 0x200x7E); non-printable / NUL / high-bit bytes cannot round-trip through TEXT[] and would silently diverge from redb's byte-lex prefix scan (PITFALLS.md §17). Every Postgres CRUD and scan entry-point calls StorageKey::validate_for_postgres() and rejects offending keys with Fatal(ConfigInvalid) before any SQL runs. For binary identifiers in path components, use hex::encode at the StorageKey construction site. A 256-case proptest (tests/postgres_scan_bytes_parity.rs) witnesses byte-parity between redb and Postgres over the printable-ASCII range.

Cross-process watch

NOT supported. The broadcast channel lives in process memory; another EmbeddedStorage instance pointing at the same file will not receive events. Cross-process watch arrives with the Postgres backend in Phase 4 (LISTEN/NOTIFY).

Crash safety

The on-disk format is crash-safe at the redb commit boundary: a successful commit() implies fsync; anything before that is invisible on reopen. The active test in tests/crash_safety.rs exercises the abort-style variant (drop txn without commit, reopen, assert no keys visible). A "true SIGKILL between put and commit" variant is #[ignore]d for now — it needs a helper-binary harness with raw signals; Phase-6 DIST-03 (restart-from-storage) will land that.

Tests

FileScope
tests/crud.rsput/get/delete/scan/get_many/ping
tests/tables.rsper-namespace isolation; six-namespace reopen
tests/txn.rscommit/abort/cas (insert-only / CAS / delete-if-equal)
tests/watch.rspublish-after-commit; abort suppresses; multi-subscriber; prefix isolation
tests/crash_safety.rsdrop-without-commit reopen; SIGKILL variant #[ignore]d

Cloud-local substrate

rollout-cloud-local ships the Phase-2 Layer-1 implementations so the rest of the stack has a real ObjectStore / Queue / SecretStore / ComputeHint to target with zero cloud creds. Per CONTEXT D-LOCAL-01..05.

What ships

CapabilityTypeImplementation
Blob storageFsObjectStoreContent-addressed two-level sharded FS under ./data/object-store/; sibling <hex>.meta.json per blob.
Work queueInMemQueuetokio::sync::Mutex<VecDeque<_>> hot path + spill to rollout-storage (cloudlocal_queue namespace).
SecretsEnvSecretStoreRead-only env-var allowlist (ROLLOUT_SECRET_<NAME>); put() returns Fatal(ConfigInvalid) by design.
Compute hinthints::*Linux full (/proc/cpuinfo + /proc/meminfo, optional NVML feature); macOS minimal (sysinfo cpu+memory).

What's deferred

  • BlockStore — D-LOCAL-05: declared in rollout-core, not implemented here; opt-in for clouds that need it.
  • Sandboxing beyond network allowlist — Phase 7, when the tool harness lands and untrusted-code isolation matters (cgroups + seccomp + FD limits + fs write restrictions).
  • Real cloud backends (rollout-cloud-aws, rollout-cloud-gcp) — Phase 5.

FsObjectStore layout (D-LOCAL-01)

./data/object-store/
├── ab/                       <- hex[0..2]
│   └── cd/                   <- hex[2..4]
│       ├── abcd…fullhash     <- blob
│       └── abcd…fullhash.meta.json
└── ...

Writes are tmp-then-rename for atomicity. Idempotent for repeated puts of the same bytes (same ContentId, no double-write). get_bytes on a missing id returns Fatal(Internal("object not found: …")) — choosing Fatal because a missing content hash signals an upstream contract violation, not a transient I/O fault.

Queue restart semantics (D-LOCAL-02)

Every enqueue writes through a Storage transaction under cloudlocal_queue/<ulid> BEFORE the item is pushed onto the in-memory deque. ack deletes the storage entry; nack re-pushes the item to the front of the deque without touching storage so the next restart still replays it.

On InMemQueue::open(storage), the queue scans cloudlocal_queue/*, decodes each QueueItemId(Ulid) from the path segment, sorts by ULID (which is k-sortable so this recovers enqueue order), and rebuilds the deque. This honors the spirit of DIST-03 (restart replay) for the local backend; the full DIST-01..05 fault-tolerance work lands in Phase 6.

Secret allowlist (D-LOCAL-03)

EnvSecretStore::new(allowlist) accepts a list of secret names (without the ROLLOUT_SECRET_ prefix). At get(name) time:

ConditionResult
name not in allowlistErr(Fatal(ConfigInvalid("not in allowlist")))
name allowed, ROLLOUT_SECRET_<name> setOk(value)
name allowed, env var unsetErr(Recoverable(Transient, RetryHint::Never))
put(name, value) — ALWAYSErr(Fatal(ConfigInvalid("read-only")))

The "allowed but unset" case is recoverable rather than fatal because the operator can provision the variable without changing config; subsequent calls will succeed.

GPU inventory (D-LOCAL-04)

GPU enumeration is opt-in behind a nvml Cargo feature. When the feature is off (default), LinuxComputeHint::inventory().gpus is always empty. When the feature is on but libnvidia-ml.so is missing or NVML init fails, the inventory still returns an empty gpus vector — never errors. This keeps local dev machines and CI runners (no GPU) functional without conditional code on the caller.

macOS skips GPU inventory entirely.

Tests

FileCoverage
tests/object_store.rs (6)Round-trip, sharded layout, meta sidecar, exists, idempotency, fatal-on-missing
tests/secrets.rs (4)Allowlist read, outside-allowlist fatal, unset transient, put fatal
tests/queue_replay.rs (5)FIFO ULID order, nack-to-front, restart replay, ack removes, nack keeps
tests/hints_macos.rs (2, cfg=macos)CPU + memory present, preemption signal None
tests/hints_linux.rs (3, cfg=linux)/proc/cpuinfo parse, /proc/meminfo parse, no GPUs without nvml feature

Linux-only tests are #[cfg(target_os = "linux")] so workspace cargo test --tests stays green on every host. The NVML integration test is additionally #[ignore]d behind --ignored since it requires a live libnvml.

rollout-transport — gRPC plane with mTLS

rollout-transport is the worker ↔ coordinator gRPC plane for Phase 2. It ships HTTP/2 + rustls + mTLS as the plan-of-record. QUIC support is experimental and lives behind a Cargo feature flag.

Plan-of-record: HTTP/2 + rustls

Phase 2 research (.planning/phases/02-local-substrate/02-RESEARCH.md, "Transport stack") found tonic-h3 v0.0.5 (latest at planning time, 2025-11-01) was explicitly labelled experimental, with bidirectional streaming not documented as supported. The upstream hyperium/tonic#339 gRPC-over-HTTP/3 issue remains open. Shipping a Work channel (bidi-streaming) on top of tonic-h3 would risk runtime hangs under load.

Therefore plan 02-04 ships HTTP/2 tonic 0.14 + rustls 0.23 + rcgen 0.13. The same .proto schema is forward-compatible with QUIC; the swap will become a single Cargo-feature flip in a later phase once tonic-h3 (or its successor) ships documented bidi support.

Three logical channels

All three channels are multiplexed over one HTTP/2 connection per (coordinator, worker) pair, per spec 05-distribution.md §3.

ChannelRPC kindPurpose
Heartbeatunary, frequentWorker's "I'm alive" ping; carries due_at
Controlserver-streamingCoordinator pushes drain / snapshot / cancel
WorkbidirectionalPhase-2 stub; Phase 6 wires pull/submit

The Work channel ships as a wired-but-stub WorkServiceImpl that echoes a heartbeat marker back on every received frame. Real pull/submit semantics arrive with DIST-01..02 in Phase 6.

mTLS auto-bootstrap (D-TRANS-02)

On first run the transport invokes tls::ensure_dev_ca(tls_dir). This:

  1. Generates a self-signed CA via rcgen 0.13.
  2. Writes ca.pem (public certificate) and ca.key.pem (private key).
  3. Sets ca.key.pem permissions to 0o600 on Unix.

Subsequent calls are idempotent — the existing files are read back.

tls::issue_server_cert(ca_cert, ca_key, dns_names) issues server certs (EKU ServerAuth). tls::issue_client_cert(ca_cert, ca_key, names) issues client certs (EKU ClientAuth). Both are signed by the dev CA.

Filenames live under ./data/tls/ (gitignored). No manual openssl steps are required; rcgen produces pure-Rust output with no openssl-sys dependency (banned by deny.toml).

Deadline-based health (D-TIME-01..02)

Spec 05-distribution.md §6 mandates deadline-based health, not polling. Workers publish due_at = now + heartbeat_interval × 2 on every Beat; coordinators scan worker state and mark failure only when:

elapsed_past_due > clock_skew_budget  AND  elapsed_past_due > coordinator_failure_timeout

Helpers health::next_due_at and health::is_failed encode the formulas above. They take SystemTime directly so callers can swap a test clock trivially.

Default constants (D-TIME-01):

ConstantDefault
heartbeat_interval500 ms
worker_self_fence_timeout4 s
coordinator_failure_timeout5 s
clock_skew_budget250 ms

Config invariants enforced at plan time (D-TIME-02)

TransportConfig::validate_cross_fields enforces two split-brain prevention rules at plan time, never at runtime:

  1. worker_self_fence_timeout < coordinator_failure_timeout — a worker that fails to self-fence before the coordinator times it out causes classic split-brain (two writers think they own the same lease).
  2. clock_skew_budget < heartbeat_interval × 2 — a clock-skew budget greater than two periods would make deadline-based failure detection meaningless.

rollout plan calls validate_cross_fields before any worker starts; violations return Fatal(ConfigInvalid) with a human-readable message identifying the offending field.

QUIC feature flag (EXPERIMENTAL)

The quic Cargo feature pulls in quinn, tonic-h3, h3, and h3-quinn:

[features]
default = ["h2"]
h2 = []
quic = ["dep:quinn", "dep:tonic-h3", "dep:h3", "dep:h3-quinn"]

At plan-02-04 execution time (2026-05-20), cargo build -p rollout-transport --features quic fails to compile because h3-quinn 0.0.7 references quinn::StreamId internals that became private in quinn 0.11.x. This confirms the RESEARCH §"Pitfall 2" assessment: the QUIC stack is not production-ready.

When tonic-h3 (or a successor) ships documented bidi-streaming and publishes a quinn 0.11-compatible release, the swap is:

  1. Verify cargo build --features quic succeeds.
  2. Implement server::serve_quic (currently returns Fatal(Internal) with an EXPERIMENTAL message).
  3. Add a matching client::build_quic_channel.
  4. Flip the default feature, or expose a CLI flag.

The proto schema does not change.

Channel: Work (Phase-2 stub)

work::WorkServiceImpl::stream accepts a bidi stream and echoes a "heartbeat" marker per inbound frame. This is enough for the Phase-2 smoke test in plan 02-07 to verify that the bidi pipe is wired end-to-end. Real pull/submit semantics — assignment, ack, work-stealing, restart-from-storage — arrive with DIST-01..02 in Phase 6.

Observability (principle #10)

Every server handler is wrapped in a tracing::instrument span with a channel field. Emitted events include:

  • transport_server_starting (with %addr)
  • mtls_handshake_bootstrap (when CA is generated)
  • heartbeat_received (worker_id, run_id, state)
  • control_subscribed (worker_id, run_id)

Binary crates configure the subscriber; library crates emit only — per D-OBSERVE-01.

Cross-references

Plugin host

rollout-plugin-host is the Phase-2 implementation of rollout_core::PluginHost. It wires three loading modes against the trait surface delivered in Wave-0 and persists manifests through rollout-storage when constructed via PluginHostImpl::with_storage.

Three modes

ModeCrate featureWhere it runsHot reload
Rust cdylib (PluginMode::RustCdylib)cdylib (default)In-process, native codeUnsupported. Returns Fatal(PluginContract) per spec 03 §7.
PyO3 in-process (PluginMode::Pyo3)pyo3 (default)Dedicated Python OS threadimportlib.reload (requires dev-hot-reload).
Python sidecar (PluginMode::Sidecar)sidecar (default)Subprocess over Unix Domain SocketSIGTERM + respawn (requires dev-hot-reload).

The host exposes one type — PluginHostImpl — that dispatches on manifest.mode at load() time. Each mode owns a small state struct kept in a parallel HashMap<PluginId, HandleState> so the public PluginHandle remains Clone + Send + Sync POD.

Manifest

Plugins ship a rollout-plugin.toml parsed by parse_manifest. The schema matches rollout_core::PluginManifest:

name             = "sample-inproc"
version          = "0.1.0"
kind             = "env-harness"           # PluginKind variant (kebab-case)
trait_id         = "rollout_core::Plugin"
mode             = "pyo3"                  # pyo3 | sidecar | rust-cdylib
network_allowlist = []

[runtime]
python_min = "3.11"      # required for pyo3
gpu        = false
memory_mib = 64

[entry.pyo3]
module  = "sample_inproc.plugin"
factory = "create_plugin"

validate_manifest runs cheap plan-time checks: pyo3 plugins must declare python_min >= 3.11 (stdlib tomllib + PyO3 abi3-py311), names/versions must be non-empty, and the manifest must parse against serde(rename_all = "kebab-case").

C-ABI shim (cdylib mode)

Phase 2 keeps the shim as an internal module (src/modes/abi.rs) per RESEARCH Open Question 3 — it graduates to a standalone rollout-plugin-abi crate when an external plugin ecosystem emerges (likely Phase 7+).

The contract:

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct Buf { pub ptr: *mut u8, pub len: usize, pub cap: usize }

#[repr(C)]
pub struct RolloutPluginVtable {
    pub abi_version: u32,                  // must equal ABI_VERSION (1)
    pub name: *const c_char,
    pub call: extern "C" fn(
        method: *const u8, method_len: usize,
        payload: *const u8, payload_len: usize,
        out: *mut Buf
    ) -> i32,
    pub free_buf: extern "C" fn(buf: Buf),
}

#[no_mangle]
pub extern "C" fn rollout_plugin_factory() -> *mut RolloutPluginVtable;
}

The host copies the returned Buf before calling free_buf so allocator mismatches across the cdylib boundary cannot corrupt anything. ABI mismatches surface as Fatal(PluginContract { msg: "cdylib ABI mismatch: got N, expected 1" }).

The in-tree example lives at tests/smoke/plugins/rust_cdylib_sample/. It's an out-of-workspace cdylib crate (its own [workspace]) so cargo build --workspace doesn't churn on it; the smoke driver (plan 02-07) builds it with cargo build --manifest-path … --release.

Sidecar IPC

The in-tree sample uses stdlib-only length-prefixed JSON over AF_UNIX per RESEARCH Pitfall 9 + AGENTS.md §7 ("every plugin testable locally — without cloud creds / GPUs / external services"). The wire format is intentionally tiny:

request:  [u32 BE length][utf-8 JSON {"method": str, "payload": str}]
response: [u32 BE length][utf-8 JSON {...}]

This avoids forcing pip install grpcio into the cargo test path. Real production sidecars that need gRPC can ingest the same rollout-proto::plugin::v1 service stubs (make protos regenerates the Python side); the host supports both via the SidecarProtocol enum (FramedJsonUds and GrpcUds). Phase 2 only ships the FramedJsonUds code path; GrpcUds lands when a sidecar consumer needs it.

UDS paths default to ./data/sidecars/<plugin>-<pid>.sock and are chmod 600 on the Python side. The host removes the socket on unload + on respawn.

Hot reload

Behind the dev-hot-reload Cargo feature only:

  • PyO3. The dedicated Python thread runs importlib.reload(module) and re-invokes factory() on the same channel. In-flight calls running on the old object complete naturally; subsequent calls hit the new object.
  • Sidecar. SIGTERM the child (via nix::sys::signal::kill), wait 2 s, SIGKILL on holdout, then respawn with the same argv on a fresh UDS path.
  • cdylib. Returns Fatal(PluginContract { msg: "cdylib reload unsupported per spec 03 §7" }) — Rust has no stable ABI; dlclose while another task holds a Box<dyn Plugin> is UB. Production should not reload cdylibs.

The feature is off by default. Per spec 03 §7, production deployments ignore reload flags entirely (with a warning) — that surfacing lands with the coordinator in plan 02-06.

Sandboxing (Phase 2 scope)

Network allowlist only (D-SANDBOX-01). cgroups + seccomp + FD limits + fs write restrictions are tracked as TODOs referencing Phase 7. The allowlist surfaces as PluginManifest::network_allowlist; the egress proxy lives in the worker / cloud-local layer and lands when the tool harness needs adversarial isolation (Phase 7).

Dependency direction

rollout-plugin-host does not depend on rollout-transport. The Wave-0 dep-direction lint enforces this — sidecar IPC uses UDS framing via rollout-proto, not the QUIC/HTTP/2 transport from rollout-transport. Forgetting this rule means the layered architecture (AGENTS.md §9) drifts; the integration test in crates/rollout-core/tests/dependency_direction.rs catches drift on every PR.

Observability

Every public op emits a structured event with target = "plugin_host" and a plugin_id field:

EventFieldsWhen
plugin_loadedplugin_id, mode, path/module/socketload() succeeds
plugin_reloadedplugin_id, reasonreload() succeeds (or is rejected)
plugin_call (span)plugin_id, methodwraps every call()
plugin_call_errorplugin_id, errorcall returns Err

Subscribers configure formatting via tracing-subscriber (e.g., the stdout-JSON sink that the coordinator binary ships in plan 02-06).

Python bridge (PyO3)

Phase 2 pins pyo3 = 0.28 + pyo3-async-runtimes = 0.28 in lockstep. Both crates moved under the PyO3 organisation in 2025 (the predecessor pyo3-asyncio is archived).

Version pin rationale

CratePinned versionWhy
pyo30.28First release with the new Python::attach / Python::try_attach API (replaces Python::with_gil); stable abi3 support; MSRV 1.83 satisfies our 1.88 toolchain.
pyo3-async-runtimes0.28Mandatory lockstep with pyo3 — the FFI surface is private and tied to the patch version. Successor to pyo3-asyncio (archived).

Both versions are declared in [workspace.dependencies] so every consumer (currently just rollout-plugin-host) inherits the same patch range.

abi3-py311 strategy

We enable pyo3/abi3-py311. This produces a single Python extension binary that runs on any CPython 3.11+ instead of building one wheel per minor (3.11 → 3.12 → 3.13). The trade-off:

  • ✅ Smaller CI matrix; one wheel ships everywhere.
  • ✅ Stdlib tomllib is available (used by sidecar samples to parse pyproject.toml if needed).
  • ✅ pyo3 abi3 was stabilised circa 0.20 and is well-exercised.
  • ❌ 3.10 builds are rejected at link time with cannot set a minimum Python version 3.11 higher than the interpreter version 3.10. Local dev machines that default to python3.10 must export PYENV_VERSION=3.11.x (or similar) before cargo build.
  • ❌ Some C-extension features (e.g. private CPython internals) aren't accessible through abi3.

scripts/preflight.sh (added in Wave-0 plan 02-00) verifies python3 >= 3.11 before make smoke does anything destructive.

Dedicated Python OS thread

Per RESEARCH Pitfall 3, mixing PyO3 calls onto random Tokio worker threads deadlocks under contention because the GIL ends up held across .await points. The host avoids this by owning one OS thread per PluginHostImpl:

                 ┌────────────────────┐
   Tokio task ───► mpsc::Sender ─────►│  rollout-py-* OS thread
                 │ (PyTask enum)      │  ───────────────────────
                 │                    │  Python::attach(...) {
                 │                    │      plugin.call(...)
                 │ ◄─── oneshot ──────┤  }
                 └────────────────────┘

The worker thread is created with std::thread::Builder::new().name("rollout-py-<plugin>") so debuggers and tracing spans can distinguish per-plugin Python contexts. With pyo3/auto-initialize the interpreter spins up on first Python::attach; we do not call the (removed-in-0.28) prepare_freethreaded_python.

Heavy-CPU Python code should release the GIL via Py::allow_threads per spec 03 §3.2 — Phase 2 in-tree samples are tiny enough not to need this.

In-tree samples and the no-pip rule

Per AGENTS.md §7, every in-tree sample must run without pip install. Phase 2 ships three:

User plugins are free to bring their own virtualenv with grpcio, numpy, etc. — the no-pip rule applies only to the in-tree samples that gate cargo test and make smoke.

Coordinator

rollout-coordinator is the Phase-2 minimal control plane. It registers workers, accepts heartbeats, persists the worker registry and heartbeat ledger to local EmbeddedStorage, and surfaces deadline-detected failures.

Scope

Phase 2 explicitly ships only the heartbeat-receiver slice:

In scope (Phase 2)Out of scope (Phase 6 DIST-01..05)
Register worker, accept heartbeatWork distribution / pull / submit
Persist workers/* + heartbeats/* to StorageCoordinator lease / CAS / HA
Deadline-based failure scan + tracing eventsMulti-coordinator handoff
Mount the three rollout-transport servicesRestart-from-storage 4-node test

The same binary scales to the full distribution story later — Phase 6 adds work-stealing on top of the existing transport + storage wiring.

Storage layout

Two redb tables (rollout-storage provides them; this crate only writes):

  • Namespace workers, path [<worker_id>] → postcard-encoded WorkerRegistryEntry
  • Namespace heartbeats, path [<worker_id>] → postcard-encoded HeartbeatRecord

Heartbeats are overwrite-on-write in Phase 2 — only the latest beat is kept. Phase 6 may bolt on a ledger if work-stealing needs history.

Failure-detection formula

A worker is marked failed iff

elapsed_past_due = now - due_at
elapsed_past_due > clock_skew_budget
elapsed_past_due > coordinator_failure_timeout

Both thresholds must trip (spec 05 §6). The defaults (CONTEXT D-TIME-01):

ConstantValue
heartbeat_interval500 ms
worker_self_fence_timeout4 s
coordinator_failure_timeout5 s
clock_skew_budget250 ms

Plan-time invariants enforced by TransportConfig::validate_cross_fields:

  1. worker_self_fence_timeout < coordinator_failure_timeout (split-brain prevention).
  2. clock_skew_budget < 2 × heartbeat_interval.

Failure-scan loop

The scan ticks every heartbeat_interval / 2 (250 ms by default) so a single missed beat is detected within 2 × heartbeat_interval — the SUBSTR-02 acceptance criterion #3. Each tick scans the heartbeats/* namespace, decodes via postcard, and emits two outputs per overdue worker:

  • A tracing::warn! line with target = "coordinator" and worker_id = <id> + due_at_ms = <ms> fields.
  • An Event { kind: EventKind::Domain { topic: "worker_failed" }, level: Warn, … } via the injected EventEmitter (see D-OBSERVE-01 below).

Already-failed workers are tracked in an in-memory HashSet so the loop emits exactly one failure event per worker per coordinator lifetime.

Observability (D-OBSERVE-01)

StdoutJsonEmitter is the Phase-2 sink for rollout_core::EventEmitter:

  • Holds a Mutex<tokio::io::Stdout> so concurrent emits don't interleave.
  • Writes one NDJSON line per event using serde_json.
  • Flushes after each event.

CoordinatorImpl::new(storage, run_id, emitter) takes the emitter as an Arc<dyn EventEmitter> so non-stdout backends drop in without code change in later phases. The coordinator emits:

  • worker_registered — on the first register() (or first heartbeat from an unknown worker; see CLI section).
  • worker_heartbeat — on every accepted beat.
  • worker_deregistered — on graceful drain.
  • worker_failed — from the failure-scan loop.

Tests use NoopEmitter (also in emitter.rs) which discards every event.

CLI

rollout coordinator run

rollout coordinator run --config <path>

Boots the coordinator from a TOML file:

run_id = "01JZ..."        # ULID
[storage]
path = "./data/rollout.db"
[transport]
listen_addr = "127.0.0.1:50051"
tls_dir = "./data/tls"

The same logic ships as a standalone binary rollout-coordinator (Cargo [[bin]]) so the smoke-test wrapper can invoke it directly without rollout-cli in the path.

rollout worker run

rollout worker run --config <path> [--worker-id <ulid>] [--plugin <manifest.toml> ...] [--hot-reload]

The Phase-2 worker:

  1. Opens its own EmbeddedStorage.
  2. Builds a PluginHostImpl::with_storage(...) and load()s each --plugin manifest.
  3. Dials the coordinator over mTLS using an ephemeral client cert issued from the dev CA at ./data/tls/.
  4. Sends Beat(state=Init) immediately; the coordinator auto-registers the worker on first heartbeat (the proto has no separate register RPC).
  5. Beats every heartbeat_interval after that, advancing state -> Ready.
  6. On SIGTERM: sends Beat(state=Draining) and exits 0.

First-run UX

On first boot, the coordinator generates data/tls/ca.pem + data/tls/ca.key.pem (chmod 600 on the key) and prints:

Generated dev CA at ./data/tls/ca.pem

Subsequent runs are idempotent (read-through). The CA + per-host certs follow rcgen 0.13 defaults — adequate for dev; production should swap in a real CA in a later phase.

Smoke test

make smoke is the SUBSTR-02 / SUBSTR-03 / SUBSTR-04 acceptance gate. It runs the live substrate end-to-end on a clean checkout with no cloud credentials and proves that deadline-based failure detection works on the wire.

What it proves

The smoke test boots the full Phase-2 substrate and exercises every concern in one run:

  • rollout-storage opens three independent redb files (one per coordinator / w1 / w2) with always-fsync durability.
  • rollout-transport brings up the HTTP/2 listener with an auto-generated dev CA + per-host mTLS certificates.
  • rollout-plugin-host loads two plugins per worker — a Rust cdylib and a Python sidecar (stdlib-only framed-JSON over UDS).
  • rollout-coordinator persists the worker registry + heartbeat ledger and runs the deadline-based failure scan that emits worker_failed for any worker overdue by more than coordinator_failure_timeout + clock_skew_budget.

When the script kills w1 with SIGKILL, the coordinator must observe and emit the failure event for w1's ULID inside the deadline budget — that is the contract the substrate ships.

Topology

                        coord.log (NDJSON events)
                          ▲
                          │
    ┌─────────────────────┴─────────────────────┐
    │            rollout-coordinator            │
    │  ./data/smoke/coord.db   tls/ca.pem       │
    │  listen 127.0.0.1:50051 (mTLS, H/2)       │
    └───────────────┬──────────────┬────────────┘
                    │              │
              heartbeat 500ms  heartbeat 500ms
                    │              │
    ┌───────────────▼──┐     ┌────▼─────────────┐
    │  rollout worker w1│     │ rollout worker w2 │
    │  w1.db            │     │ w2.db             │
    │  plugins:         │     │ plugins:          │
    │    cdylib sample  │     │   cdylib sample   │
    │    sidecar sample │     │   sidecar sample  │
    └───────────────────┘     └───────────────────┘

Both workers load both plugin samples; the cdylib path is built on demand by the script (cargo build --manifest-path .../rust_cdylib_sample/Cargo.toml) and the Python sidecar resolves via PYTHONPATH=python/examples.

Timeline

StepWall-clock budgetAsserted invariant
Build binaries + cdylib~30 s (cold) / instant (cached)exit 0
Spawn coordinator< 1 sport 50051 up
Spawn w1 + w2< 1 sboth PIDs alive
Wait heartbeat-stable< 5 sworker_heartbeat for both ULIDs in coord.log
kill -KILL w1instant
Detect worker_failed(w1)< 8 s (D-TIME-01 floor: 2 × heartbeat_interval + skew)worker_failed topic + w1 ULID in coord.log

The 8-second detection deadline is well inside the spec contract: at the locked Phase-2 defaults (heartbeat_interval = 500 ms, coordinator_failure_timeout = 5 s, clock_skew_budget = 250 ms), the failure-scan loop ticks at 250 ms and fires the event within coordinator_failure_timeout + clock_skew_budget ≈ 5.25 s of the missed beat — observed locally at ~5.2 s.

Where logs go

All smoke artifacts land under ./data/smoke/ (gitignored):

PathPurpose
data/smoke/coord.dbCoordinator embedded storage
data/smoke/w1.db, w2.dbPer-worker embedded storage
data/smoke/tls/Auto-generated dev CA + per-host certs
data/smoke/sidecars/Per-plugin UDS sockets
data/smoke/logs/coord.logCoordinator stdout (NDJSON spec-09 events + tracing)
data/smoke/logs/w1.log, w2.logWorker stdout (tracing)
data/smoke/logs/w1.toml, w2.tomlPer-worker TOML (sed-rewritten from the shared fixture so each worker has its own db)
data/smoke/logs/w1.id, w2.idPre-allocated worker ULIDs (smoke driver writes these so the grep step is deterministic)

Searching coord.log for worker_failed.*$W1_ULID is the assertion the script makes.

Configuration fixtures

The committed fixtures at tests/smoke/coordinator.toml and tests/smoke/worker.toml use the D-TIME-01 timing defaults verbatim. The worker TOML carries a single storage path; the smoke driver derives data/smoke/w1.db and data/smoke/w2.db at runtime with sed because redb takes an exclusive file lock per Database::create and two worker processes sharing the same file would conflict.

CI integration

.github/workflows/ci.yml includes a smoke job on ubuntu-latest that runs after the standard test job. The job installs netcat-openbsd for the port-poll, runs bash scripts/preflight.sh, then make smoke. On failure, all data/smoke/logs/ contents are uploaded as a smoke-logs artifact for post-mortem.

The pre-existing 11 CI jobs (lint, test, deny, commitlint, schema-drift, architecture-lint, unused-deps, rustdoc-check, docs-build, docs-deploy, docs-test-policy) are untouched — no continue-on-error, no skipped steps. The architecture-lint job automatically picks up the 4th invariant added in dependency_direction.rs (rollout-coordinator must not depend on rollout-plugin-host or any cloud-layer crate) without any YAML changes.

First-run UX

On first invocation the coordinator emits

Generated dev CA at ./data/smoke/tls/ca.pem

to stderr and proceeds. No manual openssl steps are required; rcgen mints the CA + per-host certs in-process. The tls/ directory is gitignored along with the rest of data/.

How to extend

  • Add a plugin: drop a manifest TOML under tests/smoke/plugins/, then add a --plugin <path> flag to the worker spawn lines in scripts/smoke.sh.
  • Change timings: edit tests/smoke/coordinator.toml and tests/smoke/worker.toml; the cross-field invariants (worker_self_fence_timeout < coordinator_failure_timeout, clock_skew_budget < heartbeat_interval × 2) are enforced at validate_cross_fields time and failing configs early-exit.
  • Add a worker: copy the spawn_worker block; allocate a new ULID; add the per-worker sed-rewrite line for the storage path.

Reproducing locally

bash scripts/preflight.sh   # verifies cargo + make + python3 >= 3.11
make smoke

Total wall-clock is dominated by the cold cargo build (~30 s on a warm laptop); the actual integration test runs in ~7 s once binaries exist.

If python3 --version reports < 3.11, install a newer interpreter (brew install python@3.13, pyenv install 3.13.3, or your distro's python3.11 package) and re-run; the preflight check guards against the runtime-incompatible interpreter.

Inference

Phase 3 delivers rollout's first end-user-visible building block: a batch-inference pipeline that loads a model into vLLM via PyO3, pulls content-addressed prompts off the Phase-2 in-memory queue, and writes JSONL completions to disk. The pipeline is resumablerollout infer batch --resume <run_id> scans the per-sample state in rollout-storage and only enqueues outstanding work, producing zero duplicates on restart.

Phase 3 ships two new crates plus a CLI subcommand:

CrateLayerPhase-3 responsibility
rollout-backend-vllm2InferenceBackend impl over vLLM's AsyncLLMEngine via PyO3 (in-process).
rollout-runtime-batch3CAS sample-state machine, JSONL I/O, plan-time validation, mock backend.
rollout-cli (extended)4rollout infer batch --config <toml> [--resume <run_id>] subcommand.

Per-component chapters land in plan 03-05 (smoke + docs + bench). For now, see the Wave-0 trait extension in spec 02 §2a and the WorkerRole addition in spec 01 §3a.

TODO (plan 03-05): vLLM backend chapter · batch-runtime chapter · CPU-mode chapter · macOS Docker dev workflow · resume semantics walkthrough.

vLLM backend (rollout-backend-vllm)

rollout-backend-vllm implements the Phase-3 InferenceBackend trait (init / generate / model_id / shutdown) over vLLM's AsyncLLMEngine via PyO3 in-process. The crate is the second user of the dedicated Python OS-thread pattern hardened in plan 02-05; the first was rollout-plugin-host. The architecture is documented in spec 03-plugin-system §3.2 and the trait surface in spec 02-algorithms §2 / §2a.

Status (plan 03-03): Wave-3 lives. The PyO3 dedicated thread now imports rollout.backends.vllm.engine, which wraps vllm.AsyncLLMEngine; generate drives the engine through a fresh asyncio event loop on the worker thread via pyo3_async_runtimes::tokio::run_until_complete. The default-features (no-vllm) build keeps the Wave-2 stub worker so cargo test -p rollout-backend-vllm still runs without Python / vLLM.

Architecture

+--------------------+ tokio::sync::mpsc::Sender<VllmTask>  +--------------------+
|  Tokio runtime     | -----------------------------------> |  rollout-py-vllm-  |
|  VllmBackend       |                                      |     <engine_id>    |
|  (async impl       | <----------------------------------- |  Python::attach    |
|   InferenceBackend)|       oneshot::Sender<Result<...>>   |  rt.block_on(...)  |
+--------------------+                                      +--------------------+
                                                                       |
                                                                       v
                                                    rollout.backends.vllm.engine
                                                    (Wave 3: AsyncLLMEngine)
  • The Tokio side never touches Python. Every call hops through one OS thread named rollout-py-vllm-<engine_id> that owns the interpreter for the backend's lifetime.
  • The thread imports rollout.backends.vllm.engine once at startup, then loops on mpsc::Receiver<VllmTask> until VllmTask::Shutdown arrives or the channel closes.
  • Each Generate task uses a oneshot::channel for its reply so multiple concurrent prompts can be in flight without head-of-line blocking on the channel (Tokio side dispatches them sequentially in Wave 2; Wave 3 will span them concurrently to let vLLM's continuous batcher do the work).

Cargo feature gating

[features]
default = []       # no PyO3 link; tests use the stub worker
vllm    = []       # imports rollout.backends.vllm.engine on the dedicated thread

With --features vllm off (default), the crate builds without invoking pyo3 at runtime. The dispatch path still exists; Generate returns the Wave-2 stub error from a pure-Rust worker. This honors AGENTS.md §7 (every plugin testable locally without GPU) — no Python interpreter, no vLLM install, no CUDA required for cargo test -p rollout-backend-vllm.

With --features vllm on, the worker thread calls Python::attach(|py| py.import("rollout.backends.vllm.engine")) once at startup. The Python module top-imports vllm.AsyncLLMEngine (plan 03-03).

vLLM version pin: vllm>=0.10,<0.22. The lower bound is the first release where AsyncLLMEngine is an alias for the new v1 engine (vllm.v1.engine.async_llm.AsyncLLM); the upper bound guards against future-version drops of the alias. The pin is documented but NOT enforced in Cargo.toml — vLLM is a Python install. The engine module's import falls back to from vllm.engine.async_llm_engine import … if the top-level alias is removed in a future version.

Pitfall 10: env-write before import

vLLM imports huggingface_hub which lazily reads os.environ.get("HF_TOKEN") when downloading gated models. The dedicated Python thread therefore writes HF_TOKEN into its own os.environ before the py.import("rollout…") call:

#![allow(unused)]
fn main() {
Python::attach(|py| {
    if let Some(token) = &secret_token {
        let os = py.import("os")?;
        let environ: Bound<'_, PyDict> = os.getattr("environ")?.cast_into()?;
        environ.set_item("HF_TOKEN", token)?;
    }
    py.import("rollout.backends.vllm.engine")?;
    Ok(())
});
}

The token flows through the VllmEngine::spawn(engine_id, secret_token) constructor; the SecretStore consumer is wired in plan 03-03 alongside the live engine.

Wave 2 vs Wave 3 split

ConcernWave 2 (plan 03-01)Wave 3 (plan 03-03)
Cargo crate + vllm feature flagshippedinherited
PyO3 dedicated thread (rollout-py-vllm-…)shippedinherited
VllmTask enum + mpsc::Sender dispatchshippedinherited
VllmBackend: InferenceBackend impl shapeshipped (stub)live engine
python/rollout/backends/vllm/engine.pystubreal AsyncLLMEngine
AsyncLLMEngine.from_engine_args wiringshipped
pyo3_async_runtimes::tokio::run_until_completeshipped
Explicit torch.cuda.is_available() device probeshipped
HF_TOKEN env-write before import vllmwired (passthrough)exercised
Content-addressed model_id from HF repo SHAshipped
criterion throughput bench + raw-vllm baselineplaceholdershipped

Bridging the asyncio ↔ Tokio gap (RESEARCH Pitfall 2)

vllm.AsyncLLMEngine.generate(prompt, sampling, request_id) returns an async generator that vLLM's own scheduler drives. To consume it from Rust, we:

  1. Build a coroutine on the GIL by calling engine.generate_one(...) (the Python module's wrapper that drives the async-for loop to completion and returns the final RequestOutput as a dict).
  2. Create a fresh asyncio event loop on the worker thread.
  3. Hand the coroutine to pyo3_async_runtimes::tokio::run_until_complete(event_loop, async move { into_future(coro).await }). The Python C-level event_loop.run_until_complete releases the GIL whenever the loop has nothing to do — which is exactly when our Rust await yields. That is what lets vLLM's background scheduler tasks (also on this asyncio event loop) make progress.

The contract is verified on every CI build by tests/pyo3_bridge_smoke.rs: a Python async def smoke() spawns a background threading.Thread that polls a flag; the assertion fails if the background thread does not see the flag set, proving the GIL would have been held across the await (Pitfall 2 regression).

Pitfall 9: explicit device probe

vLLM's EngineArgs(device="auto") is unreliable on partially-installed CUDA hosts (silent CPU fallback). The Python glue probes torch.cuda.is_available() and passes an explicit device="cuda" or device="cpu" to AsyncEngineArgs. CONTEXT D-VLLM-04's earlier "auto" guidance is superseded by this probe.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
args = AsyncEngineArgs(model=model_uri, device=device, ...)

Live integration tests (gated)

Two #[ignore]'d integration tests under crates/rollout-backend-vllm/tests/ exercise the live engine:

  • vllm_init.rs — bring up facebook/opt-125m; assert backend.model_id() is content-addressed and stable across two init() calls of the same URI.
  • vllm_generate.rs — bring up Qwen/Qwen2.5-0.5B-Instruct and run a 1 prompt × 8 tokens round-trip with a 300 s timeout (RESEARCH Pitfall 8 CPU mode is slow).

Both tests skip unless ROLLOUT_VLLM_AVAILABLE=1 is set in the environment. Run them with:

ROLLOUT_VLLM_AVAILABLE=1 cargo test \
    -p rollout-backend-vllm --features vllm \
    --test vllm_init --test vllm_generate -- --include-ignored

The default workspace cargo test skips them.

Benchmark methodology

crates/rollout-backend-vllm/benches/throughput.rs runs a 64-prompt × 64-token criterion bench against facebook/opt-125m; the companion scripts/raw_vllm_baseline.py drives the same prompt set through raw vllm.LLM (sync API). Compare the two tokens/sec numbers to verify the BACKEND-02 <10% overhead exit criterion — a ratio ≥ 0.9 passes.

The bench is gated behind --features vllm and ROLLOUT_VLLM_AVAILABLE=1. CI does not run it by default; the perf check lives on the self-hosted GPU runner per CONTEXT D-CLI-05.

# rollout side:
ROLLOUT_VLLM_AVAILABLE=1 cargo bench \
    -p rollout-backend-vllm --features vllm --bench throughput

# baseline side:
python scripts/raw_vllm_baseline.py facebook/opt-125m

Cross-references

rollout-runtime-batch

The runtime-glue crate behind rollout infer batch. Owns the CAS sample-state machine, the queue management around the Phase-2 InMemQueue, JSONL I/O, the InferBatchConfig TOML schema, and the test-only MockBackend that lets us exercise the full coordinator/worker path without spinning up a real vLLM engine.

Why a separate crate?

rollout-backend-vllm (Layer 2) depends only on rollout-core + pyo3 — it has no business touching rollout-storage / rollout-cloud-local / rollout-transport. The dep-direction lint (invariants #5 and #6 from plan 03-00) enforces this. The Phase-3 work that does need storage + queue + object-store — the sample-state CAS machine, the resume scan, the JSONL reader — lives here in rollout-runtime-batch (Layer 3). The runtime composes any Arc<dyn InferenceBackend>; the backend stays cloud-agnostic.

+------------------+        rollout-cloud-local::InMemQueue
|  rollout-cli     |---┐    rollout-cloud-local::FsObjectStore
|  infer batch     |   |    rollout-storage::EmbeddedStorage
+------------------+   ▼
                  +-------------------------+      +-----------------------+
                  | rollout-runtime-batch   |◀────▶| Arc<dyn               |
                  |                         |      |  InferenceBackend>    |
                  | • BatchCoordinator      |      |                       |
                  | • BatchWorker           |      | • VllmBackend (prod)  |
                  | • SampleRecord + CAS    |      | • MockBackend (test)  |
                  | • JSONL I/O             |      +-----------------------+
                  | • InferBatchConfig      |
                  +-------------------------+

SAMPLING_PARAMS_SCHEMA_VERSION

Per RESEARCH §"Pitfall 1", the deterministic sample_id() hash is brittle against any change to SamplingParams's wire shape. Postcard is not self-describing — adding a field rewrites every byte of to_stdvec(&params), which would invalidate every outstanding Pending / Running sample-ID in storage.

The defence is a schema-version byte:

#![allow(unused)]
fn main() {
const SAMPLING_PARAMS_SCHEMA_VERSION: u8 = 1;

fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId {
    let mut h = blake3::Hasher::new();
    h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]);  // FIRST byte
    h.update(&model.0);
    h.update(prompt.as_bytes());
    h.update(&postcard::to_stdvec(params).unwrap());
    h.update(&idx.to_le_bytes());
    ContentId(*h.finalize().as_bytes())
}
}

When you add a field to SamplingParams, bump SAMPLING_PARAMS_SCHEMA_VERSION. That invalidates outstanding IDs by design — drain the in-flight batch under the old version before deploying the new schema, or re-run any orphaned samples.

Tests:

  • content_id_derivation.rs::sample_id_matches_locked_hex_for_default_params — locks the hex digest for the default-params case so any silent change to hasher input order trips immediately.
  • content_id_derivation.rs::schema_version_byte_is_first — explicit regression catch for Pitfall 1.
  • Six proptest property tests prove that every component of the input (prompt, idx, model, temperature, max_tokens, seed) participates in the hash.

CAS state machine

SampleRecord { id, prompt_blob, state, created_at_ms, input_idx } lives at infer/<run_id>/samples/<sample_id_hex> (storage namespace infer, table T_INFER).

              try_claim                      try_complete
   Pending  ──────────────▶ Running ────────────────────▶ Done
     ▲                       │                            ▲
     │                       │                            │ (terminal)
     │ try_repending         │ try_fail
     │  (stale only)         ▼
     └──────────────────── Failed
                            (transient errors;
                             coordinator re-enqueues)

Helpers (in src/state.rs):

  • try_claim(txn, record, run_id, worker_id, now_ms, stale_after_ms) → Result<bool> CAS-swaps Pending (or stale Running) → Running. Returns false on race-loss.
  • try_complete(txn, running_record, run_id, completion_blob, finished_at_ms)
  • try_fail(txn, running_record, run_id, reason, failed_at_ms)
  • try_repending(txn, running_record, run_id) — for the resume path.

All helpers take a &mut Box<dyn StorageTxn> so the caller controls the commit/abort lifecycle.

Resume semantics

BatchCoordinator::scan_and_enqueue(inputs, model_content_id, sampling) is idempotent. For each input it derives sample_id(model, prompt, sampling, idx), looks up the existing record at sample_key(run_id, sid), and:

Current stateAction
(absent)write Pending + prompt blob + enqueue
Pendingenqueue (worker will claim via CAS)
Failedenqueue (retry)
Running (fresh)skip — live owner
Running (stale)CAS RunningPending; enqueue on success
Doneskip — terminal

"Stale" defaults to 5 minutes (DEFAULT_STALE_AFTER_MS = 5 * 60_000). Override per-coordinator via .with_stale_after_ms(ms). The 5-minute window follows RESEARCH §"Pitfall 5" — short enough to recover from a SIGKILL'd worker, long enough to absorb a slow generation pass.

The stale-Running re-Pending step is itself a CAS so two coordinators racing on the same orphaned sample don't double-enqueue. The test resume_skips_done.rs::scan_enqueues_only_non_terminal_samples covers all four transition cases in a single integration run.

BatchWorker flow

#![allow(unused)]
fn main() {
pub async fn run_loop(&self) -> Result<usize, CoreError> {
    loop {
        match self.run_one().await? {
            RunOutcome::Drained => return Ok(completed),
            RunOutcome::Completed => completed += 1,
            RunOutcome::Failed | RunOutcome::Skipped => { /* keep going */ }
        }
    }
}
}

run_one() dequeues a sample-id, loads the SampleRecord, CAS-claims it, fetches the prompt blob, calls backend.generate(&[Prompt], &sampling), writes the completion to the object store, and CAS-transitions Running → Done. Backend errors translate to Running → Failed via try_fail; the queue item is ack'd either way (the coordinator re-enqueues Failed rows on the next scan_and_enqueue).

Concurrency: each BatchWorker is one Tokio task; spawn N of them sharing the same Arc<dyn Queue> / Arc<dyn Storage> / Arc<dyn ObjectStore> and the CAS dance guarantees at-most-once completion per sample.

MockBackend

Gated by the test-mock-backend Cargo feature. Returns Completion { text: format!("MOCK:{}", p.0), finish_reason: "stop", … } after an optional tokio::time::sleep. Used by worker_happy_path.rs and (in Wave-4) by the restart-no-duplicates integration test to exercise the full pull-loop without a Python / vLLM dependency.

The contract: MockBackend impls the same InferenceBackend trait that VllmBackend does, so it composes against BatchWorker identically. CLI code in Wave 3 (plan 03-04) wires VllmBackend; nothing in BatchWorker / BatchCoordinator knows the difference.

InferBatchConfig

The TOML schema for rollout infer batch --config <path> lives here in src/config.rs (NOT in rollout-cli, per WARN-5). rollout-cli imports it via use rollout_runtime_batch::config::InferBatchConfig;.

[model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[sampling]
temperature = 0.7
top_p       = 0.9
max_tokens  = 64
seed        = 42

[input]
glob = "data/prompts/*.jsonl"

[output]
dir = "data/completions"

[workers]
count = 1

#[serde(deny_unknown_fields)] on every block per spec 11 — typos in the config file fail at plan-time, not minute 47 of a run.

JSONL I/O

read_jsonl + write_jsonl use stdlib tokio::io::BufReader::lines() + serde_json::from_str — no serde_jsonlines dep. Arbitrary extra fields on the input row are preserved verbatim via #[serde(flatten)] and round-trip to the output row. write_jsonl writes one JSON object per line; the CLI sorts on SampleRecord.input_idx before calling write_jsonl so output order matches input order regardless of which worker finished which sample first.

Phase 4 — TrainableBackend impl on MockBackend

Beyond the Phase-3 InferenceBackend impl, MockBackend ships a TrainableBackend impl gated behind the same test-mock-backend feature. The training extension drives plan 04-02's LOAD-BEARING snapshot_resume.rs byte-compare test (TRAIN-03) — no Python, no GPU, runs on every CI build.

  • MockBackend::new_train(seed) initialises an ndarray::Array1<f32> of length 8 with every element set to (seed as f32) / 1000.0.
  • forward_with_loss returns loss = 0.5 and a GradHandle { step: prev + 1 }.
  • optimizer_step applies a deterministic SGD delta (seed + grad_handle.step) * lr to every weight element.
  • save_weights returns ContentId::of(postcard::to_stdvec(&weights)).
  • load_weights is a no-op; new_train_with_weights(seed, weights) is the test-side restore hook so the byte-compare assertion in snapshot_resume.rs is meaningful.

Determinism contract: two MockBackends constructed with the same seed produce byte-equal weights after K identical optimizer_step calls. Plan 04-02's test proves bit-identical resume by running 10 uninterrupted steps versus 5-snapshot-5 and asserting the two final weight vectors are equal.

See also

  • RESEARCH §"Pitfall 1" — schema-version byte rationale.
  • RESEARCH §"Pitfall 5" — stale-Running re-Pending CAS dance.
  • docs/specs/02-algorithms.md §2a — InferenceBackend extension shape.
  • docs/specs/04-storage-snapshots.md §2 — Storage + StorageTxn + CAS.
  • docs/specs/06-cloud-layer.md §3 — Queue + ObjectStore.
  • Plan 04-02 — SFT skeleton + the snapshot-resume byte-compare proof.

rollout infer batch CLI

The first user-visible building block of rollout: a resumable batch-inference pipeline driven by a single TOML config. Bridges the Phase-3 substrate (rollout-backend-vllm + rollout-runtime-batch) behind a clap subcommand on rollout-cli.

Invocation

rollout infer batch \
    --config examples/batch-tiny.toml \
    [--resume <run_id>] \
    [--workers N] \
    [--dry-run]
FlagDefaultPurpose
--configrequiredPath to the TOML config (schema below).
--resumeimplicit from output dirOverride the <output.dir>/run-id lookup with an explicit ULID.
--workers[workers].countOverride the worker pool size for this invocation.
--dry-runfalseValidate config + probe inputs + check HF_TOKEN allowlist; never call the backend.

TOML schema

The schema lives in rollout-runtime-batch::config::InferBatchConfig (the CLI imports it; spec 11 single-source-of-truth). Every block is #[serde(deny_unknown_fields)] — typos are caught at load time.

[model]
uri       = "Qwen/Qwen2.5-0.5B-Instruct"   # HF repo, local path, or object-store URI
tokenizer = "..."                          # optional override

[sampling]
temperature = 0.7
top_p       = 0.9
top_k       = -1     # -1 disables
max_tokens  = 64
seed        = 42     # optional; deterministic when set
stop        = []
stream      = false  # MUST be false in Phase 3 (D-BACKEND-03)

[input]
glob = "data/prompts/*.jsonl"

[output]
dir = "data/completions"

[workers]
count = 1            # >= 1

JSONL input contract

One JSON object per line. Required: prompt. Optional: id (defaults to the deterministic sample-id from blake3(model || prompt || params) per rollout-runtime-batch::sample_id). Extra fields are preserved and round-tripped to output.

{ "prompt": "Translate to French: Hello.", "id": "p-001", "tag": "demo" }

JSONL output contract (D-CLI-03)

Each row:

{
  "id":                "<content-addressed sample id>",
  "prompt":            "<original prompt text>",
  "completion":        "<generated text>",
  "sampling_params":   { ... },
  "model_uri":         "Qwen/Qwen2.5-0.5B-Instruct",
  "finish_reason":     "stop",
  "model_content_id":  "<blake3-hex of resolved model SHA>",
  "completion_blob_id":"<blake3-hex of completion bytes in the object-store>",
  "generated_at":      "2026-05-20T22:31:09.123Z"
}

Order matches input file order regardless of worker concurrency — the CLI calls BatchCoordinator::collect_done_records() (sorted by input_idx) and emits in that order.

run_id lifecycle (BLOCKER 6)

Three-tier resolution:

SourceWhen applied
--resume <ULID>Always honored if present.
<output.dir>/run-id (file)Re-attach if the file exists and --resume is absent.
Freshly minted ULIDFirst run; written atomically (tmp + rename).

The file is single-line UTF-8 (ULID Crockford form). Plan 03-05's restart_no_duplicates test reads this file between phases to obtain the ULID for the explicit --resume flag.

--dry-run semantics

Performs (in order):

  1. Parse TOML against InferBatchConfig (deny_unknown_fields).
  2. Validate sampling.stream == false, sampling.max_tokens > 0, workers.count >= 1, input.glob non-empty.
  3. Resolve input.glob and read every JSONL file end-to-end.
  4. Probe whether the model URI is on a known-gated prefix (meta-llama/*, mistralai/*) and look up ROLLOUT_SECRET_HF_TOKEN (best-effort).
  5. Print dry-run OK: model=… inputs=… workers=… and exit 0.

Crucially, the backend is never constructed--dry-run works on a build with neither --features vllm nor --features test-mock-backend.

Backend selection

The CLI picks one backend at runtime based on Cargo features + env:

OrderConditionBackend
1--features test-mock-backend + ROLLOUT_TEST_MOCK_BACKEND=1rollout_runtime_batch::MockBackend
2--features vllmrollout_backend_vllm::VllmBackend
3nonefast-fail with Fatal(ConfigInvalid) at full-run; dry-run still works.

Observability

RUST_LOG=info rollout infer batch … produces structured tracing events. Key fields: run_id, enqueued, total, completed. Worker spans carry worker = <ulid>.

Exit codes

CodeMeaning
0Success (or successful dry-run).
2Config-invalid, infrastructure error, or backend error.

CPU-mode caveat

vLLM CPU inference on the test model (Qwen2.5-0.5B-Instruct) is dramatically slower than CUDA: expect ~1–5 tokens/sec. On macOS Apple-Silicon, vLLM has no PyPI wheels (RESEARCH Pitfall 3) — run via Docker. CI's infer-smoke job is Linux-only and gated on ROLLOUT_VLLM_AVAILABLE=1.

CPU mode

vLLM ships with first-class CUDA support, but for AGENTS.md §7 ("every plugin testable locally without GPU") rollout must also work on a plain CPU. This chapter documents the CPU-mode contract for rollout-backend-vllm, the dev-loop reality of macOS Apple-Silicon, and the smoke-test posture that lets default CI stay green without any GPU.

Where CPU mode is selected

Per Phase 3 CONTEXT decision D-VLLM-04 (as overridden by RESEARCH §"Pitfall 9"), the Python-side glue in python/rollout/backends/vllm/engine.py performs an explicit torch.cuda.is_available() probe and passes device="cuda" or device="cpu" to AsyncEngineArgs — never device="auto". The auto-detect path was rejected because vLLM silently falls back to CPU when CUDA libraries are partially installed (driver present, runtime missing, etc.), and a silent fallback would mask configuration mistakes at runtime instead of failing at plan time.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
engine_args = AsyncEngineArgs(
    model=model_uri,
    device=device,           # explicit; not "auto"
    disable_log_stats=True,
    disable_log_requests=True,
)

rollout-cloud-local::ComputeHint::inventory() still informs observability events (gpu_inventory_collected) and worker-config decisions, but it is no longer the source-of-truth for the engine device kwarg.

Expected CPU throughput

vLLM's CPU backend is functional but not fast. Approximate single-stream throughput for the canonical Phase-3 test model (Qwen/Qwen2.5-0.5B-Instruct, 16 max-tokens):

Hosttokens/sec
Apple M1 Pro (8-core perf)~3–6
Linux x86_64 (16-core)~2–4
Generic CI runner (4-core)~1–2

A 4-prompt × 16-token smoke run finishes in well under 60 s on any of these. Anything longer than the canonical examples/batch-tiny.toml shape should target a CUDA host.

macOS Apple-Silicon

vLLM has no Apple-Silicon wheel as of Phase 3. pip install vllm on macOS produces ERROR: No matching distribution found for vllm. The two paths forward:

  1. Build-from-source (slow, brittle). VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vLLM repo. Compilation takes 10–30 min and depends on Apple-clang versions in ways that drift between vLLM releases. Documented but not recommended for routine dev work.
  2. Docker (recommended). See dev-on-macos.md. A linux/amd64 (or linux/arm64 if available) container with vllm>=0.10 pre-installed lets the rollout binaries run identically to CI.

Either way, the Rust-side test surface (~80 % of Phase 3's automated tests — SamplingParams postcard determinism, sample_id derivation, CAS state-machine transitions, JSONL round-trip, the MockBackend-driven restart_no_duplicates test) runs natively on macOS without any vLLM installed. Only the live-engine integration tests (vllm_init.rs, vllm_generate.rs) and the make infer-smoke script require a real vLLM.

CI posture

  • Default CI (public runners): infer-smoke now runs on every PR and merge — no ROLLOUT_VLLM_AVAILABLE gate. It installs the vllm-cpu PyPI wheel (~101 MB unified CPU wheel, AVX2 fallback) instead of the ~10 GB CUDA wheel; the torch.cuda.is_available() probe in engine.py selects device="cpu" automatically. The job downloads Qwen2.5-0.5B-Instruct (cached under ~/.cache/huggingface), runs rollout infer batch --config examples/batch-tiny.toml, and asserts 4 non-empty completion rows — all on the free 4-vCPU ubuntu-latest runner in well under 60 s of inference. train-smoke is likewise always-on, installing CPU torch + transformers + accelerate and running the examples/sft-tiny.toml SFT (max_steps = 2) on CPU. pip + HuggingFace caches keep both jobs fast on repeat runs.
  • MockBackend proofs unchanged: the load-bearing restart_no_duplicates (BACKEND-02 exit (b)) and bit-identical-resume (TRAIN-03) proofs still run in the standard test job via MockBackend — no GPU/vLLM/transformers required there.
  • Local dev: make infer-smoke after pip install 'vllm-cpu>=0.17'. On Apple-Silicon, prefer the Docker path documented in dev-on-macos.md.

Failure modes

FailureSurfaceDiagnosis
import torch failsPython ImportError at engine initActive venv missing torch — pip install torch first
import vllm failsPython ImportError at engine initActive venv missing vllm — for the CPU/CI path pip install 'vllm-cpu>=0.17'; on a CUDA host install the matching CUDA wheel
torch.cuda.is_available() == False on a GPU hostengine boots in CPU mode silentlyNVIDIA driver/runtime mismatch — install matching CUDA runtime; the explicit probe surfaces this rather than masking it
vllm import succeeds but AsyncLLMEngine.from_engine_args panics with device="cpu" not supportedvLLM version too oldupgrade to vllm>=0.10
make infer-smoke times out (>300 s) on a CPU hostmodel larger than Qwen2.5-0.5B-Instructuse the canonical examples/batch-tiny.toml model; do not run multi-billion-param models on CPU

Resume: zero-duplicate batch restart

Phase 3's rollout infer batch is resumable. If the worker is killed mid-batch, restarting with --resume <run_id> (or with the same [output] dir = and no explicit flag) continues from the last persisted sample with zero duplicates and no skipped inputs. This is one of the three ROADMAP Phase-3 exit criteria.

The lifecycle

Every rollout infer batch invocation has exactly one run_id — a 26-character ULID. The CLI resolves it via a three-tier precedence:

  1. --resume <id> explicit override. Use this when you know the run you want to reattach.
  2. <output.dir>/run-id file. If --resume is absent but the output directory already has a run-id file (from a prior run), the CLI reads it and continues that run. This is the common case for "kill + restart with the same config."
  3. Mint a fresh ULID. No override, no prior file → generate a new RunId, write it atomically (tempfile + rename) to <output.dir>/run-id, and proceed as a brand-new run.

run-id files are single-line UTF-8 ULIDs. They are safe to delete (forces a fresh run on next invocation) but should not be edited by hand.

How resumability is implemented

Three subsystems collaborate (each documented in its own chapter):

  • rollout-storage (the redb-backed EmbeddedStorage from Phase 2) holds the per-sample state KV under namespace infer. Each sample's SampleRecord carries one of four SampleState variants — Pending, Running, Done { completion_blob }, or Failed { reason } — and transitions atomically via Storage::cas_bytes.
  • rollout-cloud-local::InMemQueue (also Phase 2) holds the work queue with Storage spill: every enqueue/ack mirrors to redb under cloudlocal_queue, so a coordinator restart replays whatever was unacknowledged.
  • rollout-cloud-local::FsObjectStore stores the actual completion blobs at content-addressed paths under <output.dir>/object-store/<sha256[0..2]>/<sha256[2..4]>/<sha256>. The SampleRecord::state = Done { completion_blob: ContentId } variant carries the blob's ContentId; the actual completion text is read from the object store at output-writing time.

At plan time, BatchCoordinator::scan_and_enqueue iterates the existing samples namespace and:

  • skips records with state = Done (already complete; their completion blobs survive in the object store);
  • re-enqueues records with state = Pending or state = Failed;
  • treats records with state = Running and a started_at older than stale_after_ms (default 60 000) as stale — CAS Running → Pending and re-enqueue (per RESEARCH Pitfall 5: covers the kill-mid-flight case where a worker crashed without finishing).

Only the new and re-enqueued samples flow through the queue. Workers process them via the same BatchWorker::run_loop as a fresh run.

The deterministic test

crates/rollout-cli/tests/restart_no_duplicates.rs is the load-bearing proof. It runs on every CI build — no GPU, no vLLM, no real model — because it uses the MockBackend shipped by rollout-runtime-batch behind the test-mock-backend Cargo feature.

The test follows RESEARCH §"Restart-resume test design" verbatim:

  1. Spawn rollout infer batch --config <tmp> as a subprocess via tokio::process::Command::new(env!("CARGO_BIN_EXE_rollout")) with ROLLOUT_TEST_MOCK_BACKEND=1.
  2. Stream stdout, count sample_completed events.
  3. After 3 completions, child.start_kill() (SIGKILL).
  4. Read <output.dir>/run-id.
  5. Spawn a second subprocess with --resume <run_id> and the same env.
  6. Wait for exit 0.
  7. Assert the final completions.jsonl has exactly N=8 lines, all unique sample_ids, and every input prompt is represented once.

Run locally:

cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates

Total wall-clock: ~1.5 s on a modern dev box. The MockBackend completes each sample in 50 ms; the test deliberately spawns subprocesses to exercise the real CLI resume code path including --resume parsing, run-id file I/O, and BatchCoordinator::scan_and_enqueue.

Content-addressed sample IDs

A sample's identity is derived deterministically from (model, prompt, sampling_params, input_index) — see crates/rollout-runtime-batch/src/state.rs::sample_id. Per CONTEXT D-RESUME-01 (extended by RESEARCH Pitfall 1):

#![allow(unused)]
fn main() {
fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId {
    let mut h = blake3::Hasher::new();
    h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]);  // = 1 in Phase 3
    h.update(model.as_bytes());
    h.update(prompt.as_bytes());
    h.update(&postcard::to_stdvec(params).unwrap());
    h.update(&idx.to_le_bytes());
    ContentId::from(h.finalize())
}
}

The leading SAMPLING_PARAMS_SCHEMA_VERSION byte is the maintenance hook: bumping the constant invalidates every outstanding sample-id, which is the desired behavior whenever the SamplingParams struct gains, loses, or reorders fields. Without it, a future serde-evolution of SamplingParams could silently produce different postcard bytes for the same logical configuration and break resume across versions.

What you cannot resume

  • Cross-model resume. A run whose <output.dir>/run-id was minted under model A cannot resume against model B — the per-sample model_id is part of the content-addressed sample ID; mixing models triggers Fatal(ConfigInvalid) at scan time.
  • Cross-machine resume (Phase 3). The EmbeddedStorage redb file is local to the host. Phase 5 (CLOUD-03) introduces object-store-backed snapshot storage that will allow cross-machine resume; until then, resume is single-host.
  • Streaming runs. Phase 3 rejects sampling.stream = true at plan time (D-BACKEND-03); streaming is Phase 8 (INFER-01).

See also

  • cli.md — full CLI flag reference for --resume.
  • cpu-mode.md — where the MockBackend-driven test runs vs. live vLLM.
  • batch-runtime.md — the BatchCoordinator and BatchWorker internals.
  • vllm-backend.md — Pitfall-2 GIL bridge details that the live engine relies on.

Dev loop on macOS (Apple Silicon)

vLLM has no Apple-Silicon wheel as of Phase 3 (see cpu-mode.md). That makes the full rollout infer batch smoke loop unavailable natively on macOS dev boxes. This chapter documents the two supported workarounds and the substantial default-CI surface that does run on macOS so you don't need a Linux box for routine work.

What runs natively on macOS

The following all pass on darwin-aarch64 with PYO3_PYTHON=/opt/homebrew/bin/python3.13 (or a 3.11+ python on PATH) — no vLLM required:

cargo test --workspace --tests              # 100+ tests across Phase 2 + Phase 3
cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates
make smoke                                  # Phase 2 substrate smoke
cargo clippy --workspace --all-targets -- -D warnings
cargo doc   --workspace --no-deps           # rustdoc gate (without --all-features; see ci.yml)
mdbook build docs/book

What does not run natively:

make infer-smoke ROLLOUT_VLLM_AVAILABLE=1   # needs `pip install vllm`, which has no aarch64-darwin wheel
cargo test -p rollout-backend-vllm --features vllm --test vllm_init -- --include-ignored
cargo bench -p rollout-backend-vllm --bench throughput

For the load-bearing exit-criterion-(b) proof (zero-duplicate restart), the MockBackend-driven test runs natively in ~1.5 s. Live vLLM is only needed for exit criterion (a) (the canonical rollout infer batch --config examples/batch-tiny.toml) and (c) (the <10 % overhead benchmark).

Run rollout inside a Linux container that has vLLM pre-installed. The repo is mounted via a bind volume so edits flow through immediately.

Sample Dockerfile.devcontainer:

FROM rust:1.88-slim-bookworm

RUN apt-get update && apt-get install -y --no-install-recommends \
      python3.11 python3-pip git make pkg-config libssl-dev protobuf-compiler \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --break-system-packages 'vllm>=0.10,<0.22'

WORKDIR /workspace
CMD ["bash"]
docker build -t rollout-dev -f Dockerfile.devcontainer .
docker run --rm -it \
  -v "$PWD":/workspace \
  -v "$HOME/.cargo/registry":/root/.cargo/registry \
  -v "$HOME/.cache/huggingface":/root/.cache/huggingface \
  rollout-dev

Inside the container:

cargo test --workspace --tests
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo run -p rollout-cli --features vllm -- infer batch --config examples/batch-tiny.toml

Caveats:

  • linux/arm64 and linux/amd64 images both work; linux/arm64 is faster on M-series silicon.
  • The first run downloads Qwen/Qwen2.5-0.5B-Instruct (~1 GiB). The ~/.cache/huggingface bind volume above persists it across container restarts.
  • A pure-CPU Docker run will not exercise CUDA-specific code paths. For those, use Workaround 2 or a cloud GPU runner.

Workaround 2: Cloud GPU runner

For exit criterion (c) (the <10 % overhead benchmark) you need a real CUDA GPU. The repo's CI infer-smoke job is gated on vars.ROLLOUT_VLLM_AVAILABLE == '1' and needs: test — set the repo variable on a self-hosted runner with a GPU, then push, and the bench captures throughput against python scripts/raw_vllm_baseline.py automatically.

For local development without committing, a Lambda Labs / Runpod / Vast.ai box rented by the hour works well:

# On the GPU box:
git clone <this-repo>
cd rollout
pip install 'vllm>=0.10,<0.22'
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo bench -p rollout-backend-vllm --bench throughput

VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vllm-project/vllm repo works on macOS in principle. In practice the build takes 10–30 min, brittle-fails on Apple-clang version mismatches in ways that drift between vLLM releases, and produces a vllm binding ~3× slower than the Linux CPU wheel. Prefer Docker.

What CI tests on macOS

The lint and test workflow jobs run on macos-14 (the GitHub macos-14 runner is Apple-Silicon). They cover the entire Rust workspace surface and the MockBackend-driven restart_no_duplicates test. The infer-smoke job runs on ubuntu-latest and is opt-in; the lint job uses default features only (no quic, no vllm) per the .github/workflows/ci.yml comment.

See also

  • cpu-mode.md — where CPU mode is selected and the expected throughput numbers.
  • cli.md — full CLI reference.
  • resume.md — how MockBackend proves the resume contract without vLLM.

Training

Phase 4 lands the first end-to-end training story: supervised fine-tuning + Bradley-Terry reward-model training + bit-identical-resume training-state snapshots + the Postgres Storage backend.

What's here

  • SFT (Supervised Fine-Tuning)rollout-algo-sft; TRAIN-01.
  • RM (Reward Model)rollout-algo-rm with Bradley-Terry pairwise loss; TRAIN-02.
  • Snapshotsrollout-snapshots, SnapshotKind::TrainState, tar + blake3 + restore; TRAIN-03.
  • Postgres backendrollout-storage[postgres]; testcontainers CI; TRAIN-04.
  • Determinismaccelerate.save_state + CUDA / CPU caveats.
  • CLIrollout train sft|rm and rollout snapshot list|show|prune.
  • CPU mode — what to expect on macOS / Apple Silicon development boxes.

Quickstart

# Dry-run validation (works without Python deps).
cargo run -p rollout-cli -- train sft --config examples/sft-tiny.toml --dry-run

# Live run (requires transformers + accelerate + torch; ~3-5 min CPU on M-series).
pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke

Phase 4 exit criteria

CriterionWhere it's proven
rollout train sft --config examples/sft-tiny.toml completes on a small modelmake train-smoke (gated on ROLLOUT_TRANSFORMERS_AVAILABLE=1)
Snapshot + restart produces bit-identical weights for next K stepscrates/rollout-algo-sft/tests/snapshot_resume.rs (default-fire) + crates/rollout-backend-vllm/tests/snapshot_resume_live.rs (gated)
Postgres backend CI-tested via containerized integration testcrates/rollout-storage/tests/postgres_integration.rs via the postgres-integration CI job

What's NOT here (deferred)

  • PPO / GRPO / DPO / IPO / KTO — Phases 9 / 10.
  • Buffer / Process / EpisodicMemory snapshot kinds — Phases 9 / 11 / 8.
  • Cloud object stores for snapshot blobs — Phase 5.
  • HuggingFace datasets Hub integration — Phase 7.
  • Multi-node distributed training — Phase 6.

SFT — Supervised Fine-Tuning

Phase 4 plan 04-02 ships rollout-algo-sft: a PolicyAlgorithm skeleton driven by a deterministic MockBackend, with the load-bearing TRAIN-03 byte-compare resume proof. This chapter covers the architecture, the SFT settings shape, the JSONL data contract, and how snapshot_save / snapshot_restore participate in the TRAIN-03 round-trip.

The HF transformers + accelerate path lands in plan 04-05; this skeleton intentionally has zero Python / GPU dependencies so the snapshot resume contract is exercised on every CI build.

Architecture

   ┌─────────────────────────┐         ┌────────────────────────────┐
   │ SftSettings (TOML)      │         │ AlgoDependencies           │
   │   base_model, optimizer │         │   backend  : Arc<dyn TB>   │
   │   budget, dataset       │         │   storage  : Arc<dyn S>    │
   │   packing, loss_on, …   │         │   object   : Arc<dyn O>    │
   └────────────┬────────────┘         │   snapshots: Arc<dyn Sn>   │
                │ from_settings        │   events   : Arc<dyn Em>   │
                ▼                      └─────────────┬──────────────┘
       ┌─────────────────────┐                       │
       │     SftAlgo         │ ◀─────────────────────┘
       │  (PolicyAlgorithm)  │
       │   step: u64         │     run() loop, bounded by budget.max_steps
       └──────────┬──────────┘
                  │ step_once()
                  ▼
       forward_with_loss  →  optimizer_step  →  step += 1
                  │
                  ▼
     snapshot_save / snapshot_restore (algo meta only — weights via TrainableBackend::save_weights)

SftAlgo holds an Arc<dyn TrainableBackend> and a step counter. Each step_once() synthesises a single-row TrainBatch, calls forward_with_loss (which returns a constant loss=0.5 plus an opaque GradHandle), then calls optimizer_step. The trait's optimizer_step takes &self (interior mutability) so the algo can step through the Arc<dyn …> without unique ownership.

Plan 04-05 replaces the synthetic batch with a real tokenized chunk read from the dataset.

SftSettings (TOML shape)

[algorithm]
kind = "sft"

[algorithm.sft]
minibatch_size = 8
gradient_accumulation = 1
loss_on = { kind = "assistant_only" }

[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.sft.dataset]
kind = "jsonl_path"
path = "data/sft-tiny.jsonl"

[algorithm.sft.optimizer]
kind = "sgd"
lr = 1.0e-3
weight_decay = 0.0
betas = [0.9, 0.999]
eps = 1.0e-8
warmup_steps = 0
schedule = "constant"

[algorithm.sft.budget]
max_steps = 1000

[algorithm.sft.packing]
kind = "off"
max_seq_len = 2048

The JSON schema is generated by cargo xtask schema-gen from rollout_core::config::training::SftSettings.

JSONL data contract (D-DATA-01)

load_jsonl accepts two shapes per row:

ShapeJSONDataRow
Prompt / completion{"prompt":"Q","completion":"A"}{ prompt: "Q", assistant: "A" }
Chat messages{"messages":[{"role":"user","content":"Q"},{"role":"assistant","content":"A"}]}{ prompt: "[user] Q", assistant: "A" }

Phase-4 restrictions:

  • At most one "role": "assistant" turn per row (multi-turn lands when the harness work in Phase 7 needs it).
  • Empty lines are skipped.
  • Malformed lines (neither shape, missing assistant in messages, unparseable JSON) produce Fatal(ConfigInvalid) with the file path and line number — <path>:<lineno>: <reason>. Easy to grep, easy to fix.

The Phase-7 work (HARNESS-*) extends this to DatasetRef::Other(...) for harness-driven datasets; until then, Other is a config error.

validate_plan errors

LocatorReason
algorithm.sft.minibatch_sizemust be ≥ 1
algorithm.sft.optimizer.lrmust be > 0

These fire at plan time so the CLI rejects bad configs before any backend is constructed.

snapshot_save / snapshot_restore (D-DETERM-05)

SftAlgo::snapshot_save builds a Snapshot row with:

  • meta = { step: <u64>, weights_id: "<hex>" } — algorithm-internal extras, free-form serde_json::Value.
  • parts = [{ role: "weights", content: <ContentId> }] — points at the bytes returned by TrainableBackend::save_weights.
  • kind = SnapshotKind::TrainState.

SftAlgo::snapshot_restore reads meta.step and resets the algo's step counter. The backend's weights are restored separately — production backends call TrainableBackend::load_weights(&weights_id); the Phase-4 snapshot_resume.rs test rebuilds MockBackend directly from a captured weights_snapshot() (the byte-compare assertion would be meaningless if it went through load_weights, which is a no-op for the mock).

The full production save path (tar of the accelerate dir, blake3 over the tar bytes, content-addressed put on the object store) lives in SnapshotterImpl::save_train_state (plan 04-01) and runs alongside the algo-level snapshot_save call.

TRAIN-03 byte-compare proof

tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the LOAD-BEARING proof for TRAIN-03. It runs on every CI build with no GPU and no HF transformers — exercising the resume contract on the MockBackend path. The flow:

  1. Run A: fresh MockBackend::new_train(42) → 10 step_once iterations → capture weights_a.
  2. Run B (phase 1): fresh MockBackend::new_train(42) → 5 step_once iterations → capture weights_after_5algo_b1.snapshot_save() → drop algo + backend.
  3. Run B (phase 2): MockBackend::new_train_with_weights(42, weights_after_5) → push step counter to 5 via test helper set_step → algo snapshot_restore → 5 more step_once iterations → capture weights_b.
  4. Assert weights_a == weights_b (byte-equal).

The set_step(5) helper is a MockBackend-only affordance because the algo sees only Arc<dyn TrainableBackend> and load_weights is a no-op on the mock. Production backends restore the optimizer step counter via their own checkpoint format inside load_weights.

Running the example

The smallest possible SFT run lives at examples/sft-tiny.toml + examples/sft-tiny.jsonl (4 chat rows; Qwen2.5-0.5B-Instruct; max_steps = 2). Two ways to exercise it:

Dry-run (works without Python deps; validates config + dataset path + algorithm shape):

cargo run -p rollout-cli -- train sft \
  --config examples/sft-tiny.toml --dry-run

Live run (requires transformers + accelerate + torch; ~3-5 min M-series CPU):

pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke

make train-smoke invokes scripts/train-smoke.sh, which dry-runs first then runs the full SFT path against Qwen/Qwen2.5-0.5B-Instruct on CPU. See CLI for the full subcommand surface.

Next steps

  • Plan 04-05 (backend-vllm-train) swaps the synthetic batch + deterministic-SGD path for real tokenization through HF transformers
    • accelerate; the same PolicyAlgorithm surface drives it.
  • Plan 04-06 (CLI) mounts rollout train sft --config <toml> on top of SftAlgo::run.
  • Plan 04-07 polishes the docs and ships the v1 SFT smoke recipe.

Reward-model training (RM)

rollout-algo-rm implements the Bradley-Terry reward-model training algorithm (TRAIN-02). It mirrors the SFT algorithm's structure — PolicyAlgorithm impl driven by a TrainableBackend, JSONL data loader, and a TRAIN-03 byte-compare resume proof — but consumes pairwise preferences instead of single sequences.

Overview

A reward model learns to score responses on a scalar "quality" axis. Training data is a stream of preference pairs (prompt, chosen, rejected): the model should learn to rank chosen higher than rejected for the given prompt.

The Bradley-Terry objective formalizes this as a pairwise logistic regression on the reward gap:

L = -E[ ln σ(r_chosen - r_rejected) ]

where σ is the logistic function and r_* are the scalar reward outputs. Spec 02 §7 carries the contract.

RmSettings (TOML)

[algorithm.rm]
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
head = "bradley_terry"          # Phase 4 supports BradleyTerry only
minibatch_size = 8

[algorithm.rm.optimizer]
kind = "sgd"
lr = 1.0e-5

[algorithm.rm.budget]
max_steps = 100

[algorithm.rm.dataset]
type = "jsonl_path"
path = "examples/data/pairs.jsonl"

Other RmSettings fields (base_model, optimizer, budget, dataset) mirror SftSettings. Head selection is bradley_terry only in Phase 4; pairwise_logistic is a Fatal(ConfigInvalid) with a Phase 9 sentinel until the RL pipeline lands.

Bradley-Terry loss math

Implemented in crates/rollout-algo-rm/src/loss.rs:

  • logsigmoid(x) = ln σ(x). Numerically stable via the softplus trick — logsigmoid(50) and logsigmoid(-50) both return finite values within 1e-4 of the true asymptote.
  • bradley_terry_loss(r_chosen, r_rejected) = -logsigmoid(r_chosen - r_rejected).
  • bradley_terry_batch_mean(pairs) — mean over a slice of (r_chosen, r_rejected) pairs. Returns 0.0 for empty batches; callers should validate non-empty upstream when needed.

Pinned golden values (tests/bradley_terry_loss.rs):

CaseInputsExpected
Zero diff(1.0, 1.0)ln 2 ≈ 0.6931
Strong preference(5.0, -5.0)near 0 (≪ 1e-3)
Inverted preference(-5.0, 5.0)≈ 10.0 (± 1e-3)
Mixed batch[(2,1), (1,2)]mean ≈ 0.8133
Empty batch&[]exactly 0.0
Numerical stabilitylogsigmoid(±50)finite; asymptotic value within 1e-4

JSONL data shape (D-DATA-01)

Phase 4 supports one row shape:

{"prompt": "What is 2+2?", "chosen": "4", "rejected": "5"}
{"prompt": "Capital of France?", "chosen": "Paris", "rejected": "London"}

load_pairs(&path) parses line-by-line, skipping blank lines and rejecting malformed rows with Fatal(ConfigInvalid) prefixed <file>:<lineno>:. A row missing any of the three fields is malformed.

PolicyAlgorithm surface

MethodBehavior
id()AlgorithmId("rm")
Settingsrollout_core::config::training::RmSettings
from_settingsclones deps.backend into the algo; step = 0
required_rolesvec![WorkerRole::LearnerWorker]
validate_planrejects RmHeadKind::PairwiseLogistic (Phase 9); rejects minibatch_size == 0; lr <= 0
runloads pairs once; loops step_once up to budget.max_steps, honoring ctx.cancel
snapshot_savemeta = {step, weights_id}; one SnapshotPart { role: "weights" }
snapshot_restorerestores self.step from meta.step; backend weights restored separately

step_once synthesizes a 2-row TrainBatch (one row per side of a pair) and drives forward_with_lossoptimizer_step. In the Phase-4 MockBackend test path the loss is a constant; the real Bradley-Terry loss fires under plan 04-05's HF transformers integration.

TRAIN-03 second-witness — byte-compare resume

tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the Bradley-Terry twin of the SFT byte-compare proof. Structure:

  1. Run A. 10 step_once iterations with seed = 42; capture weights.
  2. Run B Phase 1. 5 steps; capture mid-run weights; snapshot_save().
  3. Run B Phase 2. Rebuild MockBackend::new_train_with_weights(42, …); push step counter to 5; restore algo step from snapshot meta; 5 more steps.
  4. Assert. weights_a == weights_b byte-for-byte.

This is the second-witness for TRAIN-03 (the SFT proof is the first witness); together they discharge the "deterministic resume" exit criterion across both Phase-4 algorithms.

Content-addressed final checkpoint

tests/checkpoint_roundtrip.rs proves that TrainableBackend::save_weights returns a ContentId that is stable when the backend is idle (two calls → identical hash) and different after a non-trivial optimizer_step. This matches the TRAIN-02 contract: the final checkpoint is content-addressed by the blake3 hash of the postcard-encoded weights.

Phase 4 head support

Only RmHeadKind::BradleyTerry is wired in Phase 4. PairwiseLogistic exists in the enum so the config schema can be cross-validated end-to-end, but selecting it returns a Fatal(ConfigInvalid) with the string Phase 9 in the message — Phase 9 lands the full RL pipeline including alternate preference heads.

What lands later

  • Plan 04-05 swaps MockBackend for the real HF transformers / accelerate training loop on Qwen/Qwen2.5-0.5B-Instruct (CPU), wiring the Python-side F.logsigmoid(r_chosen - r_rejected).neg().mean() and producing real reward models.
  • Plan 04-06 mounts rollout train rm --config <toml> on RmAlgo::run.
  • Phase 9 lands PairwiseLogistic and the RL-* algorithms (PPO/GRPO) that consume reward models trained here.

Running the example

The smallest possible RM run lives at examples/rm-tiny.toml + examples/rm-tiny.jsonl (4 preference pairs; Qwen2.5-0.5B-Instruct base; BradleyTerry head; max_steps = 2).

Dry-run (works without Python deps):

cargo run -p rollout-cli -- train rm \
  --config examples/rm-tiny.toml --dry-run

The live rollout train rm path through --features train is wired identically to SFT; the Phase-4 smoke recipe (make train-smoke) exercises SFT specifically — the RM pipeline is dry-run-validated here and lands under the Phase-9 RL recipe alongside PPO / GRPO. See CLI for the full subcommand surface.

See also

  • SnapshotsSnapshotterImpl / SnapshotKind::TrainState
  • SFT — sibling algorithm sharing the same trait surface
  • Spec 02 §7 — RM contract

Snapshots

rollout-snapshots ships SnapshotterImpl, the Phase-4 implementation of the rollout_core::Snapshotter trait. It owns the persistence path for SnapshotKind::TrainState — the only kind implemented in v1's training story. Three other kinds (Buffer, Process, EpisodicMemory) compile but return Fatal { PluginContract, msg: "Phase N: <kind>" } until their owning phases land (9 / 11 / 8 respectively).

See docs/specs/04-storage-snapshots.md for the authoritative contract and crates/rollout-snapshots/ for the implementation.

Architecture

PolicyAlgorithm                  SnapshotterImpl                  Storage + ObjectStore
─────────────────────            ───────────────────              ──────────────────────
algo.snapshot_save  ──▶  save_train_state(request, dir)
                                │
                                ▼
                         build_deterministic_tar(dir)             (no I/O on substrate yet)
                                │
                                ▼
                         ContentId::of(tar_bytes) = blake3
                                │
                ┌───────────────┴────────────────┐
                ▼                                ▼
   ObjectStore::put_bytes(tar)          Storage::begin().txn
        (returns same ContentId)        put_bytes(snapshot_key, json(Snapshot))
                │                                │
                └────────► tar blob              └────────► snapshot row (namespace=snapshots)

Save returns a Snapshot whose parts[0].content == ContentId::of(tar); restore inverts the pipeline (fetch → blake3-verify → extract).

Metadata layout

A Snapshot row lives at StorageKey { namespace = "snapshots", run_id = Some(run_id), path = [hex(snapshot_id)] } and is JSON-encoded. Spec-04 §5.1 lists the fields:

FieldTypePurpose
idSnapshotIdContentId of the tar bytes (Phase 4 = single part)
kindSnapshotKindTrainState in Phase 4; others reserved
run_idRunIdOwning run
created_atDateTime<Utc>RFC3339 wire form
labelOption<SmolStr>Optional human-readable label (CLI --label)
partsVec<SnapshotPart>One per blob; Phase 4 ships exactly one (role="tar")
algorithm_idAlgorithmIdProducing algorithm ("sft", "rm", ...)
metaserde_json::ValueAlgorithm-internal extras (D-DETERM-05; opaque to core)

JSON (not postcard) is the on-disk encoding because serde_json::Value is a self-describing format that postcard intentionally refuses to encode. The small row size and infrequent writes make the choice cheap, and the row is human-readable on disk for debugging.

Tar reproducibility contract (Pitfall 9)

build_deterministic_tar(&Path) -> Vec<u8> is the byte-stable tar builder. It must produce identical bytes for identical input directories across runs, machines, and tar versions — otherwise the blake3 hash drifts and resumes fail to find their predecessor blob.

The invariants enforced:

  • Sort entries by path before writing (file-system iteration order is non-deterministic).
  • HeaderMode::Deterministic zeroes mtime/uid/gid metadata at write time but does not zero mode bits. The mode bits are platform-dependent (macOS gives regular files 0o755 by default).
  • Explicit header.set_mode(0o644) for files and 0o755 for dirs.
  • Explicit set_mtime(0), set_uid(0), set_gid(0) on top of HeaderMode::Deterministic.
  • No compression — gzip / zstd byte streams drift across versions.
  • GNU header format (Header::new_gnu) — stable across tar releases.

crates/rollout-snapshots/tests/deterministic_tar.rs proves Pitfall 9 holds by parsing every entry header and asserting mode = 0o644 | 0o755.

Restore semantics

restore_train_state(&snapshot, dst_dir) fetches the tar blob via ObjectStore::get_bytes(parts[0].content), verifies blake3(bytes) == parts[0].content, and extracts to dst_dir. A mismatch returns Fatal { PluginContract, msg: "blake3 mismatch on restore: ..." }.

The bare Snapshotter::restore trait method takes a RestoreTarget enum:

VariantPhase 4 behavior
SameRunReturns Fatal { PluginContract } — the trait method has no dst_dir; callers use restore_train_state(snapshot, dst_dir) directly.
ForkReturns Fatal { PluginContract, msg: "Phase 9: Fork restore (new_run_id=...)" }
WorkerReturns Fatal { PluginContract, msg: "Phase 6: Worker restore (worker_id=...)" }

This is intentional: the Phase-4 surface optimizes for the SFT/RM training loop, which holds the destination directory locally and drives restore_train_state directly. Phase-9 PPO actor-swap adds the multi-worker restore plumbing.

List + prune surface

Snapshotter::list(SnapshotFilter) scans namespace="snapshots", optionally filters by run_id / kind / label_contains, sorts newest-first (created_at descending), and caps by limit. The scan is O(snapshots in namespace); the secondary SnapshotId → key index is deferred to Phase 9.

Snapshotter::prune(PrunePolicy { run_id, retention }) enforces RetentionPolicy:

  • keep_last: u32 — keep the N most recent snapshots regardless of label.
  • keep_labeled: bool — labeled snapshots are immune to pruning when true.
  • max_age: Option<Duration> — anything older than max_age is eligible.

Returns the count of metadata rows deleted. The underlying tar blobs are not deleted in Phase 4 — ObjectStore has no delete method yet (Phase-5 addition). Blobs are content-addressed and idempotent; orphaned ones cost storage but never corrupt a restore.

Algorithm-internal state — meta: serde_json::Value

Snapshot.meta is an opaque JSON blob owned by the producing algorithm (D-DETERM-05). The framework never inspects it. SFT might store { "step": 5, "curriculum_cursor": 12 }; RM might store { "epoch": 2, "best_loss": 0.42 }. This keeps algorithm-specific resume state out of the framework-owned snapshot row.

Determinism caveats

  • CPU mode: byte-identical on the same toolchain across machines.
  • CUDA same-SM: weights+optimizer bit-identical when the producing and restoring GPU expose the same compute capability. Cross-SM is best-effort.
  • Cross-machine: best-effort. The tar is reproducible; PyTorch / accelerate restore is not always bit-identical across GPU generations.

Plan 04-05 (backend-vllm-train) lands the determinism CI gate and CPU-mode fallback for environments without CUDA.

Pointers

  • crates/rollout-snapshots/src/lib.rsSnapshotterImpl + trait impl.
  • crates/rollout-snapshots/src/tar_build.rs — deterministic tar builder.
  • crates/rollout-snapshots/src/kind/train_state.rs — save + restore pipeline.
  • crates/rollout-snapshots/src/policy.rsSnapshotter::prune retention enforcement.
  • crates/rollout-snapshots/src/key.rsStorageKey helpers for namespace="snapshots".
  • crates/rollout-storage/src/embedded/tables.rsT_SNAPSHOTS table definition.
  • docs/specs/04-storage-snapshots.md — authoritative contract (especially §5 + §5a + §7).

Postgres backend

Phase 4 / TRAIN-04: a Storage impl backed by Postgres 16 alongside the default embedded redb store. Same trait, same semantics — choose at config time. The crate-level Cargo feature postgres on rollout-storage gates the dep set (sqlx 0.8 + uuid + ulid + async-stream); default builds remain sqlx-free.

Why Postgres alongside embedded

The embedded backend (redb) is local-process only. Multi-process or multi-host runs need a shared store that fan-outs change notifications across processes — that's what Postgres gives us via LISTEN/NOTIFY. The trait surface stays uniform: EmbeddedStorage and PostgresStorage both implement Storage::watch_stream, wrapping their respective notification mechanisms in a BoxStream<StorageEvent>.

MethodEmbeddedPostgres
begin / put / get / casredb txn over local fssqlx txn over network
watch (broadcast)tokio::sync::broadcast::Receiverunsupported — returns Fatal(PluginContract)
watch_streamwraps the broadcast in BroadcastStreamPgListener over LISTEN/NOTIFY

Cross-process subscribers MUST use watch_stream. The embedded backend implements watch_stream by wrapping its in-process broadcast — handy when some callers need a stream-shaped surface even on a local run.

Schema

Two migrations under database/migrations/ are embedded at build time via sqlx::migrate!():

  • 0001_init.sql — the kv table that backs all StorageKey rows.
  • 0002_snapshots.sqlsnapshots + events tables consumed by rollout-snapshots (plan 04-01) and the spec-09 observability ledger.

The runs / workers tables defer to Phase 6 (multi-node distribution).

kv table

CREATE TABLE kv (
    namespace   TEXT NOT NULL,
    run_id      UUID,                       -- ULID-as-UUID; NULL for global rows
    path        TEXT[] NOT NULL,
    value       BYTEA NOT NULL,
    version     BIGINT NOT NULL DEFAULT 0,
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (namespace, run_id, path)
);

run_id is stored as UUID (16 bytes) — the same byte layout as the underlying ULID. RunId(Ulid) round-trips through Uuid::from_bytes and back without loss.

LISTEN / NOTIFY contract

A row-level trigger on kv fires pg_notify(channel, payload) after every INSERT/UPDATE/DELETE:

  • channel: rollout_watch_<namespace> (max 63 chars; rollout_watch_ is 14, leaving 49 for the namespace).
  • payload: <run_id_uuid_or_empty>|<path_parts_joined_by_slash>, truncated to 7999 bytes per Pitfall 5 (Postgres caps pg_notify payloads at 8000).

PgListener consumers parse the payload, filter by prefix (matching the caller's prefix.run_id + prefix.path), and emit a StorageEvent::Put per notification. Put vs delete is not distinguished in the payload format shipped here; Phase 9 may extend the trigger to include a +/- prefix if downstream needs demand it.

PgListener reconnects transparently on connection drop; the watch_stream loop logs the failure and continues.

Trait surface

Storage::watch (broadcast) is intentionally not implemented by the Postgres backend — broadcast is an in-process abstraction. Callers that need a tokio::sync::broadcast::Receiver must run against EmbeddedStorage or implement their own fan-out on top of watch_stream. The Postgres impl returns:

Fatal(PluginContract { plugin: "PostgresStorage",
    msg: "PostgresStorage does not support in-process broadcast watch;
          use watch_stream for cross-process notification" })

Migrations

Migrations are forward-only, sequentially numbered (0001_*.sql, 0002_*.sql, ...). PostgresStorage::new runs them once via sqlx::migrate!("../../database/migrations").run(&pool). The macro embeds every SQL file into the binary at compile time, so the pool only needs network access — no on-disk migration directory in production.

Running new() twice on the same pool is a no-op: sqlx::migrate records applied versions in _sqlx_migrations.

Adding a migration

  1. Write database/migrations/NNNN_<name>.sql.
  2. Start a local Postgres: docker run --rm -e POSTGRES_PASSWORD=pw -p 5432:5432 postgres:16.
  3. (Optional, when we switch to compile-time query! macros) regenerate the .sqlx/ offline cache: DATABASE_URL=postgres://postgres:pw@localhost/postgres SQLX_OFFLINE=false cargo sqlx prepare --workspace -- --features postgres.
  4. Commit the migration AND the regenerated .sqlx/ files (currently the directory only carries a .gitkeep; this plan ships runtime-checked SQL via sqlx::query).

Offline mode

SQLX_OFFLINE=true lives in .cargo/config.toml (not .env) per Pitfall 4 — sqlx-cli reads .env at startup and refuses to talk to the DB during cargo sqlx prepare if SQLX_OFFLINE=true is set there. Putting it in .cargo/config.toml keeps cargo build --features postgres happy everywhere except when explicitly running sqlx prepare.

Phase 4 ships runtime-checked sqlx::query calls; the .sqlx/ directory is reserved (.gitkeep) for a future switch to compile-time query! macros once the surface stabilizes.

Pool sizing

PostgresStorage::new(url, pool_size) opens two pools:

Poolminmaxacquireidle
Write pool0pool_size30 s10 min
Watch pool0430 s(default)

pool_size = 16 is a reasonable default for a Phase-6 worker; tests use 4. The watch pool is small and dedicated because each PgListener holds a connection for the lifetime of its watch_stream; you don't want them crowding out write capacity.

Testcontainers CI

crates/rollout-storage/tests/postgres_integration.rs runs against a disposable Postgres 16 container via testcontainers-modules. Six tests cover CRUD, CAS atomicity, watch_stream delivery, migration idempotency, pool reuse under load, and prefix scan.

Each test carries #[ignore = "requires Docker / testcontainers"] so the default cargo test --workspace --tests flow (macOS dev loop, no Docker guaranteed) stays green. CI opts in via:

cargo test -p rollout-storage --features postgres \
  --test postgres_integration -- --include-ignored --test-threads=1

--test-threads=1 because the test starts a fresh container per test function; running six simultaneously is wasteful (and slower than serial under most CI runners' Docker capacity).

The retry loop on PostgresStorage::new (30 attempts × 2 s) handles Pitfall 6: the container reports "running" before Postgres accepts connections.

Local dev

make postgres-test

The target checks docker info first and fails fast with a helpful message if Docker isn't running.

Limitations

  • Put-only events from pg_notify. The trigger emits a single notification format for both INSERT/UPDATE/DELETE; consumers see StorageEvent::Put only. Phase 9 may extend the trigger payload to carry a +/- operation byte if downstream callers need to distinguish deletes.
  • get_many_bytes is sequential. A future optimization could batch via ANY($1) array binding; not on the Phase-4 critical path.
  • No streaming scan_bytes. Phase-2 simplification carries forward; prefix scans return owned Vec rows. Hot prefixes (millions of rows) will need a streaming variant in Phase 6.
  • No query! macro yet. Runtime-checked SQL keeps the build hermetic without a .sqlx/ cache; revisit once the schema stabilizes.

Determinism

Phase 4 of rollout commits to deterministic training-state snapshots (D-DETERM-01): a Snapshot.parts[0].content is the same ContentId whenever the same code drives the same model with the same RNG seed and the same input stream. This page documents the contract, the platform caveats, and every mitigation that lands in rollout-backend-vllm --features train.

The determinism stack

LayerMitigationWhere
Process envCUBLAS_WORKSPACE_CONFIG=:4096:8, PYTHONHASHSEED=0train.py preamble
RNG seedsrandom + numpy + torch + torch.cuda.manual_seed_all_set_determinism_flags
PyTorch deterministictorch.use_deterministic_algorithms(True)_set_determinism_flags
cuDNNcudnn.deterministic = True, cudnn.benchmark = False_set_determinism_flags
Matmul precisiontorch.set_float32_matmul_precision("highest")_set_determinism_flags
Chat template{% generation %} markers via GENERATION_MARKED_QWEN25_TEMPLATEqwen25_chat_template.py
Dataloadertorchdata.StatefulDataLoader if available; step-replay fallbackinit_train
Acceleratoraccelerator.prepare(model, optimizer, scheduler)init_train
LR schedulerregister_for_checkpointing (Pitfall 10 fallback)init_train
Tarbyte-stable mode bits + sorted walkdir + mtime=0 (Pitfall 9)rollout-snapshots::build_deterministic_tar

Why preamble ordering matters (Pitfall 2)

CUBLAS_WORKSPACE_CONFIG and PYTHONHASHSEED are read by torch on first import. If import torch runs before they are set, the values are baked in and later os.environ mutations do not change behavior. The Rust side enforces ordering by writing the env vars on the worker thread before calling py.import("rollout.backends.vllm.train") — see rollout-backend-vllm/src/train.rs::import_train_module.

train.py repeats os.environ.setdefault(...) at the top of the module defensively: if someone imports it directly from a Python REPL with torch already loaded, the user gets non-determinism, but the source still documents the contract.

cuDNN.benchmark is the silent killer (Pitfall 8)

cuDNN auto-tunes kernel selection by running candidate kernels and picking the fastest. This is non-deterministic — re-running the same workload picks a different kernel under load. cudnn.benchmark = False is required, even when cudnn.deterministic = True. The Phase-4 implementation sets both explicitly in _set_determinism_flags.

LR scheduler must be in the save-state capture path (Pitfall 10)

accelerator.save_state captures the model, optimizer, RNG, and any object explicitly registered for checkpointing. The Phase-4 path prefers passing the scheduler through accelerator.prepare(model, optimizer, scheduler); if that fails (e.g., the scheduler shape isn't supported by accelerate's prepare), the code falls back to accelerator.register_for_checkpointing(scheduler). Without one of these, the LR resumes from lr_start after a snapshot restore instead of continuing the schedule — silent non-determinism.

CPU vs CUDA contract

HardwareBit-identical?
Same CPU model, same OS, same envYes (the load-bearing CI target)
Same SM (sm_80 vs sm_80), same cuDNNYes, with deterministic + !benchmark
Different SM (sm_80 vs sm_90)No — kernels differ
Different cuDNN minor versionNo — algorithm shortlist may differ
Cross-machine (CPU A vs CPU B)Best-effort; FP rounding differs

The snapshot_resume_live.rs live witness exercises the same-CPU case on the dev box. CI runs the MockBackend variant from plan 04-02 (snapshot_resume.rs::bit_identical_resume_at_step_5) for the unconditional green signal.

accelerate.save_state captures

ObjectSourceRestored on load_state?
Model weightsaccelerator.prepare(model)Yes
Optimizeraccelerator.prepare(optimizer)Yes
RNG (torch)implicit via torch.random.get_rng_stateYes
RNG (cuda)implicit if CUDA is initializedYes
LR schedulerprepare(scheduler) OR register_for_checkpointingYes
DataloaderDataLoaderConfiguration(use_stateful_dataloader=True)If torchdata installed

The Phase-4 contract: anything that affects subsequent step output must be in this list. If a future plan adds custom state (e.g., reward-model normalizer running stats), it MUST register via register_for_checkpointing to keep the TRAIN-03 byte-compare proof intact.

torchdata stateful dataloader (Pitfall 3)

init_train probes for torchdata. If present, it constructs a DataLoaderConfiguration(use_stateful_dataloader=True) so the dataloader's position is checkpointed by save_state. Without torchdata, the runtime falls back to step-replay: the resume side re-reads from the JSONL head and skips step rows. Both modes preserve the TRAIN-03 byte-compare invariant on deterministic input streams.

Pitfall 7 — Accelerator singleton

Accelerator() is a singleton per Python process; constructing two of them in the same interpreter raises. init_train is idempotent — the second call returns the cached _STATE dict. teardown_train flushes the singleton via del + gc.collect() + torch.cuda.empty_cache() so the next init_train call (in a subsequent run, or after a mid-process swap-back to vLLM inference) can construct a fresh accelerator.

A bidirectional mid-process swap (training ↔ inference under the same OS thread) is Phase 9 — see the deferral note in rollout-backend-vllm/src/train.rs::run_set_train_mode.

MockBackend vs live HF path

BackendCPU bit-identical?When to use
MockBackendYes (Array1 SGD)CI; every PR; algo-side tests
Live HFYes on identical CPU; same-SM only on CUDADev-box live witness; nightly

The MockBackend path is the load-bearing CI proof for TRAIN-03. The live HF path is the gated witness behind ROLLOUT_TRANSFORMERS_AVAILABLE=1.

Where the code lives

  • python/rollout/backends/vllm/train.py — determinism preamble + Accelerator construction.
  • python/rollout/backends/vllm/qwen25_chat_template.py — generation-marked chat template.
  • crates/rollout-backend-vllm/src/train.rs — env-write-before-import enforcer; py.detach wrappers.
  • crates/rollout-snapshots/src/tar_build.rs — Pitfall 9 deterministic tar.
  • Snapshots — Phase-4 snapshot pipeline.
  • SFT — SFT algorithm + TRAIN-03 byte-compare proof.
  • CPU mode — running Phase-4 training on CPU only.

rollout train + rollout snapshot CLI

Phase-4 user-facing entry points for supervised fine-tuning, reward-model training, and snapshot management. Mirrors the Phase-3 rollout infer batch clap derive shape, run_id lifecycle, and backend-selection precedence.

Subcommand overview

SubcommandPurpose
rollout train sftSupervised fine-tuning. Validates + runs an SftAlgo budget.
rollout train rmReward-model (Bradley-Terry) training via RmAlgo.
rollout snapshot listList snapshots (optionally filtered by run / kind).
rollout snapshot showPrint one snapshot's metadata by content-id.
rollout snapshot pruneDelete snapshots per a retention policy.

rollout train sft

rollout train sft \
    --config examples/sft-tiny.toml \
    [--resume <snapshot_id>] \
    [--dry-run]
FlagDefaultPurpose
--configrequiredPath to the run TOML (schema below).
--resumenoneSnapshot content-id to restore from before the algorithm runs.
--dry-runfalseValidate config + dataset path; never construct backend. Works with no backend feature.

SFT TOML schema

schema_version = 1

[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"

[algorithm]
kind = "sft"

[algorithm.sft]
minibatch_size       = 1
gradient_accumulation = 1

[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.sft.optimizer]
kind = "sgd"
lr   = 0.01

[algorithm.sft.budget]
max_steps = 10

[algorithm.sft.dataset]
kind = "jsonl_path"
path = "./data/sft.jsonl"

[algorithm.sft.packing]
kind        = "off"
max_seq_len = 64

[algorithm.sft.loss_on]
kind = "full"

The schema is owned by rollout-core::config::RunConfig; the CLI is just a TOML-loader + clap surface (spec 11 single-source-of-truth). Unknown fields fail load with a deterministic locator.

rollout train rm

rollout train rm \
    --config examples/rm-tiny.toml \
    [--resume <snapshot_id>] \
    [--dry-run]

Same flags as sft. The TOML differs only in the algorithm block:

schema_version = 1
[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"

[algorithm]
kind = "rm"

[algorithm.rm]
minibatch_size = 1

[algorithm.rm.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.rm.optimizer]
kind = "sgd"
lr   = 0.01

[algorithm.rm.budget]
max_steps = 10

[algorithm.rm.dataset]
kind = "jsonl_path"
path = "./data/pairs.jsonl"

[algorithm.rm.head]
kind = "bradley_terry"

JSONL input contract: each line { "prompt", "chosen", "rejected" }.

--dry-run semantics

In order, with the backend NEVER constructed:

  1. Read + parse TOML against RunConfig (deny_unknown_fields).
  2. Match [algorithm].kind against the subcommand (sft vs rm).
  3. Validate minibatch_size >= 1 and optimizer.lr > 0.
  4. Confirm dataset.path exists on disk (for jsonl_path form).
  5. Print dry-run OK: algorithm=<sft|rm> model=… minibatch=… dataset=… and exit 0.

Because --dry-run short-circuits before select_backend, it runs cleanly on builds with NEITHER --features train NOR --features test-mock-backend.

--resume <snapshot_id> lifecycle

If --resume <hex> is set, the CLI scans the local rollout.db for a Snapshot whose id matches the supplied content-id (32-byte blake3, 64-char hex). The matching row is fed into algorithm.snapshot_restore(snap) before algorithm.run(&ctx) begins. Determinism then follows the per-algorithm contract: SftAlgo proves byte-identical resume via tests/snapshot_resume.rs::bit_identical_resume_at_step_5; RmAlgo proves parity via tests/snapshot_resume.rs::rm_bit_identical_resume.

Missing snapshot → Fatal(ConfigInvalid) with the failed hex echoed back.

Backend selection

rollout-cli builds with at most one training backend at a time, by Cargo feature. Selection at runtime is precedence-ordered:

OrderBuild flagsEnvBackend
1--features test-mock-backendROLLOUT_TEST_MOCK_BACKEND=1rollout_runtime_batch::MockBackend (deterministic, no GPU/Python).
2--features vllm,train(any)rollout_backend_vllm::VllmBackend in train mode (live HF / accelerate).
3none(any)Fatal(ConfigInvalid) with a build-mode hint; --dry-run still works.

Build recipes:

# CI / tests (no Python, no GPU)
cargo build -p rollout-cli --features test-mock-backend

# Production (live HuggingFace + accelerate over Python)
cargo build -p rollout-cli --features vllm,train

# Both — vllm takes precedence at runtime unless ROLLOUT_TEST_MOCK_BACKEND=1.
cargo build -p rollout-cli --features test-mock-backend,vllm,train

The train feature implies vllm per Cargo.toml; the two are kept distinct so the existing Phase-3 infer batch users get vllm without pulling the training Python deps.

rollout snapshot list

rollout snapshot list \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    [--run-id <ULID>] \
    [--kind train_state|buffer|process|episodic_memory] \
    [--limit N]

Output: pretty-printed JSON array of Snapshot rows from rollout-core. Sort order is newest-first by created_at.

FlagDefaultNotes
--storage-path./rollout.dbOpens an EmbeddedStorage read-write.
--object-path./object-storeOpens a FsObjectStore (read-only for list, but consistently surfaced).
--run-idnoneCrockford ULID; restrict to one run. Without it, every run's snapshots are scanned.
--kindnonesnake_case match against SnapshotKind variants.
--limitnoneCap result length.

rollout snapshot show <snapshot_id>

rollout snapshot show \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    <SNAPSHOT_ID>

SNAPSHOT_ID is the 64-char hex blake3 digest from Snapshot.id. Prints the full Snapshot row as pretty-printed JSON. Missing id → exit 2 with snapshot not found: <id>.

rollout snapshot prune

rollout snapshot prune \
    --run-id <ULID> \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    [--keep-last N=3] \
    [--keep-labeled]

Applies a RetentionPolicy scoped to a single run (--run-id is required to avoid cross-run accidents). --keep-last N retains the N newest snapshots regardless of label; --keep-labeled (default true) further retains every labeled snapshot. Metadata rows are deleted; blob bytes stay in the object store (the Phase-2 ObjectStore trait has no delete — pending Phase-5). Prints pruned <N> snapshots.

Storage path conventions

PathOwnerPurpose
./rollout.dbEmbeddedStorageredb on-disk DB (always-fsync).
./object-store/FsObjectStoreContent-addressed two-level sharded FS.
<config_dir>/rollout.dbtraining runs`train sft
<config_dir>/object-store/training runsSame sibling-of-config convention.

snapshot list|show|prune accept explicit --storage-path / --object-path so out-of-band tooling and tests can target any directory pair.

Exit codes

CodeMeaning
0Success (or successful --dry-run).
2Config-invalid, missing dataset / snapshot, substrate error, or algorithm error.

Observability

RUST_LOG=info rollout train sft … produces structured tracing events. Each SFT / RM run emits at minimum train_start, train_step, train_end; the exact fields are pinned by the per-algorithm chapters.

CPU mode

The Phase-4 training surface runs on CPU end-to-end. This is the integration test path on dev boxes (including Apple Silicon) and the smoke recipe target in plan 04-07.

When to use

  • Local dev loop on a laptop without CUDA.
  • CI smoke that exercises the full HF transformers + accelerate path against a tiny model (Qwen/Qwen2.5-0.5B-Instruct).
  • Reproducing CUDA bugs that turn out to be deterministic-flag misconfiguration.

Expected throughput

ModelHardwareSteps/sec
Qwen/Qwen2.5-0.5B-InstructApple M2 Max (CPU)~0.1–0.3
Qwen/Qwen2.5-0.5B-InstructLinux x86_64 16-core~0.3–1.0

Roughly one to ten seconds per step for the 0.5B model. Anything larger is impractical on CPU; the per-token cost grows superlinearly. CPU mode exists to prove the pipeline, not to train.

Required env

None beyond default. The Phase-4 determinism preamble (CUBLAS_WORKSPACE_CONFIG, PYTHONHASHSEED) is written by the Rust side before import torch; CPU runs ignore CUBLAS settings without complaint.

The live tests gate on ROLLOUT_TRANSFORMERS_AVAILABLE=1:

pip install transformers>=4.45 accelerate>=0.34 torch>=2.4
ROLLOUT_TRANSFORMERS_AVAILABLE=1 \
  cargo test -p rollout-backend-vllm --features train \
  --test snapshot_resume_live -- --ignored --nocapture

Performance caveats

  • No streaming. Phase 4 rejects sampling.stream = true at the boundary (D-BACKEND-03); training has no streaming surface.
  • No multi-GPU. CPU mode is single-process. The FSDP plugin in init_train only activates when torch.cuda.device_count() >= 2.
  • Slow. The 0.5B model at one step per ~5 seconds on M-series silicon means a 10-step smoke takes a minute. Plan accordingly.
  • Determinism still holds. Two CPU runs with the same seed produce byte-identical accelerate.save_state output. The MockBackend variant in rollout-algo-sft::tests::snapshot_resume::bit_identical_resume_at_step_5 proves the Phase-4 contract holds on CPU without HF transformers installed.

Smoke recipe (plan 04-07)

make train-smoke (lands in plan 04-07) runs the live witness on dev boxes where ROLLOUT_TRANSFORMERS_AVAILABLE=1 is set. CI does not install transformers/accelerate; the MockBackend test is the unconditional gate.

  • Determinism — the determinism contract Phase-4 commits to.
  • SFT — algorithm-side overview.
  • Snapshots — snapshot pipeline.

Cloud

The cloud layer lifts rollout's local substrate to real object stores and queues (AWS S3 + SQS, GCP GCS + Pub/Sub) behind the same rollout-core traits. Algorithm crates never see a cloud SDK — a hard dependency-direction lint plus two CI gates (public-api-cloud-leak, forbidden-patterns) keep SDK types out of rollout-core's public API and keep raw metadata URLs / shell=True / libc::fork out of the tree.

Streaming + lease trait methods

Phase 5 extends two rollout-core cloud traits with four new methods. All four ship with backward-compatible default implementations so every v1.0 caller and impl compiles unchanged — only the streaming methods carry a #[deprecated] tag whose sole purpose is to nudge cloud backends into overriding them.

ObjectStore::put_stream / get_stream

#![allow(unused)]
fn main() {
#[deprecated(note = "Cloud impls MUST override; default buffers entire stream into RAM")]
async fn put_stream(
    &self,
    stream: Pin<Box<dyn AsyncRead + Send>>,
    hint: PutHint,
) -> Result<ContentId, CoreError>;

#[deprecated(note = "Cloud impls MUST override; default buffers entire blob into RAM")]
async fn get_stream(&self, id: &ContentId) -> Result<Pin<Box<dyn AsyncRead + Send>>, CoreError>;
}

The default put_stream reads the whole stream into a Vec<u8> and calls put_bytes; the default get_stream fetches via get_bytes and hands back a Cursor. That is fine for small blobs but OOMs on multi-GiB snapshots (Pitfall 16 / D-SNAP-04), which is why the #[deprecated] warning fires at any call site that did not override the method. Cloud backends (S3 multipart, GCS resumable) and the local FS store override both with true streaming, so the warning never fires from a correct impl.

Queue::dequeue_with_lease / extend_lease

#![allow(unused)]
fn main() {
async fn dequeue_with_lease(
    &self,
    lease: Duration,
) -> Result<Option<(QueueItemId, Vec<u8>, LeaseToken)>, CoreError>;

async fn extend_lease(
    &self,
    id: QueueItemId,
    token: LeaseToken,
    extend_by: Duration,
) -> Result<(), CoreError>;
}

LeaseToken(Vec<u8>) is an opaque per-backend handle: SQS ReceiptHandle bytes, Pub/Sub ack_id bytes, or — for the in-memory queue — the QueueItemId bytes. The default dequeue_with_lease ignores the lease duration and synthesizes a token from the item id; the default extend_lease returns Recoverable::Transient { hint: Never } so a backend that forgot to override it fails loudly rather than silently dropping the extension. These two are not #[deprecated] — the default fallback is a correct (if conservative) behavior for queues without visibility timeouts.

AWS

rollout-cloud-aws implements the four cloud traits against AWS: S3 (ObjectStore), SQS (Queue), Secrets Manager (SecretStore), and EC2 instance metadata (ComputeHint, IMDSv2-only). It is built behind a Cargo feature so non-AWS builds pull no AWS SDK crates.

Build

cargo build -p rollout-cli --features aws

The aws feature is default-off. A binary built without it that loads an [cloud] provider = "aws" config returns a fatal error telling you to rebuild with --features aws.

Config

[cloud]
provider = "aws"
region   = "us-west-2"

[cloud.aws.s3]
bucket                = "my-rollout-bucket"
prefix                = "runs/"          # optional
multipart_chunk_bytes = 16777216         # 16 MiB; S3 minimum is 5 MiB
max_snapshot_part_bytes = 5368709120     # 5 GiB; 10 GiB hard cap

[cloud.aws.sqs]
queue_url              = "https://sqs.us-west-2.amazonaws.com/123456789012/rollout"
visibility_timeout_secs = 300

[cloud.aws.secrets]
allowlist = ["rollout/hf-token"]

Cross-cloud is structurally impossible: the provider tag is an enum, so a single config cannot name both AWS and GCP.

IAM permissions

S3 object + multipart operations, SQS send/receive/delete/visibility, and Secrets Manager GetSecretValue are required. The full IAM matrix and the mandatory AbortIncompleteMultipartUpload lifecycle rule live in crates/rollout-cloud-aws/docs/bucket-setup.md.

Testing matrix

TierWhat runsWhen
uniterror mapping + key layout + IMDSv2 mock handshakeevery cargo test (no Docker)
emulator (cloud-emulator-aws)full S3 + SQS + Secrets Manager conformance against localstackevery PR — always-on
live (cloud-live-aws)same suite against real AWS via OIDCnightly (cron)

Locally, bring up the emulator and run the ignored conformance suite:

docker compose -f docker-compose.test.yml up -d
LOCALSTACK_ENDPOINT=http://localhost:4566 \
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test AWS_REGION=us-east-1 \
cargo test -p rollout-cloud-aws --features aws --tests -- --include-ignored --test-threads=1

Common errors

  • Throttled / SlowDown / 503 — surfaces as Recoverable::Throttled with a backoff RetryHint; the SDK retries internally and the snapshotter retries on top. Streaming ContentId is stable across retries because chunks are hashed before each UploadPart.
  • IMDSv1 disabled (HttpTokens=required) — handled transparently: the SDK performs the IMDSv2 PUT /latest/api/token handshake. No raw metadata IP is ever used.
  • Secret not in allowlistFatal::ConfigInvalid; add the name to [cloud.aws.secrets].allowlist. SecretStore::put is read-only in v1.1.
  • Orphan multipart uploads — best-effort aborted on drop; the bucket lifecycle rule reclaims any that leak on SIGKILL.

GCP

rollout-cloud-gcp implements the four cloud traits against GCP: Cloud Storage (ObjectStore), Pub/Sub (Queue), Secret Manager (SecretStore, read-only), and the GCE metadata server (ComputeHint). It is built behind a Cargo feature so non-GCP builds pull no GCP SDK crates.

Build

cargo build -p rollout-cli --features gcp

The gcp feature is default-off. A binary built without it that loads a [cloud] provider = "gcp" config returns a fatal error telling you to rebuild with --features gcp. The aws and gcp features compose — --features aws,gcp builds a binary that dispatches on the TOML [cloud].provider value.

Config

[cloud]
provider = "gcp"
project  = "my-gcp-project"

[cloud.gcp.gcs]
bucket                  = "my-rollout-bucket"
prefix                  = "runs/"          # optional
resumable_chunk_bytes   = 16777216         # 16 MiB; 5 MiB minimum
max_snapshot_part_bytes = 5368709120       # 5 GiB; 10 GiB hard cap

[cloud.gcp.pubsub]
topic            = "rollout-work"
subscription     = "rollout-workers"
ack_deadline_secs = 30

[cloud.gcp.secrets]
allowlist = ["hf-token"]

Cross-cloud is structurally impossible: the provider tag is an enum, so a single config cannot name both AWS and GCP.

IAM / Workload Identity Federation

GCS object admin, Pub/Sub publisher + subscriber, and Secret Manager secretAccessor are required; GCE compute.viewer is needed only when running on GCE/GKE. The full role matrix and the (informational) lifecycle policy live in crates/rollout-cloud-gcp/docs/bucket-setup.md. The cloud-live-gcp CI job authenticates via Workload Identity Federation (WIF) — no long-lived service-account keys.

Testing matrix

TierWhat runsWhen
uniterror mapping + key layout + Secret Manager allowlist + MDS mockevery cargo test (no Docker)
emulator (cloud-emulator-gcp)GCS + Pub/Sub conformance against fake-gcs-server + pubsub-emulator + in-test Secret Manager mockevery PR — always-on
live (cloud-live-gcp)same suite against real GCP via WIFnightly (cron)

Locally, bring up the emulators and run the ignored conformance suite:

docker compose -f docker-compose.test.yml up -d
STORAGE_EMULATOR_HOST=http://localhost:4443 \
PUBSUB_EMULATOR_HOST=localhost:8085 PUBSUB_PROJECT_ID=rollout-test \
cargo test -p rollout-cloud-gcp --features gcp --tests -- --include-ignored --test-threads=1

Emulator delta

Some production semantics are not reproducible on the emulators (resumable upload status, time-based Pub/Sub redelivery, real fault injection). Those tests are #[ignore]d locally and run in cloud-live-gcp. There is no first-party Secret Manager emulator; the Secret Manager conformance tests use an in-test hyper mock and run Docker-free on every build. See the crate README for the full delta table.

Common errors

  • Throttled / ResourceExhausted / 429 / 503 — surfaces as Recoverable::Throttled with a backoff RetryHint. The streaming ContentId is stable across retries because chunks are hashed before each resumable upload chunk.
  • Spot / preemptible reclaimComputeHint::preemption_signal reads instance/preempted from the GCE metadata server and reports a ~30s lead.
  • Secret not in allowlistFatal::ConfigInvalid; add the name to [cloud.gcp.secrets].allowlist. SecretStore::put is read-only in v1.1.
  • Orphan resumable sessions — never persisted across processes (Pitfall #5); GCS auto-expires incomplete sessions after 7 days, and a preempted worker re-uploads from byte 0 idempotently.

Cloud-backed snapshots

Training-state snapshots stream to whichever object store your [cloud] block selects. rollout-snapshots takes an injected Arc<dyn ObjectStore>, so the same SnapshotterImpl works unchanged over the local filesystem, S3, or GCS — only the injected store differs (CLOUD-03).

Configuration

CloudConfig is a #[serde(tag = "provider")] enum, so the provider's fields live directly under [cloud]. A single TOML cannot name two providers — cross-cloud single-run is structurally impossible (D-XPROV-02).

See examples/sft-tiny-aws.toml and examples/sft-tiny-gcp.toml for the minimal [cloud] flip from examples/sft-tiny.toml:

# AWS
[cloud]
provider = "aws"
region = "us-west-2"

[cloud.s3]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"
# GCP
[cloud]
provider = "gcp"
project = "rollout-prod-123"

[cloud.gcs]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"

Streaming semantics

A snapshot is a deterministic tar of the accelerate-style state directory (weights + optimizer + RNG + step), content-addressed by blake3. The upload path:

  1. Builds the tar deterministically (stable file order + zeroed mtime) so the same state always produces the same bytes.
  2. Hashes each chunk with blake3::Hasher before the SDK call, so the resulting ContentId is stable across SDK retries (S3 multipart / GCS resumable — Pitfall #16).
  3. Uploads to a temp/pending-<ulid> key (S3 multipart upload / GCS resumable session).
  4. Server-side copies the temp object to the sharded content-addressed key <prefix>cas/<ab>/<cd>/<hex> (identical layout on FS, S3, and GCS), then deletes the temp.
  5. On failure the temp upload is aborted (S3 MultipartGuard Drop) or expires via the bucket's 7-day lifecycle rule (GCS); no orphaned partial blob is ever read.

Restore fetches the blob by ContentId, re-verifies blake3, and extracts the tar — a mismatch is a hard Fatal error, never a silent partial restore.

Byte-identical resume

The CLOUD-03 acceptance criterion — byte-identical SFT resume holds over the cloud streaming path — is witnessed by two always-on tests:

  • bit_identical_resume_at_step_5_via_s3 (localstack-backed S3ObjectStore),
  • bit_identical_resume_at_step_5_via_gcs (fake-gcs-server-backed GcsObjectStore).

Each snapshots a MockBackend SFT run at step 5, restores off the cloud round-trip, runs five more steps, and asserts the final weights are byte-equal to a ten-step uninterrupted run. They run on every CI PR via the cloud-emulator-aws / cloud-emulator-gcp jobs — no GPU, no live cloud creds.

Cross-provider portability

Snapshots are content-addressed by blake3, so the same bytes produce the same ContentId on any provider. To migrate a snapshot from S3 to GCS, an operator copies the blob (rollout does not automate cross-provider transfer in v1.1):

# Operator-managed transfer between buckets:
aws s3 cp s3://aws-bucket/cas/ab/cd/<hex> /tmp/blob
gsutil cp /tmp/blob gs://gcs-bucket/cas/ab/cd/<hex>

The restore code path on either provider takes a SnapshotId and reads by ContentId; the provider is whichever ObjectStore is injected per [cloud].provider. The runnable witness is crates/rollout-snapshots/tests/snapshot_resume_s3_to_gcs_via_manual_copy.rs: it saves via S3, copies each blob into a GCS bucket asserting the ContentId is identical across providers, then restores + resumes on GCS byte-for-byte (D-XPROV-01).

Active-active cross-cloud single run is out of scope in v1.1 (PROJECT.md); the tagged-enum CloudConfig makes a config naming both [cloud.s3] and [cloud.gcs] structurally un-representable.

rollout cloud doctor

Operator pre-flight tool that exercises all four cloud traits (object store, queue, secret store, compute hint) against either AWS or GCP before a real training job runs. Addresses CLOUD-04 (D-DOCTOR-01..04).

Usage

# Build with the provider feature(s) you need.
cargo run -p rollout-cli --features aws -- cloud doctor --provider aws --config examples/sft-tiny-aws.toml
cargo run -p rollout-cli --features gcp -- cloud doctor --provider gcp --config examples/sft-tiny-gcp.toml --format json

Config source is the TOML [cloud] block only (D-DOCTOR-04) — there are no --bucket/--queue/--secret-id flag overrides in v1.1. The --provider flag MUST match the [cloud].provider in the TOML, or doctor exits 2.

Checks (in order)

  1. reachability — TCP + TLS handshake to the service endpoint (s3.<region>.amazonaws.com / storage.googleapis.com). Surfaces DNS / firewall issues distinctly from auth failures.
  2. auth — credential-chain probe (a cheap metadata read that requires the resolved credentials). Catches broken AWS credential chains / GCP ADC.
  3. object_store — small payload PUT + GET roundtrip on the configured bucket.
  4. queueenqueuedequeue_with_lease(30s)ack on the configured queue.
  5. secret_store — read the FIRST allowlisted secret ([cloud.*.secrets].allowlist). An empty allowlist is reported as a failure with remediation guidance.
  6. compute_hintinventory() + preemption_signal() probe (returns Ok(None) off a cloud instance).
  7. content_id_roundtrip — a 64 MiB random buffer through put_stream + get_stream + blake3 verify. Forces the multipart / resumable path; catches blake3-streaming bugs (Pitfall 16 / D-SNAP-04).

Wall-time target: ~5-10s on a healthy environment.

Exit codes (D-DOCTOR-03)

  • 0 — all checks pass.
  • 1 — at least one check failed (use --format json to see which).
  • 2 — invocation / config error (provider mismatch, missing TOML, malformed schema).

Plays well with shell &&:

rollout cloud doctor --provider aws --config production.toml && \
  rollout train sft --config production.toml

Output formats (D-DOCTOR-02)

  • --format human (default): colored steps with / icons + per-check latency + a N pass, M fail — total <ms> summary line.

  • --format json: machine-readable; matches the schema in crates/rollout-cli/src/commands/cloud/doctor/output/json.rs:

    {
      "checks": [
        { "name": "reachability", "status": "pass", "latency_ms": 142 },
        { "name": "queue", "status": "fail", "latency_ms": 31, "error": "enqueue: ..." }
      ],
      "summary": { "pass_count": 6, "fail_count": 1, "total_latency_ms": 5443 }
    }
    

    error is omitted on passing checks.

Limitations (v1.1)

  • Config-file-only (D-DOCTOR-04); no --bucket/--queue/--secret-id overrides.
  • One comprehensive mode (D-DOCTOR-01); no --quick/--deep tiers.
  • Cross-cloud (both [cloud.aws] and [cloud.gcp] in one TOML) is structurally impossible — CloudConfig is a #[serde(tag = "provider")] enum (D-XPROV-02).

CI coverage

The doctor_smoke integration tests run on every PR:

  • cloud-emulator-aws runs doctor_smoke_aws_* against localstack with pre-created bucket / queue / secret (exit 0 all-pass, exit 1 unreachable, human + JSON shape).
  • cloud-emulator-gcp runs doctor_smoke_gcp_* against fake-gcs-server + pubsub-emulator with pre-created bucket / topic / subscription.
  • The config-layer tests (provider mismatch → exit 2, malformed config → exit 2) and the --help golden run Docker-free on every PR.

Distribution

How rollout runs a single run across many machines: the single-coordinator lease, work-stealing, stateless-replayer restart, spot-drain, and split-brain fencing.

Multi-node distribution

rollout runs a single training/inference run across many machines from one coordinator. This chapter is the operator's view of how that works: the lease that keeps exactly one coordinator alive, the work-stealing that keeps workers busy, how a coordinator restart is invisible to progress, and how a spot preemption drains gracefully. The implementation lives in rollout-coordinator; the contract is spec 05 — Distribution.

Coordinator + worker model

A run has exactly one coordinator at any time and N workers. The coordinator owns the work ledger and brokers all coordination; workers never talk peer-to-peer (one mutual-TLS edge per worker instead of N²). Workers long-poll the coordinator for work, run it on their backend (vLLM in production, a mock backend in the smoke), and submit content-addressed results. Submission is idempotent on the content-addressed WorkItem.id, so a retried item never produces duplicate output.

Single-row CAS lease + monotonic epoch

"Exactly one coordinator per run" is enforced by a single-row compare-and-swap lease (StorageLease) over the coordinator_lease storage row. One impl rides the dual-backed StorageTxn::cas_bytes, so the SAME lease semantics hold on both the embedded redb backend and Postgres (D-LEASE-01) — the Postgres path is proven in the postgres-integration CI lane (postgres_lease.rs).

  • Acquire — a fresh coordinator CASes the empty row to epoch = 0.
  • Steal-on-expiry — if the lease TTL has passed, a new coordinator CASes the exact prior (expired) bytes to epoch + 1. The epoch is monotonic: it only ever advances.
  • Renew — the incumbent heartbeats by CASing its own bytes forward, keeping the epoch constant. A renew that finds the epoch has advanced under it returns false — the incumbent has been fenced.

The lease TTL equals coordinator_failure_timeout; the renew cadence is strictly shorter than the TTL. Every winning claim stamps the authoritative epoch row in the same transaction, so the lease epoch and the ledger epoch never diverge.

Work-stealing

When the global queue is empty but a peer is still busy, an idle worker steals work — coordinator-mediated, never peer-to-peer (rollout-coordinator::{ledger, steal}):

  • Trigger — a worker steals only when its local backlog drains to empty (backlog(thief) == 0); a non-idle thief's request is a no-op.
  • Victim — the coordinator picks the busiest peer by Running backlog.
  • Batch — it reassigns ceil(victim_backlog / 2) items, capped at MAX_STEAL_BATCH = 32.
  • Reassign — each item is moved via a two-step CAS over the same prior Running(victim) bytes: try_repending (Running → Pending) then try_claim (Pending → Running(thief)).

Stolen-then-reclaimed items never double-execute. If the victim's ack races the steal, both CAS the same expected bytes and exactly one wins — the loser sees stale bytes and is dropped. This single-winner property is witnessed every commit, Docker-free, by concurrent_ack_and_steal_no_double_execute.

Coordinator restart (stateless replayer)

The coordinator holds no run state in memory that it cannot rebuild from storage. On boot it: wins the lease, adopts the advanced epoch, then scan_byteses the work ledger and reconstructs its in-flight assignment map:

  • Running{worker} rows are reconstructed in-flight — NOT requeued, because the worker may still hold the item (only the failure-scan stale path re-pends after coordinator_failure_timeout);
  • Pending rows go back onto the dispatch queue;
  • Done / Failed rows are terminal and skipped — replayed acks are idempotent.

So a coordinator restart is invisible to progress: a fresh coordinator boots over the same storage and the run completes with zero duplicate sample IDs. This is witnessed every commit by coord_restart_no_duplicates (SC2).

Spot-drain (graceful preemption)

When a worker's cloud reports a spot preemption, the drain state machine (rollout-coordinator::drain) runs within a conservative deadline: set a stop-pull flag → nack each in-flight item (back to Pending for another worker) → opportunistically snapshot TrainState only if the remaining budget covers the cost → deregister → exit cleanly.

Two numbers, kept distinct (D-SPOT-01/04):

ProviderNotice lead (cloud gives)Drain deadline (we target)
AWS120s60s
GCP30s15s

The notice lead is what the cloud promises; the drain deadline is the budget the state machine targets, leaving margin. The preemption signal is consumed only through the ComputeHint trait — the coordinator never imports a cloud SDK (coord ↛ cloud; the dependency-direction lint enforces it). Witnessed every commit by spot_drain_completes_within_lead_time (SC3), for both AWS and GCP.

Split-brain fencing

If a coordinator is deposed (its lease was stolen and the epoch advanced), it self-fences: it emits exactly one coordinator_fenced observability event (never a shared-store write) and then std::process::abort()s within 5s. Workers reject any RPC response carrying a stale coord_epoch (EpochGuard), so a deposed coordinator's late replies can never corrupt the run. The abort edge is the hidden rollout-coordinator test-fence subcommand, driven by the SC4 subprocess witness fence_aborts_within_5s.

Operator recipe: the 3-node smoke

make smoke-3node-aws / make smoke-3node-gcp boot 1 coordinator + 3 workers (mock backend, no GPU, no Docker) over an auto-generated dev CA, drive the work ledger through dispatch + a real steal, and assert the run reports done within 30s with a steal observed:

make smoke-3node-aws      # local-transport wiring run (free, Docker-free)
make smoke-3node-gcp

By default the smoke runs over the local mTLS transport so it is reproducible on a free runner. To run the same topology over the real cloud transport (operator path; needs real AWS/GCP credentials and ~4 hosts):

ROLLOUT_SMOKE_CLOUD=1 make smoke-3node-aws

The live-cloud run is the operator-optional path; the every-commit named witnesses (coord_restart_no_duplicates, concurrent_ack_and_steal_no_double_execute, split_brain_old_coord_self_fences, spot_drain_completes_within_lead_time) are the load-bearing gate and run Docker-free on every commit:

cargo test -p rollout-coordinator

Examples

This page is reserved for the v1 working-model recipe (SHIP-03 hardened).

v1 cannot ship without at least one end-to-end recipe (make example or cargo run --example) that takes a real small open-weights model, runs SFT or PPO, completes on commodity hardware, is exercised by nightly CI, and is documented here. See AGENTS.md §9.4.

The recipe lands progressively: Phase 4 (SFT stub) → Phase 9 (real recipe) → Phase 12 (polished docs).