Introduction

A high-performance, multi-node reinforcement-learning framework for large language models. Written in Rust. Pluggable in Python.

rollout is a Rust-core reinforcement-learning framework for large language models. It supports PPO, GRPO, DPO/IPO/KTO, SFT, and reward-model training across training, batch inference, and online inference modes, with multi-node distribution from day one. AWS and GCP are first-class infra targets; vLLM is the default inference backend; plugins can be authored in Python or Rust.

Layer	What it owns
Algorithms	SFT · RM · PPO · GRPO · DPO / IPO / KTO
Substrate	Coordinator + workers + plugin host (PyO3 / sidecar RPC)
Storage / Cloud	Embedded · Postgres · S3 · GCS — AWS/GCP behind a layered trait

See Architecture for the full layered breakdown.

Architecture

See the canonical architecture doc: ARCHITECTURE.md.

Substrate (Phase 2)

Phase 2 lights up the local substrate — the layer that lets a worker start, store state, talk to a peer, load a plugin, and shut down cleanly without touching any cloud. The per-crate chapters land in plan 02-07; this page is the section landing.

What ships in Phase 2

rollout-proto — owns transport.proto (heartbeat / control / work) and plugin.proto (sidecar). tonic-build runs in this crate's build.rs; every other crate consumes the generated code.
rollout-storage — implements Storage + StorageTxn on top of redb. In-process tokio::sync::broadcast per-prefix for watch(). Always-fsync. Default path ./data/rollout.db. postcard value encoding.
rollout-cloud-local — implements ObjectStore (content-addressed sharded FS under ./data/object-store/), Queue (RAM hot path + spill to rollout-storage for restart replay), SecretStore (env-var allowlist, read-only), and ComputeHint (Linux full via /proc + nvml-wrapper; macOS minimal stub via sysinfo).
rollout-transport — implements gRPC with three logical channels (heartbeat / control / work). HTTP/2 + rustls is the plan-of-record; QUIC via tonic-h3 is feature-flagged EXPERIMENTAL.
rollout-plugin-host — implements PluginHost with all three modes wired: Rust cdylib, PyO3 in-process, Python sidecar (gRPC-over-UDS).
rollout-coordinator — minimal binary that registers workers, accepts heartbeats, persists worker registry + heartbeat ledger to Storage, and marks workers failed via deadline-based scan.
Smoke test — make smoke + scripts/smoke.sh spawn 1 coordinator + 2 workers + 1 cdylib + 1 Python sidecar, kill w1, and assert deadline detection within 2 × heartbeat_interval.

Plan-of-record vs stretch

HTTP/2 over tonic + rustls is the plan-of-record transport. QUIC via tonic-h3 v0.0.x is feature-flagged EXPERIMENTAL (it is currently pre-1.0 and bidi-streaming is not documented as production-ready). The same .proto schema covers both; the swap-to-QUIC is a single-config change in a later phase.

Trait surface

All trait contracts live in rollout-core. The Wave-0 extensions (plan 02-00) align the rollout-core traits with the spec versions:

Storage / StorageTxn — see spec 04 §2. Phase 2 simplifies scan to return owned Vec instead of BoxStream (object-safety + async_trait constraint).
PluginHost — see spec 03 §4–§5. Phase 2 uses Vec<u8> payloads in call(); richer typed-payload helpers ship in later phases.
Worker / Coordinator — see spec 01 §2. Worker::init / ready land in Phase 2; WorkerContext stays a unit struct until Phase 6.
ObjectStore / Queue / SecretStore / ComputeHint — see spec 06 §3. Queue::ack / nack, ObjectStore::exists, ComputeHint::preemption_signal, and SecretStore::put ship in rollout-core in Phase 2.
EventEmitter — see spec 09 §2. The trait lives in rollout-core (plan 02-00); the StdoutJsonEmitter impl lands in rollout-coordinator (plan 02-06).

Per-crate chapters

Proto crate
Storage
Cloud-local
Transport
Plugin host
Python bridge — PyO3 + pyo3-async-runtimes pin rationale
Coordinator
Smoke test — the make smoke SUBSTR-02 acceptance gate

Preflight

Before running make smoke, run bash scripts/preflight.sh. The script checks that cargo, make, and python3 ≥ 3.11 are on PATH and notes whether protoc is available (tonic-build bundles a copy when missing).

`rollout-proto` — wire-format crate

rollout-proto owns the workspace's .proto files and runs tonic-build exactly once. Downstream crates (rollout-transport, rollout-plugin-host, rollout-coordinator) depend on rollout-proto, not on .proto regeneration.

Why a dedicated crate

Per CONTEXT D-PROTO-01, the workspace has a single tonic-build invocation site. Centralising wire-format ownership means:

The HTTP/2 ↔ QUIC swap (CONTEXT D-TRANS-01 fallback) shares the same source-of-truth proto schema.
The sidecar IPC protocol (UDS) and the network transport carry the same message types — no drift between in-process and out-of-process plugin calls.
Build-time codegen happens once per workspace build; downstream crates compile faster.

Wire-format ownership

File	Package	Owns
`proto/transport.proto`	`rollout.transport.v1`	Worker ↔ Coordinator transport
`proto/plugin.proto`	`rollout.plugin.v1`	Host ↔ sidecar Plugin protocol

Both are committed to the repo. The build script (build.rs) compiles them via tonic-prost-build on every workspace build; rerun-if-changed keeps incremental builds fast.

Three transport channels

transport.proto defines the three logical channels from spec 05 §3:

Channel	RPC kind	Cadence	Purpose
`Heartbeat`	unary `Beat`	every 500 ms	Worker liveness; carries `due_at` for deadline-based health
`Control`	server-stream `Subscribe`	as-needed	Coordinator pushes drain / snapshot / cancel
`Work`	bidi `Stream`	bursty	Pull/submit work items (full semantics land in Phase 6)

All three multiplex over one connection — HTTP/2 today, QUIC behind a feature flag in a later phase.

Plugin sidecar protocol

plugin.proto defines the five sidecar RPCs from spec 03 §3.3 + §4:

Init(InitRequest) -> InitResponse — one-time setup per worker.
Preflight(PreflightRequest) -> PreflightResponse — optional, after Init.
Call(CallRequest) -> CallResponse — generic typed entry point; payload is opaque bytes (JSON or postcard, plugin-defined).
Reload(ReloadRequest) -> ReloadResponse — hot reload signal; the sidecar may refuse.
Shutdown(ShutdownRequest) -> ShutdownResponse — clean exit with grace period.

The Python sidecar host implementation lands in rollout-plugin-host (plan 02-05).

Python stubs

The in-tree Python sample sidecar (python/examples/sample_sidecar/) uses stdlib length-prefixed JSON framing over UDS — per AGENTS.md §7, every plugin must be testable locally without pip install grpcio or other external services. See 02-RESEARCH.md §"Python sidecar IPC: avoid pip" for the rationale.

If you are authoring a Python plugin that does want real gRPC, opt in via make protos (which shells to cargo xtask gen-protos). Output goes to python/rollout/_proto/. The xtask gracefully degrades when grpcio-tools is missing — it prints a helpful message and exits 0; CI does not require it.

pip install grpcio-tools     # opt-in
make protos                  # writes python/rollout/_proto/*_pb2.py

Versioning

Both protos use a v1 package suffix (rollout.transport.v1, rollout.plugin.v1). Bumping to v2 is a future spec edit; the v1 modules will continue to compile alongside any future versions so in-flight Phase 2 deployments do not break.

Cross-references

spec 03 — plugin system §3.3, §4
spec 05 — distribution / transport §3, §6
CONTEXT D-PROTO-01 for the "single tonic-build site" decision

Storage

rollout-storage is the Phase-2 embedded Storage + StorageTxn impl, backed by redb 2.5. It is the default backend in v1 per CONTEXT D-STO-01. The Postgres backend lives in Phase 4 (TRAIN-04).

Backend choice

Spec 04 §3.1 prefers redb: pure-Rust, single-file MVCC, copy-on-write, no compaction stalls. The async Storage trait wraps the sync redb API via tokio::task::spawn_blocking.

Table layout

Each StorageKey.namespace maps to its own redb TableDefinition<&[u8], &[u8]>. The Phase-2 namespaces are:

Table	Purpose
`runs`	Per-run metadata
`workers`	Worker registry (coordinator-owned)
`heartbeats`	Heartbeat ledger; coordinator scans this on `due_at`
`queue`	Generic queue spill
`plugins`	Plugin manifest cache
`cloudlocal_queue`	`rollout-cloud-local` queue restart-replay mirror

Adding a new namespace = adding a const TableDefinition in embedded::tables.rs and a match arm in table_for. No migration; redb opens new tables lazily on first write.

Key encoding

Namespace selects the table, so the redb key bytes only encode (run_id, path). The encoding is a single postcard::to_allocvec(&(&run_id, &path)) so decoding is unambiguous regardless of any 0x00 bytes inside Option<RunId>. See crates/rollout-storage/src/encoding.rs.

Value encoding

postcard per D-STO-04 — compact, schemafull, deterministic, serde-native. The Storage trait is bytes-only (get_bytes / put_bytes / cas_bytes); downstream crates layer postcard over the trait via free helpers when typed payloads are needed.

Durability

Always-fsync per D-STO-03. Each WriteTransaction calls set_durability(Durability::Immediate) before staging writes; redb's commit() blocks until fsync completes. Default DB path is ./data/rollout.db, overridable via [storage.embedded] path = ... in the run config.

watch() semantics

Storage::watch(prefix) -> tokio::sync::broadcast::Receiver<StorageEvent> returns an in-process subscriber to commits whose key extends prefix. The WatchRouter lives inside EmbeddedStorage; per-prefix broadcast::Senders are created on first subscribe.

The critical invariant (RESEARCH §"Pattern 2"): redb has no post-commit hook. The EmbeddedTxn buffers StorageEvents inside the in-flight txn; commit() only fans them out to the router after WriteTransaction::commit() returns Ok. On abort() (explicit or via drop), the buffered events are discarded — subscribers never observe rolled-back writes.

begin()  → buffer = []
put_bytes(K, V) → buffer.push(Put{K})
commit() → spawn_blocking(commit) → on Ok: for evt in buffer { watch.publish(evt) }
abort()  → drop(txn); buffer is dropped

Phase-2 scan semantics

Storage::scan_bytes(KeyRange { prefix, limit }) returns an owned Vec<(StorageKey, Vec<u8>)> rather than the BoxStream shown in spec 04 §2. The async-trait + object-safety constraint on stable Rust forbids stream-returning methods on dyn Storage; the simplification is documented in spec 04 §1a. Phase-2 callers (coordinator deadline scan, cloud-local restart replay) work with small per-namespace working sets where the owned-vec cost is negligible. A future StorageStream newtype can lift this restriction.

The scan iterates the namespace table and decodes each key on the way out; namespace already partitions the data, so tables stay small. limit short-circuits the iteration.

Postgres path constraint (ASCII-printable)

The Postgres backend stores StorageKey.path as TEXT[]. Path components must be ASCII-printable (bytes 0x20–0x7E); non-printable / NUL / high-bit bytes cannot round-trip through TEXT[] and would silently diverge from redb's byte-lex prefix scan (PITFALLS.md §17). Every Postgres CRUD and scan entry-point calls StorageKey::validate_for_postgres() and rejects offending keys with Fatal(ConfigInvalid) before any SQL runs. For binary identifiers in path components, use hex::encode at the StorageKey construction site. A 256-case proptest (tests/postgres_scan_bytes_parity.rs) witnesses byte-parity between redb and Postgres over the printable-ASCII range.

Cross-process watch

NOT supported. The broadcast channel lives in process memory; another EmbeddedStorage instance pointing at the same file will not receive events. Cross-process watch arrives with the Postgres backend in Phase 4 (LISTEN/NOTIFY).

Crash safety

The on-disk format is crash-safe at the redb commit boundary: a successful commit() implies fsync; anything before that is invisible on reopen. The active test in tests/crash_safety.rs exercises the abort-style variant (drop txn without commit, reopen, assert no keys visible). A "true SIGKILL between put and commit" variant is #[ignore]d for now — it needs a helper-binary harness with raw signals; Phase-6 DIST-03 (restart-from-storage) will land that.

Tests

File	Scope
`tests/crud.rs`	put/get/delete/scan/get_many/ping
`tests/tables.rs`	per-namespace isolation; six-namespace reopen
`tests/txn.rs`	commit/abort/cas (insert-only / CAS / delete-if-equal)
`tests/watch.rs`	publish-after-commit; abort suppresses; multi-subscriber; prefix isolation
`tests/crash_safety.rs`	drop-without-commit reopen; SIGKILL variant `#[ignore]`d

Cloud-local substrate

rollout-cloud-local ships the Phase-2 Layer-1 implementations so the rest of the stack has a real ObjectStore / Queue / SecretStore / ComputeHint to target with zero cloud creds. Per CONTEXT D-LOCAL-01..05.

What ships

Capability	Type	Implementation
Blob storage	`FsObjectStore`	Content-addressed two-level sharded FS under `./data/object-store/`; sibling `<hex>.meta.json` per blob.
Work queue	`InMemQueue`	`tokio::sync::Mutex<VecDeque<_>>` hot path + spill to `rollout-storage` (`cloudlocal_queue` namespace).
Secrets	`EnvSecretStore`	Read-only env-var allowlist (`ROLLOUT_SECRET_<NAME>`); `put()` returns `Fatal(ConfigInvalid)` by design.
Compute hint	`hints::*`	Linux full (`/proc/cpuinfo` + `/proc/meminfo`, optional NVML feature); macOS minimal (sysinfo cpu+memory).

What's deferred

BlockStore — D-LOCAL-05: declared in rollout-core, not implemented here; opt-in for clouds that need it.
Sandboxing beyond network allowlist — Phase 7, when the tool harness lands and untrusted-code isolation matters (cgroups + seccomp + FD limits + fs write restrictions).
Real cloud backends (rollout-cloud-aws, rollout-cloud-gcp) — Phase 5.

`FsObjectStore` layout (D-LOCAL-01)

./data/object-store/
├── ab/                       <- hex[0..2]
│   └── cd/                   <- hex[2..4]
│       ├── abcd…fullhash     <- blob
│       └── abcd…fullhash.meta.json
└── ...

Writes are tmp-then-rename for atomicity. Idempotent for repeated puts of the same bytes (same ContentId, no double-write). get_bytes on a missing id returns Fatal(Internal("object not found: …")) — choosing Fatal because a missing content hash signals an upstream contract violation, not a transient I/O fault.

Queue restart semantics (D-LOCAL-02)

Every enqueue writes through a Storage transaction under cloudlocal_queue/<ulid> BEFORE the item is pushed onto the in-memory deque. ack deletes the storage entry; nack re-pushes the item to the front of the deque without touching storage so the next restart still replays it.

On InMemQueue::open(storage), the queue scans cloudlocal_queue/*, decodes each QueueItemId(Ulid) from the path segment, sorts by ULID (which is k-sortable so this recovers enqueue order), and rebuilds the deque. This honors the spirit of DIST-03 (restart replay) for the local backend; the full DIST-01..05 fault-tolerance work lands in Phase 6.

Secret allowlist (D-LOCAL-03)

EnvSecretStore::new(allowlist) accepts a list of secret names (without the ROLLOUT_SECRET_ prefix). At get(name) time:

Condition	Result
`name` not in allowlist	`Err(Fatal(ConfigInvalid("not in allowlist")))`
`name` allowed, `ROLLOUT_SECRET_<name>` set	`Ok(value)`
`name` allowed, env var unset	`Err(Recoverable(Transient, RetryHint::Never))`
`put(name, value)` — ALWAYS	`Err(Fatal(ConfigInvalid("read-only")))`

The "allowed but unset" case is recoverable rather than fatal because the operator can provision the variable without changing config; subsequent calls will succeed.

GPU inventory (D-LOCAL-04)

GPU enumeration is opt-in behind a nvml Cargo feature. When the feature is off (default), LinuxComputeHint::inventory().gpus is always empty. When the feature is on but libnvidia-ml.so is missing or NVML init fails, the inventory still returns an empty gpus vector — never errors. This keeps local dev machines and CI runners (no GPU) functional without conditional code on the caller.

macOS skips GPU inventory entirely.

Tests

File	Coverage
`tests/object_store.rs` (6)	Round-trip, sharded layout, meta sidecar, exists, idempotency, fatal-on-missing
`tests/secrets.rs` (4)	Allowlist read, outside-allowlist fatal, unset transient, put fatal
`tests/queue_replay.rs` (5)	FIFO ULID order, nack-to-front, restart replay, ack removes, nack keeps
`tests/hints_macos.rs` (2, `cfg=macos`)	CPU + memory present, preemption signal `None`
`tests/hints_linux.rs` (3, `cfg=linux`)	`/proc/cpuinfo` parse, `/proc/meminfo` parse, no GPUs without `nvml` feature

Linux-only tests are #[cfg(target_os = "linux")] so workspace cargo test --tests stays green on every host. The NVML integration test is additionally #[ignore]d behind --ignored since it requires a live libnvml.

`rollout-transport` — gRPC plane with mTLS

rollout-transport is the worker ↔ coordinator gRPC plane for Phase 2. It ships HTTP/2 + rustls + mTLS as the plan-of-record. QUIC support is experimental and lives behind a Cargo feature flag.

Plan-of-record: HTTP/2 + rustls

Phase 2 research (.planning/phases/02-local-substrate/02-RESEARCH.md, "Transport stack") found tonic-h3 v0.0.5 (latest at planning time, 2025-11-01) was explicitly labelled experimental, with bidirectional streaming not documented as supported. The upstream hyperium/tonic#339 gRPC-over-HTTP/3 issue remains open. Shipping a Work channel (bidi-streaming) on top of tonic-h3 would risk runtime hangs under load.

Therefore plan 02-04 ships HTTP/2 tonic 0.14 + rustls 0.23 + rcgen 0.13. The same .proto schema is forward-compatible with QUIC; the swap will become a single Cargo-feature flip in a later phase once tonic-h3 (or its successor) ships documented bidi support.

Three logical channels

All three channels are multiplexed over one HTTP/2 connection per (coordinator, worker) pair, per spec 05-distribution.md §3.

Channel	RPC kind	Purpose
Heartbeat	unary, frequent	Worker's "I'm alive" ping; carries `due_at`
Control	server-streaming	Coordinator pushes drain / snapshot / cancel
Work	bidirectional	Phase-2 stub; Phase 6 wires pull/submit

The Work channel ships as a wired-but-stub WorkServiceImpl that echoes a heartbeat marker back on every received frame. Real pull/submit semantics arrive with DIST-01..02 in Phase 6.

mTLS auto-bootstrap (D-TRANS-02)

On first run the transport invokes tls::ensure_dev_ca(tls_dir). This:

Generates a self-signed CA via rcgen 0.13.
Writes ca.pem (public certificate) and ca.key.pem (private key).
Sets ca.key.pem permissions to 0o600 on Unix.

Subsequent calls are idempotent — the existing files are read back.

tls::issue_server_cert(ca_cert, ca_key, dns_names) issues server certs (EKU ServerAuth). tls::issue_client_cert(ca_cert, ca_key, names) issues client certs (EKU ClientAuth). Both are signed by the dev CA.

Filenames live under ./data/tls/ (gitignored). No manual openssl steps are required; rcgen produces pure-Rust output with no openssl-sys dependency (banned by deny.toml).

Deadline-based health (D-TIME-01..02)

Spec 05-distribution.md §6 mandates deadline-based health, not polling. Workers publish due_at = now + heartbeat_interval × 2 on every Beat; coordinators scan worker state and mark failure only when:

elapsed_past_due > clock_skew_budget  AND  elapsed_past_due > coordinator_failure_timeout

Helpers health::next_due_at and health::is_failed encode the formulas above. They take SystemTime directly so callers can swap a test clock trivially.

Default constants (D-TIME-01):

Constant	Default
`heartbeat_interval`	500 ms
`worker_self_fence_timeout`	4 s
`coordinator_failure_timeout`	5 s
`clock_skew_budget`	250 ms

Config invariants enforced at plan time (D-TIME-02)

TransportConfig::validate_cross_fields enforces two split-brain prevention rules at plan time, never at runtime:

worker_self_fence_timeout < coordinator_failure_timeout — a worker that fails to self-fence before the coordinator times it out causes classic split-brain (two writers think they own the same lease).
clock_skew_budget < heartbeat_interval × 2 — a clock-skew budget greater than two periods would make deadline-based failure detection meaningless.

rollout plan calls validate_cross_fields before any worker starts; violations return Fatal(ConfigInvalid) with a human-readable message identifying the offending field.

QUIC feature flag (EXPERIMENTAL)

The quic Cargo feature pulls in quinn, tonic-h3, h3, and h3-quinn:

[features]
default = ["h2"]
h2 = []
quic = ["dep:quinn", "dep:tonic-h3", "dep:h3", "dep:h3-quinn"]

At plan-02-04 execution time (2026-05-20), cargo build -p rollout-transport --features quic fails to compile because h3-quinn 0.0.7 references quinn::StreamId internals that became private in quinn 0.11.x. This confirms the RESEARCH §"Pitfall 2" assessment: the QUIC stack is not production-ready.

When tonic-h3 (or a successor) ships documented bidi-streaming and publishes a quinn 0.11-compatible release, the swap is:

Verify cargo build --features quic succeeds.
Implement server::serve_quic (currently returns Fatal(Internal) with an EXPERIMENTAL message).
Add a matching client::build_quic_channel.
Flip the default feature, or expose a CLI flag.

The proto schema does not change.

Channel: Work (Phase-2 stub)

work::WorkServiceImpl::stream accepts a bidi stream and echoes a "heartbeat" marker per inbound frame. This is enough for the Phase-2 smoke test in plan 02-07 to verify that the bidi pipe is wired end-to-end. Real pull/submit semantics — assignment, ack, work-stealing, restart-from-storage — arrive with DIST-01..02 in Phase 6.

Observability (principle #10)

Every server handler is wrapped in a tracing::instrument span with a channel field. Emitted events include:

transport_server_starting (with %addr)
mtls_handshake_bootstrap (when CA is generated)
heartbeat_received (worker_id, run_id, state)
control_subscribed (worker_id, run_id)

Binary crates configure the subscriber; library crates emit only — per D-OBSERVE-01.

Cross-references

docs/specs/05-distribution.md §3, §6
crates/rollout-proto/proto/transport.proto
.planning/phases/02-local-substrate/02-CONTEXT.md D-TRANS-01..03, D-TIME-01..02
.planning/phases/02-local-substrate/02-RESEARCH.md §"Transport stack", Pitfall 2

Plugin host

rollout-plugin-host is the Phase-2 implementation of rollout_core::PluginHost. It wires three loading modes against the trait surface delivered in Wave-0 and persists manifests through rollout-storage when constructed via PluginHostImpl::with_storage.

Three modes

Mode	Crate feature	Where it runs	Hot reload
Rust cdylib (`PluginMode::RustCdylib`)	`cdylib` (default)	In-process, native code	Unsupported. Returns `Fatal(PluginContract)` per spec 03 §7.
PyO3 in-process (`PluginMode::Pyo3`)	`pyo3` (default)	Dedicated Python OS thread	`importlib.reload` (requires `dev-hot-reload`).
Python sidecar (`PluginMode::Sidecar`)	`sidecar` (default)	Subprocess over Unix Domain Socket	SIGTERM + respawn (requires `dev-hot-reload`).

The host exposes one type — PluginHostImpl — that dispatches on manifest.mode at load() time. Each mode owns a small state struct kept in a parallel HashMap<PluginId, HandleState> so the public PluginHandle remains Clone + Send + Sync POD.

Manifest

Plugins ship a rollout-plugin.toml parsed by parse_manifest. The schema matches rollout_core::PluginManifest:

name             = "sample-inproc"
version          = "0.1.0"
kind             = "env-harness"           # PluginKind variant (kebab-case)
trait_id         = "rollout_core::Plugin"
mode             = "pyo3"                  # pyo3 | sidecar | rust-cdylib
network_allowlist = []

[runtime]
python_min = "3.11"      # required for pyo3
gpu        = false
memory_mib = 64

[entry.pyo3]
module  = "sample_inproc.plugin"
factory = "create_plugin"

validate_manifest runs cheap plan-time checks: pyo3 plugins must declare python_min >= 3.11 (stdlib tomllib + PyO3 abi3-py311), names/versions must be non-empty, and the manifest must parse against serde(rename_all = "kebab-case").

C-ABI shim (cdylib mode)

Phase 2 keeps the shim as an internal module (src/modes/abi.rs) per RESEARCH Open Question 3 — it graduates to a standalone rollout-plugin-abi crate when an external plugin ecosystem emerges (likely Phase 7+).

The contract:

#![allow(unused)]
fn main() {
#[repr(C)]
pub struct Buf { pub ptr: *mut u8, pub len: usize, pub cap: usize }

#[repr(C)]
pub struct RolloutPluginVtable {
    pub abi_version: u32,                  // must equal ABI_VERSION (1)
    pub name: *const c_char,
    pub call: extern "C" fn(
        method: *const u8, method_len: usize,
        payload: *const u8, payload_len: usize,
        out: *mut Buf
    ) -> i32,
    pub free_buf: extern "C" fn(buf: Buf),
}

#[no_mangle]
pub extern "C" fn rollout_plugin_factory() -> *mut RolloutPluginVtable;
}

The host copies the returned Buf before calling free_buf so allocator mismatches across the cdylib boundary cannot corrupt anything. ABI mismatches surface as Fatal(PluginContract { msg: "cdylib ABI mismatch: got N, expected 1" }).

The in-tree example lives at tests/smoke/plugins/rust_cdylib_sample/. It's an out-of-workspace cdylib crate (its own [workspace]) so cargo build --workspace doesn't churn on it; the smoke driver (plan 02-07) builds it with cargo build --manifest-path … --release.

Sidecar IPC

The in-tree sample uses stdlib-only length-prefixed JSON over AF_UNIX per RESEARCH Pitfall 9 + AGENTS.md §7 ("every plugin testable locally — without cloud creds / GPUs / external services"). The wire format is intentionally tiny:

request:  [u32 BE length][utf-8 JSON {"method": str, "payload": str}]
response: [u32 BE length][utf-8 JSON {...}]

This avoids forcing pip install grpcio into the cargo test path. Real production sidecars that need gRPC can ingest the same rollout-proto::plugin::v1 service stubs (make protos regenerates the Python side); the host supports both via the SidecarProtocol enum (FramedJsonUds and GrpcUds). Phase 2 only ships the FramedJsonUds code path; GrpcUds lands when a sidecar consumer needs it.

UDS paths default to ./data/sidecars/<plugin>-<pid>.sock and are chmod 600 on the Python side. The host removes the socket on unload + on respawn.

Hot reload

Behind the dev-hot-reload Cargo feature only:

PyO3. The dedicated Python thread runs importlib.reload(module) and re-invokes factory() on the same channel. In-flight calls running on the old object complete naturally; subsequent calls hit the new object.
Sidecar. SIGTERM the child (via nix::sys::signal::kill), wait 2 s, SIGKILL on holdout, then respawn with the same argv on a fresh UDS path.
cdylib. Returns Fatal(PluginContract { msg: "cdylib reload unsupported per spec 03 §7" }) — Rust has no stable ABI; dlclose while another task holds a Box<dyn Plugin> is UB. Production should not reload cdylibs.

The feature is off by default. Per spec 03 §7, production deployments ignore reload flags entirely (with a warning) — that surfacing lands with the coordinator in plan 02-06.

Sandboxing (Phase 2 scope)

Network allowlist only (D-SANDBOX-01). cgroups + seccomp + FD limits + fs write restrictions are tracked as TODOs referencing Phase 7. The allowlist surfaces as PluginManifest::network_allowlist; the egress proxy lives in the worker / cloud-local layer and lands when the tool harness needs adversarial isolation (Phase 7).

Dependency direction

rollout-plugin-host does not depend on rollout-transport. The Wave-0 dep-direction lint enforces this — sidecar IPC uses UDS framing via rollout-proto, not the QUIC/HTTP/2 transport from rollout-transport. Forgetting this rule means the layered architecture (AGENTS.md §9) drifts; the integration test in crates/rollout-core/tests/dependency_direction.rs catches drift on every PR.

Observability

Every public op emits a structured event with target = "plugin_host" and a plugin_id field:

Event	Fields	When
`plugin_loaded`	`plugin_id`, `mode`, `path`/`module`/`socket`	`load()` succeeds
`plugin_reloaded`	`plugin_id`, `reason`	`reload()` succeeds (or is rejected)
`plugin_call` (span)	`plugin_id`, `method`	wraps every `call()`
`plugin_call_error`	`plugin_id`, `error`	call returns Err

Subscribers configure formatting via tracing-subscriber (e.g., the stdout-JSON sink that the coordinator binary ships in plan 02-06).

Python bridge (PyO3)

Phase 2 pins pyo3 = 0.28 + pyo3-async-runtimes = 0.28 in lockstep. Both crates moved under the PyO3 organisation in 2025 (the predecessor pyo3-asyncio is archived).

Version pin rationale

Crate	Pinned version	Why
`pyo3`	`0.28`	First release with the new `Python::attach` / `Python::try_attach` API (replaces `Python::with_gil`); stable abi3 support; MSRV 1.83 satisfies our 1.88 toolchain.
`pyo3-async-runtimes`	`0.28`	Mandatory lockstep with `pyo3` — the FFI surface is private and tied to the patch version. Successor to `pyo3-asyncio` (archived).

Both versions are declared in [workspace.dependencies] so every consumer (currently just rollout-plugin-host) inherits the same patch range.

`abi3-py311` strategy

We enable pyo3/abi3-py311. This produces a single Python extension binary that runs on any CPython 3.11+ instead of building one wheel per minor (3.11 → 3.12 → 3.13). The trade-off:

✅ Smaller CI matrix; one wheel ships everywhere.
✅ Stdlib tomllib is available (used by sidecar samples to parse pyproject.toml if needed).
✅ pyo3 abi3 was stabilised circa 0.20 and is well-exercised.
❌ 3.10 builds are rejected at link time with cannot set a minimum Python version 3.11 higher than the interpreter version 3.10. Local dev machines that default to python3.10 must export PYENV_VERSION=3.11.x (or similar) before cargo build.
❌ Some C-extension features (e.g. private CPython internals) aren't accessible through abi3.

scripts/preflight.sh (added in Wave-0 plan 02-00) verifies python3 >= 3.11 before make smoke does anything destructive.

Dedicated Python OS thread

Per RESEARCH Pitfall 3, mixing PyO3 calls onto random Tokio worker threads deadlocks under contention because the GIL ends up held across .await points. The host avoids this by owning one OS thread per PluginHostImpl:

                 ┌────────────────────┐
   Tokio task ───► mpsc::Sender ─────►│  rollout-py-* OS thread
                 │ (PyTask enum)      │  ───────────────────────
                 │                    │  Python::attach(...) {
                 │                    │      plugin.call(...)
                 │ ◄─── oneshot ──────┤  }
                 └────────────────────┘

The worker thread is created with std::thread::Builder::new().name("rollout-py-<plugin>") so debuggers and tracing spans can distinguish per-plugin Python contexts. With pyo3/auto-initialize the interpreter spins up on first Python::attach; we do not call the (removed-in-0.28) prepare_freethreaded_python.

Heavy-CPU Python code should release the GIL via Py::allow_threads per spec 03 §3.2 — Phase 2 in-tree samples are tiny enough not to need this.

In-tree samples and the no-pip rule

Per AGENTS.md §7, every in-tree sample must run without pip install. Phase 2 ships three:

python/examples/sample_inproc/ — PyO3 in-process. create_plugin().call(method, payload) echoes payload or returns b"pong".
python/examples/sample_sidecar/ — Python sidecar over stdlib JSON framing. Runs as python -m sample_sidecar <socket_path>.
tests/smoke/plugins/rust_cdylib_sample/ — Rust cdylib implementing ABI v1.

User plugins are free to bring their own virtualenv with grpcio, numpy, etc. — the no-pip rule applies only to the in-tree samples that gate cargo test and make smoke.

Coordinator

rollout-coordinator is the Phase-2 minimal control plane. It registers workers, accepts heartbeats, persists the worker registry and heartbeat ledger to local EmbeddedStorage, and surfaces deadline-detected failures.

Scope

Phase 2 explicitly ships only the heartbeat-receiver slice:

In scope (Phase 2)	Out of scope (Phase 6 `DIST-01..05`)
Register worker, accept heartbeat	Work distribution / pull / submit
Persist `workers/` + `heartbeats/` to Storage	Coordinator lease / CAS / HA
Deadline-based failure scan + tracing events	Multi-coordinator handoff
Mount the three `rollout-transport` services	Restart-from-storage 4-node test

The same binary scales to the full distribution story later — Phase 6 adds work-stealing on top of the existing transport + storage wiring.

Storage layout

Two redb tables (rollout-storage provides them; this crate only writes):

Namespace workers, path [<worker_id>] → postcard-encoded WorkerRegistryEntry
Namespace heartbeats, path [<worker_id>] → postcard-encoded HeartbeatRecord

Heartbeats are overwrite-on-write in Phase 2 — only the latest beat is kept. Phase 6 may bolt on a ledger if work-stealing needs history.

Failure-detection formula

A worker is marked failed iff

elapsed_past_due = now - due_at
elapsed_past_due > clock_skew_budget
elapsed_past_due > coordinator_failure_timeout

Both thresholds must trip (spec 05 §6). The defaults (CONTEXT D-TIME-01):

Constant	Value
`heartbeat_interval`	500 ms
`worker_self_fence_timeout`	4 s
`coordinator_failure_timeout`	5 s
`clock_skew_budget`	250 ms

Plan-time invariants enforced by TransportConfig::validate_cross_fields:

worker_self_fence_timeout < coordinator_failure_timeout (split-brain prevention).
clock_skew_budget < 2 × heartbeat_interval.

Failure-scan loop

The scan ticks every heartbeat_interval / 2 (250 ms by default) so a single missed beat is detected within 2 × heartbeat_interval — the SUBSTR-02 acceptance criterion #3. Each tick scans the heartbeats/* namespace, decodes via postcard, and emits two outputs per overdue worker:

A tracing::warn! line with target = "coordinator" and worker_id = <id> + due_at_ms = <ms> fields.
An Event { kind: EventKind::Domain { topic: "worker_failed" }, level: Warn, … } via the injected EventEmitter (see D-OBSERVE-01 below).

Already-failed workers are tracked in an in-memory HashSet so the loop emits exactly one failure event per worker per coordinator lifetime.

Observability (D-OBSERVE-01)

StdoutJsonEmitter is the Phase-2 sink for rollout_core::EventEmitter:

Holds a Mutex<tokio::io::Stdout> so concurrent emits don't interleave.
Writes one NDJSON line per event using serde_json.
Flushes after each event.

CoordinatorImpl::new(storage, run_id, emitter) takes the emitter as an Arc<dyn EventEmitter> so non-stdout backends drop in without code change in later phases. The coordinator emits:

worker_registered — on the first register() (or first heartbeat from an unknown worker; see CLI section).
worker_heartbeat — on every accepted beat.
worker_deregistered — on graceful drain.
worker_failed — from the failure-scan loop.

Tests use NoopEmitter (also in emitter.rs) which discards every event.

CLI

`rollout coordinator run`

rollout coordinator run --config <path>

Boots the coordinator from a TOML file:

run_id = "01JZ..."        # ULID
[storage]
path = "./data/rollout.db"
[transport]
listen_addr = "127.0.0.1:50051"
tls_dir = "./data/tls"

The same logic ships as a standalone binary rollout-coordinator (Cargo [[bin]]) so the smoke-test wrapper can invoke it directly without rollout-cli in the path.

`rollout worker run`

rollout worker run --config <path> [--worker-id <ulid>] [--plugin <manifest.toml> ...] [--hot-reload]

The Phase-2 worker:

Opens its own EmbeddedStorage.
Builds a PluginHostImpl::with_storage(...) and load()s each --plugin manifest.
Dials the coordinator over mTLS using an ephemeral client cert issued from the dev CA at ./data/tls/.
Sends Beat(state=Init) immediately; the coordinator auto-registers the worker on first heartbeat (the proto has no separate register RPC).
Beats every heartbeat_interval after that, advancing state -> Ready.
On SIGTERM: sends Beat(state=Draining) and exits 0.

First-run UX

On first boot, the coordinator generates data/tls/ca.pem + data/tls/ca.key.pem (chmod 600 on the key) and prints:

Generated dev CA at ./data/tls/ca.pem

Subsequent runs are idempotent (read-through). The CA + per-host certs follow rcgen 0.13 defaults — adequate for dev; production should swap in a real CA in a later phase.

Smoke test

make smoke is the SUBSTR-02 / SUBSTR-03 / SUBSTR-04 acceptance gate. It runs the live substrate end-to-end on a clean checkout with no cloud credentials and proves that deadline-based failure detection works on the wire.

What it proves

The smoke test boots the full Phase-2 substrate and exercises every concern in one run:

rollout-storage opens three independent redb files (one per coordinator / w1 / w2) with always-fsync durability.
rollout-transport brings up the HTTP/2 listener with an auto-generated dev CA + per-host mTLS certificates.
rollout-plugin-host loads two plugins per worker — a Rust cdylib and a Python sidecar (stdlib-only framed-JSON over UDS).
rollout-coordinator persists the worker registry + heartbeat ledger and runs the deadline-based failure scan that emits worker_failed for any worker overdue by more than coordinator_failure_timeout + clock_skew_budget.

When the script kills w1 with SIGKILL, the coordinator must observe and emit the failure event for w1's ULID inside the deadline budget — that is the contract the substrate ships.

Topology

                        coord.log (NDJSON events)
                          ▲
                          │
    ┌─────────────────────┴─────────────────────┐
    │            rollout-coordinator            │
    │  ./data/smoke/coord.db   tls/ca.pem       │
    │  listen 127.0.0.1:50051 (mTLS, H/2)       │
    └───────────────┬──────────────┬────────────┘
                    │              │
              heartbeat 500ms  heartbeat 500ms
                    │              │
    ┌───────────────▼──┐     ┌────▼─────────────┐
    │  rollout worker w1│     │ rollout worker w2 │
    │  w1.db            │     │ w2.db             │
    │  plugins:         │     │ plugins:          │
    │    cdylib sample  │     │   cdylib sample   │
    │    sidecar sample │     │   sidecar sample  │
    └───────────────────┘     └───────────────────┘

Both workers load both plugin samples; the cdylib path is built on demand by the script (cargo build --manifest-path .../rust_cdylib_sample/Cargo.toml) and the Python sidecar resolves via PYTHONPATH=python/examples.

Timeline

Step	Wall-clock budget	Asserted invariant
Build binaries + cdylib	~30 s (cold) / instant (cached)	exit 0
Spawn coordinator	< 1 s	port 50051 up
Spawn w1 + w2	< 1 s	both PIDs alive
Wait heartbeat-stable	< 5 s	`worker_heartbeat` for both ULIDs in coord.log
`kill -KILL w1`	instant	—
Detect `worker_failed(w1)`	< 8 s (D-TIME-01 floor: `2 × heartbeat_interval + skew`)	`worker_failed` topic + w1 ULID in coord.log

The 8-second detection deadline is well inside the spec contract: at the locked Phase-2 defaults (heartbeat_interval = 500 ms, coordinator_failure_timeout = 5 s, clock_skew_budget = 250 ms), the failure-scan loop ticks at 250 ms and fires the event within coordinator_failure_timeout + clock_skew_budget ≈ 5.25 s of the missed beat — observed locally at ~5.2 s.

Where logs go

All smoke artifacts land under ./data/smoke/ (gitignored):

Path	Purpose
`data/smoke/coord.db`	Coordinator embedded storage
`data/smoke/w1.db`, `w2.db`	Per-worker embedded storage
`data/smoke/tls/`	Auto-generated dev CA + per-host certs
`data/smoke/sidecars/`	Per-plugin UDS sockets
`data/smoke/logs/coord.log`	Coordinator stdout (NDJSON spec-09 events + tracing)
`data/smoke/logs/w1.log`, `w2.log`	Worker stdout (tracing)
`data/smoke/logs/w1.toml`, `w2.toml`	Per-worker TOML (sed-rewritten from the shared fixture so each worker has its own db)
`data/smoke/logs/w1.id`, `w2.id`	Pre-allocated worker ULIDs (smoke driver writes these so the grep step is deterministic)

Searching coord.log for worker_failed.*$W1_ULID is the assertion the script makes.

Configuration fixtures

The committed fixtures at tests/smoke/coordinator.toml and tests/smoke/worker.toml use the D-TIME-01 timing defaults verbatim. The worker TOML carries a single storage path; the smoke driver derives data/smoke/w1.db and data/smoke/w2.db at runtime with sed because redb takes an exclusive file lock per Database::create and two worker processes sharing the same file would conflict.

CI integration

.github/workflows/ci.yml includes a smoke job on ubuntu-latest that runs after the standard test job. The job installs netcat-openbsd for the port-poll, runs bash scripts/preflight.sh, then make smoke. On failure, all data/smoke/logs/ contents are uploaded as a smoke-logs artifact for post-mortem.

The pre-existing 11 CI jobs (lint, test, deny, commitlint, schema-drift, architecture-lint, unused-deps, rustdoc-check, docs-build, docs-deploy, docs-test-policy) are untouched — no continue-on-error, no skipped steps. The architecture-lint job automatically picks up the 4th invariant added in dependency_direction.rs (rollout-coordinator must not depend on rollout-plugin-host or any cloud-layer crate) without any YAML changes.

First-run UX

On first invocation the coordinator emits

Generated dev CA at ./data/smoke/tls/ca.pem

to stderr and proceeds. No manual openssl steps are required; rcgen mints the CA + per-host certs in-process. The tls/ directory is gitignored along with the rest of data/.

How to extend

Add a plugin: drop a manifest TOML under tests/smoke/plugins/, then add a --plugin <path> flag to the worker spawn lines in scripts/smoke.sh.
Change timings: edit tests/smoke/coordinator.toml and tests/smoke/worker.toml; the cross-field invariants (worker_self_fence_timeout < coordinator_failure_timeout, clock_skew_budget < heartbeat_interval × 2) are enforced at validate_cross_fields time and failing configs early-exit.
Add a worker: copy the spawn_worker block; allocate a new ULID; add the per-worker sed-rewrite line for the storage path.

Reproducing locally

bash scripts/preflight.sh   # verifies cargo + make + python3 >= 3.11
make smoke

Total wall-clock is dominated by the cold cargo build (~30 s on a warm laptop); the actual integration test runs in ~7 s once binaries exist.

If python3 --version reports < 3.11, install a newer interpreter (brew install python@3.13, pyenv install 3.13.3, or your distro's python3.11 package) and re-run; the preflight check guards against the runtime-incompatible interpreter.

Inference

Phase 3 delivers rollout's first end-user-visible building block: a batch-inference pipeline that loads a model into vLLM via PyO3, pulls content-addressed prompts off the Phase-2 in-memory queue, and writes JSONL completions to disk. The pipeline is resumable — rollout infer batch --resume <run_id> scans the per-sample state in rollout-storage and only enqueues outstanding work, producing zero duplicates on restart.

Phase 3 ships two new crates plus a CLI subcommand:

Crate	Layer	Phase-3 responsibility
`rollout-backend-vllm`	2	`InferenceBackend` impl over vLLM's `AsyncLLMEngine` via PyO3 (in-process).
`rollout-runtime-batch`	3	CAS sample-state machine, JSONL I/O, plan-time validation, mock backend.
`rollout-cli` (extended)	4	`rollout infer batch --config <toml> [--resume <run_id>]` subcommand.

Per-component chapters land in plan 03-05 (smoke + docs + bench). For now, see the Wave-0 trait extension in spec 02 §2a and the WorkerRole addition in spec 01 §3a.

TODO (plan 03-05): vLLM backend chapter · batch-runtime chapter · CPU-mode chapter · macOS Docker dev workflow · resume semantics walkthrough.

vLLM backend (`rollout-backend-vllm`)

rollout-backend-vllm implements the Phase-3 InferenceBackend trait (init / generate / model_id / shutdown) over vLLM's AsyncLLMEngine via PyO3 in-process. The crate is the second user of the dedicated Python OS-thread pattern hardened in plan 02-05; the first was rollout-plugin-host. The architecture is documented in spec 03-plugin-system §3.2 and the trait surface in spec 02-algorithms §2 / §2a.

Status (plan 03-03): Wave-3 lives. The PyO3 dedicated thread now imports rollout.backends.vllm.engine, which wraps vllm.AsyncLLMEngine; generate drives the engine through a fresh asyncio event loop on the worker thread via pyo3_async_runtimes::tokio::run_until_complete. The default-features (no-vllm) build keeps the Wave-2 stub worker so cargo test -p rollout-backend-vllm still runs without Python / vLLM.

Architecture

+--------------------+ tokio::sync::mpsc::Sender<VllmTask>  +--------------------+
|  Tokio runtime     | -----------------------------------> |  rollout-py-vllm-  |
|  VllmBackend       |                                      |     <engine_id>    |
|  (async impl       | <----------------------------------- |  Python::attach    |
|   InferenceBackend)|       oneshot::Sender<Result<...>>   |  rt.block_on(...)  |
+--------------------+                                      +--------------------+
                                                                       |
                                                                       v
                                                    rollout.backends.vllm.engine
                                                    (Wave 3: AsyncLLMEngine)

The Tokio side never touches Python. Every call hops through one OS thread named rollout-py-vllm-<engine_id> that owns the interpreter for the backend's lifetime.
The thread imports rollout.backends.vllm.engine once at startup, then loops on mpsc::Receiver<VllmTask> until VllmTask::Shutdown arrives or the channel closes.
Each Generate task uses a oneshot::channel for its reply so multiple concurrent prompts can be in flight without head-of-line blocking on the channel (Tokio side dispatches them sequentially in Wave 2; Wave 3 will span them concurrently to let vLLM's continuous batcher do the work).

Cargo feature gating

[features]
default = []       # no PyO3 link; tests use the stub worker
vllm    = []       # imports rollout.backends.vllm.engine on the dedicated thread

With --features vllm off (default), the crate builds without invoking pyo3 at runtime. The dispatch path still exists; Generate returns the Wave-2 stub error from a pure-Rust worker. This honors AGENTS.md §7 (every plugin testable locally without GPU) — no Python interpreter, no vLLM install, no CUDA required for cargo test -p rollout-backend-vllm.

With --features vllm on, the worker thread calls Python::attach(|py| py.import("rollout.backends.vllm.engine")) once at startup. The Python module top-imports vllm.AsyncLLMEngine (plan 03-03).

vLLM version pin: vllm>=0.10,<0.22. The lower bound is the first release where AsyncLLMEngine is an alias for the new v1 engine (vllm.v1.engine.async_llm.AsyncLLM); the upper bound guards against future-version drops of the alias. The pin is documented but NOT enforced in Cargo.toml — vLLM is a Python install. The engine module's import falls back to from vllm.engine.async_llm_engine import … if the top-level alias is removed in a future version.

Pitfall 10: env-write before import

vLLM imports huggingface_hub which lazily reads os.environ.get("HF_TOKEN") when downloading gated models. The dedicated Python thread therefore writes HF_TOKEN into its own os.environ before the py.import("rollout…") call:

#![allow(unused)]
fn main() {
Python::attach(|py| {
    if let Some(token) = &secret_token {
        let os = py.import("os")?;
        let environ: Bound<'_, PyDict> = os.getattr("environ")?.cast_into()?;
        environ.set_item("HF_TOKEN", token)?;
    }
    py.import("rollout.backends.vllm.engine")?;
    Ok(())
});
}

The token flows through the VllmEngine::spawn(engine_id, secret_token) constructor; the SecretStore consumer is wired in plan 03-03 alongside the live engine.

Wave 2 vs Wave 3 split

Concern	Wave 2 (plan 03-01)	Wave 3 (plan 03-03)
Cargo crate + `vllm` feature flag	shipped	inherited
PyO3 dedicated thread (`rollout-py-vllm-…`)	shipped	inherited
`VllmTask` enum + `mpsc::Sender` dispatch	shipped	inherited
`VllmBackend: InferenceBackend` impl shape	shipped (stub)	live engine
`python/rollout/backends/vllm/engine.py`	stub	real `AsyncLLMEngine`
`AsyncLLMEngine.from_engine_args` wiring	—	shipped
`pyo3_async_runtimes::tokio::run_until_complete`	—	shipped
Explicit `torch.cuda.is_available()` device probe	—	shipped
`HF_TOKEN` env-write before `import vllm`	wired (passthrough)	exercised
Content-addressed `model_id` from HF repo SHA	—	shipped
`criterion` throughput bench + raw-vllm baseline	placeholder	shipped

Bridging the asyncio ↔ Tokio gap (RESEARCH Pitfall 2)

vllm.AsyncLLMEngine.generate(prompt, sampling, request_id) returns an async generator that vLLM's own scheduler drives. To consume it from Rust, we:

Build a coroutine on the GIL by calling engine.generate_one(...) (the Python module's wrapper that drives the async-for loop to completion and returns the final RequestOutput as a dict).
Create a fresh asyncio event loop on the worker thread.
Hand the coroutine to pyo3_async_runtimes::tokio::run_until_complete(event_loop, async move { into_future(coro).await }). The Python C-level event_loop.run_until_complete releases the GIL whenever the loop has nothing to do — which is exactly when our Rust await yields. That is what lets vLLM's background scheduler tasks (also on this asyncio event loop) make progress.

The contract is verified on every CI build by tests/pyo3_bridge_smoke.rs: a Python async def smoke() spawns a background threading.Thread that polls a flag; the assertion fails if the background thread does not see the flag set, proving the GIL would have been held across the await (Pitfall 2 regression).

Pitfall 9: explicit device probe

vLLM's EngineArgs(device="auto") is unreliable on partially-installed CUDA hosts (silent CPU fallback). The Python glue probes torch.cuda.is_available() and passes an explicit device="cuda" or device="cpu" to AsyncEngineArgs. CONTEXT D-VLLM-04's earlier "auto" guidance is superseded by this probe.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
args = AsyncEngineArgs(model=model_uri, device=device, ...)

Live integration tests (gated)

Two #[ignore]'d integration tests under crates/rollout-backend-vllm/tests/ exercise the live engine:

vllm_init.rs — bring up facebook/opt-125m; assert backend.model_id() is content-addressed and stable across two init() calls of the same URI.
vllm_generate.rs — bring up Qwen/Qwen2.5-0.5B-Instruct and run a 1 prompt × 8 tokens round-trip with a 300 s timeout (RESEARCH Pitfall 8 CPU mode is slow).

Both tests skip unless ROLLOUT_VLLM_AVAILABLE=1 is set in the environment. Run them with:

ROLLOUT_VLLM_AVAILABLE=1 cargo test \
    -p rollout-backend-vllm --features vllm \
    --test vllm_init --test vllm_generate -- --include-ignored

The default workspace cargo test skips them.

Benchmark methodology

crates/rollout-backend-vllm/benches/throughput.rs runs a 64-prompt × 64-token criterion bench against facebook/opt-125m; the companion scripts/raw_vllm_baseline.py drives the same prompt set through raw vllm.LLM (sync API). Compare the two tokens/sec numbers to verify the BACKEND-02 <10% overhead exit criterion — a ratio ≥ 0.9 passes.

The bench is gated behind --features vllm and ROLLOUT_VLLM_AVAILABLE=1. CI does not run it by default; the perf check lives on the self-hosted GPU runner per CONTEXT D-CLI-05.

# rollout side:
ROLLOUT_VLLM_AVAILABLE=1 cargo bench \
    -p rollout-backend-vllm --features vllm --bench throughput

# baseline side:
python scripts/raw_vllm_baseline.py facebook/opt-125m

Cross-references

RESEARCH §"Pattern 1" — the PyO3 dedicated-thread shape this crate mirrors from plan 02-05.
RESEARCH §"Common Pitfalls" — Pitfalls 2 (GIL deadlock), 9 (device="auto" not reliable), 10 (env-write before import).
spec 02 §2a — the locked Phase-3 trait surface.

`rollout-runtime-batch`

The runtime-glue crate behind rollout infer batch. Owns the CAS sample-state machine, the queue management around the Phase-2 InMemQueue, JSONL I/O, the InferBatchConfig TOML schema, and the test-only MockBackend that lets us exercise the full coordinator/worker path without spinning up a real vLLM engine.

Why a separate crate?

rollout-backend-vllm (Layer 2) depends only on rollout-core + pyo3 — it has no business touching rollout-storage / rollout-cloud-local / rollout-transport. The dep-direction lint (invariants #5 and #6 from plan 03-00) enforces this. The Phase-3 work that does need storage + queue + object-store — the sample-state CAS machine, the resume scan, the JSONL reader — lives here in rollout-runtime-batch (Layer 3). The runtime composes any Arc<dyn InferenceBackend>; the backend stays cloud-agnostic.

+------------------+        rollout-cloud-local::InMemQueue
|  rollout-cli     |---┐    rollout-cloud-local::FsObjectStore
|  infer batch     |   |    rollout-storage::EmbeddedStorage
+------------------+   ▼
                  +-------------------------+      +-----------------------+
                  | rollout-runtime-batch   |◀────▶| Arc<dyn               |
                  |                         |      |  InferenceBackend>    |
                  | • BatchCoordinator      |      |                       |
                  | • BatchWorker           |      | • VllmBackend (prod)  |
                  | • SampleRecord + CAS    |      | • MockBackend (test)  |
                  | • JSONL I/O             |      +-----------------------+
                  | • InferBatchConfig      |
                  +-------------------------+

`SAMPLING_PARAMS_SCHEMA_VERSION`

Per RESEARCH §"Pitfall 1", the deterministic sample_id() hash is brittle against any change to SamplingParams's wire shape. Postcard is not self-describing — adding a field rewrites every byte of to_stdvec(&params), which would invalidate every outstanding Pending / Running sample-ID in storage.

The defence is a schema-version byte:

#![allow(unused)]
fn main() {
const SAMPLING_PARAMS_SCHEMA_VERSION: u8 = 1;

fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId {
    let mut h = blake3::Hasher::new();
    h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]);  // FIRST byte
    h.update(&model.0);
    h.update(prompt.as_bytes());
    h.update(&postcard::to_stdvec(params).unwrap());
    h.update(&idx.to_le_bytes());
    ContentId(*h.finalize().as_bytes())
}
}

When you add a field to SamplingParams, bump SAMPLING_PARAMS_SCHEMA_VERSION. That invalidates outstanding IDs by design — drain the in-flight batch under the old version before deploying the new schema, or re-run any orphaned samples.

Tests:

content_id_derivation.rs::sample_id_matches_locked_hex_for_default_params — locks the hex digest for the default-params case so any silent change to hasher input order trips immediately.
content_id_derivation.rs::schema_version_byte_is_first — explicit regression catch for Pitfall 1.
Six proptest property tests prove that every component of the input (prompt, idx, model, temperature, max_tokens, seed) participates in the hash.

CAS state machine

SampleRecord { id, prompt_blob, state, created_at_ms, input_idx } lives at infer/<run_id>/samples/<sample_id_hex> (storage namespace infer, table T_INFER).

              try_claim                      try_complete
   Pending  ──────────────▶ Running ────────────────────▶ Done
     ▲                       │                            ▲
     │                       │                            │ (terminal)
     │ try_repending         │ try_fail
     │  (stale only)         ▼
     └──────────────────── Failed
                            (transient errors;
                             coordinator re-enqueues)

Helpers (in src/state.rs):

try_claim(txn, record, run_id, worker_id, now_ms, stale_after_ms) → Result<bool> CAS-swaps Pending (or stale Running) → Running. Returns false on race-loss.
try_complete(txn, running_record, run_id, completion_blob, finished_at_ms)
try_fail(txn, running_record, run_id, reason, failed_at_ms)
try_repending(txn, running_record, run_id) — for the resume path.

All helpers take a &mut Box<dyn StorageTxn> so the caller controls the commit/abort lifecycle.

Resume semantics

BatchCoordinator::scan_and_enqueue(inputs, model_content_id, sampling) is idempotent. For each input it derives sample_id(model, prompt, sampling, idx), looks up the existing record at sample_key(run_id, sid), and:

Current state	Action
(absent)	write `Pending` + prompt blob + enqueue
`Pending`	enqueue (worker will claim via CAS)
`Failed`	enqueue (retry)
`Running` (fresh)	skip — live owner
`Running` (stale)	CAS `Running` → `Pending`; enqueue on success
`Done`	skip — terminal

"Stale" defaults to 5 minutes (DEFAULT_STALE_AFTER_MS = 5 * 60_000). Override per-coordinator via .with_stale_after_ms(ms). The 5-minute window follows RESEARCH §"Pitfall 5" — short enough to recover from a SIGKILL'd worker, long enough to absorb a slow generation pass.

The stale-Running re-Pending step is itself a CAS so two coordinators racing on the same orphaned sample don't double-enqueue. The test resume_skips_done.rs::scan_enqueues_only_non_terminal_samples covers all four transition cases in a single integration run.

`BatchWorker` flow

#![allow(unused)]
fn main() {
pub async fn run_loop(&self) -> Result<usize, CoreError> {
    loop {
        match self.run_one().await? {
            RunOutcome::Drained => return Ok(completed),
            RunOutcome::Completed => completed += 1,
            RunOutcome::Failed | RunOutcome::Skipped => { /* keep going */ }
        }
    }
}
}

run_one() dequeues a sample-id, loads the SampleRecord, CAS-claims it, fetches the prompt blob, calls backend.generate(&[Prompt], &sampling), writes the completion to the object store, and CAS-transitions Running → Done. Backend errors translate to Running → Failed via try_fail; the queue item is ack'd either way (the coordinator re-enqueues Failed rows on the next scan_and_enqueue).

Concurrency: each BatchWorker is one Tokio task; spawn N of them sharing the same Arc<dyn Queue> / Arc<dyn Storage> / Arc<dyn ObjectStore> and the CAS dance guarantees at-most-once completion per sample.

`MockBackend`

Gated by the test-mock-backend Cargo feature. Returns Completion { text: format!("MOCK:{}", p.0), finish_reason: "stop", … } after an optional tokio::time::sleep. Used by worker_happy_path.rs and (in Wave-4) by the restart-no-duplicates integration test to exercise the full pull-loop without a Python / vLLM dependency.

The contract: MockBackend impls the same InferenceBackend trait that VllmBackend does, so it composes against BatchWorker identically. CLI code in Wave 3 (plan 03-04) wires VllmBackend; nothing in BatchWorker / BatchCoordinator knows the difference.

`InferBatchConfig`

The TOML schema for rollout infer batch --config <path> lives here in src/config.rs (NOT in rollout-cli, per WARN-5). rollout-cli imports it via use rollout_runtime_batch::config::InferBatchConfig;.

[model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[sampling]
temperature = 0.7
top_p       = 0.9
max_tokens  = 64
seed        = 42

[input]
glob = "data/prompts/*.jsonl"

[output]
dir = "data/completions"

[workers]
count = 1

#[serde(deny_unknown_fields)] on every block per spec 11 — typos in the config file fail at plan-time, not minute 47 of a run.

JSONL I/O

read_jsonl + write_jsonl use stdlib tokio::io::BufReader::lines() + serde_json::from_str — no serde_jsonlines dep. Arbitrary extra fields on the input row are preserved verbatim via #[serde(flatten)] and round-trip to the output row. write_jsonl writes one JSON object per line; the CLI sorts on SampleRecord.input_idx before calling write_jsonl so output order matches input order regardless of which worker finished which sample first.

Phase 4 — `TrainableBackend` impl on `MockBackend`

Beyond the Phase-3 InferenceBackend impl, MockBackend ships a TrainableBackend impl gated behind the same test-mock-backend feature. The training extension drives plan 04-02's LOAD-BEARING snapshot_resume.rs byte-compare test (TRAIN-03) — no Python, no GPU, runs on every CI build.

MockBackend::new_train(seed) initialises an ndarray::Array1<f32> of length 8 with every element set to (seed as f32) / 1000.0.
forward_with_loss returns loss = 0.5 and a GradHandle { step: prev + 1 }.
optimizer_step applies a deterministic SGD delta (seed + grad_handle.step) * lr to every weight element.
save_weights returns ContentId::of(postcard::to_stdvec(&weights)).
load_weights is a no-op; new_train_with_weights(seed, weights) is the test-side restore hook so the byte-compare assertion in snapshot_resume.rs is meaningful.

Determinism contract: two MockBackends constructed with the same seed produce byte-equal weights after K identical optimizer_step calls. Plan 04-02's test proves bit-identical resume by running 10 uninterrupted steps versus 5-snapshot-5 and asserting the two final weight vectors are equal.

`rollout infer batch` CLI

The first user-visible building block of rollout: a resumable batch-inference pipeline driven by a single TOML config. Bridges the Phase-3 substrate (rollout-backend-vllm + rollout-runtime-batch) behind a clap subcommand on rollout-cli.

Invocation

rollout infer batch \
    --config examples/batch-tiny.toml \
    [--resume <run_id>] \
    [--workers N] \
    [--dry-run]

Flag	Default	Purpose
`--config`	required	Path to the TOML config (schema below).
`--resume`	implicit from output dir	Override the `<output.dir>/run-id` lookup with an explicit ULID.
`--workers`	`[workers].count`	Override the worker pool size for this invocation.
`--dry-run`	`false`	Validate config + probe inputs + check `HF_TOKEN` allowlist; never call the backend.

TOML schema

The schema lives in rollout-runtime-batch::config::InferBatchConfig (the CLI imports it; spec 11 single-source-of-truth). Every block is #[serde(deny_unknown_fields)] — typos are caught at load time.

[model]
uri       = "Qwen/Qwen2.5-0.5B-Instruct"   # HF repo, local path, or object-store URI
tokenizer = "..."                          # optional override

[sampling]
temperature = 0.7
top_p       = 0.9
top_k       = -1     # -1 disables
max_tokens  = 64
seed        = 42     # optional; deterministic when set
stop        = []
stream      = false  # MUST be false in Phase 3 (D-BACKEND-03)

[input]
glob = "data/prompts/*.jsonl"

[output]
dir = "data/completions"

[workers]
count = 1            # >= 1

JSONL input contract

One JSON object per line. Required: prompt. Optional: id (defaults to the deterministic sample-id from blake3(model || prompt || params) per rollout-runtime-batch::sample_id). Extra fields are preserved and round-tripped to output.

{ "prompt": "Translate to French: Hello.", "id": "p-001", "tag": "demo" }

JSONL output contract (D-CLI-03)

Each row:

{
  "id":                "<content-addressed sample id>",
  "prompt":            "<original prompt text>",
  "completion":        "<generated text>",
  "sampling_params":   { ... },
  "model_uri":         "Qwen/Qwen2.5-0.5B-Instruct",
  "finish_reason":     "stop",
  "model_content_id":  "<blake3-hex of resolved model SHA>",
  "completion_blob_id":"<blake3-hex of completion bytes in the object-store>",
  "generated_at":      "2026-05-20T22:31:09.123Z"
}

Order matches input file order regardless of worker concurrency — the CLI calls BatchCoordinator::collect_done_records() (sorted by input_idx) and emits in that order.

`run_id` lifecycle (BLOCKER 6)

Three-tier resolution:

Source	When applied
`--resume <ULID>`	Always honored if present.
`<output.dir>/run-id` (file)	Re-attach if the file exists and `--resume` is absent.
Freshly minted ULID	First run; written atomically (tmp + rename).

The file is single-line UTF-8 (ULID Crockford form). Plan 03-05's restart_no_duplicates test reads this file between phases to obtain the ULID for the explicit --resume flag.

`--dry-run` semantics

Performs (in order):

Parse TOML against InferBatchConfig (deny_unknown_fields).
Validate sampling.stream == false, sampling.max_tokens > 0, workers.count >= 1, input.glob non-empty.
Resolve input.glob and read every JSONL file end-to-end.
Probe whether the model URI is on a known-gated prefix (meta-llama/*, mistralai/*) and look up ROLLOUT_SECRET_HF_TOKEN (best-effort).
Print dry-run OK: model=… inputs=… workers=… and exit 0.

Crucially, the backend is never constructed — --dry-run works on a build with neither --features vllm nor --features test-mock-backend.

Backend selection

The CLI picks one backend at runtime based on Cargo features + env:

Order	Condition	Backend
1	`--features test-mock-backend` + `ROLLOUT_TEST_MOCK_BACKEND=1`	`rollout_runtime_batch::MockBackend`
2	`--features vllm`	`rollout_backend_vllm::VllmBackend`
3	none	fast-fail with `Fatal(ConfigInvalid)` at full-run; dry-run still works.

Observability

RUST_LOG=info rollout infer batch … produces structured tracing events. Key fields: run_id, enqueued, total, completed. Worker spans carry worker = <ulid>.

Exit codes

Code	Meaning
0	Success (or successful dry-run).
2	Config-invalid, infrastructure error, or backend error.

CPU-mode caveat

vLLM CPU inference on the test model (Qwen2.5-0.5B-Instruct) is dramatically slower than CUDA: expect ~1–5 tokens/sec. On macOS Apple-Silicon, vLLM has no PyPI wheels (RESEARCH Pitfall 3) — run via Docker. CI's infer-smoke job is Linux-only and gated on ROLLOUT_VLLM_AVAILABLE=1.

CPU mode

vLLM ships with first-class CUDA support, but for AGENTS.md §7 ("every plugin testable locally without GPU") rollout must also work on a plain CPU. This chapter documents the CPU-mode contract for rollout-backend-vllm, the dev-loop reality of macOS Apple-Silicon, and the smoke-test posture that lets default CI stay green without any GPU.

Where CPU mode is selected

Per Phase 3 CONTEXT decision D-VLLM-04 (as overridden by RESEARCH §"Pitfall 9"), the Python-side glue in python/rollout/backends/vllm/engine.py performs an explicit torch.cuda.is_available() probe and passes device="cuda" or device="cpu" to AsyncEngineArgs — never device="auto". The auto-detect path was rejected because vLLM silently falls back to CPU when CUDA libraries are partially installed (driver present, runtime missing, etc.), and a silent fallback would mask configuration mistakes at runtime instead of failing at plan time.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
engine_args = AsyncEngineArgs(
    model=model_uri,
    device=device,           # explicit; not "auto"
    disable_log_stats=True,
    disable_log_requests=True,
)

rollout-cloud-local::ComputeHint::inventory() still informs observability events (gpu_inventory_collected) and worker-config decisions, but it is no longer the source-of-truth for the engine device kwarg.

Expected CPU throughput

vLLM's CPU backend is functional but not fast. Approximate single-stream throughput for the canonical Phase-3 test model (Qwen/Qwen2.5-0.5B-Instruct, 16 max-tokens):

Host	tokens/sec
Apple M1 Pro (8-core perf)	~3–6
Linux x86_64 (16-core)	~2–4
Generic CI runner (4-core)	~1–2

A 4-prompt × 16-token smoke run finishes in well under 60 s on any of these. Anything longer than the canonical examples/batch-tiny.toml shape should target a CUDA host.

macOS Apple-Silicon

vLLM has no Apple-Silicon wheel as of Phase 3. pip install vllm on macOS produces ERROR: No matching distribution found for vllm. The two paths forward:

Build-from-source (slow, brittle). VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vLLM repo. Compilation takes 10–30 min and depends on Apple-clang versions in ways that drift between vLLM releases. Documented but not recommended for routine dev work.
Docker (recommended). See dev-on-macos.md. A linux/amd64 (or linux/arm64 if available) container with vllm>=0.10 pre-installed lets the rollout binaries run identically to CI.

Either way, the Rust-side test surface (~80 % of Phase 3's automated tests — SamplingParams postcard determinism, sample_id derivation, CAS state-machine transitions, JSONL round-trip, the MockBackend-driven restart_no_duplicates test) runs natively on macOS without any vLLM installed. Only the live-engine integration tests (vllm_init.rs, vllm_generate.rs) and the make infer-smoke script require a real vLLM.

CI posture

Default CI (public runners): infer-smoke now runs on every PR and merge — no ROLLOUT_VLLM_AVAILABLE gate. It installs the vllm-cpu PyPI wheel (~101 MB unified CPU wheel, AVX2 fallback) instead of the ~10 GB CUDA wheel; the torch.cuda.is_available() probe in engine.py selects device="cpu" automatically. The job downloads Qwen2.5-0.5B-Instruct (cached under ~/.cache/huggingface), runs rollout infer batch --config examples/batch-tiny.toml, and asserts 4 non-empty completion rows — all on the free 4-vCPU ubuntu-latest runner in well under 60 s of inference. train-smoke is likewise always-on, installing CPU torch + transformers + accelerate and running the examples/sft-tiny.toml SFT (max_steps = 2) on CPU. pip + HuggingFace caches keep both jobs fast on repeat runs.
MockBackend proofs unchanged: the load-bearing restart_no_duplicates (BACKEND-02 exit (b)) and bit-identical-resume (TRAIN-03) proofs still run in the standard test job via MockBackend — no GPU/vLLM/transformers required there.
Local dev: make infer-smoke after pip install 'vllm-cpu>=0.17'. On Apple-Silicon, prefer the Docker path documented in dev-on-macos.md.

Failure modes

Failure	Surface	Diagnosis
`import torch` fails	Python ImportError at engine init	Active venv missing torch — `pip install torch` first
`import vllm` fails	Python ImportError at engine init	Active venv missing vllm — for the CPU/CI path `pip install 'vllm-cpu>=0.17'`; on a CUDA host install the matching CUDA wheel
`torch.cuda.is_available() == False` on a GPU host	engine boots in CPU mode silently	NVIDIA driver/runtime mismatch — install matching CUDA runtime; the explicit probe surfaces this rather than masking it
`vllm` import succeeds but `AsyncLLMEngine.from_engine_args` panics with `device="cpu"` not supported	vLLM version too old	upgrade to `vllm>=0.10`
`make infer-smoke` times out (>300 s) on a CPU host	model larger than `Qwen2.5-0.5B-Instruct`	use the canonical `examples/batch-tiny.toml` model; do not run multi-billion-param models on CPU

Resume: zero-duplicate batch restart

Phase 3's rollout infer batch is resumable. If the worker is killed mid-batch, restarting with --resume <run_id> (or with the same [output] dir = and no explicit flag) continues from the last persisted sample with zero duplicates and no skipped inputs. This is one of the three ROADMAP Phase-3 exit criteria.

The lifecycle

Every rollout infer batch invocation has exactly one run_id — a 26-character ULID. The CLI resolves it via a three-tier precedence:

--resume <id> explicit override. Use this when you know the run you want to reattach.
<output.dir>/run-id file. If --resume is absent but the output directory already has a run-id file (from a prior run), the CLI reads it and continues that run. This is the common case for "kill + restart with the same config."
Mint a fresh ULID. No override, no prior file → generate a new RunId, write it atomically (tempfile + rename) to <output.dir>/run-id, and proceed as a brand-new run.

run-id files are single-line UTF-8 ULIDs. They are safe to delete (forces a fresh run on next invocation) but should not be edited by hand.

How resumability is implemented

Three subsystems collaborate (each documented in its own chapter):

rollout-storage (the redb-backed EmbeddedStorage from Phase 2) holds the per-sample state KV under namespace infer. Each sample's SampleRecord carries one of four SampleState variants — Pending, Running, Done { completion_blob }, or Failed { reason } — and transitions atomically via Storage::cas_bytes.
rollout-cloud-local::InMemQueue (also Phase 2) holds the work queue with Storage spill: every enqueue/ack mirrors to redb under cloudlocal_queue, so a coordinator restart replays whatever was unacknowledged.
rollout-cloud-local::FsObjectStore stores the actual completion blobs at content-addressed paths under <output.dir>/object-store/<sha256[0..2]>/<sha256[2..4]>/<sha256>. The SampleRecord::state = Done { completion_blob: ContentId } variant carries the blob's ContentId; the actual completion text is read from the object store at output-writing time.

At plan time, BatchCoordinator::scan_and_enqueue iterates the existing samples namespace and:

skips records with state = Done (already complete; their completion blobs survive in the object store);
re-enqueues records with state = Pending or state = Failed;
treats records with state = Running and a started_at older than stale_after_ms (default 60 000) as stale — CAS Running → Pending and re-enqueue (per RESEARCH Pitfall 5: covers the kill-mid-flight case where a worker crashed without finishing).

Only the new and re-enqueued samples flow through the queue. Workers process them via the same BatchWorker::run_loop as a fresh run.

The deterministic test

crates/rollout-cli/tests/restart_no_duplicates.rs is the load-bearing proof. It runs on every CI build — no GPU, no vLLM, no real model — because it uses the MockBackend shipped by rollout-runtime-batch behind the test-mock-backend Cargo feature.

The test follows RESEARCH §"Restart-resume test design" verbatim:

Spawn rollout infer batch --config <tmp> as a subprocess via tokio::process::Command::new(env!("CARGO_BIN_EXE_rollout")) with ROLLOUT_TEST_MOCK_BACKEND=1.
Stream stdout, count sample_completed events.
After 3 completions, child.start_kill() (SIGKILL).
Read <output.dir>/run-id.
Spawn a second subprocess with --resume <run_id> and the same env.
Wait for exit 0.
Assert the final completions.jsonl has exactly N=8 lines, all unique sample_ids, and every input prompt is represented once.

Run locally:

cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates

Total wall-clock: ~1.5 s on a modern dev box. The MockBackend completes each sample in 50 ms; the test deliberately spawns subprocesses to exercise the real CLI resume code path including --resume parsing, run-id file I/O, and BatchCoordinator::scan_and_enqueue.

Content-addressed sample IDs

A sample's identity is derived deterministically from (model, prompt, sampling_params, input_index) — see crates/rollout-runtime-batch/src/state.rs::sample_id. Per CONTEXT D-RESUME-01 (extended by RESEARCH Pitfall 1):

#![allow(unused)]
fn main() {
fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId {
    let mut h = blake3::Hasher::new();
    h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]);  // = 1 in Phase 3
    h.update(model.as_bytes());
    h.update(prompt.as_bytes());
    h.update(&postcard::to_stdvec(params).unwrap());
    h.update(&idx.to_le_bytes());
    ContentId::from(h.finalize())
}
}

The leading SAMPLING_PARAMS_SCHEMA_VERSION byte is the maintenance hook: bumping the constant invalidates every outstanding sample-id, which is the desired behavior whenever the SamplingParams struct gains, loses, or reorders fields. Without it, a future serde-evolution of SamplingParams could silently produce different postcard bytes for the same logical configuration and break resume across versions.

What you cannot resume

Cross-model resume. A run whose <output.dir>/run-id was minted under model A cannot resume against model B — the per-sample model_id is part of the content-addressed sample ID; mixing models triggers Fatal(ConfigInvalid) at scan time.
Cross-machine resume (Phase 3). The EmbeddedStorage redb file is local to the host. Phase 5 (CLOUD-03) introduces object-store-backed snapshot storage that will allow cross-machine resume; until then, resume is single-host.
Streaming runs. Phase 3 rejects sampling.stream = true at plan time (D-BACKEND-03); streaming is Phase 8 (INFER-01).

Dev loop on macOS (Apple Silicon)

vLLM has no Apple-Silicon wheel as of Phase 3 (see cpu-mode.md). That makes the full rollout infer batch smoke loop unavailable natively on macOS dev boxes. This chapter documents the two supported workarounds and the substantial default-CI surface that does run on macOS so you don't need a Linux box for routine work.

What runs natively on macOS

The following all pass on darwin-aarch64 with PYO3_PYTHON=/opt/homebrew/bin/python3.13 (or a 3.11+ python on PATH) — no vLLM required:

cargo test --workspace --tests              # 100+ tests across Phase 2 + Phase 3
cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates
make smoke                                  # Phase 2 substrate smoke
cargo clippy --workspace --all-targets -- -D warnings
cargo doc   --workspace --no-deps           # rustdoc gate (without --all-features; see ci.yml)
mdbook build docs/book

What does not run natively:

make infer-smoke ROLLOUT_VLLM_AVAILABLE=1   # needs `pip install vllm`, which has no aarch64-darwin wheel
cargo test -p rollout-backend-vllm --features vllm --test vllm_init -- --include-ignored
cargo bench -p rollout-backend-vllm --bench throughput

For the load-bearing exit-criterion-(b) proof (zero-duplicate restart), the MockBackend-driven test runs natively in ~1.5 s. Live vLLM is only needed for exit criterion (a) (the canonical rollout infer batch --config examples/batch-tiny.toml) and (c) (the <10 % overhead benchmark).

Workaround 1: Docker (recommended)

Run rollout inside a Linux container that has vLLM pre-installed. The repo is mounted via a bind volume so edits flow through immediately.

Sample Dockerfile.devcontainer:

FROM rust:1.88-slim-bookworm

RUN apt-get update && apt-get install -y --no-install-recommends \
      python3.11 python3-pip git make pkg-config libssl-dev protobuf-compiler \
    && rm -rf /var/lib/apt/lists/*

RUN pip3 install --break-system-packages 'vllm>=0.10,<0.22'

WORKDIR /workspace
CMD ["bash"]

docker build -t rollout-dev -f Dockerfile.devcontainer .
docker run --rm -it \
  -v "$PWD":/workspace \
  -v "$HOME/.cargo/registry":/root/.cargo/registry \
  -v "$HOME/.cache/huggingface":/root/.cache/huggingface \
  rollout-dev

Inside the container:

cargo test --workspace --tests
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo run -p rollout-cli --features vllm -- infer batch --config examples/batch-tiny.toml

Caveats:

linux/arm64 and linux/amd64 images both work; linux/arm64 is faster on M-series silicon.
The first run downloads Qwen/Qwen2.5-0.5B-Instruct (~1 GiB). The ~/.cache/huggingface bind volume above persists it across container restarts.
A pure-CPU Docker run will not exercise CUDA-specific code paths. For those, use Workaround 2 or a cloud GPU runner.

Workaround 2: Cloud GPU runner

For exit criterion (c) (the <10 % overhead benchmark) you need a real CUDA GPU. The repo's CI infer-smoke job is gated on vars.ROLLOUT_VLLM_AVAILABLE == '1' and needs: test — set the repo variable on a self-hosted runner with a GPU, then push, and the bench captures throughput against python scripts/raw_vllm_baseline.py automatically.

For local development without committing, a Lambda Labs / Runpod / Vast.ai box rented by the hour works well:

# On the GPU box:
git clone <this-repo>
cd rollout
pip install 'vllm>=0.10,<0.22'
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo bench -p rollout-backend-vllm --bench throughput

Workaround 3: Source-built vLLM (not recommended)

VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vllm-project/vllm repo works on macOS in principle. In practice the build takes 10–30 min, brittle-fails on Apple-clang version mismatches in ways that drift between vLLM releases, and produces a vllm binding ~3× slower than the Linux CPU wheel. Prefer Docker.

What CI tests on macOS

The lint and test workflow jobs run on macos-14 (the GitHub macos-14 runner is Apple-Silicon). They cover the entire Rust workspace surface and the MockBackend-driven restart_no_duplicates test. The infer-smoke job runs on ubuntu-latest and is opt-in; the lint job uses default features only (no quic, no vllm) per the .github/workflows/ci.yml comment.

Training

Phase 4 lands the first end-to-end training story: supervised fine-tuning + Bradley-Terry reward-model training + bit-identical-resume training-state snapshots + the Postgres Storage backend.

What's here

SFT (Supervised Fine-Tuning) — rollout-algo-sft; TRAIN-01.
RM (Reward Model) — rollout-algo-rm with Bradley-Terry pairwise loss; TRAIN-02.
Snapshots — rollout-snapshots, SnapshotKind::TrainState, tar + blake3 + restore; TRAIN-03.
Postgres backend — rollout-storage[postgres]; testcontainers CI; TRAIN-04.
Determinism — accelerate.save_state + CUDA / CPU caveats.
CLI — rollout train sft|rm and rollout snapshot list|show|prune.
CPU mode — what to expect on macOS / Apple Silicon development boxes.

Quickstart

# Dry-run validation (works without Python deps).
cargo run -p rollout-cli -- train sft --config examples/sft-tiny.toml --dry-run

# Live run (requires transformers + accelerate + torch; ~3-5 min CPU on M-series).
pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke

Phase 4 exit criteria

Criterion	Where it's proven
`rollout train sft --config examples/sft-tiny.toml` completes on a small model	`make train-smoke` (gated on `ROLLOUT_TRANSFORMERS_AVAILABLE=1`)
Snapshot + restart produces bit-identical weights for next K steps	`crates/rollout-algo-sft/tests/snapshot_resume.rs` (default-fire) + `crates/rollout-backend-vllm/tests/snapshot_resume_live.rs` (gated)
Postgres backend CI-tested via containerized integration test	`crates/rollout-storage/tests/postgres_integration.rs` via the `postgres-integration` CI job

What's NOT here (deferred)

PPO / GRPO / DPO / IPO / KTO — Phases 9 / 10.
Buffer / Process / EpisodicMemory snapshot kinds — Phases 9 / 11 / 8.
Cloud object stores for snapshot blobs — Phase 5.
HuggingFace datasets Hub integration — Phase 7.
Multi-node distributed training — Phase 6.

SFT — Supervised Fine-Tuning

Phase 4 plan 04-02 ships rollout-algo-sft: a PolicyAlgorithm skeleton driven by a deterministic MockBackend, with the load-bearing TRAIN-03 byte-compare resume proof. This chapter covers the architecture, the SFT settings shape, the JSONL data contract, and how snapshot_save / snapshot_restore participate in the TRAIN-03 round-trip.

The HF transformers + accelerate path lands in plan 04-05; this skeleton intentionally has zero Python / GPU dependencies so the snapshot resume contract is exercised on every CI build.

Architecture

   ┌─────────────────────────┐         ┌────────────────────────────┐
   │ SftSettings (TOML)      │         │ AlgoDependencies           │
   │   base_model, optimizer │         │   backend  : Arc<dyn TB>   │
   │   budget, dataset       │         │   storage  : Arc<dyn S>    │
   │   packing, loss_on, …   │         │   object   : Arc<dyn O>    │
   └────────────┬────────────┘         │   snapshots: Arc<dyn Sn>   │
                │ from_settings        │   events   : Arc<dyn Em>   │
                ▼                      └─────────────┬──────────────┘
       ┌─────────────────────┐                       │
       │     SftAlgo         │ ◀─────────────────────┘
       │  (PolicyAlgorithm)  │
       │   step: u64         │     run() loop, bounded by budget.max_steps
       └──────────┬──────────┘
                  │ step_once()
                  ▼
       forward_with_loss  →  optimizer_step  →  step += 1
                  │
                  ▼
     snapshot_save / snapshot_restore (algo meta only — weights via TrainableBackend::save_weights)

SftAlgo holds an Arc<dyn TrainableBackend> and a step counter. Each step_once() synthesises a single-row TrainBatch, calls forward_with_loss (which returns a constant loss=0.5 plus an opaque GradHandle), then calls optimizer_step. The trait's optimizer_step takes &self (interior mutability) so the algo can step through the Arc<dyn …> without unique ownership.

Plan 04-05 replaces the synthetic batch with a real tokenized chunk read from the dataset.

`SftSettings` (TOML shape)

[algorithm]
kind = "sft"

[algorithm.sft]
minibatch_size = 8
gradient_accumulation = 1
loss_on = { kind = "assistant_only" }

[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.sft.dataset]
kind = "jsonl_path"
path = "data/sft-tiny.jsonl"

[algorithm.sft.optimizer]
kind = "sgd"
lr = 1.0e-3
weight_decay = 0.0
betas = [0.9, 0.999]
eps = 1.0e-8
warmup_steps = 0
schedule = "constant"

[algorithm.sft.budget]
max_steps = 1000

[algorithm.sft.packing]
kind = "off"
max_seq_len = 2048

The JSON schema is generated by cargo xtask schema-gen from rollout_core::config::training::SftSettings.

JSONL data contract (D-DATA-01)

load_jsonl accepts two shapes per row:

Shape	JSON	`DataRow`
Prompt / completion	`{"prompt":"Q","completion":"A"}`	`{ prompt: "Q", assistant: "A" }`
Chat messages	`{"messages":[{"role":"user","content":"Q"},{"role":"assistant","content":"A"}]}`	`{ prompt: "[user] Q", assistant: "A" }`

Phase-4 restrictions:

At most one "role": "assistant" turn per row (multi-turn lands when the harness work in Phase 7 needs it).
Empty lines are skipped.
Malformed lines (neither shape, missing assistant in messages, unparseable JSON) produce Fatal(ConfigInvalid) with the file path and line number — <path>:<lineno>: <reason>. Easy to grep, easy to fix.

The Phase-7 work (HARNESS-*) extends this to DatasetRef::Other(...) for harness-driven datasets; until then, Other is a config error.

`validate_plan` errors

Locator	Reason
`algorithm.sft.minibatch_size`	must be ≥ 1
`algorithm.sft.optimizer.lr`	must be > 0

These fire at plan time so the CLI rejects bad configs before any backend is constructed.

`snapshot_save` / `snapshot_restore` (D-DETERM-05)

SftAlgo::snapshot_save builds a Snapshot row with:

meta = { step: <u64>, weights_id: "<hex>" } — algorithm-internal extras, free-form serde_json::Value.
parts = [{ role: "weights", content: <ContentId> }] — points at the bytes returned by TrainableBackend::save_weights.
kind = SnapshotKind::TrainState.

SftAlgo::snapshot_restore reads meta.step and resets the algo's step counter. The backend's weights are restored separately — production backends call TrainableBackend::load_weights(&weights_id); the Phase-4 snapshot_resume.rs test rebuilds MockBackend directly from a captured weights_snapshot() (the byte-compare assertion would be meaningless if it went through load_weights, which is a no-op for the mock).

The full production save path (tar of the accelerate dir, blake3 over the tar bytes, content-addressed put on the object store) lives in SnapshotterImpl::save_train_state (plan 04-01) and runs alongside the algo-level snapshot_save call.

TRAIN-03 byte-compare proof

tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the LOAD-BEARING proof for TRAIN-03. It runs on every CI build with no GPU and no HF transformers — exercising the resume contract on the MockBackend path. The flow:

Run A: fresh MockBackend::new_train(42) → 10 step_once iterations → capture weights_a.
Run B (phase 1): fresh MockBackend::new_train(42) → 5 step_once iterations → capture weights_after_5 → algo_b1.snapshot_save() → drop algo + backend.
Run B (phase 2): MockBackend::new_train_with_weights(42, weights_after_5) → push step counter to 5 via test helper set_step → algo snapshot_restore → 5 more step_once iterations → capture weights_b.
Assert weights_a == weights_b (byte-equal).

The set_step(5) helper is a MockBackend-only affordance because the algo sees only Arc<dyn TrainableBackend> and load_weights is a no-op on the mock. Production backends restore the optimizer step counter via their own checkpoint format inside load_weights.

Running the example

The smallest possible SFT run lives at examples/sft-tiny.toml + examples/sft-tiny.jsonl (4 chat rows; Qwen2.5-0.5B-Instruct; max_steps = 2). Two ways to exercise it:

Dry-run (works without Python deps; validates config + dataset path + algorithm shape):

cargo run -p rollout-cli -- train sft \
  --config examples/sft-tiny.toml --dry-run

Live run (requires transformers + accelerate + torch; ~3-5 min M-series CPU):

pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke

make train-smoke invokes scripts/train-smoke.sh, which dry-runs first then runs the full SFT path against Qwen/Qwen2.5-0.5B-Instruct on CPU. See CLI for the full subcommand surface.

Next steps

Plan 04-05 (backend-vllm-train) swaps the synthetic batch + deterministic-SGD path for real tokenization through HF transformers
- accelerate; the same PolicyAlgorithm surface drives it.
Plan 04-06 (CLI) mounts rollout train sft --config <toml> on top of SftAlgo::run.
Plan 04-07 polishes the docs and ships the v1 SFT smoke recipe.

Reward-model training (RM)

rollout-algo-rm implements the Bradley-Terry reward-model training algorithm (TRAIN-02). It mirrors the SFT algorithm's structure — PolicyAlgorithm impl driven by a TrainableBackend, JSONL data loader, and a TRAIN-03 byte-compare resume proof — but consumes pairwise preferences instead of single sequences.

Overview

A reward model learns to score responses on a scalar "quality" axis. Training data is a stream of preference pairs (prompt, chosen, rejected): the model should learn to rank chosen higher than rejected for the given prompt.

The Bradley-Terry objective formalizes this as a pairwise logistic regression on the reward gap:

L = -E[ ln σ(r_chosen - r_rejected) ]

where σ is the logistic function and r_* are the scalar reward outputs. Spec 02 §7 carries the contract.

`RmSettings` (TOML)

[algorithm.rm]
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
head = "bradley_terry"          # Phase 4 supports BradleyTerry only
minibatch_size = 8

[algorithm.rm.optimizer]
kind = "sgd"
lr = 1.0e-5

[algorithm.rm.budget]
max_steps = 100

[algorithm.rm.dataset]
type = "jsonl_path"
path = "examples/data/pairs.jsonl"

Other RmSettings fields (base_model, optimizer, budget, dataset) mirror SftSettings. Head selection is bradley_terry only in Phase 4; pairwise_logistic is a Fatal(ConfigInvalid) with a Phase 9 sentinel until the RL pipeline lands.

Bradley-Terry loss math

Implemented in crates/rollout-algo-rm/src/loss.rs:

logsigmoid(x) = ln σ(x). Numerically stable via the softplus trick — logsigmoid(50) and logsigmoid(-50) both return finite values within 1e-4 of the true asymptote.
bradley_terry_loss(r_chosen, r_rejected) = -logsigmoid(r_chosen - r_rejected).
bradley_terry_batch_mean(pairs) — mean over a slice of (r_chosen, r_rejected) pairs. Returns 0.0 for empty batches; callers should validate non-empty upstream when needed.

Pinned golden values (tests/bradley_terry_loss.rs):

Case	Inputs	Expected
Zero diff	`(1.0, 1.0)`	`ln 2 ≈ 0.6931`
Strong preference	`(5.0, -5.0)`	near 0 (≪ 1e-3)
Inverted preference	`(-5.0, 5.0)`	≈ 10.0 (± 1e-3)
Mixed batch	`[(2,1), (1,2)]`	mean ≈ 0.8133
Empty batch	`&[]`	exactly 0.0
Numerical stability	`logsigmoid(±50)`	finite; asymptotic value within 1e-4

JSONL data shape (D-DATA-01)

Phase 4 supports one row shape:

{"prompt": "What is 2+2?", "chosen": "4", "rejected": "5"}
{"prompt": "Capital of France?", "chosen": "Paris", "rejected": "London"}

load_pairs(&path) parses line-by-line, skipping blank lines and rejecting malformed rows with Fatal(ConfigInvalid) prefixed <file>:<lineno>:. A row missing any of the three fields is malformed.

`PolicyAlgorithm` surface

Method	Behavior
`id()`	`AlgorithmId("rm")`
`Settings`	`rollout_core::config::training::RmSettings`
`from_settings`	clones `deps.backend` into the algo; `step = 0`
`required_roles`	`vec![WorkerRole::LearnerWorker]`
`validate_plan`	rejects `RmHeadKind::PairwiseLogistic` (Phase 9); rejects `minibatch_size == 0`; `lr <= 0`
`run`	loads pairs once; loops `step_once` up to `budget.max_steps`, honoring `ctx.cancel`
`snapshot_save`	meta = `{step, weights_id}`; one `SnapshotPart { role: "weights" }`
`snapshot_restore`	restores `self.step` from `meta.step`; backend weights restored separately

step_once synthesizes a 2-row TrainBatch (one row per side of a pair) and drives forward_with_loss → optimizer_step. In the Phase-4 MockBackend test path the loss is a constant; the real Bradley-Terry loss fires under plan 04-05's HF transformers integration.

TRAIN-03 second-witness — byte-compare resume

tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the Bradley-Terry twin of the SFT byte-compare proof. Structure:

Run A. 10 step_once iterations with seed = 42; capture weights.
Run B Phase 1. 5 steps; capture mid-run weights; snapshot_save().
Run B Phase 2. Rebuild MockBackend::new_train_with_weights(42, …); push step counter to 5; restore algo step from snapshot meta; 5 more steps.
Assert. weights_a == weights_b byte-for-byte.

This is the second-witness for TRAIN-03 (the SFT proof is the first witness); together they discharge the "deterministic resume" exit criterion across both Phase-4 algorithms.

Content-addressed final checkpoint

tests/checkpoint_roundtrip.rs proves that TrainableBackend::save_weights returns a ContentId that is stable when the backend is idle (two calls → identical hash) and different after a non-trivial optimizer_step. This matches the TRAIN-02 contract: the final checkpoint is content-addressed by the blake3 hash of the postcard-encoded weights.

Phase 4 head support

Only RmHeadKind::BradleyTerry is wired in Phase 4. PairwiseLogistic exists in the enum so the config schema can be cross-validated end-to-end, but selecting it returns a Fatal(ConfigInvalid) with the string Phase 9 in the message — Phase 9 lands the full RL pipeline including alternate preference heads.

What lands later

Plan 04-05 swaps MockBackend for the real HF transformers / accelerate training loop on Qwen/Qwen2.5-0.5B-Instruct (CPU), wiring the Python-side F.logsigmoid(r_chosen - r_rejected).neg().mean() and producing real reward models.
Plan 04-06 mounts rollout train rm --config <toml> on RmAlgo::run.
Phase 9 lands PairwiseLogistic and the RL-* algorithms (PPO/GRPO) that consume reward models trained here.

Running the example

The smallest possible RM run lives at examples/rm-tiny.toml + examples/rm-tiny.jsonl (4 preference pairs; Qwen2.5-0.5B-Instruct base; BradleyTerry head; max_steps = 2).

Dry-run (works without Python deps):

cargo run -p rollout-cli -- train rm \
  --config examples/rm-tiny.toml --dry-run

The live rollout train rm path through --features train is wired identically to SFT; the Phase-4 smoke recipe (make train-smoke) exercises SFT specifically — the RM pipeline is dry-run-validated here and lands under the Phase-9 RL recipe alongside PPO / GRPO. See CLI for the full subcommand surface.

Snapshots

rollout-snapshots ships SnapshotterImpl, the Phase-4 implementation of the rollout_core::Snapshotter trait. It owns the persistence path for SnapshotKind::TrainState — the only kind implemented in v1's training story. Three other kinds (Buffer, Process, EpisodicMemory) compile but return Fatal { PluginContract, msg: "Phase N: <kind>" } until their owning phases land (9 / 11 / 8 respectively).

See docs/specs/04-storage-snapshots.md for the authoritative contract and crates/rollout-snapshots/ for the implementation.

Architecture

PolicyAlgorithm                  SnapshotterImpl                  Storage + ObjectStore
─────────────────────            ───────────────────              ──────────────────────
algo.snapshot_save  ──▶  save_train_state(request, dir)
                                │
                                ▼
                         build_deterministic_tar(dir)             (no I/O on substrate yet)
                                │
                                ▼
                         ContentId::of(tar_bytes) = blake3
                                │
                ┌───────────────┴────────────────┐
                ▼                                ▼
   ObjectStore::put_bytes(tar)          Storage::begin().txn
        (returns same ContentId)        put_bytes(snapshot_key, json(Snapshot))
                │                                │
                └────────► tar blob              └────────► snapshot row (namespace=snapshots)

Save returns a Snapshot whose parts[0].content == ContentId::of(tar); restore inverts the pipeline (fetch → blake3-verify → extract).

Metadata layout

A Snapshot row lives at StorageKey { namespace = "snapshots", run_id = Some(run_id), path = [hex(snapshot_id)] } and is JSON-encoded. Spec-04 §5.1 lists the fields:

Field	Type	Purpose
`id`	`SnapshotId`	`ContentId` of the tar bytes (Phase 4 = single part)
`kind`	`SnapshotKind`	`TrainState` in Phase 4; others reserved
`run_id`	`RunId`	Owning run
`created_at`	`DateTime<Utc>`	RFC3339 wire form
`label`	`Option<SmolStr>`	Optional human-readable label (CLI `--label`)
`parts`	`Vec<SnapshotPart>`	One per blob; Phase 4 ships exactly one (`role="tar"`)
`algorithm_id`	`AlgorithmId`	Producing algorithm (`"sft"`, `"rm"`, ...)
`meta`	`serde_json::Value`	Algorithm-internal extras (D-DETERM-05; opaque to core)

JSON (not postcard) is the on-disk encoding because serde_json::Value is a self-describing format that postcard intentionally refuses to encode. The small row size and infrequent writes make the choice cheap, and the row is human-readable on disk for debugging.

Tar reproducibility contract (Pitfall 9)

build_deterministic_tar(&Path) -> Vec<u8> is the byte-stable tar builder. It must produce identical bytes for identical input directories across runs, machines, and tar versions — otherwise the blake3 hash drifts and resumes fail to find their predecessor blob.

The invariants enforced:

Sort entries by path before writing (file-system iteration order is non-deterministic).
HeaderMode::Deterministic zeroes mtime/uid/gid metadata at write time but does not zero mode bits. The mode bits are platform-dependent (macOS gives regular files 0o755 by default).
Explicit header.set_mode(0o644) for files and 0o755 for dirs.
Explicit set_mtime(0), set_uid(0), set_gid(0) on top of HeaderMode::Deterministic.
No compression — gzip / zstd byte streams drift across versions.
GNU header format (Header::new_gnu) — stable across tar releases.

crates/rollout-snapshots/tests/deterministic_tar.rs proves Pitfall 9 holds by parsing every entry header and asserting mode = 0o644 | 0o755.

Restore semantics

restore_train_state(&snapshot, dst_dir) fetches the tar blob via ObjectStore::get_bytes(parts[0].content), verifies blake3(bytes) == parts[0].content, and extracts to dst_dir. A mismatch returns Fatal { PluginContract, msg: "blake3 mismatch on restore: ..." }.

The bare Snapshotter::restore trait method takes a RestoreTarget enum:

Variant	Phase 4 behavior
`SameRun`	Returns `Fatal { PluginContract }` — the trait method has no `dst_dir`; callers use `restore_train_state(snapshot, dst_dir)` directly.
`Fork`	Returns `Fatal { PluginContract, msg: "Phase 9: Fork restore (new_run_id=...)" }`
`Worker`	Returns `Fatal { PluginContract, msg: "Phase 6: Worker restore (worker_id=...)" }`

This is intentional: the Phase-4 surface optimizes for the SFT/RM training loop, which holds the destination directory locally and drives restore_train_state directly. Phase-9 PPO actor-swap adds the multi-worker restore plumbing.

List + prune surface

Snapshotter::list(SnapshotFilter) scans namespace="snapshots", optionally filters by run_id / kind / label_contains, sorts newest-first (created_at descending), and caps by limit. The scan is O(snapshots in namespace); the secondary SnapshotId → key index is deferred to Phase 9.

Snapshotter::prune(PrunePolicy { run_id, retention }) enforces RetentionPolicy:

keep_last: u32 — keep the N most recent snapshots regardless of label.
keep_labeled: bool — labeled snapshots are immune to pruning when true.
max_age: Option<Duration> — anything older than max_age is eligible.

Returns the count of metadata rows deleted. The underlying tar blobs are not deleted in Phase 4 — ObjectStore has no delete method yet (Phase-5 addition). Blobs are content-addressed and idempotent; orphaned ones cost storage but never corrupt a restore.

Algorithm-internal state — `meta: serde_json::Value`

Snapshot.meta is an opaque JSON blob owned by the producing algorithm (D-DETERM-05). The framework never inspects it. SFT might store { "step": 5, "curriculum_cursor": 12 }; RM might store { "epoch": 2, "best_loss": 0.42 }. This keeps algorithm-specific resume state out of the framework-owned snapshot row.

Determinism caveats

CPU mode: byte-identical on the same toolchain across machines.
CUDA same-SM: weights+optimizer bit-identical when the producing and restoring GPU expose the same compute capability. Cross-SM is best-effort.
Cross-machine: best-effort. The tar is reproducible; PyTorch / accelerate restore is not always bit-identical across GPU generations.

Plan 04-05 (backend-vllm-train) lands the determinism CI gate and CPU-mode fallback for environments without CUDA.

Pointers

crates/rollout-snapshots/src/lib.rs — SnapshotterImpl + trait impl.
crates/rollout-snapshots/src/tar_build.rs — deterministic tar builder.
crates/rollout-snapshots/src/kind/train_state.rs — save + restore pipeline.
crates/rollout-snapshots/src/policy.rs — Snapshotter::prune retention enforcement.
crates/rollout-snapshots/src/key.rs — StorageKey helpers for namespace="snapshots".
crates/rollout-storage/src/embedded/tables.rs — T_SNAPSHOTS table definition.
docs/specs/04-storage-snapshots.md — authoritative contract (especially §5 + §5a + §7).

Postgres backend

Phase 4 / TRAIN-04: a Storage impl backed by Postgres 16 alongside the default embedded redb store. Same trait, same semantics — choose at config time. The crate-level Cargo feature postgres on rollout-storage gates the dep set (sqlx 0.8 + uuid + ulid + async-stream); default builds remain sqlx-free.

Why Postgres alongside embedded

The embedded backend (redb) is local-process only. Multi-process or multi-host runs need a shared store that fan-outs change notifications across processes — that's what Postgres gives us via LISTEN/NOTIFY. The trait surface stays uniform: EmbeddedStorage and PostgresStorage both implement Storage::watch_stream, wrapping their respective notification mechanisms in a BoxStream<StorageEvent>.

Method	Embedded	Postgres
`begin / put / get / cas`	redb txn over local fs	sqlx txn over network
`watch` (broadcast)	`tokio::sync::broadcast::Receiver`	unsupported — returns `Fatal(PluginContract)`
`watch_stream`	wraps the broadcast in `BroadcastStream`	`PgListener` over `LISTEN/NOTIFY`

Cross-process subscribers MUST use watch_stream. The embedded backend implements watch_stream by wrapping its in-process broadcast — handy when some callers need a stream-shaped surface even on a local run.

Schema

Two migrations under database/migrations/ are embedded at build time via sqlx::migrate!():

0001_init.sql — the kv table that backs all StorageKey rows.
0002_snapshots.sql — snapshots + events tables consumed by rollout-snapshots (plan 04-01) and the spec-09 observability ledger.

The runs / workers tables defer to Phase 6 (multi-node distribution).

`kv` table

CREATE TABLE kv (
    namespace   TEXT NOT NULL,
    run_id      UUID,                       -- ULID-as-UUID; NULL for global rows
    path        TEXT[] NOT NULL,
    value       BYTEA NOT NULL,
    version     BIGINT NOT NULL DEFAULT 0,
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    PRIMARY KEY (namespace, run_id, path)
);

run_id is stored as UUID (16 bytes) — the same byte layout as the underlying ULID. RunId(Ulid) round-trips through Uuid::from_bytes and back without loss.

LISTEN / NOTIFY contract

A row-level trigger on kv fires pg_notify(channel, payload) after every INSERT/UPDATE/DELETE:

channel: rollout_watch_<namespace> (max 63 chars; rollout_watch_ is 14, leaving 49 for the namespace).
payload: <run_id_uuid_or_empty>|<path_parts_joined_by_slash>, truncated to 7999 bytes per Pitfall 5 (Postgres caps pg_notify payloads at 8000).

PgListener consumers parse the payload, filter by prefix (matching the caller's prefix.run_id + prefix.path), and emit a StorageEvent::Put per notification. Put vs delete is not distinguished in the payload format shipped here; Phase 9 may extend the trigger to include a +/- prefix if downstream needs demand it.

PgListener reconnects transparently on connection drop; the watch_stream loop logs the failure and continues.

Trait surface

Storage::watch (broadcast) is intentionally not implemented by the Postgres backend — broadcast is an in-process abstraction. Callers that need a tokio::sync::broadcast::Receiver must run against EmbeddedStorage or implement their own fan-out on top of watch_stream. The Postgres impl returns:

Fatal(PluginContract { plugin: "PostgresStorage",
    msg: "PostgresStorage does not support in-process broadcast watch;
          use watch_stream for cross-process notification" })

Migrations

Migrations are forward-only, sequentially numbered (0001_*.sql, 0002_*.sql, ...). PostgresStorage::new runs them once via sqlx::migrate!("../../database/migrations").run(&pool). The macro embeds every SQL file into the binary at compile time, so the pool only needs network access — no on-disk migration directory in production.

Running new() twice on the same pool is a no-op: sqlx::migrate records applied versions in _sqlx_migrations.

Adding a migration

Write database/migrations/NNNN_<name>.sql.
Start a local Postgres: docker run --rm -e POSTGRES_PASSWORD=pw -p 5432:5432 postgres:16.
(Optional, when we switch to compile-time query! macros) regenerate the .sqlx/ offline cache: DATABASE_URL=postgres://postgres:pw@localhost/postgres SQLX_OFFLINE=false cargo sqlx prepare --workspace -- --features postgres.
Commit the migration AND the regenerated .sqlx/ files (currently the directory only carries a .gitkeep; this plan ships runtime-checked SQL via sqlx::query).

Offline mode

SQLX_OFFLINE=true lives in .cargo/config.toml (not .env) per Pitfall 4 — sqlx-cli reads .env at startup and refuses to talk to the DB during cargo sqlx prepare if SQLX_OFFLINE=true is set there. Putting it in .cargo/config.toml keeps cargo build --features postgres happy everywhere except when explicitly running sqlx prepare.

Phase 4 ships runtime-checked sqlx::query calls; the .sqlx/ directory is reserved (.gitkeep) for a future switch to compile-time query! macros once the surface stabilizes.

Pool sizing

PostgresStorage::new(url, pool_size) opens two pools:

Pool	min	max	acquire	idle
Write pool	0	`pool_size`	30 s	10 min
Watch pool	0	4	30 s	(default)

pool_size = 16 is a reasonable default for a Phase-6 worker; tests use 4. The watch pool is small and dedicated because each PgListener holds a connection for the lifetime of its watch_stream; you don't want them crowding out write capacity.

Testcontainers CI

crates/rollout-storage/tests/postgres_integration.rs runs against a disposable Postgres 16 container via testcontainers-modules. Six tests cover CRUD, CAS atomicity, watch_stream delivery, migration idempotency, pool reuse under load, and prefix scan.

Each test carries #[ignore = "requires Docker / testcontainers"] so the default cargo test --workspace --tests flow (macOS dev loop, no Docker guaranteed) stays green. CI opts in via:

cargo test -p rollout-storage --features postgres \
  --test postgres_integration -- --include-ignored --test-threads=1

--test-threads=1 because the test starts a fresh container per test function; running six simultaneously is wasteful (and slower than serial under most CI runners' Docker capacity).

The retry loop on PostgresStorage::new (30 attempts × 2 s) handles Pitfall 6: the container reports "running" before Postgres accepts connections.

Local dev

make postgres-test

The target checks docker info first and fails fast with a helpful message if Docker isn't running.

Limitations

Put-only events from pg_notify. The trigger emits a single notification format for both INSERT/UPDATE/DELETE; consumers see StorageEvent::Put only. Phase 9 may extend the trigger payload to carry a +/- operation byte if downstream callers need to distinguish deletes.
get_many_bytes is sequential. A future optimization could batch via ANY($1) array binding; not on the Phase-4 critical path.
No streaming scan_bytes. Phase-2 simplification carries forward; prefix scans return owned Vec rows. Hot prefixes (millions of rows) will need a streaming variant in Phase 6.
No query! macro yet. Runtime-checked SQL keeps the build hermetic without a .sqlx/ cache; revisit once the schema stabilizes.

Determinism

Phase 4 of rollout commits to deterministic training-state snapshots (D-DETERM-01): a Snapshot.parts[0].content is the same ContentId whenever the same code drives the same model with the same RNG seed and the same input stream. This page documents the contract, the platform caveats, and every mitigation that lands in rollout-backend-vllm --features train.

The determinism stack

Layer	Mitigation	Where
Process env	`CUBLAS_WORKSPACE_CONFIG=:4096:8`, `PYTHONHASHSEED=0`	`train.py` preamble
RNG seeds	`random` + `numpy` + `torch` + `torch.cuda.manual_seed_all`	`_set_determinism_flags`
PyTorch deterministic	`torch.use_deterministic_algorithms(True)`	`_set_determinism_flags`
cuDNN	`cudnn.deterministic = True`, `cudnn.benchmark = False`	`_set_determinism_flags`
Matmul precision	`torch.set_float32_matmul_precision("highest")`	`_set_determinism_flags`
Chat template	`{% generation %}` markers via `GENERATION_MARKED_QWEN25_TEMPLATE`	`qwen25_chat_template.py`
Dataloader	`torchdata.StatefulDataLoader` if available; step-replay fallback	`init_train`
Accelerator	`accelerator.prepare(model, optimizer, scheduler)`	`init_train`
LR scheduler	`register_for_checkpointing` (Pitfall 10 fallback)	`init_train`
Tar	byte-stable mode bits + sorted walkdir + mtime=0 (Pitfall 9)	`rollout-snapshots::build_deterministic_tar`

Why preamble ordering matters (Pitfall 2)

CUBLAS_WORKSPACE_CONFIG and PYTHONHASHSEED are read by torch on first import. If import torch runs before they are set, the values are baked in and later os.environ mutations do not change behavior. The Rust side enforces ordering by writing the env vars on the worker thread before calling py.import("rollout.backends.vllm.train") — see rollout-backend-vllm/src/train.rs::import_train_module.

train.py repeats os.environ.setdefault(...) at the top of the module defensively: if someone imports it directly from a Python REPL with torch already loaded, the user gets non-determinism, but the source still documents the contract.

cuDNN.benchmark is the silent killer (Pitfall 8)

cuDNN auto-tunes kernel selection by running candidate kernels and picking the fastest. This is non-deterministic — re-running the same workload picks a different kernel under load. cudnn.benchmark = False is required, even when cudnn.deterministic = True. The Phase-4 implementation sets both explicitly in _set_determinism_flags.

LR scheduler must be in the save-state capture path (Pitfall 10)

accelerator.save_state captures the model, optimizer, RNG, and any object explicitly registered for checkpointing. The Phase-4 path prefers passing the scheduler through accelerator.prepare(model, optimizer, scheduler); if that fails (e.g., the scheduler shape isn't supported by accelerate's prepare), the code falls back to accelerator.register_for_checkpointing(scheduler). Without one of these, the LR resumes from lr_start after a snapshot restore instead of continuing the schedule — silent non-determinism.

CPU vs CUDA contract

Hardware	Bit-identical?
Same CPU model, same OS, same env	Yes (the load-bearing CI target)
Same SM (sm_80 vs sm_80), same cuDNN	Yes, with `deterministic + !benchmark`
Different SM (sm_80 vs sm_90)	No — kernels differ
Different cuDNN minor version	No — algorithm shortlist may differ
Cross-machine (CPU A vs CPU B)	Best-effort; FP rounding differs

The snapshot_resume_live.rs live witness exercises the same-CPU case on the dev box. CI runs the MockBackend variant from plan 04-02 (snapshot_resume.rs::bit_identical_resume_at_step_5) for the unconditional green signal.

accelerate.save_state captures

Object	Source	Restored on `load_state`?
Model weights	`accelerator.prepare(model)`	Yes
Optimizer	`accelerator.prepare(optimizer)`	Yes
RNG (torch)	implicit via `torch.random.get_rng_state`	Yes
RNG (cuda)	implicit if CUDA is initialized	Yes
LR scheduler	`prepare(scheduler)` OR `register_for_checkpointing`	Yes
Dataloader	`DataLoaderConfiguration(use_stateful_dataloader=True)`	If torchdata installed

The Phase-4 contract: anything that affects subsequent step output must be in this list. If a future plan adds custom state (e.g., reward-model normalizer running stats), it MUST register via register_for_checkpointing to keep the TRAIN-03 byte-compare proof intact.

torchdata stateful dataloader (Pitfall 3)

init_train probes for torchdata. If present, it constructs a DataLoaderConfiguration(use_stateful_dataloader=True) so the dataloader's position is checkpointed by save_state. Without torchdata, the runtime falls back to step-replay: the resume side re-reads from the JSONL head and skips step rows. Both modes preserve the TRAIN-03 byte-compare invariant on deterministic input streams.

Pitfall 7 — Accelerator singleton

Accelerator() is a singleton per Python process; constructing two of them in the same interpreter raises. init_train is idempotent — the second call returns the cached _STATE dict. teardown_train flushes the singleton via del + gc.collect() + torch.cuda.empty_cache() so the next init_train call (in a subsequent run, or after a mid-process swap-back to vLLM inference) can construct a fresh accelerator.

A bidirectional mid-process swap (training ↔ inference under the same OS thread) is Phase 9 — see the deferral note in rollout-backend-vllm/src/train.rs::run_set_train_mode.

MockBackend vs live HF path

Backend	CPU bit-identical?	When to use
`MockBackend`	Yes (Array1 SGD)	CI; every PR; algo-side tests
Live HF	Yes on identical CPU; same-SM only on CUDA	Dev-box live witness; nightly

The MockBackend path is the load-bearing CI proof for TRAIN-03. The live HF path is the gated witness behind ROLLOUT_TRANSFORMERS_AVAILABLE=1.

Where the code lives

python/rollout/backends/vllm/train.py — determinism preamble + Accelerator construction.
python/rollout/backends/vllm/qwen25_chat_template.py — generation-marked chat template.
crates/rollout-backend-vllm/src/train.rs — env-write-before-import enforcer; py.detach wrappers.
crates/rollout-snapshots/src/tar_build.rs — Pitfall 9 deterministic tar.

Snapshots — Phase-4 snapshot pipeline.
SFT — SFT algorithm + TRAIN-03 byte-compare proof.
CPU mode — running Phase-4 training on CPU only.

`rollout train` + `rollout snapshot` CLI

Phase-4 user-facing entry points for supervised fine-tuning, reward-model training, and snapshot management. Mirrors the Phase-3 rollout infer batch clap derive shape, run_id lifecycle, and backend-selection precedence.

Subcommand overview

Subcommand	Purpose
`rollout train sft`	Supervised fine-tuning. Validates + runs an `SftAlgo` budget.
`rollout train rm`	Reward-model (Bradley-Terry) training via `RmAlgo`.
`rollout snapshot list`	List snapshots (optionally filtered by run / kind).
`rollout snapshot show`	Print one snapshot's metadata by content-id.
`rollout snapshot prune`	Delete snapshots per a retention policy.

`rollout train sft`

rollout train sft \
    --config examples/sft-tiny.toml \
    [--resume <snapshot_id>] \
    [--dry-run]

Flag	Default	Purpose
`--config`	required	Path to the run TOML (schema below).
`--resume`	none	Snapshot content-id to restore from before the algorithm runs.
`--dry-run`	`false`	Validate config + dataset path; never construct backend. Works with no backend feature.

SFT TOML schema

schema_version = 1

[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"

[algorithm]
kind = "sft"

[algorithm.sft]
minibatch_size       = 1
gradient_accumulation = 1

[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.sft.optimizer]
kind = "sgd"
lr   = 0.01

[algorithm.sft.budget]
max_steps = 10

[algorithm.sft.dataset]
kind = "jsonl_path"
path = "./data/sft.jsonl"

[algorithm.sft.packing]
kind        = "off"
max_seq_len = 64

[algorithm.sft.loss_on]
kind = "full"

The schema is owned by rollout-core::config::RunConfig; the CLI is just a TOML-loader + clap surface (spec 11 single-source-of-truth). Unknown fields fail load with a deterministic locator.

`rollout train rm`

rollout train rm \
    --config examples/rm-tiny.toml \
    [--resume <snapshot_id>] \
    [--dry-run]

Same flags as sft. The TOML differs only in the algorithm block:

schema_version = 1
[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"

[algorithm]
kind = "rm"

[algorithm.rm]
minibatch_size = 1

[algorithm.rm.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.rm.optimizer]
kind = "sgd"
lr   = 0.01

[algorithm.rm.budget]
max_steps = 10

[algorithm.rm.dataset]
kind = "jsonl_path"
path = "./data/pairs.jsonl"

[algorithm.rm.head]
kind = "bradley_terry"

JSONL input contract: each line { "prompt", "chosen", "rejected" }.

`--dry-run` semantics

In order, with the backend NEVER constructed:

Read + parse TOML against RunConfig (deny_unknown_fields).
Match [algorithm].kind against the subcommand (sft vs rm).
Validate minibatch_size >= 1 and optimizer.lr > 0.
Confirm dataset.path exists on disk (for jsonl_path form).
Print dry-run OK: algorithm=<sft|rm> model=… minibatch=… dataset=… and exit 0.

Because --dry-run short-circuits before select_backend, it runs cleanly on builds with NEITHER --features train NOR --features test-mock-backend.

`--resume <snapshot_id>` lifecycle

If --resume <hex> is set, the CLI scans the local rollout.db for a Snapshot whose id matches the supplied content-id (32-byte blake3, 64-char hex). The matching row is fed into algorithm.snapshot_restore(snap) before algorithm.run(&ctx) begins. Determinism then follows the per-algorithm contract: SftAlgo proves byte-identical resume via tests/snapshot_resume.rs::bit_identical_resume_at_step_5; RmAlgo proves parity via tests/snapshot_resume.rs::rm_bit_identical_resume.

Missing snapshot → Fatal(ConfigInvalid) with the failed hex echoed back.

Backend selection

rollout-cli builds with at most one training backend at a time, by Cargo feature. Selection at runtime is precedence-ordered:

Order	Build flags	Env	Backend
1	`--features test-mock-backend`	`ROLLOUT_TEST_MOCK_BACKEND=1`	`rollout_runtime_batch::MockBackend` (deterministic, no GPU/Python).
2	`--features vllm,train`	(any)	`rollout_backend_vllm::VllmBackend` in train mode (live HF / accelerate).
3	none	(any)	`Fatal(ConfigInvalid)` with a build-mode hint; `--dry-run` still works.

Build recipes:

# CI / tests (no Python, no GPU)
cargo build -p rollout-cli --features test-mock-backend

# Production (live HuggingFace + accelerate over Python)
cargo build -p rollout-cli --features vllm,train

# Both — vllm takes precedence at runtime unless ROLLOUT_TEST_MOCK_BACKEND=1.
cargo build -p rollout-cli --features test-mock-backend,vllm,train

The train feature implies vllm per Cargo.toml; the two are kept distinct so the existing Phase-3 infer batch users get vllm without pulling the training Python deps.

`rollout snapshot list`

rollout snapshot list \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    [--run-id <ULID>] \
    [--kind train_state|buffer|process|episodic_memory] \
    [--limit N]

Output: pretty-printed JSON array of Snapshot rows from rollout-core. Sort order is newest-first by created_at.

Flag	Default	Notes
`--storage-path`	`./rollout.db`	Opens an `EmbeddedStorage` read-write.
`--object-path`	`./object-store`	Opens a `FsObjectStore` (read-only for `list`, but consistently surfaced).
`--run-id`	none	Crockford ULID; restrict to one run. Without it, every run's snapshots are scanned.
`--kind`	none	snake_case match against `SnapshotKind` variants.
`--limit`	none	Cap result length.

`rollout snapshot show <snapshot_id>`

rollout snapshot show \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    <SNAPSHOT_ID>

SNAPSHOT_ID is the 64-char hex blake3 digest from Snapshot.id. Prints the full Snapshot row as pretty-printed JSON. Missing id → exit 2 with snapshot not found: <id>.

`rollout snapshot prune`

rollout snapshot prune \
    --run-id <ULID> \
    [--storage-path ./rollout.db] \
    [--object-path ./object-store] \
    [--keep-last N=3] \
    [--keep-labeled]

Applies a RetentionPolicy scoped to a single run (--run-id is required to avoid cross-run accidents). --keep-last N retains the N newest snapshots regardless of label; --keep-labeled (default true) further retains every labeled snapshot. Metadata rows are deleted; blob bytes stay in the object store (the Phase-2 ObjectStore trait has no delete — pending Phase-5). Prints pruned <N> snapshots.

Storage path conventions

Path	Owner	Purpose
`./rollout.db`	`EmbeddedStorage`	redb on-disk DB (always-fsync).
`./object-store/`	`FsObjectStore`	Content-addressed two-level sharded FS.
`<config_dir>/rollout.db`	training runs	`train sft
`<config_dir>/object-store/`	training runs	Same sibling-of-config convention.

snapshot list|show|prune accept explicit --storage-path / --object-path so out-of-band tooling and tests can target any directory pair.

Exit codes

Code	Meaning
0	Success (or successful `--dry-run`).
2	Config-invalid, missing dataset / snapshot, substrate error, or algorithm error.

Observability

RUST_LOG=info rollout train sft … produces structured tracing events. Each SFT / RM run emits at minimum train_start, train_step, train_end; the exact fields are pinned by the per-algorithm chapters.

CPU mode

The Phase-4 training surface runs on CPU end-to-end. This is the integration test path on dev boxes (including Apple Silicon) and the smoke recipe target in plan 04-07.

When to use

Local dev loop on a laptop without CUDA.
CI smoke that exercises the full HF transformers + accelerate path against a tiny model (Qwen/Qwen2.5-0.5B-Instruct).
Reproducing CUDA bugs that turn out to be deterministic-flag misconfiguration.

Expected throughput

Model	Hardware	Steps/sec
`Qwen/Qwen2.5-0.5B-Instruct`	Apple M2 Max (CPU)	~0.1–0.3
`Qwen/Qwen2.5-0.5B-Instruct`	Linux x86_64 16-core	~0.3–1.0

Roughly one to ten seconds per step for the 0.5B model. Anything larger is impractical on CPU; the per-token cost grows superlinearly. CPU mode exists to prove the pipeline, not to train.

Required env

None beyond default. The Phase-4 determinism preamble (CUBLAS_WORKSPACE_CONFIG, PYTHONHASHSEED) is written by the Rust side before import torch; CPU runs ignore CUBLAS settings without complaint.

The live tests gate on ROLLOUT_TRANSFORMERS_AVAILABLE=1:

pip install transformers>=4.45 accelerate>=0.34 torch>=2.4
ROLLOUT_TRANSFORMERS_AVAILABLE=1 \
  cargo test -p rollout-backend-vllm --features train \
  --test snapshot_resume_live -- --ignored --nocapture

Performance caveats

No streaming. Phase 4 rejects sampling.stream = true at the boundary (D-BACKEND-03); training has no streaming surface.
No multi-GPU. CPU mode is single-process. The FSDP plugin in init_train only activates when torch.cuda.device_count() >= 2.
Slow. The 0.5B model at one step per ~5 seconds on M-series silicon means a 10-step smoke takes a minute. Plan accordingly.
Determinism still holds. Two CPU runs with the same seed produce byte-identical accelerate.save_state output. The MockBackend variant in rollout-algo-sft::tests::snapshot_resume::bit_identical_resume_at_step_5 proves the Phase-4 contract holds on CPU without HF transformers installed.

Smoke recipe (plan 04-07)

make train-smoke (lands in plan 04-07) runs the live witness on dev boxes where ROLLOUT_TRANSFORMERS_AVAILABLE=1 is set. CI does not install transformers/accelerate; the MockBackend test is the unconditional gate.

Determinism — the determinism contract Phase-4 commits to.
SFT — algorithm-side overview.
Snapshots — snapshot pipeline.

Cloud

The cloud layer lifts rollout's local substrate to real object stores and queues (AWS S3 + SQS, GCP GCS + Pub/Sub) behind the same rollout-core traits. Algorithm crates never see a cloud SDK — a hard dependency-direction lint plus two CI gates (public-api-cloud-leak, forbidden-patterns) keep SDK types out of rollout-core's public API and keep raw metadata URLs / shell=True / libc::fork out of the tree.

Streaming + lease trait methods

Streaming + lease trait methods

Phase 5 extends two rollout-core cloud traits with four new methods. All four ship with backward-compatible default implementations so every v1.0 caller and impl compiles unchanged — only the streaming methods carry a #[deprecated] tag whose sole purpose is to nudge cloud backends into overriding them.

`ObjectStore::put_stream` / `get_stream`

#![allow(unused)]
fn main() {
#[deprecated(note = "Cloud impls MUST override; default buffers entire stream into RAM")]
async fn put_stream(
    &self,
    stream: Pin<Box<dyn AsyncRead + Send>>,
    hint: PutHint,
) -> Result<ContentId, CoreError>;

#[deprecated(note = "Cloud impls MUST override; default buffers entire blob into RAM")]
async fn get_stream(&self, id: &ContentId) -> Result<Pin<Box<dyn AsyncRead + Send>>, CoreError>;
}

The default put_stream reads the whole stream into a Vec<u8> and calls put_bytes; the default get_stream fetches via get_bytes and hands back a Cursor. That is fine for small blobs but OOMs on multi-GiB snapshots (Pitfall 16 / D-SNAP-04), which is why the #[deprecated] warning fires at any call site that did not override the method. Cloud backends (S3 multipart, GCS resumable) and the local FS store override both with true streaming, so the warning never fires from a correct impl.

`Queue::dequeue_with_lease` / `extend_lease`

#![allow(unused)]
fn main() {
async fn dequeue_with_lease(
    &self,
    lease: Duration,
) -> Result<Option<(QueueItemId, Vec<u8>, LeaseToken)>, CoreError>;

async fn extend_lease(
    &self,
    id: QueueItemId,
    token: LeaseToken,
    extend_by: Duration,
) -> Result<(), CoreError>;
}

LeaseToken(Vec<u8>) is an opaque per-backend handle: SQS ReceiptHandle bytes, Pub/Sub ack_id bytes, or — for the in-memory queue — the QueueItemId bytes. The default dequeue_with_lease ignores the lease duration and synthesizes a token from the item id; the default extend_lease returns Recoverable::Transient { hint: Never } so a backend that forgot to override it fails loudly rather than silently dropping the extension. These two are not #[deprecated] — the default fallback is a correct (if conservative) behavior for queues without visibility timeouts.

AWS

rollout-cloud-aws implements the four cloud traits against AWS: S3 (ObjectStore), SQS (Queue), Secrets Manager (SecretStore), and EC2 instance metadata (ComputeHint, IMDSv2-only). It is built behind a Cargo feature so non-AWS builds pull no AWS SDK crates.

Build

cargo build -p rollout-cli --features aws

The aws feature is default-off. A binary built without it that loads an [cloud] provider = "aws" config returns a fatal error telling you to rebuild with --features aws.

Config

[cloud]
provider = "aws"
region   = "us-west-2"

[cloud.aws.s3]
bucket                = "my-rollout-bucket"
prefix                = "runs/"          # optional
multipart_chunk_bytes = 16777216         # 16 MiB; S3 minimum is 5 MiB
max_snapshot_part_bytes = 5368709120     # 5 GiB; 10 GiB hard cap

[cloud.aws.sqs]
queue_url              = "https://sqs.us-west-2.amazonaws.com/123456789012/rollout"
visibility_timeout_secs = 300

[cloud.aws.secrets]
allowlist = ["rollout/hf-token"]

Cross-cloud is structurally impossible: the provider tag is an enum, so a single config cannot name both AWS and GCP.

IAM permissions

S3 object + multipart operations, SQS send/receive/delete/visibility, and Secrets Manager GetSecretValue are required. The full IAM matrix and the mandatory AbortIncompleteMultipartUpload lifecycle rule live in crates/rollout-cloud-aws/docs/bucket-setup.md.

Testing matrix

Tier	What runs	When
unit	error mapping + key layout + IMDSv2 mock handshake	every `cargo test` (no Docker)
emulator (`cloud-emulator-aws`)	full S3 + SQS + Secrets Manager conformance against localstack	every PR — always-on
live (`cloud-live-aws`)	same suite against real AWS via OIDC	nightly (cron)

Locally, bring up the emulator and run the ignored conformance suite:

docker compose -f docker-compose.test.yml up -d
LOCALSTACK_ENDPOINT=http://localhost:4566 \
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test AWS_REGION=us-east-1 \
cargo test -p rollout-cloud-aws --features aws --tests -- --include-ignored --test-threads=1

Common errors

Throttled / SlowDown / 503 — surfaces as Recoverable::Throttled with a backoff RetryHint; the SDK retries internally and the snapshotter retries on top. Streaming ContentId is stable across retries because chunks are hashed before each UploadPart.
IMDSv1 disabled (HttpTokens=required) — handled transparently: the SDK performs the IMDSv2 PUT /latest/api/token handshake. No raw metadata IP is ever used.
Secret not in allowlist — Fatal::ConfigInvalid; add the name to [cloud.aws.secrets].allowlist. SecretStore::put is read-only in v1.1.
Orphan multipart uploads — best-effort aborted on drop; the bucket lifecycle rule reclaims any that leak on SIGKILL.

GCP

rollout-cloud-gcp implements the four cloud traits against GCP: Cloud Storage (ObjectStore), Pub/Sub (Queue), Secret Manager (SecretStore, read-only), and the GCE metadata server (ComputeHint). It is built behind a Cargo feature so non-GCP builds pull no GCP SDK crates.

Build

cargo build -p rollout-cli --features gcp

The gcp feature is default-off. A binary built without it that loads a [cloud] provider = "gcp" config returns a fatal error telling you to rebuild with --features gcp. The aws and gcp features compose — --features aws,gcp builds a binary that dispatches on the TOML [cloud].provider value.

Config

[cloud]
provider = "gcp"
project  = "my-gcp-project"

[cloud.gcp.gcs]
bucket                  = "my-rollout-bucket"
prefix                  = "runs/"          # optional
resumable_chunk_bytes   = 16777216         # 16 MiB; 5 MiB minimum
max_snapshot_part_bytes = 5368709120       # 5 GiB; 10 GiB hard cap

[cloud.gcp.pubsub]
topic            = "rollout-work"
subscription     = "rollout-workers"
ack_deadline_secs = 30

[cloud.gcp.secrets]
allowlist = ["hf-token"]

Cross-cloud is structurally impossible: the provider tag is an enum, so a single config cannot name both AWS and GCP.

IAM / Workload Identity Federation

GCS object admin, Pub/Sub publisher + subscriber, and Secret Manager secretAccessor are required; GCE compute.viewer is needed only when running on GCE/GKE. The full role matrix and the (informational) lifecycle policy live in crates/rollout-cloud-gcp/docs/bucket-setup.md. The cloud-live-gcp CI job authenticates via Workload Identity Federation (WIF) — no long-lived service-account keys.

Testing matrix

Tier	What runs	When
unit	error mapping + key layout + Secret Manager allowlist + MDS mock	every `cargo test` (no Docker)
emulator (`cloud-emulator-gcp`)	GCS + Pub/Sub conformance against fake-gcs-server + pubsub-emulator + in-test Secret Manager mock	every PR — always-on
live (`cloud-live-gcp`)	same suite against real GCP via WIF	nightly (cron)

Locally, bring up the emulators and run the ignored conformance suite:

docker compose -f docker-compose.test.yml up -d
STORAGE_EMULATOR_HOST=http://localhost:4443 \
PUBSUB_EMULATOR_HOST=localhost:8085 PUBSUB_PROJECT_ID=rollout-test \
cargo test -p rollout-cloud-gcp --features gcp --tests -- --include-ignored --test-threads=1

Emulator delta

Some production semantics are not reproducible on the emulators (resumable upload status, time-based Pub/Sub redelivery, real fault injection). Those tests are #[ignore]d locally and run in cloud-live-gcp. There is no first-party Secret Manager emulator; the Secret Manager conformance tests use an in-test hyper mock and run Docker-free on every build. See the crate README for the full delta table.

Common errors

Throttled / ResourceExhausted / 429 / 503 — surfaces as Recoverable::Throttled with a backoff RetryHint. The streaming ContentId is stable across retries because chunks are hashed before each resumable upload chunk.
Spot / preemptible reclaim — ComputeHint::preemption_signal reads instance/preempted from the GCE metadata server and reports a ~30s lead.
Secret not in allowlist — Fatal::ConfigInvalid; add the name to [cloud.gcp.secrets].allowlist. SecretStore::put is read-only in v1.1.
Orphan resumable sessions — never persisted across processes (Pitfall #5); GCS auto-expires incomplete sessions after 7 days, and a preempted worker re-uploads from byte 0 idempotently.

Cloud-backed snapshots

Training-state snapshots stream to whichever object store your [cloud] block selects. rollout-snapshots takes an injected Arc<dyn ObjectStore>, so the same SnapshotterImpl works unchanged over the local filesystem, S3, or GCS — only the injected store differs (CLOUD-03).

Configuration

CloudConfig is a #[serde(tag = "provider")] enum, so the provider's fields live directly under [cloud]. A single TOML cannot name two providers — cross-cloud single-run is structurally impossible (D-XPROV-02).

See examples/sft-tiny-aws.toml and examples/sft-tiny-gcp.toml for the minimal [cloud] flip from examples/sft-tiny.toml:

# AWS
[cloud]
provider = "aws"
region = "us-west-2"

[cloud.s3]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"

# GCP
[cloud]
provider = "gcp"
project = "rollout-prod-123"

[cloud.gcs]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"

Streaming semantics

A snapshot is a deterministic tar of the accelerate-style state directory (weights + optimizer + RNG + step), content-addressed by blake3. The upload path:

Builds the tar deterministically (stable file order + zeroed mtime) so the same state always produces the same bytes.
Hashes each chunk with blake3::Hasher before the SDK call, so the resulting ContentId is stable across SDK retries (S3 multipart / GCS resumable — Pitfall #16).
Uploads to a temp/pending-<ulid> key (S3 multipart upload / GCS resumable session).
Server-side copies the temp object to the sharded content-addressed key <prefix>cas/<ab>/<cd>/<hex> (identical layout on FS, S3, and GCS), then deletes the temp.
On failure the temp upload is aborted (S3 MultipartGuard Drop) or expires via the bucket's 7-day lifecycle rule (GCS); no orphaned partial blob is ever read.

Restore fetches the blob by ContentId, re-verifies blake3, and extracts the tar — a mismatch is a hard Fatal error, never a silent partial restore.

Byte-identical resume

The CLOUD-03 acceptance criterion — byte-identical SFT resume holds over the cloud streaming path — is witnessed by two always-on tests:

bit_identical_resume_at_step_5_via_s3 (localstack-backed S3ObjectStore),
bit_identical_resume_at_step_5_via_gcs (fake-gcs-server-backed GcsObjectStore).

Each snapshots a MockBackend SFT run at step 5, restores off the cloud round-trip, runs five more steps, and asserts the final weights are byte-equal to a ten-step uninterrupted run. They run on every CI PR via the cloud-emulator-aws / cloud-emulator-gcp jobs — no GPU, no live cloud creds.

Cross-provider portability

Snapshots are content-addressed by blake3, so the same bytes produce the same ContentId on any provider. To migrate a snapshot from S3 to GCS, an operator copies the blob (rollout does not automate cross-provider transfer in v1.1):

# Operator-managed transfer between buckets:
aws s3 cp s3://aws-bucket/cas/ab/cd/<hex> /tmp/blob
gsutil cp /tmp/blob gs://gcs-bucket/cas/ab/cd/<hex>

The restore code path on either provider takes a SnapshotId and reads by ContentId; the provider is whichever ObjectStore is injected per [cloud].provider. The runnable witness is crates/rollout-snapshots/tests/snapshot_resume_s3_to_gcs_via_manual_copy.rs: it saves via S3, copies each blob into a GCS bucket asserting the ContentId is identical across providers, then restores + resumes on GCS byte-for-byte (D-XPROV-01).

Active-active cross-cloud single run is out of scope in v1.1 (PROJECT.md); the tagged-enum CloudConfig makes a config naming both [cloud.s3] and [cloud.gcs] structurally un-representable.

rollout cloud doctor

Operator pre-flight tool that exercises all four cloud traits (object store, queue, secret store, compute hint) against either AWS or GCP before a real training job runs. Addresses CLOUD-04 (D-DOCTOR-01..04).

Usage

# Build with the provider feature(s) you need.
cargo run -p rollout-cli --features aws -- cloud doctor --provider aws --config examples/sft-tiny-aws.toml
cargo run -p rollout-cli --features gcp -- cloud doctor --provider gcp --config examples/sft-tiny-gcp.toml --format json

Config source is the TOML [cloud] block only (D-DOCTOR-04) — there are no --bucket/--queue/--secret-id flag overrides in v1.1. The --provider flag MUST match the [cloud].provider in the TOML, or doctor exits 2.

Checks (in order)

reachability — TCP + TLS handshake to the service endpoint (s3.<region>.amazonaws.com / storage.googleapis.com). Surfaces DNS / firewall issues distinctly from auth failures.
auth — credential-chain probe (a cheap metadata read that requires the resolved credentials). Catches broken AWS credential chains / GCP ADC.
object_store — small payload PUT + GET roundtrip on the configured bucket.
queue — enqueue → dequeue_with_lease(30s) → ack on the configured queue.
secret_store — read the FIRST allowlisted secret ([cloud.*.secrets].allowlist). An empty allowlist is reported as a failure with remediation guidance.
compute_hint — inventory() + preemption_signal() probe (returns Ok(None) off a cloud instance).
content_id_roundtrip — a 64 MiB random buffer through put_stream + get_stream + blake3 verify. Forces the multipart / resumable path; catches blake3-streaming bugs (Pitfall 16 / D-SNAP-04).

Wall-time target: ~5-10s on a healthy environment.

Exit codes (D-DOCTOR-03)

0 — all checks pass.
1 — at least one check failed (use --format json to see which).
2 — invocation / config error (provider mismatch, missing TOML, malformed schema).

Plays well with shell &&:

rollout cloud doctor --provider aws --config production.toml && \
  rollout train sft --config production.toml

Output formats (D-DOCTOR-02)

--format human (default): colored steps with ✓/✗ icons + per-check latency + a N pass, M fail — total <ms> summary line.

--format json: machine-readable; matches the schema in crates/rollout-cli/src/commands/cloud/doctor/output/json.rs:

{
  "checks": [
    { "name": "reachability", "status": "pass", "latency_ms": 142 },
    { "name": "queue", "status": "fail", "latency_ms": 31, "error": "enqueue: ..." }
  ],
  "summary": { "pass_count": 6, "fail_count": 1, "total_latency_ms": 5443 }
}

error is omitted on passing checks.

Limitations (v1.1)

Config-file-only (D-DOCTOR-04); no --bucket/--queue/--secret-id overrides.
One comprehensive mode (D-DOCTOR-01); no --quick/--deep tiers.
Cross-cloud (both [cloud.aws] and [cloud.gcp] in one TOML) is structurally impossible — CloudConfig is a #[serde(tag = "provider")] enum (D-XPROV-02).

CI coverage

The doctor_smoke integration tests run on every PR:

cloud-emulator-aws runs doctor_smoke_aws_* against localstack with pre-created bucket / queue / secret (exit 0 all-pass, exit 1 unreachable, human + JSON shape).
cloud-emulator-gcp runs doctor_smoke_gcp_* against fake-gcs-server + pubsub-emulator with pre-created bucket / topic / subscription.
The config-layer tests (provider mismatch → exit 2, malformed config → exit 2) and the --help golden run Docker-free on every PR.

Distribution

How rollout runs a single run across many machines: the single-coordinator lease, work-stealing, stateless-replayer restart, spot-drain, and split-brain fencing.

Multi-node

Multi-node distribution

rollout runs a single training/inference run across many machines from one coordinator. This chapter is the operator's view of how that works: the lease that keeps exactly one coordinator alive, the work-stealing that keeps workers busy, how a coordinator restart is invisible to progress, and how a spot preemption drains gracefully. The implementation lives in rollout-coordinator; the contract is spec 05 — Distribution.

Coordinator + worker model

A run has exactly one coordinator at any time and N workers. The coordinator owns the work ledger and brokers all coordination; workers never talk peer-to-peer (one mutual-TLS edge per worker instead of N²). Workers long-poll the coordinator for work, run it on their backend (vLLM in production, a mock backend in the smoke), and submit content-addressed results. Submission is idempotent on the content-addressed WorkItem.id, so a retried item never produces duplicate output.

Single-row CAS lease + monotonic epoch

"Exactly one coordinator per run" is enforced by a single-row compare-and-swap lease (StorageLease) over the coordinator_lease storage row. One impl rides the dual-backed StorageTxn::cas_bytes, so the SAME lease semantics hold on both the embedded redb backend and Postgres (D-LEASE-01) — the Postgres path is proven in the postgres-integration CI lane (postgres_lease.rs).

Acquire — a fresh coordinator CASes the empty row to epoch = 0.
Steal-on-expiry — if the lease TTL has passed, a new coordinator CASes the exact prior (expired) bytes to epoch + 1. The epoch is monotonic: it only ever advances.
Renew — the incumbent heartbeats by CASing its own bytes forward, keeping the epoch constant. A renew that finds the epoch has advanced under it returns false — the incumbent has been fenced.

The lease TTL equals coordinator_failure_timeout; the renew cadence is strictly shorter than the TTL. Every winning claim stamps the authoritative epoch row in the same transaction, so the lease epoch and the ledger epoch never diverge.

Work-stealing

When the global queue is empty but a peer is still busy, an idle worker steals work — coordinator-mediated, never peer-to-peer (rollout-coordinator::{ledger, steal}):

Trigger — a worker steals only when its local backlog drains to empty (backlog(thief) == 0); a non-idle thief's request is a no-op.
Victim — the coordinator picks the busiest peer by Running backlog.
Batch — it reassigns ceil(victim_backlog / 2) items, capped at MAX_STEAL_BATCH = 32.
Reassign — each item is moved via a two-step CAS over the same prior Running(victim) bytes: try_repending (Running → Pending) then try_claim (Pending → Running(thief)).

Stolen-then-reclaimed items never double-execute. If the victim's ack races the steal, both CAS the same expected bytes and exactly one wins — the loser sees stale bytes and is dropped. This single-winner property is witnessed every commit, Docker-free, by concurrent_ack_and_steal_no_double_execute.

Coordinator restart (stateless replayer)

The coordinator holds no run state in memory that it cannot rebuild from storage. On boot it: wins the lease, adopts the advanced epoch, then scan_byteses the work ledger and reconstructs its in-flight assignment map:

Running{worker} rows are reconstructed in-flight — NOT requeued, because the worker may still hold the item (only the failure-scan stale path re-pends after coordinator_failure_timeout);
Pending rows go back onto the dispatch queue;
Done / Failed rows are terminal and skipped — replayed acks are idempotent.

So a coordinator restart is invisible to progress: a fresh coordinator boots over the same storage and the run completes with zero duplicate sample IDs. This is witnessed every commit by coord_restart_no_duplicates (SC2).

Spot-drain (graceful preemption)

When a worker's cloud reports a spot preemption, the drain state machine (rollout-coordinator::drain) runs within a conservative deadline: set a stop-pull flag → nack each in-flight item (back to Pending for another worker) → opportunistically snapshot TrainState only if the remaining budget covers the cost → deregister → exit cleanly.

Two numbers, kept distinct (D-SPOT-01/04):

Provider	Notice lead (cloud gives)	Drain deadline (we target)
AWS	120s	60s
GCP	30s	15s

The notice lead is what the cloud promises; the drain deadline is the budget the state machine targets, leaving margin. The preemption signal is consumed only through the ComputeHint trait — the coordinator never imports a cloud SDK (coord ↛ cloud; the dependency-direction lint enforces it). Witnessed every commit by spot_drain_completes_within_lead_time (SC3), for both AWS and GCP.

Split-brain fencing

If a coordinator is deposed (its lease was stolen and the epoch advanced), it self-fences: it emits exactly one coordinator_fenced observability event (never a shared-store write) and then std::process::abort()s within 5s. Workers reject any RPC response carrying a stale coord_epoch (EpochGuard), so a deposed coordinator's late replies can never corrupt the run. The abort edge is the hidden rollout-coordinator test-fence subcommand, driven by the SC4 subprocess witness fence_aborts_within_5s.

Operator recipe: the 3-node smoke

make smoke-3node-aws / make smoke-3node-gcp boot 1 coordinator + 3 workers (mock backend, no GPU, no Docker) over an auto-generated dev CA, drive the work ledger through dispatch + a real steal, and assert the run reports done within 30s with a steal observed:

make smoke-3node-aws      # local-transport wiring run (free, Docker-free)
make smoke-3node-gcp

By default the smoke runs over the local mTLS transport so it is reproducible on a free runner. To run the same topology over the real cloud transport (operator path; needs real AWS/GCP credentials and ~4 hosts):

ROLLOUT_SMOKE_CLOUD=1 make smoke-3node-aws

The live-cloud run is the operator-optional path; the every-commit named witnesses (coord_restart_no_duplicates, concurrent_ack_and_steal_no_double_execute, split_brain_old_coord_self_fences, spot_drain_completes_within_lead_time) are the load-bearing gate and run Docker-free on every commit:

cargo test -p rollout-coordinator

Harnesses

Three algo-layer crates that v1.2 PPO/GRPO consume via trait objects:

rollout-harness-text (HARNESS-01) — a text-completion RL environment.
rollout-harness-tool (HARNESS-02) — six sandboxed tools behind a layered best-effort Linux sandbox.
rollout-harness-eval (HARNESS-03) — bundled MMLU / IFEval / GSM8K scorers mirroring lm-evaluation-harness, driven by the rollout eval CLI.

Three traits, one principle (spec 07 §1)

The three harness kinds share one design rule: every method is batched. EnvHarness::reset/step/close, ToolHarness::invoke, and EvalHarness::run all take and return vectors so the GPU stays saturated — the harness never forces a one-at-a-time round trip.

Each trait is constructed the same way: from_settings(settings, deps) where deps: HarnessDependencies injects the substrate handles (plugin host, object store, storage, queue, event emitter, clock). The struct is the stable seam that keeps the traits unchanged as later phases add capabilities.

Dependency direction

The harness crates are algo-layer: they depend down onto rollout-core traits and the shipped substrate, never up or sideways onto the cloud / transport crates. The dep_direction_invariants_hold lint (14 invariants) enforces this — rollout-harness-{text,tool,eval} are in the ALGO_AND_ABOVE set that may not reach the cloud crates.

Local-first constraint

Every harness is testable locally without cloud credentials or a GPU:

the env reward path runs against a mock plugin host;
the tool sandbox compiles to a macOS dev stub and enforces on the Linux lane;
the eval suites run against offline SHA-pinned fixtures (HF_OFFLINE=1) with a GPU-free MockEvalBackend.
Env harness
Tool sandbox
Eval suites
CLI: rollout eval

Env harness (HARNESS-01)

rollout-harness-text is the bundled EnvHarness: a text-in / text-out environment where Observation = prompt and Action = completion.

Batched lifecycle

reset(Vec<Prompt>)      -> Vec<Episode>       # open episodes
step(Vec<EpisodeStep>)  -> Vec<StepResult>    # one turn each
close(Vec<EpisodeId>)   -> ()                 # release episode state

All three methods are batched (spec 07 principle 2). Episode state lives in an in-memory store keyed by EpisodeId (a fresh ULID per reset).

Multi-turn capable (D-ENV-01)

The step loop supports N steps per episode. The bundled text env is single-turn text completion, but the contract carries EpisodeStep + per-episode state across multiple steps so v1.2 conversational / tool-using environments need no contract change.

Reward via the plugin host (D-ENV-03)

There is no built-in reward trait. On step, if a reward plugin is configured, the env encodes a postcard RewardInput { prompt, completion }, calls PluginHost::call(handle, "score", payload), and decodes a Reward. A decode failure surfaces as a typed Fatal(PluginContract).

This reuses the Phase-2 rollout-plugin-host (cdylib / PyO3 / sidecar) — the env crate never embeds scoring logic.

No trajectory persistence in v1.1 (D-ENV-02)

step returns StepResults in memory. There is no ObjectStore persistence of trajectories in v1.1; the content-addressed Trajectory type + serializer lands with RL-03 in v1.2.

Witnesses

EchoEnv — canned reward, no plugin (the trivial happy path).
MockRewardEnv — wires a deterministic mock plugin handle to exercise the plugin-host reward path end-to-end.
env_deterministic_replay — a seeded SplitMix64 RNG (no rand dep) folds a nonce into StepResult.info so the seed materially changes the trajectory; same seed reproduces the trajectory (observation, reward, done, info), a different seed diverges. (episode_id is normalized out of the equality check since it is a fresh ULID per reset.)

All three are GPU-free and need no cloud credentials.

Tool sandbox (HARNESS-02)

rollout-harness-tool exposes six tools — python_exec, shell, file_read, file_write (Linux-enforced) and http_get, http_post (platform-independent, SSRF-filtered) — behind a best-effort layered Linux sandbox.

The crate README.md is the single source of truth for the matrix below; this chapter summarizes it.

Threat-model boundary (D-TOOL-08)

Tool harnesses defend against accidental damage; they are NOT a security perimeter for actively malicious code.

The sandbox is process-isolated, NOT VM-isolated. It is a real boundary against accidental escape and run-away resource use — a buggy tool, a runaway script, a path-traversal mistake, an SSRF to cloud metadata. It is not a production-grade defense against an adversary actively trying to break out. gVisor / Firecracker microVM isolation is a v1.2+ capability, explicitly out of scope. To run actively-hostile code, run this harness inside a microVM or a throwaway host.

Sandbox-depth matrix

Layer	Mechanism	Linux	macOS	Defeats
Namespaces	`rustix`/`unshare` user+pid+net+mount+uts+ipc	yes (best-effort)	stub	ambient process/network visibility; net ns = default-deny egress for exec tools
Resource limits	`rustix::process::setrlimit` (CPU/AS/NOFILE/NPROC)	yes (always)	stub	CPU spin, memory blow-up, fd/pid exhaustion
cgroups v2	`memory.max` + `pids.max` in a delegated subtree	yes, degrade-with-warning	stub	memory/pid exhaustion (defense-in-depth over rlimits)
Filesystem	`landlock` kernel FS allowlist	yes, fail-closed on kernel ≥ 5.13	stub	reads/writes outside the per-invocation tempdir
Syscalls	`seccompiler` deny-default-`EPERM` curated allowlist, installed LAST	yes (always)	stub	`mount`/`keyctl`/`bpf`/`ptrace`/`socket` + `clone*`/`unshare` with `CLONE_NEWUSER`
File-tool FS root	`cap-std::fs::Dir` rooted at the tempdir	yes	yes	path traversal / TOCTOU for `file_read`/`file_write`
HTTP egress	SSRF-filtered hyper connector (post-DNS IP filter + IP pin + per-redirect re-filter)	yes	yes	SSRF / DNS-rebinding / redirect-to-IMDS / RFC1918 / loopback

Layer order (exec tools)

After unshare, the child applies, in order: setrlimit → landlock → seccomp LAST → execve. seccomp is last because its own setup syscalls (landlock_*, prctl, setrlimit) must be permitted; landlock precedes seccomp because landlock enforcement needs PR_SET_NO_NEW_PRIVS + landlock_restrict_self.

The curated seccomp allowlist is deny-by-default with action ERRNO(EPERM) (not KILL) so a blocked tool returns a typed error and the harness can emit a seccomp-violation event rather than segfaulting. The allowlist is post-2020 syscall-aware (clone3, openat2, faccessat2, rseq, arch_prctl, the rt_sig* family) — validated on the Linux CI lane against a real strace -c baseline plus the seccomp_python_runs positive witness.

Fail-closed kernel gate (D-TOOL-02)

require_landlock = true by default. On kernel < 5.13 the harness refuses to start: Fatal::ConfigInvalid("landlock requires kernel >= 5.13, found {n}"). RHEL 8 (4.18) / Amazon Linux 2 (5.10) must set require_landlock = false to run with reduced isolation (rlimits + seccomp + namespaces still apply). No silent fallback.

macOS dev stub (D-TOOL-05)

On macOS the exec/file sandbox compiles to a stub; a sandboxed invoke returns Fatal::ConfigInvalid("sandbox unavailable on macOS — dev stub"). Linux is the only enforced surface. The HTTP tools DO run on macOS — their SSRF defense is an in-process hyper connector, not a syscall sandbox, so the http_tool_blocks_* witnesses run on both lanes.

HTTP SSRF connector

http_get / http_post run in-process with a custom hyper connector (not reqwest, whose redirect-follow gives no per-hop injection point). The connector resolves DNS itself, rejects private/link-local/loopback/IMDS IPs, pins the chosen IP for the connection (defeating DNS rebinding), and re-applies the filter on every redirect (defeating redirect-to-IMDS). Exec tools run in a deny-all net namespace with socket blocked — the network surface is deliberately separate.

Eval suites (HARNESS-03)

rollout-harness-eval bundles three scorers that mirror lm-evaluation-harness at a pinned version (LM_EVAL_VERSION = "lm-eval-harness-v0.4.9", cited in the scorer rustdoc). The pin removes scoring ambiguity: published MMLU/GSM8K conventions diverge, so the authoritative convention is declared and version-pinned.

MMLU — report both `acc` and `acc_norm` (D-EVAL-03)

acc — predicted answer = argmax over the four choice continuations' total log-likelihood; correct iff it matches the gold letter (A–D). Raw exact-match, temperature = 0.
acc_norm — same argmax, but each choice's log-likelihood is length-normalized before the argmax.

Both are reported (lm-eval's headline pair). Prompt format follows lm-eval's mmlu task default (A./B./C./D. + Answer:).

IFEval — strict, non-language (D-EVAL-04)

Pure-Rust strict checks for the verifiable non-language constraints (word / sentence count, JSON format, bullet count, keyword existence/frequency, case, placeholders). Language-detection constraints (language:*) are SKIPPED — no langdetect/langid dependency — and a load-time warning fires: "IFEval: skipping N language-detection constraints (unsupported in v1.1)". A prompt whose instructions are all skipped is excluded from both strict denominators; a partially-skipped prompt is scored on the remainder. Reported metrics: inst_strict_acc, prompt_strict_acc.

GSM8K — `####` numeric extraction (D-EVAL-04)

Gold answer = the number after the final #### (strip ,/$). Model answer = the last match of the lm-eval gsm8k filter regex (-?[$0-9.,]{2,})|(-?[0-9]+) in the generation. Score = exact numeric equivalence, temperature = 0.

Datasets: offline-default (D-EVAL-01)

HF_OFFLINE=1 is the test default. Loaders read vendored SHA-pinned 10-row parquet fixtures under crates/rollout-harness-eval/tests/fixtures/, so eval_score_matches_lm_eval_harness (≤1% parity) and eval_loader_works_with_no_network run on every commit with zero network. Each fixture has a hardcoded blake3 constant that fails loudly on drift.

Online runs (HF_OFFLINE=0) download full splits via hf-hub (pure-Rust, rustls — never openssl) and persist to the v1.0 ObjectStore under a ContentId hash-checked cache. google/IFEval has stricter anonymous rate limits — set HF_TOKEN for full-split runs.

Eval as a WorkQueue job (D-EVAL-05)

One example = one queue item, reusing the Phase-6 rollout-coordinator::work_item CAS state machine (Pending → Running → Done{result_id}, single-winner via try_claim/try_complete). The item id is blake3(suite, version, idx, model_id) — idempotent, so re-enqueue is a no-op. Per-task determinism comes from a fixed seeded order + temperature = 0; MockEvalBackend makes the whole path GPU-free.

The aggregate EvalReport is persisted to the eval_reports storage namespace (postcard row) and content-addressed as a full blob in the object store.

This is eval execution as a job — NOT the eval gate (pause training → eval → continue/stop), which lands with HARNESS-04 in v1.2.

CLI: rollout eval

rollout eval is a top-level subcommand (D-EVAL-02), sibling to infer / train / snapshot. It runs a bundled eval suite against a checkpoint and emits per-task scores plus aggregate metrics.

rollout eval --suite mmlu  --checkpoint <snapshot-id>
rollout eval --suite gsm8k --checkpoint <snapshot-id> --format json
rollout eval --suite ifeval --checkpoint <snapshot-id> --dry-run

Flags

Flag	Meaning
`--suite <mmlu\|ifeval\|gsm8k>`	which bundled suite to run
`--checkpoint <snapshot-id>`	resolved from local storage to a `ModelRef`; if no snapshot row matches, treated as a direct content-id pin
`--config <toml>`	optional (reserved for future eval settings)
`--storage-path` / `--object-path`	local embedded storage + object store roots
`--seed`	deterministic sampling/task-order seed
`--dry-run`	validate args + resolve the checkpoint, but construct no backend
`--format json`	pretty-printed `EvalReport` (default)

Checkpoint resolution

--checkpoint parses as a ContentId. If a Snapshot row in local storage has that id, its "tar" part's content-id pins ModelRef.content_id (spec 04). Otherwise the value is used directly as a content-id pin. --dry-run short-circuits before any backend is built, so it works on a build with neither backend feature.

Spec-08 reconciliation (D-EVAL-02)

Earlier drafts of docs/specs/08-cli.md listed rollout infer eval. Phase 7 reconciled the spec: that form is removed and rollout eval is documented as a top-level subcommand. The spec's rollout plan "Harness DAG: acyclic, 3 nodes" line is footnoted as a v1.2 concern (D-CORE-02) — DAG validation is not implemented in v1.1, which ships the three harnesses standalone.

Backend selection

Mirrors rollout infer batch: with the test-mock-backend feature + ROLLOUT_TEST_MOCK_BACKEND=1 the GPU-free MockEvalBackend runs the offline fixtures; the live backend path lands with the full-split download wiring.

Examples

This page is reserved for the v1 working-model recipe (SHIP-03 hardened).

v1 cannot ship without at least one end-to-end recipe (make example or cargo run --example) that takes a real small open-weights model, runs SFT or PPO, completes on commodity hardware, is exercised by nightly CI, and is documented here. See AGENTS.md §9.4.

The recipe lands progressively: Phase 4 (SFT stub) → Phase 9 (real recipe) → Phase 12 (polished docs).