Introduction
A high-performance, multi-node reinforcement-learning framework for large language models. Written in Rust. Pluggable in Python.
rollout is a Rust-core reinforcement-learning framework for large language models. It supports PPO, GRPO, DPO/IPO/KTO, SFT, and reward-model training across training, batch inference, and online inference modes, with multi-node distribution from day one. AWS and GCP are first-class infra targets; vLLM is the default inference backend; plugins can be authored in Python or Rust.
| Layer | What it owns |
|---|---|
| Algorithms | SFT · RM · PPO · GRPO · DPO / IPO / KTO |
| Substrate | Coordinator + workers + plugin host (PyO3 / sidecar RPC) |
| Storage / Cloud | Embedded · Postgres · S3 · GCS — AWS/GCP behind a layered trait |
See Architecture for the full layered breakdown.
Architecture
See the canonical architecture doc: ARCHITECTURE.md.
Substrate (Phase 2)
Phase 2 lights up the local substrate — the layer that lets a worker start, store state, talk to a peer, load a plugin, and shut down cleanly without touching any cloud. The per-crate chapters land in plan 02-07; this page is the section landing.
What ships in Phase 2
rollout-proto— ownstransport.proto(heartbeat / control / work) andplugin.proto(sidecar).tonic-buildruns in this crate'sbuild.rs; every other crate consumes the generated code.rollout-storage— implementsStorage+StorageTxnon top ofredb. In-processtokio::sync::broadcastper-prefix forwatch(). Always-fsync. Default path./data/rollout.db.postcardvalue encoding.rollout-cloud-local— implementsObjectStore(content-addressed sharded FS under./data/object-store/),Queue(RAM hot path + spill torollout-storagefor restart replay),SecretStore(env-var allowlist, read-only), andComputeHint(Linux full via/proc+nvml-wrapper; macOS minimal stub viasysinfo).rollout-transport— implements gRPC with three logical channels (heartbeat / control / work). HTTP/2 +rustlsis the plan-of-record; QUIC viatonic-h3is feature-flagged EXPERIMENTAL.rollout-plugin-host— implementsPluginHostwith all three modes wired: Rust cdylib, PyO3 in-process, Python sidecar (gRPC-over-UDS).rollout-coordinator— minimal binary that registers workers, accepts heartbeats, persists worker registry + heartbeat ledger toStorage, and marks workers failed via deadline-based scan.- Smoke test —
make smoke+scripts/smoke.shspawn 1 coordinator + 2 workers + 1 cdylib + 1 Python sidecar, killw1, and assert deadline detection within2 × heartbeat_interval.
Plan-of-record vs stretch
HTTP/2 over tonic + rustls is the plan-of-record transport. QUIC via
tonic-h3 v0.0.x is feature-flagged
EXPERIMENTAL (it is currently pre-1.0 and bidi-streaming is not documented as
production-ready). The same .proto schema covers both; the swap-to-QUIC is a
single-config change in a later phase.
Trait surface
All trait contracts live in rollout-core. The Wave-0 extensions
(plan 02-00) align the rollout-core traits with the spec versions:
Storage/StorageTxn— see spec 04 §2. Phase 2 simplifiesscanto return ownedVecinstead ofBoxStream(object-safety +async_traitconstraint).PluginHost— see spec 03 §4–§5. Phase 2 usesVec<u8>payloads incall(); richer typed-payload helpers ship in later phases.Worker/Coordinator— see spec 01 §2.Worker::init/readyland in Phase 2;WorkerContextstays a unit struct until Phase 6.ObjectStore/Queue/SecretStore/ComputeHint— see spec 06 §3.Queue::ack/nack,ObjectStore::exists,ComputeHint::preemption_signal, andSecretStore::putship inrollout-corein Phase 2.EventEmitter— see spec 09 §2. The trait lives inrollout-core(plan 02-00); theStdoutJsonEmitterimpl lands inrollout-coordinator(plan 02-06).
Per-crate chapters
- Proto crate
- Storage
- Cloud-local
- Transport
- Plugin host
- Python bridge — PyO3 +
pyo3-async-runtimespin rationale - Coordinator
- Smoke test — the
make smokeSUBSTR-02 acceptance gate
Preflight
Before running make smoke, run bash scripts/preflight.sh. The script
checks that cargo, make, and python3 ≥ 3.11 are on PATH and notes
whether protoc is available (tonic-build bundles a copy when missing).
rollout-proto — wire-format crate
rollout-proto owns the workspace's .proto files and runs tonic-build exactly
once. Downstream crates (rollout-transport, rollout-plugin-host,
rollout-coordinator) depend on rollout-proto, not on .proto regeneration.
Why a dedicated crate
Per CONTEXT D-PROTO-01,
the workspace has a single tonic-build invocation site. Centralising wire-format
ownership means:
- The HTTP/2 ↔ QUIC swap (CONTEXT D-TRANS-01 fallback) shares the same source-of-truth proto schema.
- The sidecar IPC protocol (UDS) and the network transport carry the same message types — no drift between in-process and out-of-process plugin calls.
- Build-time codegen happens once per workspace build; downstream crates compile faster.
Wire-format ownership
| File | Package | Owns |
|---|---|---|
proto/transport.proto | rollout.transport.v1 | Worker ↔ Coordinator transport |
proto/plugin.proto | rollout.plugin.v1 | Host ↔ sidecar Plugin protocol |
Both are committed to the repo. The build script (build.rs) compiles them via
tonic-prost-build on every workspace build; rerun-if-changed keeps incremental
builds fast.
Three transport channels
transport.proto defines the three logical channels from
spec 05 §3:
| Channel | RPC kind | Cadence | Purpose |
|---|---|---|---|
Heartbeat | unary Beat | every 500 ms | Worker liveness; carries due_at for deadline-based health |
Control | server-stream Subscribe | as-needed | Coordinator pushes drain / snapshot / cancel |
Work | bidi Stream | bursty | Pull/submit work items (full semantics land in Phase 6) |
All three multiplex over one connection — HTTP/2 today, QUIC behind a feature flag in a later phase.
Plugin sidecar protocol
plugin.proto defines the five sidecar RPCs from
spec 03 §3.3 + §4:
Init(InitRequest) -> InitResponse— one-time setup per worker.Preflight(PreflightRequest) -> PreflightResponse— optional, afterInit.Call(CallRequest) -> CallResponse— generic typed entry point;payloadis opaque bytes (JSON or postcard, plugin-defined).Reload(ReloadRequest) -> ReloadResponse— hot reload signal; the sidecar may refuse.Shutdown(ShutdownRequest) -> ShutdownResponse— clean exit with grace period.
The Python sidecar host implementation lands in
rollout-plugin-host (plan 02-05).
Python stubs
The in-tree Python sample sidecar
(python/examples/sample_sidecar/) uses stdlib length-prefixed JSON framing
over UDS — per AGENTS.md §7, every plugin must be testable locally without
pip install grpcio or other external services. See
02-RESEARCH.md §"Python sidecar IPC: avoid pip"
for the rationale.
If you are authoring a Python plugin that does want real gRPC, opt in via
make protos (which shells to cargo xtask gen-protos). Output goes to
python/rollout/_proto/. The xtask gracefully degrades when grpcio-tools is
missing — it prints a helpful message and exits 0; CI does not require it.
pip install grpcio-tools # opt-in
make protos # writes python/rollout/_proto/*_pb2.py
Versioning
Both protos use a v1 package suffix
(rollout.transport.v1, rollout.plugin.v1). Bumping to v2 is a future spec
edit; the v1 modules will continue to compile alongside any future versions so
in-flight Phase 2 deployments do not break.
Cross-references
- spec 03 — plugin system §3.3, §4
- spec 05 — distribution / transport §3, §6
- CONTEXT D-PROTO-01 for the "single tonic-build site" decision
Storage
rollout-storage is the Phase-2 embedded Storage + StorageTxn impl, backed by redb 2.5. It is the default backend in v1 per CONTEXT D-STO-01. The Postgres backend lives in Phase 4 (TRAIN-04).
Backend choice
Spec 04 §3.1 prefers redb: pure-Rust, single-file MVCC, copy-on-write, no compaction stalls. The async Storage trait wraps the sync redb API via tokio::task::spawn_blocking.
Table layout
Each StorageKey.namespace maps to its own redb TableDefinition<&[u8], &[u8]>. The Phase-2 namespaces are:
| Table | Purpose |
|---|---|
runs | Per-run metadata |
workers | Worker registry (coordinator-owned) |
heartbeats | Heartbeat ledger; coordinator scans this on due_at |
queue | Generic queue spill |
plugins | Plugin manifest cache |
cloudlocal_queue | rollout-cloud-local queue restart-replay mirror |
Adding a new namespace = adding a const TableDefinition in embedded::tables.rs and a match arm in table_for. No migration; redb opens new tables lazily on first write.
Key encoding
Namespace selects the table, so the redb key bytes only encode (run_id, path). The encoding is a single postcard::to_allocvec(&(&run_id, &path)) so decoding is unambiguous regardless of any 0x00 bytes inside Option<RunId>. See crates/rollout-storage/src/encoding.rs.
Value encoding
postcard per D-STO-04 — compact, schemafull, deterministic, serde-native. The Storage trait is bytes-only (get_bytes / put_bytes / cas_bytes); downstream crates layer postcard over the trait via free helpers when typed payloads are needed.
Durability
Always-fsync per D-STO-03. Each WriteTransaction calls set_durability(Durability::Immediate) before staging writes; redb's commit() blocks until fsync completes. Default DB path is ./data/rollout.db, overridable via [storage.embedded] path = ... in the run config.
watch() semantics
Storage::watch(prefix) -> tokio::sync::broadcast::Receiver<StorageEvent> returns an in-process subscriber to commits whose key extends prefix. The WatchRouter lives inside EmbeddedStorage; per-prefix broadcast::Senders are created on first subscribe.
The critical invariant (RESEARCH §"Pattern 2"): redb has no post-commit hook. The EmbeddedTxn buffers StorageEvents inside the in-flight txn; commit() only fans them out to the router after WriteTransaction::commit() returns Ok. On abort() (explicit or via drop), the buffered events are discarded — subscribers never observe rolled-back writes.
begin() → buffer = []
put_bytes(K, V) → buffer.push(Put{K})
commit() → spawn_blocking(commit) → on Ok: for evt in buffer { watch.publish(evt) }
abort() → drop(txn); buffer is dropped
Phase-2 scan semantics
Storage::scan_bytes(KeyRange { prefix, limit }) returns an owned Vec<(StorageKey, Vec<u8>)> rather than the BoxStream shown in spec 04 §2. The async-trait + object-safety constraint on stable Rust forbids stream-returning methods on dyn Storage; the simplification is documented in spec 04 §1a. Phase-2 callers (coordinator deadline scan, cloud-local restart replay) work with small per-namespace working sets where the owned-vec cost is negligible. A future StorageStream newtype can lift this restriction.
The scan iterates the namespace table and decodes each key on the way out; namespace already partitions the data, so tables stay small. limit short-circuits the iteration.
Postgres path constraint (ASCII-printable)
The Postgres backend stores StorageKey.path as TEXT[]. Path components must be ASCII-printable (bytes 0x20–0x7E); non-printable / NUL / high-bit bytes cannot round-trip through TEXT[] and would silently diverge from redb's byte-lex prefix scan (PITFALLS.md §17). Every Postgres CRUD and scan entry-point calls StorageKey::validate_for_postgres() and rejects offending keys with Fatal(ConfigInvalid) before any SQL runs. For binary identifiers in path components, use hex::encode at the StorageKey construction site. A 256-case proptest (tests/postgres_scan_bytes_parity.rs) witnesses byte-parity between redb and Postgres over the printable-ASCII range.
Cross-process watch
NOT supported. The broadcast channel lives in process memory; another EmbeddedStorage instance pointing at the same file will not receive events. Cross-process watch arrives with the Postgres backend in Phase 4 (LISTEN/NOTIFY).
Crash safety
The on-disk format is crash-safe at the redb commit boundary: a successful commit() implies fsync; anything before that is invisible on reopen. The active test in tests/crash_safety.rs exercises the abort-style variant (drop txn without commit, reopen, assert no keys visible). A "true SIGKILL between put and commit" variant is #[ignore]d for now — it needs a helper-binary harness with raw signals; Phase-6 DIST-03 (restart-from-storage) will land that.
Tests
| File | Scope |
|---|---|
tests/crud.rs | put/get/delete/scan/get_many/ping |
tests/tables.rs | per-namespace isolation; six-namespace reopen |
tests/txn.rs | commit/abort/cas (insert-only / CAS / delete-if-equal) |
tests/watch.rs | publish-after-commit; abort suppresses; multi-subscriber; prefix isolation |
tests/crash_safety.rs | drop-without-commit reopen; SIGKILL variant #[ignore]d |
Cloud-local substrate
rollout-cloud-local ships the Phase-2 Layer-1 implementations so the rest of
the stack has a real ObjectStore / Queue / SecretStore / ComputeHint to
target with zero cloud creds. Per CONTEXT D-LOCAL-01..05.
What ships
| Capability | Type | Implementation |
|---|---|---|
| Blob storage | FsObjectStore | Content-addressed two-level sharded FS under ./data/object-store/; sibling <hex>.meta.json per blob. |
| Work queue | InMemQueue | tokio::sync::Mutex<VecDeque<_>> hot path + spill to rollout-storage (cloudlocal_queue namespace). |
| Secrets | EnvSecretStore | Read-only env-var allowlist (ROLLOUT_SECRET_<NAME>); put() returns Fatal(ConfigInvalid) by design. |
| Compute hint | hints::* | Linux full (/proc/cpuinfo + /proc/meminfo, optional NVML feature); macOS minimal (sysinfo cpu+memory). |
What's deferred
BlockStore— D-LOCAL-05: declared inrollout-core, not implemented here; opt-in for clouds that need it.- Sandboxing beyond network allowlist — Phase 7, when the tool harness lands and untrusted-code isolation matters (cgroups + seccomp + FD limits + fs write restrictions).
- Real cloud backends (
rollout-cloud-aws,rollout-cloud-gcp) — Phase 5.
FsObjectStore layout (D-LOCAL-01)
./data/object-store/
├── ab/ <- hex[0..2]
│ └── cd/ <- hex[2..4]
│ ├── abcd…fullhash <- blob
│ └── abcd…fullhash.meta.json
└── ...
Writes are tmp-then-rename for atomicity. Idempotent for repeated puts of the
same bytes (same ContentId, no double-write). get_bytes on a missing id
returns Fatal(Internal("object not found: …")) — choosing Fatal because a
missing content hash signals an upstream contract violation, not a transient
I/O fault.
Queue restart semantics (D-LOCAL-02)
Every enqueue writes through a Storage transaction under
cloudlocal_queue/<ulid> BEFORE the item is pushed onto the in-memory deque.
ack deletes the storage entry; nack re-pushes the item to the front of the
deque without touching storage so the next restart still replays it.
On InMemQueue::open(storage), the queue scans cloudlocal_queue/*, decodes
each QueueItemId(Ulid) from the path segment, sorts by ULID (which is
k-sortable so this recovers enqueue order), and rebuilds the deque. This honors
the spirit of DIST-03 (restart replay) for the local backend; the full
DIST-01..05 fault-tolerance work lands in Phase 6.
Secret allowlist (D-LOCAL-03)
EnvSecretStore::new(allowlist) accepts a list of secret names (without the
ROLLOUT_SECRET_ prefix). At get(name) time:
| Condition | Result |
|---|---|
name not in allowlist | Err(Fatal(ConfigInvalid("not in allowlist"))) |
name allowed, ROLLOUT_SECRET_<name> set | Ok(value) |
name allowed, env var unset | Err(Recoverable(Transient, RetryHint::Never)) |
put(name, value) — ALWAYS | Err(Fatal(ConfigInvalid("read-only"))) |
The "allowed but unset" case is recoverable rather than fatal because the operator can provision the variable without changing config; subsequent calls will succeed.
GPU inventory (D-LOCAL-04)
GPU enumeration is opt-in behind a nvml Cargo feature. When the feature
is off (default), LinuxComputeHint::inventory().gpus is always empty. When
the feature is on but libnvidia-ml.so is missing or NVML init fails, the
inventory still returns an empty gpus vector — never errors. This keeps
local dev machines and CI runners (no GPU) functional without conditional code
on the caller.
macOS skips GPU inventory entirely.
Tests
| File | Coverage |
|---|---|
tests/object_store.rs (6) | Round-trip, sharded layout, meta sidecar, exists, idempotency, fatal-on-missing |
tests/secrets.rs (4) | Allowlist read, outside-allowlist fatal, unset transient, put fatal |
tests/queue_replay.rs (5) | FIFO ULID order, nack-to-front, restart replay, ack removes, nack keeps |
tests/hints_macos.rs (2, cfg=macos) | CPU + memory present, preemption signal None |
tests/hints_linux.rs (3, cfg=linux) | /proc/cpuinfo parse, /proc/meminfo parse, no GPUs without nvml feature |
Linux-only tests are #[cfg(target_os = "linux")] so workspace cargo test --tests stays green on every host. The NVML integration test is additionally
#[ignore]d behind --ignored since it requires a live libnvml.
rollout-transport — gRPC plane with mTLS
rollout-transport is the worker ↔ coordinator gRPC plane for Phase 2. It
ships HTTP/2 + rustls + mTLS as the plan-of-record. QUIC support is
experimental and lives behind a Cargo feature flag.
Plan-of-record: HTTP/2 + rustls
Phase 2 research (.planning/phases/02-local-substrate/02-RESEARCH.md,
"Transport stack") found tonic-h3 v0.0.5 (latest at planning time,
2025-11-01) was explicitly labelled experimental, with bidirectional
streaming not documented as supported. The upstream
hyperium/tonic#339 gRPC-over-HTTP/3
issue remains open. Shipping a Work channel (bidi-streaming) on top of
tonic-h3 would risk runtime hangs under load.
Therefore plan 02-04 ships HTTP/2 tonic 0.14 + rustls 0.23 + rcgen 0.13.
The same .proto schema is forward-compatible with QUIC; the swap will
become a single Cargo-feature flip in a later phase once tonic-h3 (or
its successor) ships documented bidi support.
Three logical channels
All three channels are multiplexed over one HTTP/2 connection per
(coordinator, worker) pair, per spec
05-distribution.md §3.
| Channel | RPC kind | Purpose |
|---|---|---|
| Heartbeat | unary, frequent | Worker's "I'm alive" ping; carries due_at |
| Control | server-streaming | Coordinator pushes drain / snapshot / cancel |
| Work | bidirectional | Phase-2 stub; Phase 6 wires pull/submit |
The Work channel ships as a wired-but-stub WorkServiceImpl that echoes a
heartbeat marker back on every received frame. Real pull/submit semantics
arrive with DIST-01..02 in Phase 6.
mTLS auto-bootstrap (D-TRANS-02)
On first run the transport invokes tls::ensure_dev_ca(tls_dir). This:
- Generates a self-signed CA via
rcgen 0.13. - Writes
ca.pem(public certificate) andca.key.pem(private key). - Sets
ca.key.pempermissions to0o600on Unix.
Subsequent calls are idempotent — the existing files are read back.
tls::issue_server_cert(ca_cert, ca_key, dns_names) issues server certs
(EKU ServerAuth). tls::issue_client_cert(ca_cert, ca_key, names)
issues client certs (EKU ClientAuth). Both are signed by the dev CA.
Filenames live under ./data/tls/ (gitignored). No manual openssl
steps are required; rcgen produces pure-Rust output with no
openssl-sys dependency (banned by deny.toml).
Deadline-based health (D-TIME-01..02)
Spec
05-distribution.md §6
mandates deadline-based health, not polling. Workers publish
due_at = now + heartbeat_interval × 2 on every Beat; coordinators scan
worker state and mark failure only when:
elapsed_past_due > clock_skew_budget AND elapsed_past_due > coordinator_failure_timeout
Helpers health::next_due_at and health::is_failed encode the formulas
above. They take SystemTime directly so callers can swap a test clock
trivially.
Default constants (D-TIME-01):
| Constant | Default |
|---|---|
heartbeat_interval | 500 ms |
worker_self_fence_timeout | 4 s |
coordinator_failure_timeout | 5 s |
clock_skew_budget | 250 ms |
Config invariants enforced at plan time (D-TIME-02)
TransportConfig::validate_cross_fields enforces two split-brain
prevention rules at plan time, never at runtime:
worker_self_fence_timeout < coordinator_failure_timeout— a worker that fails to self-fence before the coordinator times it out causes classic split-brain (two writers think they own the same lease).clock_skew_budget < heartbeat_interval × 2— a clock-skew budget greater than two periods would make deadline-based failure detection meaningless.
rollout plan calls validate_cross_fields before any worker starts;
violations return Fatal(ConfigInvalid) with a human-readable message
identifying the offending field.
QUIC feature flag (EXPERIMENTAL)
The quic Cargo feature pulls in quinn, tonic-h3, h3, and
h3-quinn:
[features]
default = ["h2"]
h2 = []
quic = ["dep:quinn", "dep:tonic-h3", "dep:h3", "dep:h3-quinn"]
At plan-02-04 execution time (2026-05-20), cargo build -p rollout-transport --features quic fails to compile because h3-quinn 0.0.7 references quinn::StreamId internals that became private in
quinn 0.11.x. This confirms the RESEARCH §"Pitfall 2" assessment:
the QUIC stack is not production-ready.
When tonic-h3 (or a successor) ships documented bidi-streaming and
publishes a quinn 0.11-compatible release, the swap is:
- Verify
cargo build --features quicsucceeds. - Implement
server::serve_quic(currently returnsFatal(Internal)with an EXPERIMENTAL message). - Add a matching
client::build_quic_channel. - Flip the default feature, or expose a CLI flag.
The proto schema does not change.
Channel: Work (Phase-2 stub)
work::WorkServiceImpl::stream accepts a bidi stream and echoes a
"heartbeat" marker per inbound frame. This is enough for the Phase-2
smoke test in plan 02-07 to verify that the bidi pipe is wired
end-to-end. Real pull/submit semantics — assignment, ack, work-stealing,
restart-from-storage — arrive with DIST-01..02 in Phase 6.
Observability (principle #10)
Every server handler is wrapped in a tracing::instrument span with a
channel field. Emitted events include:
transport_server_starting(with%addr)mtls_handshake_bootstrap(when CA is generated)heartbeat_received(worker_id, run_id, state)control_subscribed(worker_id, run_id)
Binary crates configure the subscriber; library crates emit only — per D-OBSERVE-01.
Cross-references
docs/specs/05-distribution.md§3, §6crates/rollout-proto/proto/transport.proto.planning/phases/02-local-substrate/02-CONTEXT.mdD-TRANS-01..03, D-TIME-01..02.planning/phases/02-local-substrate/02-RESEARCH.md§"Transport stack", Pitfall 2
Plugin host
rollout-plugin-host is the Phase-2 implementation of rollout_core::PluginHost. It wires three loading modes against the trait surface delivered in Wave-0 and persists manifests through rollout-storage when constructed via PluginHostImpl::with_storage.
Three modes
| Mode | Crate feature | Where it runs | Hot reload |
|---|---|---|---|
Rust cdylib (PluginMode::RustCdylib) | cdylib (default) | In-process, native code | Unsupported. Returns Fatal(PluginContract) per spec 03 §7. |
PyO3 in-process (PluginMode::Pyo3) | pyo3 (default) | Dedicated Python OS thread | importlib.reload (requires dev-hot-reload). |
Python sidecar (PluginMode::Sidecar) | sidecar (default) | Subprocess over Unix Domain Socket | SIGTERM + respawn (requires dev-hot-reload). |
The host exposes one type — PluginHostImpl — that dispatches on manifest.mode at load() time. Each mode owns a small state struct kept in a parallel HashMap<PluginId, HandleState> so the public PluginHandle remains Clone + Send + Sync POD.
Manifest
Plugins ship a rollout-plugin.toml parsed by parse_manifest. The schema matches rollout_core::PluginManifest:
name = "sample-inproc"
version = "0.1.0"
kind = "env-harness" # PluginKind variant (kebab-case)
trait_id = "rollout_core::Plugin"
mode = "pyo3" # pyo3 | sidecar | rust-cdylib
network_allowlist = []
[runtime]
python_min = "3.11" # required for pyo3
gpu = false
memory_mib = 64
[entry.pyo3]
module = "sample_inproc.plugin"
factory = "create_plugin"
validate_manifest runs cheap plan-time checks: pyo3 plugins must declare python_min >= 3.11 (stdlib tomllib + PyO3 abi3-py311), names/versions must be non-empty, and the manifest must parse against serde(rename_all = "kebab-case").
C-ABI shim (cdylib mode)
Phase 2 keeps the shim as an internal module (src/modes/abi.rs) per RESEARCH Open Question 3 — it graduates to a standalone rollout-plugin-abi crate when an external plugin ecosystem emerges (likely Phase 7+).
The contract:
#![allow(unused)] fn main() { #[repr(C)] pub struct Buf { pub ptr: *mut u8, pub len: usize, pub cap: usize } #[repr(C)] pub struct RolloutPluginVtable { pub abi_version: u32, // must equal ABI_VERSION (1) pub name: *const c_char, pub call: extern "C" fn( method: *const u8, method_len: usize, payload: *const u8, payload_len: usize, out: *mut Buf ) -> i32, pub free_buf: extern "C" fn(buf: Buf), } #[no_mangle] pub extern "C" fn rollout_plugin_factory() -> *mut RolloutPluginVtable; }
The host copies the returned Buf before calling free_buf so allocator mismatches across the cdylib boundary cannot corrupt anything. ABI mismatches surface as Fatal(PluginContract { msg: "cdylib ABI mismatch: got N, expected 1" }).
The in-tree example lives at tests/smoke/plugins/rust_cdylib_sample/. It's an out-of-workspace cdylib crate (its own [workspace]) so cargo build --workspace doesn't churn on it; the smoke driver (plan 02-07) builds it with cargo build --manifest-path … --release.
Sidecar IPC
The in-tree sample uses stdlib-only length-prefixed JSON over AF_UNIX per RESEARCH Pitfall 9 + AGENTS.md §7 ("every plugin testable locally — without cloud creds / GPUs / external services"). The wire format is intentionally tiny:
request: [u32 BE length][utf-8 JSON {"method": str, "payload": str}]
response: [u32 BE length][utf-8 JSON {...}]
This avoids forcing pip install grpcio into the cargo test path. Real production sidecars that need gRPC can ingest the same rollout-proto::plugin::v1 service stubs (make protos regenerates the Python side); the host supports both via the SidecarProtocol enum (FramedJsonUds and GrpcUds). Phase 2 only ships the FramedJsonUds code path; GrpcUds lands when a sidecar consumer needs it.
UDS paths default to ./data/sidecars/<plugin>-<pid>.sock and are chmod 600 on the Python side. The host removes the socket on unload + on respawn.
Hot reload
Behind the dev-hot-reload Cargo feature only:
- PyO3. The dedicated Python thread runs
importlib.reload(module)and re-invokesfactory()on the same channel. In-flight calls running on the old object complete naturally; subsequent calls hit the new object. - Sidecar. SIGTERM the child (via
nix::sys::signal::kill), wait 2 s, SIGKILL on holdout, then respawn with the same argv on a fresh UDS path. - cdylib. Returns
Fatal(PluginContract { msg: "cdylib reload unsupported per spec 03 §7" })— Rust has no stable ABI;dlclosewhile another task holds aBox<dyn Plugin>is UB. Production should not reload cdylibs.
The feature is off by default. Per spec 03 §7, production deployments ignore reload flags entirely (with a warning) — that surfacing lands with the coordinator in plan 02-06.
Sandboxing (Phase 2 scope)
Network allowlist only (D-SANDBOX-01). cgroups + seccomp + FD limits + fs write restrictions are tracked as TODOs referencing Phase 7. The allowlist surfaces as PluginManifest::network_allowlist; the egress proxy lives in the worker / cloud-local layer and lands when the tool harness needs adversarial isolation (Phase 7).
Dependency direction
rollout-plugin-host does not depend on rollout-transport. The Wave-0 dep-direction lint enforces this — sidecar IPC uses UDS framing via rollout-proto, not the QUIC/HTTP/2 transport from rollout-transport. Forgetting this rule means the layered architecture (AGENTS.md §9) drifts; the integration test in crates/rollout-core/tests/dependency_direction.rs catches drift on every PR.
Observability
Every public op emits a structured event with target = "plugin_host" and a plugin_id field:
| Event | Fields | When |
|---|---|---|
plugin_loaded | plugin_id, mode, path/module/socket | load() succeeds |
plugin_reloaded | plugin_id, reason | reload() succeeds (or is rejected) |
plugin_call (span) | plugin_id, method | wraps every call() |
plugin_call_error | plugin_id, error | call returns Err |
Subscribers configure formatting via tracing-subscriber (e.g., the stdout-JSON sink that the coordinator binary ships in plan 02-06).
Python bridge (PyO3)
Phase 2 pins pyo3 = 0.28 + pyo3-async-runtimes = 0.28 in lockstep. Both crates moved under the PyO3 organisation in 2025 (the predecessor pyo3-asyncio is archived).
Version pin rationale
| Crate | Pinned version | Why |
|---|---|---|
pyo3 | 0.28 | First release with the new Python::attach / Python::try_attach API (replaces Python::with_gil); stable abi3 support; MSRV 1.83 satisfies our 1.88 toolchain. |
pyo3-async-runtimes | 0.28 | Mandatory lockstep with pyo3 — the FFI surface is private and tied to the patch version. Successor to pyo3-asyncio (archived). |
Both versions are declared in [workspace.dependencies] so every consumer (currently just rollout-plugin-host) inherits the same patch range.
abi3-py311 strategy
We enable pyo3/abi3-py311. This produces a single Python extension binary that runs on any CPython 3.11+ instead of building one wheel per minor (3.11 → 3.12 → 3.13). The trade-off:
- ✅ Smaller CI matrix; one wheel ships everywhere.
- ✅ Stdlib
tomllibis available (used by sidecar samples to parsepyproject.tomlif needed). - ✅ pyo3 abi3 was stabilised circa 0.20 and is well-exercised.
- ❌ 3.10 builds are rejected at link time with
cannot set a minimum Python version 3.11 higher than the interpreter version 3.10. Local dev machines that default topython3.10mustexport PYENV_VERSION=3.11.x(or similar) beforecargo build. - ❌ Some C-extension features (e.g. private CPython internals) aren't accessible through abi3.
scripts/preflight.sh (added in Wave-0 plan 02-00) verifies python3 >= 3.11 before make smoke does anything destructive.
Dedicated Python OS thread
Per RESEARCH Pitfall 3, mixing PyO3 calls onto random Tokio worker threads deadlocks under contention because the GIL ends up held across .await points. The host avoids this by owning one OS thread per PluginHostImpl:
┌────────────────────┐
Tokio task ───► mpsc::Sender ─────►│ rollout-py-* OS thread
│ (PyTask enum) │ ───────────────────────
│ │ Python::attach(...) {
│ │ plugin.call(...)
│ ◄─── oneshot ──────┤ }
└────────────────────┘
The worker thread is created with std::thread::Builder::new().name("rollout-py-<plugin>") so debuggers and tracing spans can distinguish per-plugin Python contexts. With pyo3/auto-initialize the interpreter spins up on first Python::attach; we do not call the (removed-in-0.28) prepare_freethreaded_python.
Heavy-CPU Python code should release the GIL via Py::allow_threads per spec 03 §3.2 — Phase 2 in-tree samples are tiny enough not to need this.
In-tree samples and the no-pip rule
Per AGENTS.md §7, every in-tree sample must run without pip install. Phase 2 ships three:
python/examples/sample_inproc/— PyO3 in-process.create_plugin().call(method, payload)echoes payload or returnsb"pong".python/examples/sample_sidecar/— Python sidecar over stdlib JSON framing. Runs aspython -m sample_sidecar <socket_path>.tests/smoke/plugins/rust_cdylib_sample/— Rust cdylib implementing ABI v1.
User plugins are free to bring their own virtualenv with grpcio, numpy, etc. — the no-pip rule applies only to the in-tree samples that gate cargo test and make smoke.
Coordinator
rollout-coordinator is the Phase-2 minimal control plane. It registers
workers, accepts heartbeats, persists the worker registry and heartbeat ledger
to local EmbeddedStorage, and surfaces deadline-detected failures.
Scope
Phase 2 explicitly ships only the heartbeat-receiver slice:
| In scope (Phase 2) | Out of scope (Phase 6 DIST-01..05) |
|---|---|
| Register worker, accept heartbeat | Work distribution / pull / submit |
Persist workers/* + heartbeats/* to Storage | Coordinator lease / CAS / HA |
| Deadline-based failure scan + tracing events | Multi-coordinator handoff |
Mount the three rollout-transport services | Restart-from-storage 4-node test |
The same binary scales to the full distribution story later — Phase 6 adds work-stealing on top of the existing transport + storage wiring.
Storage layout
Two redb tables (rollout-storage provides them; this crate only writes):
- Namespace
workers, path[<worker_id>]→ postcard-encodedWorkerRegistryEntry - Namespace
heartbeats, path[<worker_id>]→ postcard-encodedHeartbeatRecord
Heartbeats are overwrite-on-write in Phase 2 — only the latest beat is kept. Phase 6 may bolt on a ledger if work-stealing needs history.
Failure-detection formula
A worker is marked failed iff
elapsed_past_due = now - due_at
elapsed_past_due > clock_skew_budget
elapsed_past_due > coordinator_failure_timeout
Both thresholds must trip (spec 05 §6). The defaults (CONTEXT D-TIME-01):
| Constant | Value |
|---|---|
heartbeat_interval | 500 ms |
worker_self_fence_timeout | 4 s |
coordinator_failure_timeout | 5 s |
clock_skew_budget | 250 ms |
Plan-time invariants enforced by TransportConfig::validate_cross_fields:
worker_self_fence_timeout < coordinator_failure_timeout(split-brain prevention).clock_skew_budget < 2 × heartbeat_interval.
Failure-scan loop
The scan ticks every heartbeat_interval / 2 (250 ms by default) so a single
missed beat is detected within 2 × heartbeat_interval — the SUBSTR-02
acceptance criterion #3. Each tick scans the heartbeats/* namespace,
decodes via postcard, and emits two outputs per overdue worker:
- A
tracing::warn!line withtarget = "coordinator"andworker_id = <id>+due_at_ms = <ms>fields. - An
Event { kind: EventKind::Domain { topic: "worker_failed" }, level: Warn, … }via the injectedEventEmitter(see D-OBSERVE-01 below).
Already-failed workers are tracked in an in-memory HashSet so the loop
emits exactly one failure event per worker per coordinator lifetime.
Observability (D-OBSERVE-01)
StdoutJsonEmitter is the Phase-2 sink for rollout_core::EventEmitter:
- Holds a
Mutex<tokio::io::Stdout>so concurrent emits don't interleave. - Writes one NDJSON line per event using
serde_json. - Flushes after each event.
CoordinatorImpl::new(storage, run_id, emitter) takes the emitter as an
Arc<dyn EventEmitter> so non-stdout backends drop in without code change
in later phases. The coordinator emits:
worker_registered— on the firstregister()(or first heartbeat from an unknown worker; see CLI section).worker_heartbeat— on every accepted beat.worker_deregistered— on graceful drain.worker_failed— from the failure-scan loop.
Tests use NoopEmitter (also in emitter.rs) which discards every event.
CLI
rollout coordinator run
rollout coordinator run --config <path>
Boots the coordinator from a TOML file:
run_id = "01JZ..." # ULID
[storage]
path = "./data/rollout.db"
[transport]
listen_addr = "127.0.0.1:50051"
tls_dir = "./data/tls"
The same logic ships as a standalone binary rollout-coordinator (Cargo
[[bin]]) so the smoke-test wrapper can invoke it directly without
rollout-cli in the path.
rollout worker run
rollout worker run --config <path> [--worker-id <ulid>] [--plugin <manifest.toml> ...] [--hot-reload]
The Phase-2 worker:
- Opens its own
EmbeddedStorage. - Builds a
PluginHostImpl::with_storage(...)andload()s each--pluginmanifest. - Dials the coordinator over mTLS using an ephemeral client cert issued
from the dev CA at
./data/tls/. - Sends
Beat(state=Init)immediately; the coordinator auto-registers the worker on first heartbeat (the proto has no separateregisterRPC). - Beats every
heartbeat_intervalafter that, advancingstate -> Ready. - On SIGTERM: sends
Beat(state=Draining)and exits 0.
First-run UX
On first boot, the coordinator generates data/tls/ca.pem +
data/tls/ca.key.pem (chmod 600 on the key) and prints:
Generated dev CA at ./data/tls/ca.pem
Subsequent runs are idempotent (read-through). The CA + per-host certs
follow rcgen 0.13 defaults — adequate for dev; production should swap
in a real CA in a later phase.
Smoke test
make smoke is the SUBSTR-02 / SUBSTR-03 / SUBSTR-04 acceptance gate. It runs
the live substrate end-to-end on a clean checkout with no cloud credentials
and proves that deadline-based failure detection works on the wire.
What it proves
The smoke test boots the full Phase-2 substrate and exercises every concern in one run:
rollout-storageopens three independent redb files (one per coordinator / w1 / w2) with always-fsync durability.rollout-transportbrings up the HTTP/2 listener with an auto-generated dev CA + per-host mTLS certificates.rollout-plugin-hostloads two plugins per worker — a Rust cdylib and a Python sidecar (stdlib-only framed-JSON over UDS).rollout-coordinatorpersists the worker registry + heartbeat ledger and runs the deadline-based failure scan that emitsworker_failedfor any worker overdue by more thancoordinator_failure_timeout + clock_skew_budget.
When the script kills w1 with SIGKILL, the coordinator must observe and
emit the failure event for w1's ULID inside the deadline budget — that is
the contract the substrate ships.
Topology
coord.log (NDJSON events)
▲
│
┌─────────────────────┴─────────────────────┐
│ rollout-coordinator │
│ ./data/smoke/coord.db tls/ca.pem │
│ listen 127.0.0.1:50051 (mTLS, H/2) │
└───────────────┬──────────────┬────────────┘
│ │
heartbeat 500ms heartbeat 500ms
│ │
┌───────────────▼──┐ ┌────▼─────────────┐
│ rollout worker w1│ │ rollout worker w2 │
│ w1.db │ │ w2.db │
│ plugins: │ │ plugins: │
│ cdylib sample │ │ cdylib sample │
│ sidecar sample │ │ sidecar sample │
└───────────────────┘ └───────────────────┘
Both workers load both plugin samples; the cdylib path is built on demand by
the script (cargo build --manifest-path .../rust_cdylib_sample/Cargo.toml)
and the Python sidecar resolves via PYTHONPATH=python/examples.
Timeline
| Step | Wall-clock budget | Asserted invariant |
|---|---|---|
| Build binaries + cdylib | ~30 s (cold) / instant (cached) | exit 0 |
| Spawn coordinator | < 1 s | port 50051 up |
| Spawn w1 + w2 | < 1 s | both PIDs alive |
| Wait heartbeat-stable | < 5 s | worker_heartbeat for both ULIDs in coord.log |
kill -KILL w1 | instant | — |
Detect worker_failed(w1) | < 8 s (D-TIME-01 floor: 2 × heartbeat_interval + skew) | worker_failed topic + w1 ULID in coord.log |
The 8-second detection deadline is well inside the spec contract: at the
locked Phase-2 defaults (heartbeat_interval = 500 ms,
coordinator_failure_timeout = 5 s, clock_skew_budget = 250 ms), the
failure-scan loop ticks at 250 ms and fires the event within
coordinator_failure_timeout + clock_skew_budget ≈ 5.25 s of the missed
beat — observed locally at ~5.2 s.
Where logs go
All smoke artifacts land under ./data/smoke/ (gitignored):
| Path | Purpose |
|---|---|
data/smoke/coord.db | Coordinator embedded storage |
data/smoke/w1.db, w2.db | Per-worker embedded storage |
data/smoke/tls/ | Auto-generated dev CA + per-host certs |
data/smoke/sidecars/ | Per-plugin UDS sockets |
data/smoke/logs/coord.log | Coordinator stdout (NDJSON spec-09 events + tracing) |
data/smoke/logs/w1.log, w2.log | Worker stdout (tracing) |
data/smoke/logs/w1.toml, w2.toml | Per-worker TOML (sed-rewritten from the shared fixture so each worker has its own db) |
data/smoke/logs/w1.id, w2.id | Pre-allocated worker ULIDs (smoke driver writes these so the grep step is deterministic) |
Searching coord.log for worker_failed.*$W1_ULID is the assertion the
script makes.
Configuration fixtures
The committed fixtures at tests/smoke/coordinator.toml and
tests/smoke/worker.toml use the D-TIME-01 timing defaults verbatim. The
worker TOML carries a single storage path; the smoke driver derives
data/smoke/w1.db and data/smoke/w2.db at runtime with sed because redb
takes an exclusive file lock per Database::create and two worker processes
sharing the same file would conflict.
CI integration
.github/workflows/ci.yml includes a smoke job on ubuntu-latest that
runs after the standard test job. The job installs netcat-openbsd for the
port-poll, runs bash scripts/preflight.sh, then make smoke. On failure,
all data/smoke/logs/ contents are uploaded as a smoke-logs artifact for
post-mortem.
The pre-existing 11 CI jobs (lint, test, deny, commitlint,
schema-drift, architecture-lint, unused-deps, rustdoc-check,
docs-build, docs-deploy, docs-test-policy) are untouched — no
continue-on-error, no skipped steps. The architecture-lint job
automatically picks up the 4th invariant added in dependency_direction.rs
(rollout-coordinator must not depend on rollout-plugin-host or any
cloud-layer crate) without any YAML changes.
First-run UX
On first invocation the coordinator emits
Generated dev CA at ./data/smoke/tls/ca.pem
to stderr and proceeds. No manual openssl steps are required; rcgen
mints the CA + per-host certs in-process. The tls/ directory is gitignored
along with the rest of data/.
How to extend
- Add a plugin: drop a manifest TOML under
tests/smoke/plugins/, then add a--plugin <path>flag to the worker spawn lines inscripts/smoke.sh. - Change timings: edit
tests/smoke/coordinator.tomlandtests/smoke/worker.toml; the cross-field invariants (worker_self_fence_timeout < coordinator_failure_timeout,clock_skew_budget < heartbeat_interval × 2) are enforced atvalidate_cross_fieldstime and failing configs early-exit. - Add a worker: copy the
spawn_workerblock; allocate a new ULID; add the per-workersed-rewrite line for the storage path.
Reproducing locally
bash scripts/preflight.sh # verifies cargo + make + python3 >= 3.11
make smoke
Total wall-clock is dominated by the cold cargo build (~30 s on a warm laptop); the actual integration test runs in ~7 s once binaries exist.
If python3 --version reports < 3.11, install a newer interpreter
(brew install python@3.13, pyenv install 3.13.3, or your distro's
python3.11 package) and re-run; the preflight check guards against the
runtime-incompatible interpreter.
Inference
Phase 3 delivers rollout's first end-user-visible building block: a batch-inference
pipeline that loads a model into vLLM via PyO3, pulls
content-addressed prompts off the Phase-2 in-memory queue, and writes JSONL
completions to disk. The pipeline is resumable — rollout infer batch --resume <run_id> scans the per-sample state in rollout-storage and only
enqueues outstanding work, producing zero duplicates on restart.
Phase 3 ships two new crates plus a CLI subcommand:
| Crate | Layer | Phase-3 responsibility |
|---|---|---|
rollout-backend-vllm | 2 | InferenceBackend impl over vLLM's AsyncLLMEngine via PyO3 (in-process). |
rollout-runtime-batch | 3 | CAS sample-state machine, JSONL I/O, plan-time validation, mock backend. |
rollout-cli (extended) | 4 | rollout infer batch --config <toml> [--resume <run_id>] subcommand. |
Per-component chapters land in plan 03-05 (smoke + docs + bench). For now, see
the Wave-0 trait extension in spec 02 §2a and
the WorkerRole addition in spec 01 §3a.
TODO (plan 03-05): vLLM backend chapter · batch-runtime chapter · CPU-mode chapter · macOS Docker dev workflow · resume semantics walkthrough.
vLLM backend (rollout-backend-vllm)
rollout-backend-vllm implements the Phase-3 InferenceBackend trait
(init / generate / model_id / shutdown) over vLLM's AsyncLLMEngine
via PyO3 in-process. The crate is the second user of the dedicated
Python OS-thread pattern hardened in plan 02-05; the first was
rollout-plugin-host. The architecture is documented in spec
03-plugin-system §3.2 and the trait surface in spec
02-algorithms §2 / §2a.
Status (plan 03-03): Wave-3 lives. The PyO3 dedicated thread now imports
rollout.backends.vllm.engine, which wrapsvllm.AsyncLLMEngine;generatedrives the engine through a fresh asyncio event loop on the worker thread viapyo3_async_runtimes::tokio::run_until_complete. The default-features (no-vllm) build keeps the Wave-2 stub worker socargo test -p rollout-backend-vllmstill runs without Python / vLLM.
Architecture
+--------------------+ tokio::sync::mpsc::Sender<VllmTask> +--------------------+
| Tokio runtime | -----------------------------------> | rollout-py-vllm- |
| VllmBackend | | <engine_id> |
| (async impl | <----------------------------------- | Python::attach |
| InferenceBackend)| oneshot::Sender<Result<...>> | rt.block_on(...) |
+--------------------+ +--------------------+
|
v
rollout.backends.vllm.engine
(Wave 3: AsyncLLMEngine)
- The Tokio side never touches Python. Every call hops through one OS thread
named
rollout-py-vllm-<engine_id>that owns the interpreter for the backend's lifetime. - The thread imports
rollout.backends.vllm.engineonce at startup, then loops onmpsc::Receiver<VllmTask>untilVllmTask::Shutdownarrives or the channel closes. - Each
Generatetask uses aoneshot::channelfor its reply so multiple concurrent prompts can be in flight without head-of-line blocking on the channel (Tokio side dispatches them sequentially in Wave 2; Wave 3 will span them concurrently to let vLLM's continuous batcher do the work).
Cargo feature gating
[features]
default = [] # no PyO3 link; tests use the stub worker
vllm = [] # imports rollout.backends.vllm.engine on the dedicated thread
With --features vllm off (default), the crate builds without invoking
pyo3 at runtime. The dispatch path still exists; Generate returns the
Wave-2 stub error from a pure-Rust worker. This honors AGENTS.md §7
(every plugin testable locally without GPU) — no Python interpreter, no vLLM
install, no CUDA required for cargo test -p rollout-backend-vllm.
With --features vllm on, the worker thread calls
Python::attach(|py| py.import("rollout.backends.vllm.engine")) once at
startup. The Python module top-imports vllm.AsyncLLMEngine (plan 03-03).
vLLM version pin: vllm>=0.10,<0.22. The lower bound is the first
release where AsyncLLMEngine is an alias for the new v1 engine
(vllm.v1.engine.async_llm.AsyncLLM); the upper bound guards against
future-version drops of the alias. The pin is documented but NOT enforced
in Cargo.toml — vLLM is a Python install. The engine module's import
falls back to from vllm.engine.async_llm_engine import … if the
top-level alias is removed in a future version.
Pitfall 10: env-write before import
vLLM imports huggingface_hub which lazily reads os.environ.get("HF_TOKEN")
when downloading gated models. The dedicated Python thread therefore writes
HF_TOKEN into its own os.environ before the py.import("rollout…")
call:
#![allow(unused)] fn main() { Python::attach(|py| { if let Some(token) = &secret_token { let os = py.import("os")?; let environ: Bound<'_, PyDict> = os.getattr("environ")?.cast_into()?; environ.set_item("HF_TOKEN", token)?; } py.import("rollout.backends.vllm.engine")?; Ok(()) }); }
The token flows through the VllmEngine::spawn(engine_id, secret_token)
constructor; the SecretStore consumer is wired in plan 03-03 alongside the
live engine.
Wave 2 vs Wave 3 split
| Concern | Wave 2 (plan 03-01) | Wave 3 (plan 03-03) |
|---|---|---|
Cargo crate + vllm feature flag | shipped | inherited |
PyO3 dedicated thread (rollout-py-vllm-…) | shipped | inherited |
VllmTask enum + mpsc::Sender dispatch | shipped | inherited |
VllmBackend: InferenceBackend impl shape | shipped (stub) | live engine |
python/rollout/backends/vllm/engine.py | stub | real AsyncLLMEngine |
AsyncLLMEngine.from_engine_args wiring | — | shipped |
pyo3_async_runtimes::tokio::run_until_complete | — | shipped |
Explicit torch.cuda.is_available() device probe | — | shipped |
HF_TOKEN env-write before import vllm | wired (passthrough) | exercised |
Content-addressed model_id from HF repo SHA | — | shipped |
criterion throughput bench + raw-vllm baseline | placeholder | shipped |
Bridging the asyncio ↔ Tokio gap (RESEARCH Pitfall 2)
vllm.AsyncLLMEngine.generate(prompt, sampling, request_id) returns an
async generator that vLLM's own scheduler drives. To consume it from Rust,
we:
- Build a coroutine on the GIL by calling
engine.generate_one(...)(the Python module's wrapper that drives the async-for loop to completion and returns the finalRequestOutputas a dict). - Create a fresh
asyncioevent loop on the worker thread. - Hand the coroutine to
pyo3_async_runtimes::tokio::run_until_complete(event_loop, async move { into_future(coro).await }). The Python C-levelevent_loop.run_until_completereleases the GIL whenever the loop has nothing to do — which is exactly when our Rustawaityields. That is what lets vLLM's background scheduler tasks (also on this asyncio event loop) make progress.
The contract is verified on every CI build by
tests/pyo3_bridge_smoke.rs: a Python async def smoke() spawns a
background threading.Thread that polls a flag; the assertion fails if the
background thread does not see the flag set, proving the GIL would have
been held across the await (Pitfall 2 regression).
Pitfall 9: explicit device probe
vLLM's EngineArgs(device="auto") is unreliable on partially-installed CUDA
hosts (silent CPU fallback). The Python glue probes
torch.cuda.is_available() and passes an explicit device="cuda" or
device="cpu" to AsyncEngineArgs. CONTEXT D-VLLM-04's earlier "auto"
guidance is superseded by this probe.
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
args = AsyncEngineArgs(model=model_uri, device=device, ...)
Live integration tests (gated)
Two #[ignore]'d integration tests under
crates/rollout-backend-vllm/tests/ exercise the live engine:
vllm_init.rs— bring upfacebook/opt-125m; assertbackend.model_id()is content-addressed and stable across twoinit()calls of the same URI.vllm_generate.rs— bring upQwen/Qwen2.5-0.5B-Instructand run a 1 prompt × 8 tokens round-trip with a 300 s timeout (RESEARCH Pitfall 8 CPU mode is slow).
Both tests skip unless ROLLOUT_VLLM_AVAILABLE=1 is set in the
environment. Run them with:
ROLLOUT_VLLM_AVAILABLE=1 cargo test \
-p rollout-backend-vllm --features vllm \
--test vllm_init --test vllm_generate -- --include-ignored
The default workspace cargo test skips them.
Benchmark methodology
crates/rollout-backend-vllm/benches/throughput.rs runs a 64-prompt ×
64-token criterion bench against facebook/opt-125m; the companion
scripts/raw_vllm_baseline.py drives the same prompt set through raw
vllm.LLM (sync API). Compare the two tokens/sec numbers to verify the
BACKEND-02 <10% overhead exit criterion — a ratio ≥ 0.9 passes.
The bench is gated behind --features vllm and ROLLOUT_VLLM_AVAILABLE=1.
CI does not run it by default; the perf check lives on the self-hosted GPU
runner per CONTEXT D-CLI-05.
# rollout side:
ROLLOUT_VLLM_AVAILABLE=1 cargo bench \
-p rollout-backend-vllm --features vllm --bench throughput
# baseline side:
python scripts/raw_vllm_baseline.py facebook/opt-125m
Cross-references
- RESEARCH §"Pattern 1" — the PyO3 dedicated-thread shape this crate mirrors from plan 02-05.
- RESEARCH §"Common Pitfalls" — Pitfalls 2 (GIL deadlock),
9 (
device="auto"not reliable), 10 (env-write before import). - spec 02 §2a — the locked Phase-3 trait surface.
rollout-runtime-batch
The runtime-glue crate behind rollout infer batch. Owns the CAS sample-state
machine, the queue management around the Phase-2 InMemQueue, JSONL I/O, the
InferBatchConfig TOML schema, and the test-only MockBackend that lets us
exercise the full coordinator/worker path without spinning up a real vLLM
engine.
Why a separate crate?
rollout-backend-vllm (Layer 2) depends only on rollout-core + pyo3 — it
has no business touching rollout-storage / rollout-cloud-local /
rollout-transport. The dep-direction lint (invariants #5 and #6 from plan
03-00) enforces this. The Phase-3 work that does need storage + queue +
object-store — the sample-state CAS machine, the resume scan, the JSONL
reader — lives here in rollout-runtime-batch (Layer 3). The runtime composes
any Arc<dyn InferenceBackend>; the backend stays cloud-agnostic.
+------------------+ rollout-cloud-local::InMemQueue
| rollout-cli |---┐ rollout-cloud-local::FsObjectStore
| infer batch | | rollout-storage::EmbeddedStorage
+------------------+ ▼
+-------------------------+ +-----------------------+
| rollout-runtime-batch |◀────▶| Arc<dyn |
| | | InferenceBackend> |
| • BatchCoordinator | | |
| • BatchWorker | | • VllmBackend (prod) |
| • SampleRecord + CAS | | • MockBackend (test) |
| • JSONL I/O | +-----------------------+
| • InferBatchConfig |
+-------------------------+
SAMPLING_PARAMS_SCHEMA_VERSION
Per RESEARCH §"Pitfall 1", the deterministic sample_id() hash is brittle
against any change to SamplingParams's wire shape. Postcard is not
self-describing — adding a field rewrites every byte of to_stdvec(¶ms),
which would invalidate every outstanding Pending / Running sample-ID in
storage.
The defence is a schema-version byte:
#![allow(unused)] fn main() { const SAMPLING_PARAMS_SCHEMA_VERSION: u8 = 1; fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId { let mut h = blake3::Hasher::new(); h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]); // FIRST byte h.update(&model.0); h.update(prompt.as_bytes()); h.update(&postcard::to_stdvec(params).unwrap()); h.update(&idx.to_le_bytes()); ContentId(*h.finalize().as_bytes()) } }
When you add a field to SamplingParams, bump SAMPLING_PARAMS_SCHEMA_VERSION.
That invalidates outstanding IDs by design — drain the in-flight batch under
the old version before deploying the new schema, or re-run any orphaned
samples.
Tests:
content_id_derivation.rs::sample_id_matches_locked_hex_for_default_params— locks the hex digest for the default-params case so any silent change to hasher input order trips immediately.content_id_derivation.rs::schema_version_byte_is_first— explicit regression catch for Pitfall 1.- Six proptest property tests prove that every component of the input
(
prompt,idx,model,temperature,max_tokens,seed) participates in the hash.
CAS state machine
SampleRecord { id, prompt_blob, state, created_at_ms, input_idx } lives at
infer/<run_id>/samples/<sample_id_hex> (storage namespace infer,
table T_INFER).
try_claim try_complete
Pending ──────────────▶ Running ────────────────────▶ Done
▲ │ ▲
│ │ │ (terminal)
│ try_repending │ try_fail
│ (stale only) ▼
└──────────────────── Failed
(transient errors;
coordinator re-enqueues)
Helpers (in src/state.rs):
try_claim(txn, record, run_id, worker_id, now_ms, stale_after_ms) → Result<bool>CAS-swapsPending(or staleRunning) →Running. Returnsfalseon race-loss.try_complete(txn, running_record, run_id, completion_blob, finished_at_ms)try_fail(txn, running_record, run_id, reason, failed_at_ms)try_repending(txn, running_record, run_id)— for the resume path.
All helpers take a &mut Box<dyn StorageTxn> so the caller controls the
commit/abort lifecycle.
Resume semantics
BatchCoordinator::scan_and_enqueue(inputs, model_content_id, sampling) is
idempotent. For each input it derives sample_id(model, prompt, sampling, idx),
looks up the existing record at sample_key(run_id, sid), and:
| Current state | Action |
|---|---|
| (absent) | write Pending + prompt blob + enqueue |
Pending | enqueue (worker will claim via CAS) |
Failed | enqueue (retry) |
Running (fresh) | skip — live owner |
Running (stale) | CAS Running → Pending; enqueue on success |
Done | skip — terminal |
"Stale" defaults to 5 minutes (DEFAULT_STALE_AFTER_MS = 5 * 60_000).
Override per-coordinator via .with_stale_after_ms(ms). The 5-minute window
follows RESEARCH §"Pitfall 5" — short enough to recover from a SIGKILL'd
worker, long enough to absorb a slow generation pass.
The stale-Running re-Pending step is itself a CAS so two coordinators racing
on the same orphaned sample don't double-enqueue. The test
resume_skips_done.rs::scan_enqueues_only_non_terminal_samples covers all four
transition cases in a single integration run.
BatchWorker flow
#![allow(unused)] fn main() { pub async fn run_loop(&self) -> Result<usize, CoreError> { loop { match self.run_one().await? { RunOutcome::Drained => return Ok(completed), RunOutcome::Completed => completed += 1, RunOutcome::Failed | RunOutcome::Skipped => { /* keep going */ } } } } }
run_one() dequeues a sample-id, loads the SampleRecord, CAS-claims it,
fetches the prompt blob, calls backend.generate(&[Prompt], &sampling), writes
the completion to the object store, and CAS-transitions Running → Done.
Backend errors translate to Running → Failed via try_fail; the queue item
is ack'd either way (the coordinator re-enqueues Failed rows on the next
scan_and_enqueue).
Concurrency: each BatchWorker is one Tokio task; spawn N of them sharing
the same Arc<dyn Queue> / Arc<dyn Storage> / Arc<dyn ObjectStore> and
the CAS dance guarantees at-most-once completion per sample.
MockBackend
Gated by the test-mock-backend Cargo feature. Returns
Completion { text: format!("MOCK:{}", p.0), finish_reason: "stop", … } after
an optional tokio::time::sleep. Used by worker_happy_path.rs and (in
Wave-4) by the restart-no-duplicates integration test to exercise the full
pull-loop without a Python / vLLM dependency.
The contract: MockBackend impls the same InferenceBackend trait that
VllmBackend does, so it composes against BatchWorker identically. CLI code
in Wave 3 (plan 03-04) wires VllmBackend; nothing in BatchWorker /
BatchCoordinator knows the difference.
InferBatchConfig
The TOML schema for rollout infer batch --config <path> lives here in
src/config.rs (NOT in rollout-cli, per WARN-5). rollout-cli imports it
via use rollout_runtime_batch::config::InferBatchConfig;.
[model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"
[sampling]
temperature = 0.7
top_p = 0.9
max_tokens = 64
seed = 42
[input]
glob = "data/prompts/*.jsonl"
[output]
dir = "data/completions"
[workers]
count = 1
#[serde(deny_unknown_fields)] on every block per spec 11 — typos in the
config file fail at plan-time, not minute 47 of a run.
JSONL I/O
read_jsonl + write_jsonl use stdlib tokio::io::BufReader::lines() +
serde_json::from_str — no serde_jsonlines dep. Arbitrary extra fields on
the input row are preserved verbatim via #[serde(flatten)] and round-trip
to the output row. write_jsonl writes one JSON object per line; the CLI
sorts on SampleRecord.input_idx before calling write_jsonl so output order
matches input order regardless of which worker finished which sample first.
Phase 4 — TrainableBackend impl on MockBackend
Beyond the Phase-3 InferenceBackend impl, MockBackend ships a
TrainableBackend impl gated behind the same test-mock-backend feature.
The training extension drives plan 04-02's LOAD-BEARING snapshot_resume.rs
byte-compare test (TRAIN-03) — no Python, no GPU, runs on every CI build.
MockBackend::new_train(seed)initialises anndarray::Array1<f32>of length 8 with every element set to(seed as f32) / 1000.0.forward_with_lossreturnsloss = 0.5and aGradHandle { step: prev + 1 }.optimizer_stepapplies a deterministic SGD delta(seed + grad_handle.step) * lrto every weight element.save_weightsreturnsContentId::of(postcard::to_stdvec(&weights)).load_weightsis a no-op;new_train_with_weights(seed, weights)is the test-side restore hook so the byte-compare assertion insnapshot_resume.rsis meaningful.
Determinism contract: two MockBackends constructed with the same seed produce
byte-equal weights after K identical optimizer_step calls. Plan 04-02's test
proves bit-identical resume by running 10 uninterrupted steps versus 5-snapshot-5
and asserting the two final weight vectors are equal.
See also
- RESEARCH §"Pitfall 1" — schema-version byte rationale.
- RESEARCH §"Pitfall 5" — stale-Running re-Pending CAS dance.
docs/specs/02-algorithms.md§2a —InferenceBackendextension shape.docs/specs/04-storage-snapshots.md§2 —Storage+StorageTxn+ CAS.docs/specs/06-cloud-layer.md§3 —Queue+ObjectStore.- Plan 04-02 — SFT skeleton + the snapshot-resume byte-compare proof.
rollout infer batch CLI
The first user-visible building block of rollout: a resumable batch-inference
pipeline driven by a single TOML config. Bridges the Phase-3 substrate
(rollout-backend-vllm + rollout-runtime-batch) behind a clap subcommand on
rollout-cli.
Invocation
rollout infer batch \
--config examples/batch-tiny.toml \
[--resume <run_id>] \
[--workers N] \
[--dry-run]
| Flag | Default | Purpose |
|---|---|---|
--config | required | Path to the TOML config (schema below). |
--resume | implicit from output dir | Override the <output.dir>/run-id lookup with an explicit ULID. |
--workers | [workers].count | Override the worker pool size for this invocation. |
--dry-run | false | Validate config + probe inputs + check HF_TOKEN allowlist; never call the backend. |
TOML schema
The schema lives in rollout-runtime-batch::config::InferBatchConfig (the CLI
imports it; spec 11 single-source-of-truth). Every block is
#[serde(deny_unknown_fields)] — typos are caught at load time.
[model]
uri = "Qwen/Qwen2.5-0.5B-Instruct" # HF repo, local path, or object-store URI
tokenizer = "..." # optional override
[sampling]
temperature = 0.7
top_p = 0.9
top_k = -1 # -1 disables
max_tokens = 64
seed = 42 # optional; deterministic when set
stop = []
stream = false # MUST be false in Phase 3 (D-BACKEND-03)
[input]
glob = "data/prompts/*.jsonl"
[output]
dir = "data/completions"
[workers]
count = 1 # >= 1
JSONL input contract
One JSON object per line. Required: prompt. Optional: id (defaults to the
deterministic sample-id from blake3(model || prompt || params) per
rollout-runtime-batch::sample_id). Extra fields are preserved and
round-tripped to output.
{ "prompt": "Translate to French: Hello.", "id": "p-001", "tag": "demo" }
JSONL output contract (D-CLI-03)
Each row:
{
"id": "<content-addressed sample id>",
"prompt": "<original prompt text>",
"completion": "<generated text>",
"sampling_params": { ... },
"model_uri": "Qwen/Qwen2.5-0.5B-Instruct",
"finish_reason": "stop",
"model_content_id": "<blake3-hex of resolved model SHA>",
"completion_blob_id":"<blake3-hex of completion bytes in the object-store>",
"generated_at": "2026-05-20T22:31:09.123Z"
}
Order matches input file order regardless of worker concurrency — the CLI
calls BatchCoordinator::collect_done_records() (sorted by input_idx) and
emits in that order.
run_id lifecycle (BLOCKER 6)
Three-tier resolution:
| Source | When applied |
|---|---|
--resume <ULID> | Always honored if present. |
<output.dir>/run-id (file) | Re-attach if the file exists and --resume is absent. |
| Freshly minted ULID | First run; written atomically (tmp + rename). |
The file is single-line UTF-8 (ULID Crockford form). Plan 03-05's
restart_no_duplicates test reads this file between phases to obtain the ULID
for the explicit --resume flag.
--dry-run semantics
Performs (in order):
- Parse TOML against
InferBatchConfig(deny_unknown_fields). - Validate
sampling.stream == false,sampling.max_tokens > 0,workers.count >= 1,input.globnon-empty. - Resolve
input.globand read every JSONL file end-to-end. - Probe whether the model URI is on a known-gated prefix
(
meta-llama/*,mistralai/*) and look upROLLOUT_SECRET_HF_TOKEN(best-effort). - Print
dry-run OK: model=… inputs=… workers=…and exit0.
Crucially, the backend is never constructed — --dry-run works on a build
with neither --features vllm nor --features test-mock-backend.
Backend selection
The CLI picks one backend at runtime based on Cargo features + env:
| Order | Condition | Backend |
|---|---|---|
| 1 | --features test-mock-backend + ROLLOUT_TEST_MOCK_BACKEND=1 | rollout_runtime_batch::MockBackend |
| 2 | --features vllm | rollout_backend_vllm::VllmBackend |
| 3 | none | fast-fail with Fatal(ConfigInvalid) at full-run; dry-run still works. |
Observability
RUST_LOG=info rollout infer batch … produces structured tracing events.
Key fields: run_id, enqueued, total, completed. Worker spans carry
worker = <ulid>.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success (or successful dry-run). |
| 2 | Config-invalid, infrastructure error, or backend error. |
CPU-mode caveat
vLLM CPU inference on the test model (Qwen2.5-0.5B-Instruct) is dramatically
slower than CUDA: expect ~1–5 tokens/sec. On macOS Apple-Silicon, vLLM has no
PyPI wheels (RESEARCH Pitfall 3) — run via Docker. CI's infer-smoke job is
Linux-only and gated on ROLLOUT_VLLM_AVAILABLE=1.
CPU mode
vLLM ships with first-class CUDA support, but for AGENTS.md §7 ("every plugin testable locally without GPU") rollout must also work on a plain CPU. This chapter documents the CPU-mode contract for rollout-backend-vllm, the dev-loop reality of macOS Apple-Silicon, and the smoke-test posture that lets default CI stay green without any GPU.
Where CPU mode is selected
Per Phase 3 CONTEXT decision D-VLLM-04 (as overridden by RESEARCH §"Pitfall 9"), the Python-side glue in python/rollout/backends/vllm/engine.py performs an explicit torch.cuda.is_available() probe and passes device="cuda" or device="cpu" to AsyncEngineArgs — never device="auto". The auto-detect path was rejected because vLLM silently falls back to CPU when CUDA libraries are partially installed (driver present, runtime missing, etc.), and a silent fallback would mask configuration mistakes at runtime instead of failing at plan time.
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
engine_args = AsyncEngineArgs(
model=model_uri,
device=device, # explicit; not "auto"
disable_log_stats=True,
disable_log_requests=True,
)
rollout-cloud-local::ComputeHint::inventory() still informs observability events (gpu_inventory_collected) and worker-config decisions, but it is no longer the source-of-truth for the engine device kwarg.
Expected CPU throughput
vLLM's CPU backend is functional but not fast. Approximate single-stream throughput for the canonical Phase-3 test model (Qwen/Qwen2.5-0.5B-Instruct, 16 max-tokens):
| Host | tokens/sec |
|---|---|
| Apple M1 Pro (8-core perf) | ~3–6 |
| Linux x86_64 (16-core) | ~2–4 |
| Generic CI runner (4-core) | ~1–2 |
A 4-prompt × 16-token smoke run finishes in well under 60 s on any of these. Anything longer than the canonical examples/batch-tiny.toml shape should target a CUDA host.
macOS Apple-Silicon
vLLM has no Apple-Silicon wheel as of Phase 3. pip install vllm on macOS produces ERROR: No matching distribution found for vllm. The two paths forward:
- Build-from-source (slow, brittle).
VLLM_TARGET_DEVICE=cpu pip install -e .against a freshly cloned vLLM repo. Compilation takes 10–30 min and depends on Apple-clang versions in ways that drift between vLLM releases. Documented but not recommended for routine dev work. - Docker (recommended). See
dev-on-macos.md. Alinux/amd64(orlinux/arm64if available) container withvllm>=0.10pre-installed lets the rollout binaries run identically to CI.
Either way, the Rust-side test surface (~80 % of Phase 3's automated tests — SamplingParams postcard determinism, sample_id derivation, CAS state-machine transitions, JSONL round-trip, the MockBackend-driven restart_no_duplicates test) runs natively on macOS without any vLLM installed. Only the live-engine integration tests (vllm_init.rs, vllm_generate.rs) and the make infer-smoke script require a real vLLM.
CI posture
- Default CI (public runners):
infer-smokenow runs on every PR and merge — noROLLOUT_VLLM_AVAILABLEgate. It installs thevllm-cpuPyPI wheel (~101 MB unified CPU wheel, AVX2 fallback) instead of the ~10 GB CUDA wheel; thetorch.cuda.is_available()probe inengine.pyselectsdevice="cpu"automatically. The job downloadsQwen2.5-0.5B-Instruct(cached under~/.cache/huggingface), runsrollout infer batch --config examples/batch-tiny.toml, and asserts 4 non-empty completion rows — all on the free 4-vCPUubuntu-latestrunner in well under 60 s of inference.train-smokeis likewise always-on, installing CPU torch + transformers + accelerate and running theexamples/sft-tiny.tomlSFT (max_steps = 2) on CPU. pip + HuggingFace caches keep both jobs fast on repeat runs. - MockBackend proofs unchanged: the load-bearing
restart_no_duplicates(BACKEND-02 exit (b)) and bit-identical-resume (TRAIN-03) proofs still run in the standardtestjob via MockBackend — no GPU/vLLM/transformers required there. - Local dev:
make infer-smokeafterpip install 'vllm-cpu>=0.17'. On Apple-Silicon, prefer the Docker path documented indev-on-macos.md.
Failure modes
| Failure | Surface | Diagnosis |
|---|---|---|
import torch fails | Python ImportError at engine init | Active venv missing torch — pip install torch first |
import vllm fails | Python ImportError at engine init | Active venv missing vllm — for the CPU/CI path pip install 'vllm-cpu>=0.17'; on a CUDA host install the matching CUDA wheel |
torch.cuda.is_available() == False on a GPU host | engine boots in CPU mode silently | NVIDIA driver/runtime mismatch — install matching CUDA runtime; the explicit probe surfaces this rather than masking it |
vllm import succeeds but AsyncLLMEngine.from_engine_args panics with device="cpu" not supported | vLLM version too old | upgrade to vllm>=0.10 |
make infer-smoke times out (>300 s) on a CPU host | model larger than Qwen2.5-0.5B-Instruct | use the canonical examples/batch-tiny.toml model; do not run multi-billion-param models on CPU |
Resume: zero-duplicate batch restart
Phase 3's rollout infer batch is resumable. If the worker is killed mid-batch, restarting with --resume <run_id> (or with the same [output] dir = and no explicit flag) continues from the last persisted sample with zero duplicates and no skipped inputs. This is one of the three ROADMAP Phase-3 exit criteria.
The lifecycle
Every rollout infer batch invocation has exactly one run_id — a 26-character ULID. The CLI resolves it via a three-tier precedence:
--resume <id>explicit override. Use this when you know the run you want to reattach.<output.dir>/run-idfile. If--resumeis absent but the output directory already has arun-idfile (from a prior run), the CLI reads it and continues that run. This is the common case for "kill + restart with the same config."- Mint a fresh ULID. No override, no prior file → generate a new
RunId, write it atomically (tempfile + rename) to<output.dir>/run-id, and proceed as a brand-new run.
run-id files are single-line UTF-8 ULIDs. They are safe to delete (forces a fresh run on next invocation) but should not be edited by hand.
How resumability is implemented
Three subsystems collaborate (each documented in its own chapter):
rollout-storage(the redb-backedEmbeddedStoragefrom Phase 2) holds the per-sample state KV under namespaceinfer. Each sample'sSampleRecordcarries one of fourSampleStatevariants —Pending,Running,Done { completion_blob }, orFailed { reason }— and transitions atomically viaStorage::cas_bytes.rollout-cloud-local::InMemQueue(also Phase 2) holds the work queue with Storage spill: every enqueue/ack mirrors to redb undercloudlocal_queue, so a coordinator restart replays whatever was unacknowledged.rollout-cloud-local::FsObjectStorestores the actual completion blobs at content-addressed paths under<output.dir>/object-store/<sha256[0..2]>/<sha256[2..4]>/<sha256>. TheSampleRecord::state = Done { completion_blob: ContentId }variant carries the blob'sContentId; the actual completion text is read from the object store at output-writing time.
At plan time, BatchCoordinator::scan_and_enqueue iterates the existing samples namespace and:
- skips records with
state = Done(already complete; their completion blobs survive in the object store); - re-enqueues records with
state = Pendingorstate = Failed; - treats records with
state = Runningand astarted_atolder thanstale_after_ms(default 60 000) as stale — CASRunning → Pendingand re-enqueue (per RESEARCH Pitfall 5: covers the kill-mid-flight case where a worker crashed without finishing).
Only the new and re-enqueued samples flow through the queue. Workers process them via the same BatchWorker::run_loop as a fresh run.
The deterministic test
crates/rollout-cli/tests/restart_no_duplicates.rs is the load-bearing proof. It runs on every CI build — no GPU, no vLLM, no real model — because it uses the MockBackend shipped by rollout-runtime-batch behind the test-mock-backend Cargo feature.
The test follows RESEARCH §"Restart-resume test design" verbatim:
- Spawn
rollout infer batch --config <tmp>as a subprocess viatokio::process::Command::new(env!("CARGO_BIN_EXE_rollout"))withROLLOUT_TEST_MOCK_BACKEND=1. - Stream stdout, count
sample_completedevents. - After 3 completions,
child.start_kill()(SIGKILL). - Read
<output.dir>/run-id. - Spawn a second subprocess with
--resume <run_id>and the same env. - Wait for exit 0.
- Assert the final
completions.jsonlhas exactly N=8 lines, all uniquesample_ids, and every input prompt is represented once.
Run locally:
cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates
Total wall-clock: ~1.5 s on a modern dev box. The MockBackend completes each sample in 50 ms; the test deliberately spawns subprocesses to exercise the real CLI resume code path including --resume parsing, run-id file I/O, and BatchCoordinator::scan_and_enqueue.
Content-addressed sample IDs
A sample's identity is derived deterministically from (model, prompt, sampling_params, input_index) — see crates/rollout-runtime-batch/src/state.rs::sample_id. Per CONTEXT D-RESUME-01 (extended by RESEARCH Pitfall 1):
#![allow(unused)] fn main() { fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId { let mut h = blake3::Hasher::new(); h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]); // = 1 in Phase 3 h.update(model.as_bytes()); h.update(prompt.as_bytes()); h.update(&postcard::to_stdvec(params).unwrap()); h.update(&idx.to_le_bytes()); ContentId::from(h.finalize()) } }
The leading SAMPLING_PARAMS_SCHEMA_VERSION byte is the maintenance hook: bumping the constant invalidates every outstanding sample-id, which is the desired behavior whenever the SamplingParams struct gains, loses, or reorders fields. Without it, a future serde-evolution of SamplingParams could silently produce different postcard bytes for the same logical configuration and break resume across versions.
What you cannot resume
- Cross-model resume. A run whose
<output.dir>/run-idwas minted under model A cannot resume against model B — the per-samplemodel_idis part of the content-addressed sample ID; mixing models triggersFatal(ConfigInvalid)at scan time. - Cross-machine resume (Phase 3). The
EmbeddedStorageredb file is local to the host. Phase 5 (CLOUD-03) introduces object-store-backed snapshot storage that will allow cross-machine resume; until then, resume is single-host. - Streaming runs. Phase 3 rejects
sampling.stream = trueat plan time (D-BACKEND-03); streaming is Phase 8 (INFER-01).
See also
cli.md— full CLI flag reference for--resume.cpu-mode.md— where the MockBackend-driven test runs vs. live vLLM.batch-runtime.md— theBatchCoordinatorandBatchWorkerinternals.vllm-backend.md— Pitfall-2 GIL bridge details that the live engine relies on.
Dev loop on macOS (Apple Silicon)
vLLM has no Apple-Silicon wheel as of Phase 3 (see cpu-mode.md). That makes the full rollout infer batch smoke loop unavailable natively on macOS dev boxes. This chapter documents the two supported workarounds and the substantial default-CI surface that does run on macOS so you don't need a Linux box for routine work.
What runs natively on macOS
The following all pass on darwin-aarch64 with PYO3_PYTHON=/opt/homebrew/bin/python3.13 (or a 3.11+ python on PATH) — no vLLM required:
cargo test --workspace --tests # 100+ tests across Phase 2 + Phase 3
cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates
make smoke # Phase 2 substrate smoke
cargo clippy --workspace --all-targets -- -D warnings
cargo doc --workspace --no-deps # rustdoc gate (without --all-features; see ci.yml)
mdbook build docs/book
What does not run natively:
make infer-smoke ROLLOUT_VLLM_AVAILABLE=1 # needs `pip install vllm`, which has no aarch64-darwin wheel
cargo test -p rollout-backend-vllm --features vllm --test vllm_init -- --include-ignored
cargo bench -p rollout-backend-vllm --bench throughput
For the load-bearing exit-criterion-(b) proof (zero-duplicate restart), the MockBackend-driven test runs natively in ~1.5 s. Live vLLM is only needed for exit criterion (a) (the canonical rollout infer batch --config examples/batch-tiny.toml) and (c) (the <10 % overhead benchmark).
Workaround 1: Docker (recommended)
Run rollout inside a Linux container that has vLLM pre-installed. The repo is mounted via a bind volume so edits flow through immediately.
Sample Dockerfile.devcontainer:
FROM rust:1.88-slim-bookworm
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3-pip git make pkg-config libssl-dev protobuf-compiler \
&& rm -rf /var/lib/apt/lists/*
RUN pip3 install --break-system-packages 'vllm>=0.10,<0.22'
WORKDIR /workspace
CMD ["bash"]
docker build -t rollout-dev -f Dockerfile.devcontainer .
docker run --rm -it \
-v "$PWD":/workspace \
-v "$HOME/.cargo/registry":/root/.cargo/registry \
-v "$HOME/.cache/huggingface":/root/.cache/huggingface \
rollout-dev
Inside the container:
cargo test --workspace --tests
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo run -p rollout-cli --features vllm -- infer batch --config examples/batch-tiny.toml
Caveats:
linux/arm64andlinux/amd64images both work;linux/arm64is faster on M-series silicon.- The first run downloads
Qwen/Qwen2.5-0.5B-Instruct(~1 GiB). The~/.cache/huggingfacebind volume above persists it across container restarts. - A pure-CPU Docker run will not exercise CUDA-specific code paths. For those, use Workaround 2 or a cloud GPU runner.
Workaround 2: Cloud GPU runner
For exit criterion (c) (the <10 % overhead benchmark) you need a real CUDA GPU. The repo's CI infer-smoke job is gated on vars.ROLLOUT_VLLM_AVAILABLE == '1' and needs: test — set the repo variable on a self-hosted runner with a GPU, then push, and the bench captures throughput against python scripts/raw_vllm_baseline.py automatically.
For local development without committing, a Lambda Labs / Runpod / Vast.ai box rented by the hour works well:
# On the GPU box:
git clone <this-repo>
cd rollout
pip install 'vllm>=0.10,<0.22'
ROLLOUT_VLLM_AVAILABLE=1 make infer-smoke
cargo bench -p rollout-backend-vllm --bench throughput
Workaround 3: Source-built vLLM (not recommended)
VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vllm-project/vllm repo works on macOS in principle. In practice the build takes 10–30 min, brittle-fails on Apple-clang version mismatches in ways that drift between vLLM releases, and produces a vllm binding ~3× slower than the Linux CPU wheel. Prefer Docker.
What CI tests on macOS
The lint and test workflow jobs run on macos-14 (the GitHub macos-14 runner is Apple-Silicon). They cover the entire Rust workspace surface and the MockBackend-driven restart_no_duplicates test. The infer-smoke job runs on ubuntu-latest and is opt-in; the lint job uses default features only (no quic, no vllm) per the .github/workflows/ci.yml comment.
See also
cpu-mode.md— where CPU mode is selected and the expected throughput numbers.cli.md— full CLI reference.resume.md— howMockBackendproves the resume contract without vLLM.
Training
Phase 4 lands the first end-to-end training story: supervised fine-tuning + Bradley-Terry reward-model training + bit-identical-resume training-state snapshots + the Postgres Storage backend.
What's here
- SFT (Supervised Fine-Tuning) —
rollout-algo-sft; TRAIN-01. - RM (Reward Model) —
rollout-algo-rmwith Bradley-Terry pairwise loss; TRAIN-02. - Snapshots —
rollout-snapshots,SnapshotKind::TrainState, tar + blake3 + restore; TRAIN-03. - Postgres backend —
rollout-storage[postgres]; testcontainers CI; TRAIN-04. - Determinism —
accelerate.save_state+ CUDA / CPU caveats. - CLI —
rollout train sft|rmandrollout snapshot list|show|prune. - CPU mode — what to expect on macOS / Apple Silicon development boxes.
Quickstart
# Dry-run validation (works without Python deps).
cargo run -p rollout-cli -- train sft --config examples/sft-tiny.toml --dry-run
# Live run (requires transformers + accelerate + torch; ~3-5 min CPU on M-series).
pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke
Phase 4 exit criteria
| Criterion | Where it's proven |
|---|---|
rollout train sft --config examples/sft-tiny.toml completes on a small model | make train-smoke (gated on ROLLOUT_TRANSFORMERS_AVAILABLE=1) |
| Snapshot + restart produces bit-identical weights for next K steps | crates/rollout-algo-sft/tests/snapshot_resume.rs (default-fire) + crates/rollout-backend-vllm/tests/snapshot_resume_live.rs (gated) |
| Postgres backend CI-tested via containerized integration test | crates/rollout-storage/tests/postgres_integration.rs via the postgres-integration CI job |
What's NOT here (deferred)
- PPO / GRPO / DPO / IPO / KTO — Phases 9 / 10.
- Buffer / Process / EpisodicMemory snapshot kinds — Phases 9 / 11 / 8.
- Cloud object stores for snapshot blobs — Phase 5.
- HuggingFace datasets Hub integration — Phase 7.
- Multi-node distributed training — Phase 6.
SFT — Supervised Fine-Tuning
Phase 4 plan 04-02 ships rollout-algo-sft: a PolicyAlgorithm skeleton
driven by a deterministic MockBackend, with the load-bearing TRAIN-03
byte-compare resume proof. This chapter covers the architecture, the SFT
settings shape, the JSONL data contract, and how snapshot_save /
snapshot_restore participate in the TRAIN-03 round-trip.
The HF transformers + accelerate path lands in plan 04-05; this skeleton intentionally has zero Python / GPU dependencies so the snapshot resume contract is exercised on every CI build.
Architecture
┌─────────────────────────┐ ┌────────────────────────────┐
│ SftSettings (TOML) │ │ AlgoDependencies │
│ base_model, optimizer │ │ backend : Arc<dyn TB> │
│ budget, dataset │ │ storage : Arc<dyn S> │
│ packing, loss_on, … │ │ object : Arc<dyn O> │
└────────────┬────────────┘ │ snapshots: Arc<dyn Sn> │
│ from_settings │ events : Arc<dyn Em> │
▼ └─────────────┬──────────────┘
┌─────────────────────┐ │
│ SftAlgo │ ◀─────────────────────┘
│ (PolicyAlgorithm) │
│ step: u64 │ run() loop, bounded by budget.max_steps
└──────────┬──────────┘
│ step_once()
▼
forward_with_loss → optimizer_step → step += 1
│
▼
snapshot_save / snapshot_restore (algo meta only — weights via TrainableBackend::save_weights)
SftAlgo holds an Arc<dyn TrainableBackend> and a step counter. Each
step_once() synthesises a single-row TrainBatch, calls
forward_with_loss (which returns a constant loss=0.5 plus an opaque
GradHandle), then calls optimizer_step. The trait's optimizer_step
takes &self (interior mutability) so the algo can step through the
Arc<dyn …> without unique ownership.
Plan 04-05 replaces the synthetic batch with a real tokenized chunk read from the dataset.
SftSettings (TOML shape)
[algorithm]
kind = "sft"
[algorithm.sft]
minibatch_size = 8
gradient_accumulation = 1
loss_on = { kind = "assistant_only" }
[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"
[algorithm.sft.dataset]
kind = "jsonl_path"
path = "data/sft-tiny.jsonl"
[algorithm.sft.optimizer]
kind = "sgd"
lr = 1.0e-3
weight_decay = 0.0
betas = [0.9, 0.999]
eps = 1.0e-8
warmup_steps = 0
schedule = "constant"
[algorithm.sft.budget]
max_steps = 1000
[algorithm.sft.packing]
kind = "off"
max_seq_len = 2048
The JSON schema is generated by cargo xtask schema-gen from
rollout_core::config::training::SftSettings.
JSONL data contract (D-DATA-01)
load_jsonl accepts two shapes per row:
| Shape | JSON | DataRow |
|---|---|---|
| Prompt / completion | {"prompt":"Q","completion":"A"} | { prompt: "Q", assistant: "A" } |
| Chat messages | {"messages":[{"role":"user","content":"Q"},{"role":"assistant","content":"A"}]} | { prompt: "[user] Q", assistant: "A" } |
Phase-4 restrictions:
- At most one
"role": "assistant"turn per row (multi-turn lands when the harness work in Phase 7 needs it). - Empty lines are skipped.
- Malformed lines (neither shape, missing assistant in messages,
unparseable JSON) produce
Fatal(ConfigInvalid)with the file path and line number —<path>:<lineno>: <reason>. Easy to grep, easy to fix.
The Phase-7 work (HARNESS-*) extends this to DatasetRef::Other(...)
for harness-driven datasets; until then, Other is a config error.
validate_plan errors
| Locator | Reason |
|---|---|
algorithm.sft.minibatch_size | must be ≥ 1 |
algorithm.sft.optimizer.lr | must be > 0 |
These fire at plan time so the CLI rejects bad configs before any backend is constructed.
snapshot_save / snapshot_restore (D-DETERM-05)
SftAlgo::snapshot_save builds a Snapshot row with:
meta = { step: <u64>, weights_id: "<hex>" }— algorithm-internal extras, free-formserde_json::Value.parts = [{ role: "weights", content: <ContentId> }]— points at the bytes returned byTrainableBackend::save_weights.kind = SnapshotKind::TrainState.
SftAlgo::snapshot_restore reads meta.step and resets the algo's
step counter. The backend's weights are restored separately — production
backends call TrainableBackend::load_weights(&weights_id); the Phase-4
snapshot_resume.rs test rebuilds MockBackend directly from a
captured weights_snapshot() (the byte-compare assertion would be
meaningless if it went through load_weights, which is a no-op for the
mock).
The full production save path (tar of the accelerate dir, blake3 over
the tar bytes, content-addressed put on the object store) lives in
SnapshotterImpl::save_train_state (plan 04-01) and runs alongside the
algo-level snapshot_save call.
TRAIN-03 byte-compare proof
tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the
LOAD-BEARING proof for TRAIN-03. It runs on every CI build with no GPU
and no HF transformers — exercising the resume contract on the
MockBackend path. The flow:
- Run A: fresh
MockBackend::new_train(42)→ 10step_onceiterations → captureweights_a. - Run B (phase 1): fresh
MockBackend::new_train(42)→ 5step_onceiterations → captureweights_after_5→algo_b1.snapshot_save()→ drop algo + backend. - Run B (phase 2):
MockBackend::new_train_with_weights(42, weights_after_5)→ push step counter to 5 via test helperset_step→ algosnapshot_restore→ 5 morestep_onceiterations → captureweights_b. - Assert
weights_a == weights_b(byte-equal).
The set_step(5) helper is a MockBackend-only affordance because the
algo sees only Arc<dyn TrainableBackend> and load_weights is a no-op
on the mock. Production backends restore the optimizer step counter via
their own checkpoint format inside load_weights.
Running the example
The smallest possible SFT run lives at examples/sft-tiny.toml +
examples/sft-tiny.jsonl (4 chat rows; Qwen2.5-0.5B-Instruct; max_steps = 2). Two ways to exercise it:
Dry-run (works without Python deps; validates config + dataset path + algorithm shape):
cargo run -p rollout-cli -- train sft \
--config examples/sft-tiny.toml --dry-run
Live run (requires transformers + accelerate + torch; ~3-5 min M-series CPU):
pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke
make train-smoke invokes scripts/train-smoke.sh, which dry-runs first then
runs the full SFT path against Qwen/Qwen2.5-0.5B-Instruct on CPU. See
CLI for the full subcommand surface.
Next steps
- Plan 04-05 (
backend-vllm-train) swaps the synthetic batch + deterministic-SGD path for real tokenization through HF transformers- accelerate; the same
PolicyAlgorithmsurface drives it.
- accelerate; the same
- Plan 04-06 (CLI) mounts
rollout train sft --config <toml>on top ofSftAlgo::run. - Plan 04-07 polishes the docs and ships the v1 SFT smoke recipe.
Reward-model training (RM)
rollout-algo-rm implements the Bradley-Terry reward-model training algorithm
(TRAIN-02). It mirrors the SFT algorithm's structure — PolicyAlgorithm impl
driven by a TrainableBackend, JSONL data loader, and a TRAIN-03
byte-compare resume proof — but consumes pairwise preferences instead of
single sequences.
Overview
A reward model learns to score responses on a scalar "quality" axis. Training
data is a stream of preference pairs (prompt, chosen, rejected): the model
should learn to rank chosen higher than rejected for the given prompt.
The Bradley-Terry objective formalizes this as a pairwise logistic regression on the reward gap:
L = -E[ ln σ(r_chosen - r_rejected) ]
where σ is the logistic function and r_* are the scalar reward outputs.
Spec 02 §7 carries the contract.
RmSettings (TOML)
[algorithm.rm]
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
head = "bradley_terry" # Phase 4 supports BradleyTerry only
minibatch_size = 8
[algorithm.rm.optimizer]
kind = "sgd"
lr = 1.0e-5
[algorithm.rm.budget]
max_steps = 100
[algorithm.rm.dataset]
type = "jsonl_path"
path = "examples/data/pairs.jsonl"
Other RmSettings fields (base_model, optimizer, budget, dataset)
mirror SftSettings. Head selection is bradley_terry only in Phase 4;
pairwise_logistic is a Fatal(ConfigInvalid) with a Phase 9 sentinel
until the RL pipeline lands.
Bradley-Terry loss math
Implemented in crates/rollout-algo-rm/src/loss.rs:
logsigmoid(x) = ln σ(x). Numerically stable via the softplus trick —logsigmoid(50)andlogsigmoid(-50)both return finite values within1e-4of the true asymptote.bradley_terry_loss(r_chosen, r_rejected) = -logsigmoid(r_chosen - r_rejected).bradley_terry_batch_mean(pairs)— mean over a slice of(r_chosen, r_rejected)pairs. Returns0.0for empty batches; callers should validate non-empty upstream when needed.
Pinned golden values (tests/bradley_terry_loss.rs):
| Case | Inputs | Expected |
|---|---|---|
| Zero diff | (1.0, 1.0) | ln 2 ≈ 0.6931 |
| Strong preference | (5.0, -5.0) | near 0 (≪ 1e-3) |
| Inverted preference | (-5.0, 5.0) | ≈ 10.0 (± 1e-3) |
| Mixed batch | [(2,1), (1,2)] | mean ≈ 0.8133 |
| Empty batch | &[] | exactly 0.0 |
| Numerical stability | logsigmoid(±50) | finite; asymptotic value within 1e-4 |
JSONL data shape (D-DATA-01)
Phase 4 supports one row shape:
{"prompt": "What is 2+2?", "chosen": "4", "rejected": "5"}
{"prompt": "Capital of France?", "chosen": "Paris", "rejected": "London"}
load_pairs(&path) parses line-by-line, skipping blank lines and rejecting
malformed rows with Fatal(ConfigInvalid) prefixed <file>:<lineno>:. A row
missing any of the three fields is malformed.
PolicyAlgorithm surface
| Method | Behavior |
|---|---|
id() | AlgorithmId("rm") |
Settings | rollout_core::config::training::RmSettings |
from_settings | clones deps.backend into the algo; step = 0 |
required_roles | vec![WorkerRole::LearnerWorker] |
validate_plan | rejects RmHeadKind::PairwiseLogistic (Phase 9); rejects minibatch_size == 0; lr <= 0 |
run | loads pairs once; loops step_once up to budget.max_steps, honoring ctx.cancel |
snapshot_save | meta = {step, weights_id}; one SnapshotPart { role: "weights" } |
snapshot_restore | restores self.step from meta.step; backend weights restored separately |
step_once synthesizes a 2-row TrainBatch (one row per side of a pair) and
drives forward_with_loss → optimizer_step. In the Phase-4 MockBackend
test path the loss is a constant; the real Bradley-Terry loss fires under
plan 04-05's HF transformers integration.
TRAIN-03 second-witness — byte-compare resume
tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the Bradley-Terry
twin of the SFT byte-compare proof. Structure:
- Run A. 10
step_onceiterations withseed = 42; capture weights. - Run B Phase 1. 5 steps; capture mid-run weights;
snapshot_save(). - Run B Phase 2. Rebuild
MockBackend::new_train_with_weights(42, …); push step counter to 5; restore algo step from snapshot meta; 5 more steps. - Assert.
weights_a == weights_bbyte-for-byte.
This is the second-witness for TRAIN-03 (the SFT proof is the first witness); together they discharge the "deterministic resume" exit criterion across both Phase-4 algorithms.
Content-addressed final checkpoint
tests/checkpoint_roundtrip.rs proves that TrainableBackend::save_weights
returns a ContentId that is stable when the backend is idle (two calls →
identical hash) and different after a non-trivial optimizer_step. This
matches the TRAIN-02 contract: the final checkpoint is content-addressed by
the blake3 hash of the postcard-encoded weights.
Phase 4 head support
Only RmHeadKind::BradleyTerry is wired in Phase 4. PairwiseLogistic exists
in the enum so the config schema can be cross-validated end-to-end, but
selecting it returns a Fatal(ConfigInvalid) with the string Phase 9 in the
message — Phase 9 lands the full RL pipeline including alternate preference
heads.
What lands later
- Plan 04-05 swaps
MockBackendfor the real HF transformers / accelerate training loop onQwen/Qwen2.5-0.5B-Instruct(CPU), wiring the Python-sideF.logsigmoid(r_chosen - r_rejected).neg().mean()and producing real reward models. - Plan 04-06 mounts
rollout train rm --config <toml>onRmAlgo::run. - Phase 9 lands
PairwiseLogisticand the RL-* algorithms (PPO/GRPO) that consume reward models trained here.
Running the example
The smallest possible RM run lives at examples/rm-tiny.toml +
examples/rm-tiny.jsonl (4 preference pairs; Qwen2.5-0.5B-Instruct base;
BradleyTerry head; max_steps = 2).
Dry-run (works without Python deps):
cargo run -p rollout-cli -- train rm \
--config examples/rm-tiny.toml --dry-run
The live rollout train rm path through --features train is wired
identically to SFT; the Phase-4 smoke recipe (make train-smoke) exercises
SFT specifically — the RM pipeline is dry-run-validated here and lands
under the Phase-9 RL recipe alongside PPO / GRPO. See CLI for
the full subcommand surface.
See also
- Snapshots —
SnapshotterImpl/SnapshotKind::TrainState - SFT — sibling algorithm sharing the same trait surface
- Spec 02 §7 — RM contract
Snapshots
rollout-snapshots ships SnapshotterImpl, the Phase-4 implementation of the
rollout_core::Snapshotter trait. It owns the persistence path for
SnapshotKind::TrainState — the only kind implemented in v1's training story.
Three other kinds (Buffer, Process, EpisodicMemory) compile but return
Fatal { PluginContract, msg: "Phase N: <kind>" } until their owning phases
land (9 / 11 / 8 respectively).
See docs/specs/04-storage-snapshots.md for the authoritative contract and
crates/rollout-snapshots/ for the implementation.
Architecture
PolicyAlgorithm SnapshotterImpl Storage + ObjectStore
───────────────────── ─────────────────── ──────────────────────
algo.snapshot_save ──▶ save_train_state(request, dir)
│
▼
build_deterministic_tar(dir) (no I/O on substrate yet)
│
▼
ContentId::of(tar_bytes) = blake3
│
┌───────────────┴────────────────┐
▼ ▼
ObjectStore::put_bytes(tar) Storage::begin().txn
(returns same ContentId) put_bytes(snapshot_key, json(Snapshot))
│ │
└────────► tar blob └────────► snapshot row (namespace=snapshots)
Save returns a Snapshot whose parts[0].content == ContentId::of(tar);
restore inverts the pipeline (fetch → blake3-verify → extract).
Metadata layout
A Snapshot row lives at
StorageKey { namespace = "snapshots", run_id = Some(run_id), path = [hex(snapshot_id)] }
and is JSON-encoded. Spec-04 §5.1 lists the fields:
| Field | Type | Purpose |
|---|---|---|
id | SnapshotId | ContentId of the tar bytes (Phase 4 = single part) |
kind | SnapshotKind | TrainState in Phase 4; others reserved |
run_id | RunId | Owning run |
created_at | DateTime<Utc> | RFC3339 wire form |
label | Option<SmolStr> | Optional human-readable label (CLI --label) |
parts | Vec<SnapshotPart> | One per blob; Phase 4 ships exactly one (role="tar") |
algorithm_id | AlgorithmId | Producing algorithm ("sft", "rm", ...) |
meta | serde_json::Value | Algorithm-internal extras (D-DETERM-05; opaque to core) |
JSON (not postcard) is the on-disk encoding because serde_json::Value is a
self-describing format that postcard intentionally refuses to encode. The
small row size and infrequent writes make the choice cheap, and the row is
human-readable on disk for debugging.
Tar reproducibility contract (Pitfall 9)
build_deterministic_tar(&Path) -> Vec<u8> is the byte-stable tar builder.
It must produce identical bytes for identical input directories across runs,
machines, and tar versions — otherwise the blake3 hash drifts and resumes
fail to find their predecessor blob.
The invariants enforced:
- Sort entries by path before writing (file-system iteration order is non-deterministic).
HeaderMode::Deterministiczeroes mtime/uid/gid metadata at write time but does not zero mode bits. The mode bits are platform-dependent (macOS gives regular files0o755by default).- Explicit
header.set_mode(0o644)for files and0o755for dirs. - Explicit
set_mtime(0),set_uid(0),set_gid(0)on top ofHeaderMode::Deterministic. - No compression — gzip / zstd byte streams drift across versions.
- GNU header format (
Header::new_gnu) — stable across tar releases.
crates/rollout-snapshots/tests/deterministic_tar.rs proves Pitfall 9 holds
by parsing every entry header and asserting mode = 0o644 | 0o755.
Restore semantics
restore_train_state(&snapshot, dst_dir) fetches the tar blob via
ObjectStore::get_bytes(parts[0].content), verifies
blake3(bytes) == parts[0].content, and extracts to dst_dir. A mismatch
returns Fatal { PluginContract, msg: "blake3 mismatch on restore: ..." }.
The bare Snapshotter::restore trait method takes a RestoreTarget enum:
| Variant | Phase 4 behavior |
|---|---|
SameRun | Returns Fatal { PluginContract } — the trait method has no dst_dir; callers use restore_train_state(snapshot, dst_dir) directly. |
Fork | Returns Fatal { PluginContract, msg: "Phase 9: Fork restore (new_run_id=...)" } |
Worker | Returns Fatal { PluginContract, msg: "Phase 6: Worker restore (worker_id=...)" } |
This is intentional: the Phase-4 surface optimizes for the SFT/RM training
loop, which holds the destination directory locally and drives
restore_train_state directly. Phase-9 PPO actor-swap adds the multi-worker
restore plumbing.
List + prune surface
Snapshotter::list(SnapshotFilter) scans namespace="snapshots", optionally
filters by run_id / kind / label_contains, sorts newest-first
(created_at descending), and caps by limit. The scan is O(snapshots in
namespace); the secondary SnapshotId → key index is deferred to Phase 9.
Snapshotter::prune(PrunePolicy { run_id, retention }) enforces
RetentionPolicy:
keep_last: u32— keep the N most recent snapshots regardless of label.keep_labeled: bool— labeled snapshots are immune to pruning when true.max_age: Option<Duration>— anything older thanmax_ageis eligible.
Returns the count of metadata rows deleted. The underlying tar blobs are
not deleted in Phase 4 — ObjectStore has no delete method yet
(Phase-5 addition). Blobs are content-addressed and idempotent; orphaned
ones cost storage but never corrupt a restore.
Algorithm-internal state — meta: serde_json::Value
Snapshot.meta is an opaque JSON blob owned by the producing algorithm
(D-DETERM-05). The framework never inspects it. SFT might store
{ "step": 5, "curriculum_cursor": 12 }; RM might store
{ "epoch": 2, "best_loss": 0.42 }. This keeps algorithm-specific resume
state out of the framework-owned snapshot row.
Determinism caveats
- CPU mode: byte-identical on the same toolchain across machines.
- CUDA same-SM: weights+optimizer bit-identical when the producing and restoring GPU expose the same compute capability. Cross-SM is best-effort.
- Cross-machine: best-effort. The tar is reproducible; PyTorch / accelerate restore is not always bit-identical across GPU generations.
Plan 04-05 (backend-vllm-train) lands the determinism CI gate and CPU-mode
fallback for environments without CUDA.
Pointers
crates/rollout-snapshots/src/lib.rs—SnapshotterImpl+ trait impl.crates/rollout-snapshots/src/tar_build.rs— deterministic tar builder.crates/rollout-snapshots/src/kind/train_state.rs— save + restore pipeline.crates/rollout-snapshots/src/policy.rs—Snapshotter::pruneretention enforcement.crates/rollout-snapshots/src/key.rs—StorageKeyhelpers fornamespace="snapshots".crates/rollout-storage/src/embedded/tables.rs—T_SNAPSHOTStable definition.docs/specs/04-storage-snapshots.md— authoritative contract (especially §5 + §5a + §7).
Postgres backend
Phase 4 / TRAIN-04: a Storage impl backed by Postgres 16 alongside the
default embedded redb store. Same trait, same semantics — choose at config
time. The crate-level Cargo feature postgres on rollout-storage gates
the dep set (sqlx 0.8 + uuid + ulid + async-stream); default builds remain
sqlx-free.
Why Postgres alongside embedded
The embedded backend (redb) is local-process only. Multi-process or
multi-host runs need a shared store that fan-outs change notifications
across processes — that's what Postgres gives us via LISTEN/NOTIFY. The
trait surface stays uniform: EmbeddedStorage and PostgresStorage both
implement Storage::watch_stream, wrapping their respective notification
mechanisms in a BoxStream<StorageEvent>.
| Method | Embedded | Postgres |
|---|---|---|
begin / put / get / cas | redb txn over local fs | sqlx txn over network |
watch (broadcast) | tokio::sync::broadcast::Receiver | unsupported — returns Fatal(PluginContract) |
watch_stream | wraps the broadcast in BroadcastStream | PgListener over LISTEN/NOTIFY |
Cross-process subscribers MUST use watch_stream. The embedded backend
implements watch_stream by wrapping its in-process broadcast — handy when
some callers need a stream-shaped surface even on a local run.
Schema
Two migrations under database/migrations/ are embedded at build time via
sqlx::migrate!():
0001_init.sql— thekvtable that backs allStorageKeyrows.0002_snapshots.sql—snapshots+eventstables consumed byrollout-snapshots(plan 04-01) and the spec-09 observability ledger.
The runs / workers tables defer to Phase 6 (multi-node distribution).
kv table
CREATE TABLE kv (
namespace TEXT NOT NULL,
run_id UUID, -- ULID-as-UUID; NULL for global rows
path TEXT[] NOT NULL,
value BYTEA NOT NULL,
version BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (namespace, run_id, path)
);
run_id is stored as UUID (16 bytes) — the same byte layout as the
underlying ULID. RunId(Ulid) round-trips through Uuid::from_bytes and
back without loss.
LISTEN / NOTIFY contract
A row-level trigger on kv fires pg_notify(channel, payload) after every
INSERT/UPDATE/DELETE:
- channel:
rollout_watch_<namespace>(max 63 chars;rollout_watch_is 14, leaving 49 for the namespace). - payload:
<run_id_uuid_or_empty>|<path_parts_joined_by_slash>, truncated to 7999 bytes per Pitfall 5 (Postgres capspg_notifypayloads at 8000).
PgListener consumers parse the payload, filter by prefix (matching the
caller's prefix.run_id + prefix.path), and emit a StorageEvent::Put
per notification. Put vs delete is not distinguished in the payload format
shipped here; Phase 9 may extend the trigger to include a +/- prefix if
downstream needs demand it.
PgListener reconnects transparently on connection drop; the watch_stream
loop logs the failure and continues.
Trait surface
Storage::watch (broadcast) is intentionally not implemented by the
Postgres backend — broadcast is an in-process abstraction. Callers that
need a tokio::sync::broadcast::Receiver must run against EmbeddedStorage
or implement their own fan-out on top of watch_stream. The Postgres impl
returns:
Fatal(PluginContract { plugin: "PostgresStorage",
msg: "PostgresStorage does not support in-process broadcast watch;
use watch_stream for cross-process notification" })
Migrations
Migrations are forward-only, sequentially numbered (0001_*.sql,
0002_*.sql, ...). PostgresStorage::new runs them once via
sqlx::migrate!("../../database/migrations").run(&pool). The macro embeds
every SQL file into the binary at compile time, so the pool only needs
network access — no on-disk migration directory in production.
Running new() twice on the same pool is a no-op: sqlx::migrate records
applied versions in _sqlx_migrations.
Adding a migration
- Write
database/migrations/NNNN_<name>.sql. - Start a local Postgres:
docker run --rm -e POSTGRES_PASSWORD=pw -p 5432:5432 postgres:16. - (Optional, when we switch to compile-time
query!macros) regenerate the.sqlx/offline cache:DATABASE_URL=postgres://postgres:pw@localhost/postgres SQLX_OFFLINE=false cargo sqlx prepare --workspace -- --features postgres. - Commit the migration AND the regenerated
.sqlx/files (currently the directory only carries a.gitkeep; this plan ships runtime-checked SQL viasqlx::query).
Offline mode
SQLX_OFFLINE=true lives in .cargo/config.toml (not .env) per Pitfall 4
— sqlx-cli reads .env at startup and refuses to talk to the DB during
cargo sqlx prepare if SQLX_OFFLINE=true is set there. Putting it in
.cargo/config.toml keeps cargo build --features postgres happy
everywhere except when explicitly running sqlx prepare.
Phase 4 ships runtime-checked sqlx::query calls; the .sqlx/ directory
is reserved (.gitkeep) for a future switch to compile-time query!
macros once the surface stabilizes.
Pool sizing
PostgresStorage::new(url, pool_size) opens two pools:
| Pool | min | max | acquire | idle |
|---|---|---|---|---|
| Write pool | 0 | pool_size | 30 s | 10 min |
| Watch pool | 0 | 4 | 30 s | (default) |
pool_size = 16 is a reasonable default for a Phase-6 worker; tests use
4. The watch pool is small and dedicated because each PgListener holds a
connection for the lifetime of its watch_stream; you don't want them
crowding out write capacity.
Testcontainers CI
crates/rollout-storage/tests/postgres_integration.rs runs against a
disposable Postgres 16 container via testcontainers-modules. Six tests
cover CRUD, CAS atomicity, watch_stream delivery, migration idempotency,
pool reuse under load, and prefix scan.
Each test carries #[ignore = "requires Docker / testcontainers"] so the
default cargo test --workspace --tests flow (macOS dev loop, no Docker
guaranteed) stays green. CI opts in via:
cargo test -p rollout-storage --features postgres \
--test postgres_integration -- --include-ignored --test-threads=1
--test-threads=1 because the test starts a fresh container per test
function; running six simultaneously is wasteful (and slower than
serial under most CI runners' Docker capacity).
The retry loop on PostgresStorage::new (30 attempts × 2 s) handles
Pitfall 6: the container reports "running" before Postgres accepts
connections.
Local dev
make postgres-test
The target checks docker info first and fails fast with a helpful
message if Docker isn't running.
Limitations
- Put-only events from
pg_notify. The trigger emits a single notification format for both INSERT/UPDATE/DELETE; consumers seeStorageEvent::Putonly. Phase 9 may extend the trigger payload to carry a+/-operation byte if downstream callers need to distinguish deletes. get_many_bytesis sequential. A future optimization could batch viaANY($1)array binding; not on the Phase-4 critical path.- No streaming
scan_bytes. Phase-2 simplification carries forward; prefix scans return ownedVecrows. Hot prefixes (millions of rows) will need a streaming variant in Phase 6. - No
query!macro yet. Runtime-checked SQL keeps the build hermetic without a.sqlx/cache; revisit once the schema stabilizes.
Determinism
Phase 4 of rollout commits to deterministic training-state snapshots
(D-DETERM-01): a Snapshot.parts[0].content is the same ContentId whenever
the same code drives the same model with the same RNG seed and the same input
stream. This page documents the contract, the platform caveats, and every
mitigation that lands in rollout-backend-vllm --features train.
The determinism stack
| Layer | Mitigation | Where |
|---|---|---|
| Process env | CUBLAS_WORKSPACE_CONFIG=:4096:8, PYTHONHASHSEED=0 | train.py preamble |
| RNG seeds | random + numpy + torch + torch.cuda.manual_seed_all | _set_determinism_flags |
| PyTorch deterministic | torch.use_deterministic_algorithms(True) | _set_determinism_flags |
| cuDNN | cudnn.deterministic = True, cudnn.benchmark = False | _set_determinism_flags |
| Matmul precision | torch.set_float32_matmul_precision("highest") | _set_determinism_flags |
| Chat template | {% generation %} markers via GENERATION_MARKED_QWEN25_TEMPLATE | qwen25_chat_template.py |
| Dataloader | torchdata.StatefulDataLoader if available; step-replay fallback | init_train |
| Accelerator | accelerator.prepare(model, optimizer, scheduler) | init_train |
| LR scheduler | register_for_checkpointing (Pitfall 10 fallback) | init_train |
| Tar | byte-stable mode bits + sorted walkdir + mtime=0 (Pitfall 9) | rollout-snapshots::build_deterministic_tar |
Why preamble ordering matters (Pitfall 2)
CUBLAS_WORKSPACE_CONFIG and PYTHONHASHSEED are read by torch on first
import. If import torch runs before they are set, the values are baked in
and later os.environ mutations do not change behavior. The Rust side
enforces ordering by writing the env vars on the worker thread before
calling py.import("rollout.backends.vllm.train") — see
rollout-backend-vllm/src/train.rs::import_train_module.
train.py repeats os.environ.setdefault(...) at the top of the module
defensively: if someone imports it directly from a Python REPL with torch
already loaded, the user gets non-determinism, but the source still documents
the contract.
cuDNN.benchmark is the silent killer (Pitfall 8)
cuDNN auto-tunes kernel selection by running candidate kernels and picking the
fastest. This is non-deterministic — re-running the same workload picks a
different kernel under load. cudnn.benchmark = False is required, even
when cudnn.deterministic = True. The Phase-4 implementation sets both
explicitly in _set_determinism_flags.
LR scheduler must be in the save-state capture path (Pitfall 10)
accelerator.save_state captures the model, optimizer, RNG, and any object
explicitly registered for checkpointing. The Phase-4 path prefers passing the
scheduler through accelerator.prepare(model, optimizer, scheduler); if that
fails (e.g., the scheduler shape isn't supported by accelerate's prepare),
the code falls back to accelerator.register_for_checkpointing(scheduler).
Without one of these, the LR resumes from lr_start after a snapshot restore
instead of continuing the schedule — silent non-determinism.
CPU vs CUDA contract
| Hardware | Bit-identical? |
|---|---|
| Same CPU model, same OS, same env | Yes (the load-bearing CI target) |
| Same SM (sm_80 vs sm_80), same cuDNN | Yes, with deterministic + !benchmark |
| Different SM (sm_80 vs sm_90) | No — kernels differ |
| Different cuDNN minor version | No — algorithm shortlist may differ |
| Cross-machine (CPU A vs CPU B) | Best-effort; FP rounding differs |
The snapshot_resume_live.rs live witness exercises the same-CPU case on
the dev box. CI runs the MockBackend variant from plan 04-02
(snapshot_resume.rs::bit_identical_resume_at_step_5) for the unconditional
green signal.
accelerate.save_state captures
| Object | Source | Restored on load_state? |
|---|---|---|
| Model weights | accelerator.prepare(model) | Yes |
| Optimizer | accelerator.prepare(optimizer) | Yes |
| RNG (torch) | implicit via torch.random.get_rng_state | Yes |
| RNG (cuda) | implicit if CUDA is initialized | Yes |
| LR scheduler | prepare(scheduler) OR register_for_checkpointing | Yes |
| Dataloader | DataLoaderConfiguration(use_stateful_dataloader=True) | If torchdata installed |
The Phase-4 contract: anything that affects subsequent step output must be in
this list. If a future plan adds custom state (e.g., reward-model normalizer
running stats), it MUST register via register_for_checkpointing to keep the
TRAIN-03 byte-compare proof intact.
torchdata stateful dataloader (Pitfall 3)
init_train probes for torchdata. If present, it constructs a
DataLoaderConfiguration(use_stateful_dataloader=True) so the dataloader's
position is checkpointed by save_state. Without torchdata, the runtime
falls back to step-replay: the resume side re-reads from the JSONL head and
skips step rows. Both modes preserve the TRAIN-03 byte-compare invariant on
deterministic input streams.
Pitfall 7 — Accelerator singleton
Accelerator() is a singleton per Python process; constructing two of them
in the same interpreter raises. init_train is idempotent — the second call
returns the cached _STATE dict. teardown_train flushes the singleton via
del + gc.collect() + torch.cuda.empty_cache() so the next init_train
call (in a subsequent run, or after a mid-process swap-back to vLLM
inference) can construct a fresh accelerator.
A bidirectional mid-process swap (training ↔ inference under the same OS
thread) is Phase 9 — see the deferral note in
rollout-backend-vllm/src/train.rs::run_set_train_mode.
MockBackend vs live HF path
| Backend | CPU bit-identical? | When to use |
|---|---|---|
MockBackend | Yes (Array1 | CI; every PR; algo-side tests |
| Live HF | Yes on identical CPU; same-SM only on CUDA | Dev-box live witness; nightly |
The MockBackend path is the load-bearing CI proof for TRAIN-03. The
live HF path is the gated witness behind ROLLOUT_TRANSFORMERS_AVAILABLE=1.
Where the code lives
python/rollout/backends/vllm/train.py— determinism preamble + Accelerator construction.python/rollout/backends/vllm/qwen25_chat_template.py— generation-marked chat template.crates/rollout-backend-vllm/src/train.rs— env-write-before-import enforcer;py.detachwrappers.crates/rollout-snapshots/src/tar_build.rs— Pitfall 9 deterministic tar.
Related
- Snapshots — Phase-4 snapshot pipeline.
- SFT — SFT algorithm + TRAIN-03 byte-compare proof.
- CPU mode — running Phase-4 training on CPU only.
rollout train + rollout snapshot CLI
Phase-4 user-facing entry points for supervised fine-tuning, reward-model
training, and snapshot management. Mirrors the Phase-3 rollout infer batch
clap derive shape, run_id lifecycle, and backend-selection precedence.
Subcommand overview
| Subcommand | Purpose |
|---|---|
rollout train sft | Supervised fine-tuning. Validates + runs an SftAlgo budget. |
rollout train rm | Reward-model (Bradley-Terry) training via RmAlgo. |
rollout snapshot list | List snapshots (optionally filtered by run / kind). |
rollout snapshot show | Print one snapshot's metadata by content-id. |
rollout snapshot prune | Delete snapshots per a retention policy. |
rollout train sft
rollout train sft \
--config examples/sft-tiny.toml \
[--resume <snapshot_id>] \
[--dry-run]
| Flag | Default | Purpose |
|---|---|---|
--config | required | Path to the run TOML (schema below). |
--resume | none | Snapshot content-id to restore from before the algorithm runs. |
--dry-run | false | Validate config + dataset path; never construct backend. Works with no backend feature. |
SFT TOML schema
schema_version = 1
[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"
[algorithm]
kind = "sft"
[algorithm.sft]
minibatch_size = 1
gradient_accumulation = 1
[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"
[algorithm.sft.optimizer]
kind = "sgd"
lr = 0.01
[algorithm.sft.budget]
max_steps = 10
[algorithm.sft.dataset]
kind = "jsonl_path"
path = "./data/sft.jsonl"
[algorithm.sft.packing]
kind = "off"
max_seq_len = 64
[algorithm.sft.loss_on]
kind = "full"
The schema is owned by rollout-core::config::RunConfig; the CLI is just a
TOML-loader + clap surface (spec 11 single-source-of-truth). Unknown fields
fail load with a deterministic locator.
rollout train rm
rollout train rm \
--config examples/rm-tiny.toml \
[--resume <snapshot_id>] \
[--dry-run]
Same flags as sft. The TOML differs only in the algorithm block:
schema_version = 1
[storage]
backend = "embedded"
[storage.embedded]
path = "./rollout.db"
[algorithm]
kind = "rm"
[algorithm.rm]
minibatch_size = 1
[algorithm.rm.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"
[algorithm.rm.optimizer]
kind = "sgd"
lr = 0.01
[algorithm.rm.budget]
max_steps = 10
[algorithm.rm.dataset]
kind = "jsonl_path"
path = "./data/pairs.jsonl"
[algorithm.rm.head]
kind = "bradley_terry"
JSONL input contract: each line { "prompt", "chosen", "rejected" }.
--dry-run semantics
In order, with the backend NEVER constructed:
- Read + parse TOML against
RunConfig(deny_unknown_fields). - Match
[algorithm].kindagainst the subcommand (sftvsrm). - Validate
minibatch_size >= 1andoptimizer.lr > 0. - Confirm
dataset.pathexists on disk (forjsonl_pathform). - Print
dry-run OK: algorithm=<sft|rm> model=… minibatch=… dataset=…and exit0.
Because --dry-run short-circuits before select_backend, it runs cleanly on
builds with NEITHER --features train NOR --features test-mock-backend.
--resume <snapshot_id> lifecycle
If --resume <hex> is set, the CLI scans the local rollout.db for a
Snapshot whose id matches the supplied content-id (32-byte blake3, 64-char
hex). The matching row is fed into algorithm.snapshot_restore(snap) before
algorithm.run(&ctx) begins. Determinism then follows the per-algorithm
contract: SftAlgo proves byte-identical resume via
tests/snapshot_resume.rs::bit_identical_resume_at_step_5; RmAlgo proves
parity via tests/snapshot_resume.rs::rm_bit_identical_resume.
Missing snapshot → Fatal(ConfigInvalid) with the failed hex echoed back.
Backend selection
rollout-cli builds with at most one training backend at a time, by Cargo
feature. Selection at runtime is precedence-ordered:
| Order | Build flags | Env | Backend |
|---|---|---|---|
| 1 | --features test-mock-backend | ROLLOUT_TEST_MOCK_BACKEND=1 | rollout_runtime_batch::MockBackend (deterministic, no GPU/Python). |
| 2 | --features vllm,train | (any) | rollout_backend_vllm::VllmBackend in train mode (live HF / accelerate). |
| 3 | none | (any) | Fatal(ConfigInvalid) with a build-mode hint; --dry-run still works. |
Build recipes:
# CI / tests (no Python, no GPU)
cargo build -p rollout-cli --features test-mock-backend
# Production (live HuggingFace + accelerate over Python)
cargo build -p rollout-cli --features vllm,train
# Both — vllm takes precedence at runtime unless ROLLOUT_TEST_MOCK_BACKEND=1.
cargo build -p rollout-cli --features test-mock-backend,vllm,train
The train feature implies vllm per Cargo.toml; the two are kept distinct
so the existing Phase-3 infer batch users get vllm without pulling the
training Python deps.
rollout snapshot list
rollout snapshot list \
[--storage-path ./rollout.db] \
[--object-path ./object-store] \
[--run-id <ULID>] \
[--kind train_state|buffer|process|episodic_memory] \
[--limit N]
Output: pretty-printed JSON array of Snapshot rows from rollout-core. Sort
order is newest-first by created_at.
| Flag | Default | Notes |
|---|---|---|
--storage-path | ./rollout.db | Opens an EmbeddedStorage read-write. |
--object-path | ./object-store | Opens a FsObjectStore (read-only for list, but consistently surfaced). |
--run-id | none | Crockford ULID; restrict to one run. Without it, every run's snapshots are scanned. |
--kind | none | snake_case match against SnapshotKind variants. |
--limit | none | Cap result length. |
rollout snapshot show <snapshot_id>
rollout snapshot show \
[--storage-path ./rollout.db] \
[--object-path ./object-store] \
<SNAPSHOT_ID>
SNAPSHOT_ID is the 64-char hex blake3 digest from Snapshot.id. Prints the
full Snapshot row as pretty-printed JSON. Missing id → exit 2 with
snapshot not found: <id>.
rollout snapshot prune
rollout snapshot prune \
--run-id <ULID> \
[--storage-path ./rollout.db] \
[--object-path ./object-store] \
[--keep-last N=3] \
[--keep-labeled]
Applies a RetentionPolicy scoped to a single run (--run-id is required to
avoid cross-run accidents). --keep-last N retains the N newest snapshots
regardless of label; --keep-labeled (default true) further retains every
labeled snapshot. Metadata rows are deleted; blob bytes stay in the object
store (the Phase-2 ObjectStore trait has no delete — pending Phase-5).
Prints pruned <N> snapshots.
Storage path conventions
| Path | Owner | Purpose |
|---|---|---|
./rollout.db | EmbeddedStorage | redb on-disk DB (always-fsync). |
./object-store/ | FsObjectStore | Content-addressed two-level sharded FS. |
<config_dir>/rollout.db | training runs | `train sft |
<config_dir>/object-store/ | training runs | Same sibling-of-config convention. |
snapshot list|show|prune accept explicit --storage-path / --object-path
so out-of-band tooling and tests can target any directory pair.
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success (or successful --dry-run). |
| 2 | Config-invalid, missing dataset / snapshot, substrate error, or algorithm error. |
Observability
RUST_LOG=info rollout train sft … produces structured tracing events. Each
SFT / RM run emits at minimum train_start, train_step, train_end; the
exact fields are pinned by the per-algorithm chapters.
CPU mode
The Phase-4 training surface runs on CPU end-to-end. This is the integration test path on dev boxes (including Apple Silicon) and the smoke recipe target in plan 04-07.
When to use
- Local dev loop on a laptop without CUDA.
- CI smoke that exercises the full HF transformers + accelerate path against
a tiny model (
Qwen/Qwen2.5-0.5B-Instruct). - Reproducing CUDA bugs that turn out to be deterministic-flag misconfiguration.
Expected throughput
| Model | Hardware | Steps/sec |
|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct | Apple M2 Max (CPU) | ~0.1–0.3 |
Qwen/Qwen2.5-0.5B-Instruct | Linux x86_64 16-core | ~0.3–1.0 |
Roughly one to ten seconds per step for the 0.5B model. Anything larger is impractical on CPU; the per-token cost grows superlinearly. CPU mode exists to prove the pipeline, not to train.
Required env
None beyond default. The Phase-4 determinism preamble
(CUBLAS_WORKSPACE_CONFIG, PYTHONHASHSEED) is written by the Rust side
before import torch; CPU runs ignore CUBLAS settings without complaint.
The live tests gate on ROLLOUT_TRANSFORMERS_AVAILABLE=1:
pip install transformers>=4.45 accelerate>=0.34 torch>=2.4
ROLLOUT_TRANSFORMERS_AVAILABLE=1 \
cargo test -p rollout-backend-vllm --features train \
--test snapshot_resume_live -- --ignored --nocapture
Performance caveats
- No streaming. Phase 4 rejects
sampling.stream = trueat the boundary (D-BACKEND-03); training has no streaming surface. - No multi-GPU. CPU mode is single-process. The FSDP plugin in
init_trainonly activates whentorch.cuda.device_count() >= 2. - Slow. The 0.5B model at one step per ~5 seconds on M-series silicon means a 10-step smoke takes a minute. Plan accordingly.
- Determinism still holds. Two CPU runs with the same seed produce
byte-identical
accelerate.save_stateoutput. TheMockBackendvariant inrollout-algo-sft::tests::snapshot_resume::bit_identical_resume_at_step_5proves the Phase-4 contract holds on CPU without HF transformers installed.
Smoke recipe (plan 04-07)
make train-smoke (lands in plan 04-07) runs the live witness on dev boxes
where ROLLOUT_TRANSFORMERS_AVAILABLE=1 is set. CI does not install
transformers/accelerate; the MockBackend test is the unconditional gate.
Related
- Determinism — the determinism contract Phase-4 commits to.
- SFT — algorithm-side overview.
- Snapshots — snapshot pipeline.
Cloud
The cloud layer lifts rollout's local substrate to real object stores and queues
(AWS S3 + SQS, GCP GCS + Pub/Sub) behind the same rollout-core traits. Algorithm
crates never see a cloud SDK — a hard dependency-direction lint plus two CI gates
(public-api-cloud-leak, forbidden-patterns) keep SDK types out of rollout-core's
public API and keep raw metadata URLs / shell=True / libc::fork out of the tree.
Streaming + lease trait methods
Phase 5 extends two rollout-core cloud traits with four new methods. All four ship
with backward-compatible default implementations so every v1.0 caller and impl
compiles unchanged — only the streaming methods carry a #[deprecated] tag whose sole
purpose is to nudge cloud backends into overriding them.
ObjectStore::put_stream / get_stream
#![allow(unused)] fn main() { #[deprecated(note = "Cloud impls MUST override; default buffers entire stream into RAM")] async fn put_stream( &self, stream: Pin<Box<dyn AsyncRead + Send>>, hint: PutHint, ) -> Result<ContentId, CoreError>; #[deprecated(note = "Cloud impls MUST override; default buffers entire blob into RAM")] async fn get_stream(&self, id: &ContentId) -> Result<Pin<Box<dyn AsyncRead + Send>>, CoreError>; }
The default put_stream reads the whole stream into a Vec<u8> and calls put_bytes;
the default get_stream fetches via get_bytes and hands back a Cursor. That is fine
for small blobs but OOMs on multi-GiB snapshots (Pitfall 16 / D-SNAP-04), which is why
the #[deprecated] warning fires at any call site that did not override the method. Cloud
backends (S3 multipart, GCS resumable) and the local FS store override both with true
streaming, so the warning never fires from a correct impl.
Queue::dequeue_with_lease / extend_lease
#![allow(unused)] fn main() { async fn dequeue_with_lease( &self, lease: Duration, ) -> Result<Option<(QueueItemId, Vec<u8>, LeaseToken)>, CoreError>; async fn extend_lease( &self, id: QueueItemId, token: LeaseToken, extend_by: Duration, ) -> Result<(), CoreError>; }
LeaseToken(Vec<u8>) is an opaque per-backend handle: SQS ReceiptHandle bytes, Pub/Sub
ack_id bytes, or — for the in-memory queue — the QueueItemId bytes. The default
dequeue_with_lease ignores the lease duration and synthesizes a token from the item id;
the default extend_lease returns Recoverable::Transient { hint: Never } so a backend
that forgot to override it fails loudly rather than silently dropping the extension. These
two are not #[deprecated] — the default fallback is a correct (if conservative)
behavior for queues without visibility timeouts.
AWS
rollout-cloud-aws implements the four cloud traits against AWS: S3
(ObjectStore), SQS (Queue), Secrets Manager (SecretStore), and EC2
instance metadata (ComputeHint, IMDSv2-only). It is built behind a Cargo
feature so non-AWS builds pull no AWS SDK crates.
Build
cargo build -p rollout-cli --features aws
The aws feature is default-off. A binary built without it that loads an
[cloud] provider = "aws" config returns a fatal error telling you to rebuild
with --features aws.
Config
[cloud]
provider = "aws"
region = "us-west-2"
[cloud.aws.s3]
bucket = "my-rollout-bucket"
prefix = "runs/" # optional
multipart_chunk_bytes = 16777216 # 16 MiB; S3 minimum is 5 MiB
max_snapshot_part_bytes = 5368709120 # 5 GiB; 10 GiB hard cap
[cloud.aws.sqs]
queue_url = "https://sqs.us-west-2.amazonaws.com/123456789012/rollout"
visibility_timeout_secs = 300
[cloud.aws.secrets]
allowlist = ["rollout/hf-token"]
Cross-cloud is structurally impossible: the provider tag is an enum, so a
single config cannot name both AWS and GCP.
IAM permissions
S3 object + multipart operations, SQS send/receive/delete/visibility, and
Secrets Manager GetSecretValue are required. The full IAM matrix and the
mandatory AbortIncompleteMultipartUpload lifecycle rule live in
crates/rollout-cloud-aws/docs/bucket-setup.md.
Testing matrix
| Tier | What runs | When |
|---|---|---|
| unit | error mapping + key layout + IMDSv2 mock handshake | every cargo test (no Docker) |
emulator (cloud-emulator-aws) | full S3 + SQS + Secrets Manager conformance against localstack | every PR — always-on |
live (cloud-live-aws) | same suite against real AWS via OIDC | nightly (cron) |
Locally, bring up the emulator and run the ignored conformance suite:
docker compose -f docker-compose.test.yml up -d
LOCALSTACK_ENDPOINT=http://localhost:4566 \
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test AWS_REGION=us-east-1 \
cargo test -p rollout-cloud-aws --features aws --tests -- --include-ignored --test-threads=1
Common errors
- Throttled /
SlowDown/ 503 — surfaces asRecoverable::Throttledwith a backoffRetryHint; the SDK retries internally and the snapshotter retries on top. StreamingContentIdis stable across retries because chunks are hashed before eachUploadPart. - IMDSv1 disabled (
HttpTokens=required) — handled transparently: the SDK performs the IMDSv2PUT /latest/api/tokenhandshake. No raw metadata IP is ever used. - Secret not in allowlist —
Fatal::ConfigInvalid; add the name to[cloud.aws.secrets].allowlist.SecretStore::putis read-only in v1.1. - Orphan multipart uploads — best-effort aborted on drop; the bucket lifecycle rule reclaims any that leak on SIGKILL.
GCP
rollout-cloud-gcp implements the four cloud traits against GCP: Cloud Storage
(ObjectStore), Pub/Sub (Queue), Secret Manager (SecretStore, read-only),
and the GCE metadata server (ComputeHint). It is built behind a Cargo feature
so non-GCP builds pull no GCP SDK crates.
Build
cargo build -p rollout-cli --features gcp
The gcp feature is default-off. A binary built without it that loads a
[cloud] provider = "gcp" config returns a fatal error telling you to rebuild
with --features gcp. The aws and gcp features compose — --features aws,gcp builds a binary that dispatches on the TOML [cloud].provider value.
Config
[cloud]
provider = "gcp"
project = "my-gcp-project"
[cloud.gcp.gcs]
bucket = "my-rollout-bucket"
prefix = "runs/" # optional
resumable_chunk_bytes = 16777216 # 16 MiB; 5 MiB minimum
max_snapshot_part_bytes = 5368709120 # 5 GiB; 10 GiB hard cap
[cloud.gcp.pubsub]
topic = "rollout-work"
subscription = "rollout-workers"
ack_deadline_secs = 30
[cloud.gcp.secrets]
allowlist = ["hf-token"]
Cross-cloud is structurally impossible: the provider tag is an enum, so a
single config cannot name both AWS and GCP.
IAM / Workload Identity Federation
GCS object admin, Pub/Sub publisher + subscriber, and Secret Manager
secretAccessor are required; GCE compute.viewer is needed only when running
on GCE/GKE. The full role matrix and the (informational) lifecycle policy live
in crates/rollout-cloud-gcp/docs/bucket-setup.md.
The cloud-live-gcp CI job authenticates via Workload Identity Federation (WIF)
— no long-lived service-account keys.
Testing matrix
| Tier | What runs | When |
|---|---|---|
| unit | error mapping + key layout + Secret Manager allowlist + MDS mock | every cargo test (no Docker) |
emulator (cloud-emulator-gcp) | GCS + Pub/Sub conformance against fake-gcs-server + pubsub-emulator + in-test Secret Manager mock | every PR — always-on |
live (cloud-live-gcp) | same suite against real GCP via WIF | nightly (cron) |
Locally, bring up the emulators and run the ignored conformance suite:
docker compose -f docker-compose.test.yml up -d
STORAGE_EMULATOR_HOST=http://localhost:4443 \
PUBSUB_EMULATOR_HOST=localhost:8085 PUBSUB_PROJECT_ID=rollout-test \
cargo test -p rollout-cloud-gcp --features gcp --tests -- --include-ignored --test-threads=1
Emulator delta
Some production semantics are not reproducible on the emulators (resumable
upload status, time-based Pub/Sub redelivery, real fault injection). Those tests
are #[ignore]d locally and run in cloud-live-gcp. There is no first-party
Secret Manager emulator; the Secret Manager conformance tests use an in-test
hyper mock and run Docker-free on every build. See the crate
README
for the full delta table.
Common errors
- Throttled /
ResourceExhausted/ 429 / 503 — surfaces asRecoverable::Throttledwith a backoffRetryHint. The streamingContentIdis stable across retries because chunks are hashed before each resumable upload chunk. - Spot / preemptible reclaim —
ComputeHint::preemption_signalreadsinstance/preemptedfrom the GCE metadata server and reports a ~30s lead. - Secret not in allowlist —
Fatal::ConfigInvalid; add the name to[cloud.gcp.secrets].allowlist.SecretStore::putis read-only in v1.1. - Orphan resumable sessions — never persisted across processes (Pitfall #5); GCS auto-expires incomplete sessions after 7 days, and a preempted worker re-uploads from byte 0 idempotently.
Cloud-backed snapshots
Training-state snapshots stream to whichever object store your [cloud] block
selects. rollout-snapshots takes an injected Arc<dyn ObjectStore>, so the
same SnapshotterImpl works unchanged over the local filesystem, S3, or GCS —
only the injected store differs (CLOUD-03).
Configuration
CloudConfig is a #[serde(tag = "provider")] enum, so the provider's fields
live directly under [cloud]. A single TOML cannot name two providers —
cross-cloud single-run is structurally impossible (D-XPROV-02).
See examples/sft-tiny-aws.toml
and examples/sft-tiny-gcp.toml
for the minimal [cloud] flip from examples/sft-tiny.toml:
# AWS
[cloud]
provider = "aws"
region = "us-west-2"
[cloud.s3]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"
# GCP
[cloud]
provider = "gcp"
project = "rollout-prod-123"
[cloud.gcs]
bucket = "rollout-snapshots-prod"
prefix = "sft-tiny/"
Streaming semantics
A snapshot is a deterministic tar of the accelerate-style state directory (weights + optimizer + RNG + step), content-addressed by blake3. The upload path:
- Builds the tar deterministically (stable file order + zeroed mtime) so the same state always produces the same bytes.
- Hashes each chunk with
blake3::Hasherbefore the SDK call, so the resultingContentIdis stable across SDK retries (S3 multipart / GCS resumable — Pitfall #16). - Uploads to a
temp/pending-<ulid>key (S3 multipart upload / GCS resumable session). - Server-side copies the temp object to the sharded content-addressed key
<prefix>cas/<ab>/<cd>/<hex>(identical layout on FS, S3, and GCS), then deletes the temp. - On failure the temp upload is aborted (S3
MultipartGuardDrop) or expires via the bucket's 7-day lifecycle rule (GCS); no orphaned partial blob is ever read.
Restore fetches the blob by ContentId, re-verifies blake3, and extracts the
tar — a mismatch is a hard Fatal error, never a silent partial restore.
Byte-identical resume
The CLOUD-03 acceptance criterion — byte-identical SFT resume holds over the cloud streaming path — is witnessed by two always-on tests:
bit_identical_resume_at_step_5_via_s3(localstack-backedS3ObjectStore),bit_identical_resume_at_step_5_via_gcs(fake-gcs-server-backedGcsObjectStore).
Each snapshots a MockBackend SFT run at step 5, restores off the cloud round-trip,
runs five more steps, and asserts the final weights are byte-equal to a ten-step
uninterrupted run. They run on every CI PR via the cloud-emulator-aws /
cloud-emulator-gcp jobs — no GPU, no live cloud creds.
Cross-provider portability
Snapshots are content-addressed by blake3, so the same bytes produce the same
ContentId on any provider. To migrate a snapshot from S3 to GCS, an operator
copies the blob (rollout does not automate cross-provider transfer in v1.1):
# Operator-managed transfer between buckets:
aws s3 cp s3://aws-bucket/cas/ab/cd/<hex> /tmp/blob
gsutil cp /tmp/blob gs://gcs-bucket/cas/ab/cd/<hex>
The restore code path on either provider takes a SnapshotId and reads by
ContentId; the provider is whichever ObjectStore is injected per
[cloud].provider. The runnable witness is
crates/rollout-snapshots/tests/snapshot_resume_s3_to_gcs_via_manual_copy.rs:
it saves via S3, copies each blob into a GCS bucket asserting the ContentId is
identical across providers, then restores + resumes on GCS byte-for-byte
(D-XPROV-01).
Active-active cross-cloud single run is out of scope in v1.1 (PROJECT.md);
the tagged-enum CloudConfig makes a config naming both [cloud.s3] and
[cloud.gcs] structurally un-representable.
rollout cloud doctor
Operator pre-flight tool that exercises all four cloud traits (object store, queue, secret store, compute hint) against either AWS or GCP before a real training job runs. Addresses CLOUD-04 (D-DOCTOR-01..04).
Usage
# Build with the provider feature(s) you need.
cargo run -p rollout-cli --features aws -- cloud doctor --provider aws --config examples/sft-tiny-aws.toml
cargo run -p rollout-cli --features gcp -- cloud doctor --provider gcp --config examples/sft-tiny-gcp.toml --format json
Config source is the TOML [cloud] block only (D-DOCTOR-04) — there are no
--bucket/--queue/--secret-id flag overrides in v1.1. The --provider
flag MUST match the [cloud].provider in the TOML, or doctor exits 2.
Checks (in order)
- reachability — TCP + TLS handshake to the service endpoint
(
s3.<region>.amazonaws.com/storage.googleapis.com). Surfaces DNS / firewall issues distinctly from auth failures. - auth — credential-chain probe (a cheap metadata read that requires the resolved credentials). Catches broken AWS credential chains / GCP ADC.
- object_store — small payload PUT + GET roundtrip on the configured bucket.
- queue —
enqueue→dequeue_with_lease(30s)→ackon the configured queue. - secret_store — read the FIRST allowlisted secret (
[cloud.*.secrets].allowlist). An empty allowlist is reported as a failure with remediation guidance. - compute_hint —
inventory()+preemption_signal()probe (returnsOk(None)off a cloud instance). - content_id_roundtrip — a 64 MiB random buffer through
put_stream+get_stream+ blake3 verify. Forces the multipart / resumable path; catches blake3-streaming bugs (Pitfall 16 / D-SNAP-04).
Wall-time target: ~5-10s on a healthy environment.
Exit codes (D-DOCTOR-03)
0— all checks pass.1— at least one check failed (use--format jsonto see which).2— invocation / config error (provider mismatch, missing TOML, malformed schema).
Plays well with shell &&:
rollout cloud doctor --provider aws --config production.toml && \
rollout train sft --config production.toml
Output formats (D-DOCTOR-02)
-
--format human(default): colored steps with✓/✗icons + per-check latency + aN pass, M fail — total <ms>summary line. -
--format json: machine-readable; matches the schema incrates/rollout-cli/src/commands/cloud/doctor/output/json.rs:{ "checks": [ { "name": "reachability", "status": "pass", "latency_ms": 142 }, { "name": "queue", "status": "fail", "latency_ms": 31, "error": "enqueue: ..." } ], "summary": { "pass_count": 6, "fail_count": 1, "total_latency_ms": 5443 } }erroris omitted on passing checks.
Limitations (v1.1)
- Config-file-only (D-DOCTOR-04); no
--bucket/--queue/--secret-idoverrides. - One comprehensive mode (D-DOCTOR-01); no
--quick/--deeptiers. - Cross-cloud (both
[cloud.aws]and[cloud.gcp]in one TOML) is structurally impossible —CloudConfigis a#[serde(tag = "provider")]enum (D-XPROV-02).
CI coverage
The doctor_smoke integration tests run on every PR:
cloud-emulator-awsrunsdoctor_smoke_aws_*against localstack with pre-created bucket / queue / secret (exit 0 all-pass, exit 1 unreachable, human + JSON shape).cloud-emulator-gcprunsdoctor_smoke_gcp_*against fake-gcs-server + pubsub-emulator with pre-created bucket / topic / subscription.- The config-layer tests (provider mismatch → exit 2, malformed config → exit 2)
and the
--helpgolden run Docker-free on every PR.
Distribution
How rollout runs a single run across many machines: the single-coordinator lease, work-stealing, stateless-replayer restart, spot-drain, and split-brain fencing.
Multi-node distribution
rollout runs a single training/inference run across many machines from one
coordinator. This chapter is the operator's view of how that works: the
lease that keeps exactly one coordinator alive, the work-stealing that keeps
workers busy, how a coordinator restart is invisible to progress, and how a spot
preemption drains gracefully. The implementation lives in rollout-coordinator;
the contract is spec 05 — Distribution.
Coordinator + worker model
A run has exactly one coordinator at any time and N workers. The coordinator
owns the work ledger and brokers all coordination; workers never talk
peer-to-peer (one mutual-TLS edge per worker instead of N²). Workers long-poll
the coordinator for work, run it on their backend (vLLM in production, a mock
backend in the smoke), and submit content-addressed results. Submission is
idempotent on the content-addressed WorkItem.id, so a retried item never
produces duplicate output.
Single-row CAS lease + monotonic epoch
"Exactly one coordinator per run" is enforced by a single-row compare-and-swap
lease (StorageLease) over the coordinator_lease storage row. One impl rides
the dual-backed StorageTxn::cas_bytes, so the SAME lease semantics hold on both
the embedded redb backend and Postgres (D-LEASE-01) — the Postgres path is proven
in the postgres-integration CI lane (postgres_lease.rs).
- Acquire — a fresh coordinator CASes the empty row to
epoch = 0. - Steal-on-expiry — if the lease TTL has passed, a new coordinator CASes the
exact prior (expired) bytes to
epoch + 1. The epoch is monotonic: it only ever advances. - Renew — the incumbent heartbeats by CASing its own bytes forward, keeping
the epoch constant. A renew that finds the epoch has advanced under it returns
false— the incumbent has been fenced.
The lease TTL equals coordinator_failure_timeout; the renew cadence is strictly
shorter than the TTL. Every winning claim stamps the authoritative epoch row in
the same transaction, so the lease epoch and the ledger epoch never diverge.
Work-stealing
When the global queue is empty but a peer is still busy, an idle worker steals
work — coordinator-mediated, never peer-to-peer
(rollout-coordinator::{ledger, steal}):
- Trigger — a worker steals only when its local backlog drains to empty
(
backlog(thief) == 0); a non-idle thief's request is a no-op. - Victim — the coordinator picks the busiest peer by
Runningbacklog. - Batch — it reassigns
ceil(victim_backlog / 2)items, capped atMAX_STEAL_BATCH = 32. - Reassign — each item is moved via a two-step CAS over the same prior
Running(victim)bytes:try_repending(Running → Pending) thentry_claim(Pending → Running(thief)).
Stolen-then-reclaimed items never double-execute. If the victim's ack races
the steal, both CAS the same expected bytes and exactly one wins — the loser sees
stale bytes and is dropped. This single-winner property is witnessed every commit,
Docker-free, by concurrent_ack_and_steal_no_double_execute.
Coordinator restart (stateless replayer)
The coordinator holds no run state in memory that it cannot rebuild from
storage. On boot it: wins the lease, adopts the advanced epoch, then
scan_byteses the work ledger and reconstructs its in-flight assignment map:
Running{worker}rows are reconstructed in-flight — NOT requeued, because the worker may still hold the item (only the failure-scan stale path re-pends aftercoordinator_failure_timeout);Pendingrows go back onto the dispatch queue;Done/Failedrows are terminal and skipped — replayed acks are idempotent.
So a coordinator restart is invisible to progress: a fresh coordinator boots
over the same storage and the run completes with zero duplicate sample IDs. This
is witnessed every commit by coord_restart_no_duplicates (SC2).
Spot-drain (graceful preemption)
When a worker's cloud reports a spot preemption, the drain state machine
(rollout-coordinator::drain) runs within a conservative deadline: set a
stop-pull flag → nack each in-flight item (back to Pending for another worker)
→ opportunistically snapshot TrainState only if the remaining budget covers
the cost → deregister → exit cleanly.
Two numbers, kept distinct (D-SPOT-01/04):
| Provider | Notice lead (cloud gives) | Drain deadline (we target) |
|---|---|---|
| AWS | 120s | 60s |
| GCP | 30s | 15s |
The notice lead is what the cloud promises; the drain deadline is the budget the
state machine targets, leaving margin. The preemption signal is consumed only
through the ComputeHint trait — the coordinator never imports a cloud SDK
(coord ↛ cloud; the dependency-direction lint enforces it). Witnessed every commit
by spot_drain_completes_within_lead_time (SC3), for both AWS and GCP.
Split-brain fencing
If a coordinator is deposed (its lease was stolen and the epoch advanced), it
self-fences: it emits exactly one coordinator_fenced observability event
(never a shared-store write) and then std::process::abort()s within 5s. Workers
reject any RPC response carrying a stale coord_epoch (EpochGuard), so a deposed
coordinator's late replies can never corrupt the run. The abort edge is the hidden
rollout-coordinator test-fence subcommand, driven by the SC4 subprocess witness
fence_aborts_within_5s.
Operator recipe: the 3-node smoke
make smoke-3node-aws / make smoke-3node-gcp boot 1 coordinator + 3 workers
(mock backend, no GPU, no Docker) over an auto-generated dev CA, drive the work
ledger through dispatch + a real steal, and assert the run reports done within
30s with a steal observed:
make smoke-3node-aws # local-transport wiring run (free, Docker-free)
make smoke-3node-gcp
By default the smoke runs over the local mTLS transport so it is reproducible on a free runner. To run the same topology over the real cloud transport (operator path; needs real AWS/GCP credentials and ~4 hosts):
ROLLOUT_SMOKE_CLOUD=1 make smoke-3node-aws
The live-cloud run is the operator-optional path; the every-commit named witnesses
(coord_restart_no_duplicates, concurrent_ack_and_steal_no_double_execute,
split_brain_old_coord_self_fences, spot_drain_completes_within_lead_time) are
the load-bearing gate and run Docker-free on every commit:
cargo test -p rollout-coordinator
Examples
This page is reserved for the v1 working-model recipe (SHIP-03 hardened).
v1 cannot ship without at least one end-to-end recipe (make example or
cargo run --example) that takes a real small open-weights model, runs SFT or
PPO, completes on commodity hardware, is exercised by nightly CI, and is
documented here. See AGENTS.md §9.4.
The recipe lands progressively: Phase 4 (SFT stub) → Phase 9 (real recipe) → Phase 12 (polished docs).