Resume: zero-duplicate batch restart
Phase 3's rollout infer batch is resumable. If the worker is killed mid-batch, restarting with --resume <run_id> (or with the same [output] dir = and no explicit flag) continues from the last persisted sample with zero duplicates and no skipped inputs. This is one of the three ROADMAP Phase-3 exit criteria.
The lifecycle
Every rollout infer batch invocation has exactly one run_id — a 26-character ULID. The CLI resolves it via a three-tier precedence:
--resume <id>explicit override. Use this when you know the run you want to reattach.<output.dir>/run-idfile. If--resumeis absent but the output directory already has arun-idfile (from a prior run), the CLI reads it and continues that run. This is the common case for "kill + restart with the same config."- Mint a fresh ULID. No override, no prior file → generate a new
RunId, write it atomically (tempfile + rename) to<output.dir>/run-id, and proceed as a brand-new run.
run-id files are single-line UTF-8 ULIDs. They are safe to delete (forces a fresh run on next invocation) but should not be edited by hand.
How resumability is implemented
Three subsystems collaborate (each documented in its own chapter):
rollout-storage(the redb-backedEmbeddedStoragefrom Phase 2) holds the per-sample state KV under namespaceinfer. Each sample'sSampleRecordcarries one of fourSampleStatevariants —Pending,Running,Done { completion_blob }, orFailed { reason }— and transitions atomically viaStorage::cas_bytes.rollout-cloud-local::InMemQueue(also Phase 2) holds the work queue with Storage spill: every enqueue/ack mirrors to redb undercloudlocal_queue, so a coordinator restart replays whatever was unacknowledged.rollout-cloud-local::FsObjectStorestores the actual completion blobs at content-addressed paths under<output.dir>/object-store/<sha256[0..2]>/<sha256[2..4]>/<sha256>. TheSampleRecord::state = Done { completion_blob: ContentId }variant carries the blob'sContentId; the actual completion text is read from the object store at output-writing time.
At plan time, BatchCoordinator::scan_and_enqueue iterates the existing samples namespace and:
- skips records with
state = Done(already complete; their completion blobs survive in the object store); - re-enqueues records with
state = Pendingorstate = Failed; - treats records with
state = Runningand astarted_atolder thanstale_after_ms(default 60 000) as stale — CASRunning → Pendingand re-enqueue (per RESEARCH Pitfall 5: covers the kill-mid-flight case where a worker crashed without finishing).
Only the new and re-enqueued samples flow through the queue. Workers process them via the same BatchWorker::run_loop as a fresh run.
The deterministic test
crates/rollout-cli/tests/restart_no_duplicates.rs is the load-bearing proof. It runs on every CI build — no GPU, no vLLM, no real model — because it uses the MockBackend shipped by rollout-runtime-batch behind the test-mock-backend Cargo feature.
The test follows RESEARCH §"Restart-resume test design" verbatim:
- Spawn
rollout infer batch --config <tmp>as a subprocess viatokio::process::Command::new(env!("CARGO_BIN_EXE_rollout"))withROLLOUT_TEST_MOCK_BACKEND=1. - Stream stdout, count
sample_completedevents. - After 3 completions,
child.start_kill()(SIGKILL). - Read
<output.dir>/run-id. - Spawn a second subprocess with
--resume <run_id>and the same env. - Wait for exit 0.
- Assert the final
completions.jsonlhas exactly N=8 lines, all uniquesample_ids, and every input prompt is represented once.
Run locally:
cargo test -p rollout-cli --features test-mock-backend --test restart_no_duplicates
Total wall-clock: ~1.5 s on a modern dev box. The MockBackend completes each sample in 50 ms; the test deliberately spawns subprocesses to exercise the real CLI resume code path including --resume parsing, run-id file I/O, and BatchCoordinator::scan_and_enqueue.
Content-addressed sample IDs
A sample's identity is derived deterministically from (model, prompt, sampling_params, input_index) — see crates/rollout-runtime-batch/src/state.rs::sample_id. Per CONTEXT D-RESUME-01 (extended by RESEARCH Pitfall 1):
#![allow(unused)] fn main() { fn sample_id(model: &ContentId, prompt: &str, params: &SamplingParams, idx: u64) -> ContentId { let mut h = blake3::Hasher::new(); h.update(&[SAMPLING_PARAMS_SCHEMA_VERSION]); // = 1 in Phase 3 h.update(model.as_bytes()); h.update(prompt.as_bytes()); h.update(&postcard::to_stdvec(params).unwrap()); h.update(&idx.to_le_bytes()); ContentId::from(h.finalize()) } }
The leading SAMPLING_PARAMS_SCHEMA_VERSION byte is the maintenance hook: bumping the constant invalidates every outstanding sample-id, which is the desired behavior whenever the SamplingParams struct gains, loses, or reorders fields. Without it, a future serde-evolution of SamplingParams could silently produce different postcard bytes for the same logical configuration and break resume across versions.
What you cannot resume
- Cross-model resume. A run whose
<output.dir>/run-idwas minted under model A cannot resume against model B — the per-samplemodel_idis part of the content-addressed sample ID; mixing models triggersFatal(ConfigInvalid)at scan time. - Cross-machine resume (Phase 3). The
EmbeddedStorageredb file is local to the host. Phase 5 (CLOUD-03) introduces object-store-backed snapshot storage that will allow cross-machine resume; until then, resume is single-host. - Streaming runs. Phase 3 rejects
sampling.stream = trueat plan time (D-BACKEND-03); streaming is Phase 8 (INFER-01).
See also
cli.md— full CLI flag reference for--resume.cpu-mode.md— where the MockBackend-driven test runs vs. live vLLM.batch-runtime.md— theBatchCoordinatorandBatchWorkerinternals.vllm-backend.md— Pitfall-2 GIL bridge details that the live engine relies on.