Determinism

Phase 4 of rollout commits to deterministic training-state snapshots (D-DETERM-01): a Snapshot.parts[0].content is the same ContentId whenever the same code drives the same model with the same RNG seed and the same input stream. This page documents the contract, the platform caveats, and every mitigation that lands in rollout-backend-vllm --features train.

The determinism stack

LayerMitigationWhere
Process envCUBLAS_WORKSPACE_CONFIG=:4096:8, PYTHONHASHSEED=0train.py preamble
RNG seedsrandom + numpy + torch + torch.cuda.manual_seed_all_set_determinism_flags
PyTorch deterministictorch.use_deterministic_algorithms(True)_set_determinism_flags
cuDNNcudnn.deterministic = True, cudnn.benchmark = False_set_determinism_flags
Matmul precisiontorch.set_float32_matmul_precision("highest")_set_determinism_flags
Chat template{% generation %} markers via GENERATION_MARKED_QWEN25_TEMPLATEqwen25_chat_template.py
Dataloadertorchdata.StatefulDataLoader if available; step-replay fallbackinit_train
Acceleratoraccelerator.prepare(model, optimizer, scheduler)init_train
LR schedulerregister_for_checkpointing (Pitfall 10 fallback)init_train
Tarbyte-stable mode bits + sorted walkdir + mtime=0 (Pitfall 9)rollout-snapshots::build_deterministic_tar

Why preamble ordering matters (Pitfall 2)

CUBLAS_WORKSPACE_CONFIG and PYTHONHASHSEED are read by torch on first import. If import torch runs before they are set, the values are baked in and later os.environ mutations do not change behavior. The Rust side enforces ordering by writing the env vars on the worker thread before calling py.import("rollout.backends.vllm.train") — see rollout-backend-vllm/src/train.rs::import_train_module.

train.py repeats os.environ.setdefault(...) at the top of the module defensively: if someone imports it directly from a Python REPL with torch already loaded, the user gets non-determinism, but the source still documents the contract.

cuDNN.benchmark is the silent killer (Pitfall 8)

cuDNN auto-tunes kernel selection by running candidate kernels and picking the fastest. This is non-deterministic — re-running the same workload picks a different kernel under load. cudnn.benchmark = False is required, even when cudnn.deterministic = True. The Phase-4 implementation sets both explicitly in _set_determinism_flags.

LR scheduler must be in the save-state capture path (Pitfall 10)

accelerator.save_state captures the model, optimizer, RNG, and any object explicitly registered for checkpointing. The Phase-4 path prefers passing the scheduler through accelerator.prepare(model, optimizer, scheduler); if that fails (e.g., the scheduler shape isn't supported by accelerate's prepare), the code falls back to accelerator.register_for_checkpointing(scheduler). Without one of these, the LR resumes from lr_start after a snapshot restore instead of continuing the schedule — silent non-determinism.

CPU vs CUDA contract

HardwareBit-identical?
Same CPU model, same OS, same envYes (the load-bearing CI target)
Same SM (sm_80 vs sm_80), same cuDNNYes, with deterministic + !benchmark
Different SM (sm_80 vs sm_90)No — kernels differ
Different cuDNN minor versionNo — algorithm shortlist may differ
Cross-machine (CPU A vs CPU B)Best-effort; FP rounding differs

The snapshot_resume_live.rs live witness exercises the same-CPU case on the dev box. CI runs the MockBackend variant from plan 04-02 (snapshot_resume.rs::bit_identical_resume_at_step_5) for the unconditional green signal.

accelerate.save_state captures

ObjectSourceRestored on load_state?
Model weightsaccelerator.prepare(model)Yes
Optimizeraccelerator.prepare(optimizer)Yes
RNG (torch)implicit via torch.random.get_rng_stateYes
RNG (cuda)implicit if CUDA is initializedYes
LR schedulerprepare(scheduler) OR register_for_checkpointingYes
DataloaderDataLoaderConfiguration(use_stateful_dataloader=True)If torchdata installed

The Phase-4 contract: anything that affects subsequent step output must be in this list. If a future plan adds custom state (e.g., reward-model normalizer running stats), it MUST register via register_for_checkpointing to keep the TRAIN-03 byte-compare proof intact.

torchdata stateful dataloader (Pitfall 3)

init_train probes for torchdata. If present, it constructs a DataLoaderConfiguration(use_stateful_dataloader=True) so the dataloader's position is checkpointed by save_state. Without torchdata, the runtime falls back to step-replay: the resume side re-reads from the JSONL head and skips step rows. Both modes preserve the TRAIN-03 byte-compare invariant on deterministic input streams.

Pitfall 7 — Accelerator singleton

Accelerator() is a singleton per Python process; constructing two of them in the same interpreter raises. init_train is idempotent — the second call returns the cached _STATE dict. teardown_train flushes the singleton via del + gc.collect() + torch.cuda.empty_cache() so the next init_train call (in a subsequent run, or after a mid-process swap-back to vLLM inference) can construct a fresh accelerator.

A bidirectional mid-process swap (training ↔ inference under the same OS thread) is Phase 9 — see the deferral note in rollout-backend-vllm/src/train.rs::run_set_train_mode.

MockBackend vs live HF path

BackendCPU bit-identical?When to use
MockBackendYes (Array1 SGD)CI; every PR; algo-side tests
Live HFYes on identical CPU; same-SM only on CUDADev-box live witness; nightly

The MockBackend path is the load-bearing CI proof for TRAIN-03. The live HF path is the gated witness behind ROLLOUT_TRANSFORMERS_AVAILABLE=1.

Where the code lives

  • python/rollout/backends/vllm/train.py — determinism preamble + Accelerator construction.
  • python/rollout/backends/vllm/qwen25_chat_template.py — generation-marked chat template.
  • crates/rollout-backend-vllm/src/train.rs — env-write-before-import enforcer; py.detach wrappers.
  • crates/rollout-snapshots/src/tar_build.rs — Pitfall 9 deterministic tar.
  • Snapshots — Phase-4 snapshot pipeline.
  • SFT — SFT algorithm + TRAIN-03 byte-compare proof.
  • CPU mode — running Phase-4 training on CPU only.