CPU mode
The Phase-4 training surface runs on CPU end-to-end. This is the integration test path on dev boxes (including Apple Silicon) and the smoke recipe target in plan 04-07.
When to use
- Local dev loop on a laptop without CUDA.
- CI smoke that exercises the full HF transformers + accelerate path against
a tiny model (
Qwen/Qwen2.5-0.5B-Instruct). - Reproducing CUDA bugs that turn out to be deterministic-flag misconfiguration.
Expected throughput
| Model | Hardware | Steps/sec |
|---|---|---|
Qwen/Qwen2.5-0.5B-Instruct | Apple M2 Max (CPU) | ~0.1–0.3 |
Qwen/Qwen2.5-0.5B-Instruct | Linux x86_64 16-core | ~0.3–1.0 |
Roughly one to ten seconds per step for the 0.5B model. Anything larger is impractical on CPU; the per-token cost grows superlinearly. CPU mode exists to prove the pipeline, not to train.
Required env
None beyond default. The Phase-4 determinism preamble
(CUBLAS_WORKSPACE_CONFIG, PYTHONHASHSEED) is written by the Rust side
before import torch; CPU runs ignore CUBLAS settings without complaint.
The live tests gate on ROLLOUT_TRANSFORMERS_AVAILABLE=1:
pip install transformers>=4.45 accelerate>=0.34 torch>=2.4
ROLLOUT_TRANSFORMERS_AVAILABLE=1 \
cargo test -p rollout-backend-vllm --features train \
--test snapshot_resume_live -- --ignored --nocapture
Performance caveats
- No streaming. Phase 4 rejects
sampling.stream = trueat the boundary (D-BACKEND-03); training has no streaming surface. - No multi-GPU. CPU mode is single-process. The FSDP plugin in
init_trainonly activates whentorch.cuda.device_count() >= 2. - Slow. The 0.5B model at one step per ~5 seconds on M-series silicon means a 10-step smoke takes a minute. Plan accordingly.
- Determinism still holds. Two CPU runs with the same seed produce
byte-identical
accelerate.save_stateoutput. TheMockBackendvariant inrollout-algo-sft::tests::snapshot_resume::bit_identical_resume_at_step_5proves the Phase-4 contract holds on CPU without HF transformers installed.
Smoke recipe (plan 04-07)
make train-smoke (lands in plan 04-07) runs the live witness on dev boxes
where ROLLOUT_TRANSFORMERS_AVAILABLE=1 is set. CI does not install
transformers/accelerate; the MockBackend test is the unconditional gate.
Related
- Determinism — the determinism contract Phase-4 commits to.
- SFT — algorithm-side overview.
- Snapshots — snapshot pipeline.