CPU mode
vLLM ships with first-class CUDA support, but for AGENTS.md §7 ("every plugin testable locally without GPU") rollout must also work on a plain CPU. This chapter documents the CPU-mode contract for rollout-backend-vllm, the dev-loop reality of macOS Apple-Silicon, and the smoke-test posture that lets default CI stay green without any GPU.
Where CPU mode is selected
Per Phase 3 CONTEXT decision D-VLLM-04 (as overridden by RESEARCH §"Pitfall 9"), the Python-side glue in python/rollout/backends/vllm/engine.py performs an explicit torch.cuda.is_available() probe and passes device="cuda" or device="cpu" to AsyncEngineArgs — never device="auto". The auto-detect path was rejected because vLLM silently falls back to CPU when CUDA libraries are partially installed (driver present, runtime missing, etc.), and a silent fallback would mask configuration mistakes at runtime instead of failing at plan time.
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
engine_args = AsyncEngineArgs(
model=model_uri,
device=device, # explicit; not "auto"
disable_log_stats=True,
disable_log_requests=True,
)
rollout-cloud-local::ComputeHint::inventory() still informs observability events (gpu_inventory_collected) and worker-config decisions, but it is no longer the source-of-truth for the engine device kwarg.
Expected CPU throughput
vLLM's CPU backend is functional but not fast. Approximate single-stream throughput for the canonical Phase-3 test model (Qwen/Qwen2.5-0.5B-Instruct, 16 max-tokens):
| Host | tokens/sec |
|---|---|
| Apple M1 Pro (8-core perf) | ~3–6 |
| Linux x86_64 (16-core) | ~2–4 |
| Generic CI runner (4-core) | ~1–2 |
A 4-prompt × 16-token smoke run finishes in well under 60 s on any of these. Anything longer than the canonical examples/batch-tiny.toml shape should target a CUDA host.
macOS Apple-Silicon
vLLM has no Apple-Silicon wheel as of Phase 3. pip install vllm on macOS produces ERROR: No matching distribution found for vllm. The two paths forward:
- Build-from-source (slow, brittle).
VLLM_TARGET_DEVICE=cpu pip install -e .against a freshly cloned vLLM repo. Compilation takes 10–30 min and depends on Apple-clang versions in ways that drift between vLLM releases. Documented but not recommended for routine dev work. - Docker (recommended). See
dev-on-macos.md. Alinux/amd64(orlinux/arm64if available) container withvllm>=0.10pre-installed lets the rollout binaries run identically to CI.
Either way, the Rust-side test surface (~80 % of Phase 3's automated tests — SamplingParams postcard determinism, sample_id derivation, CAS state-machine transitions, JSONL round-trip, the MockBackend-driven restart_no_duplicates test) runs natively on macOS without any vLLM installed. Only the live-engine integration tests (vllm_init.rs, vllm_generate.rs) and the make infer-smoke script require a real vLLM.
CI posture
- Default CI (public runners):
infer-smokenow runs on every PR and merge — noROLLOUT_VLLM_AVAILABLEgate. It installs thevllm-cpuPyPI wheel (~101 MB unified CPU wheel, AVX2 fallback) instead of the ~10 GB CUDA wheel; thetorch.cuda.is_available()probe inengine.pyselectsdevice="cpu"automatically. The job downloadsQwen2.5-0.5B-Instruct(cached under~/.cache/huggingface), runsrollout infer batch --config examples/batch-tiny.toml, and asserts 4 non-empty completion rows — all on the free 4-vCPUubuntu-latestrunner in well under 60 s of inference.train-smokeis likewise always-on, installing CPU torch + transformers + accelerate and running theexamples/sft-tiny.tomlSFT (max_steps = 2) on CPU. pip + HuggingFace caches keep both jobs fast on repeat runs. - MockBackend proofs unchanged: the load-bearing
restart_no_duplicates(BACKEND-02 exit (b)) and bit-identical-resume (TRAIN-03) proofs still run in the standardtestjob via MockBackend — no GPU/vLLM/transformers required there. - Local dev:
make infer-smokeafterpip install 'vllm-cpu>=0.17'. On Apple-Silicon, prefer the Docker path documented indev-on-macos.md.
Failure modes
| Failure | Surface | Diagnosis |
|---|---|---|
import torch fails | Python ImportError at engine init | Active venv missing torch — pip install torch first |
import vllm fails | Python ImportError at engine init | Active venv missing vllm — for the CPU/CI path pip install 'vllm-cpu>=0.17'; on a CUDA host install the matching CUDA wheel |
torch.cuda.is_available() == False on a GPU host | engine boots in CPU mode silently | NVIDIA driver/runtime mismatch — install matching CUDA runtime; the explicit probe surfaces this rather than masking it |
vllm import succeeds but AsyncLLMEngine.from_engine_args panics with device="cpu" not supported | vLLM version too old | upgrade to vllm>=0.10 |
make infer-smoke times out (>300 s) on a CPU host | model larger than Qwen2.5-0.5B-Instruct | use the canonical examples/batch-tiny.toml model; do not run multi-billion-param models on CPU |