CPU mode

vLLM ships with first-class CUDA support, but for AGENTS.md §7 ("every plugin testable locally without GPU") rollout must also work on a plain CPU. This chapter documents the CPU-mode contract for rollout-backend-vllm, the dev-loop reality of macOS Apple-Silicon, and the smoke-test posture that lets default CI stay green without any GPU.

Where CPU mode is selected

Per Phase 3 CONTEXT decision D-VLLM-04 (as overridden by RESEARCH §"Pitfall 9"), the Python-side glue in python/rollout/backends/vllm/engine.py performs an explicit torch.cuda.is_available() probe and passes device="cuda" or device="cpu" to AsyncEngineArgs — never device="auto". The auto-detect path was rejected because vLLM silently falls back to CPU when CUDA libraries are partially installed (driver present, runtime missing, etc.), and a silent fallback would mask configuration mistakes at runtime instead of failing at plan time.

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
engine_args = AsyncEngineArgs(
    model=model_uri,
    device=device,           # explicit; not "auto"
    disable_log_stats=True,
    disable_log_requests=True,
)

rollout-cloud-local::ComputeHint::inventory() still informs observability events (gpu_inventory_collected) and worker-config decisions, but it is no longer the source-of-truth for the engine device kwarg.

Expected CPU throughput

vLLM's CPU backend is functional but not fast. Approximate single-stream throughput for the canonical Phase-3 test model (Qwen/Qwen2.5-0.5B-Instruct, 16 max-tokens):

Hosttokens/sec
Apple M1 Pro (8-core perf)~3–6
Linux x86_64 (16-core)~2–4
Generic CI runner (4-core)~1–2

A 4-prompt × 16-token smoke run finishes in well under 60 s on any of these. Anything longer than the canonical examples/batch-tiny.toml shape should target a CUDA host.

macOS Apple-Silicon

vLLM has no Apple-Silicon wheel as of Phase 3. pip install vllm on macOS produces ERROR: No matching distribution found for vllm. The two paths forward:

  1. Build-from-source (slow, brittle). VLLM_TARGET_DEVICE=cpu pip install -e . against a freshly cloned vLLM repo. Compilation takes 10–30 min and depends on Apple-clang versions in ways that drift between vLLM releases. Documented but not recommended for routine dev work.
  2. Docker (recommended). See dev-on-macos.md. A linux/amd64 (or linux/arm64 if available) container with vllm>=0.10 pre-installed lets the rollout binaries run identically to CI.

Either way, the Rust-side test surface (~80 % of Phase 3's automated tests — SamplingParams postcard determinism, sample_id derivation, CAS state-machine transitions, JSONL round-trip, the MockBackend-driven restart_no_duplicates test) runs natively on macOS without any vLLM installed. Only the live-engine integration tests (vllm_init.rs, vllm_generate.rs) and the make infer-smoke script require a real vLLM.

CI posture

  • Default CI (public runners): infer-smoke now runs on every PR and merge — no ROLLOUT_VLLM_AVAILABLE gate. It installs the vllm-cpu PyPI wheel (~101 MB unified CPU wheel, AVX2 fallback) instead of the ~10 GB CUDA wheel; the torch.cuda.is_available() probe in engine.py selects device="cpu" automatically. The job downloads Qwen2.5-0.5B-Instruct (cached under ~/.cache/huggingface), runs rollout infer batch --config examples/batch-tiny.toml, and asserts 4 non-empty completion rows — all on the free 4-vCPU ubuntu-latest runner in well under 60 s of inference. train-smoke is likewise always-on, installing CPU torch + transformers + accelerate and running the examples/sft-tiny.toml SFT (max_steps = 2) on CPU. pip + HuggingFace caches keep both jobs fast on repeat runs.
  • MockBackend proofs unchanged: the load-bearing restart_no_duplicates (BACKEND-02 exit (b)) and bit-identical-resume (TRAIN-03) proofs still run in the standard test job via MockBackend — no GPU/vLLM/transformers required there.
  • Local dev: make infer-smoke after pip install 'vllm-cpu>=0.17'. On Apple-Silicon, prefer the Docker path documented in dev-on-macos.md.

Failure modes

FailureSurfaceDiagnosis
import torch failsPython ImportError at engine initActive venv missing torch — pip install torch first
import vllm failsPython ImportError at engine initActive venv missing vllm — for the CPU/CI path pip install 'vllm-cpu>=0.17'; on a CUDA host install the matching CUDA wheel
torch.cuda.is_available() == False on a GPU hostengine boots in CPU mode silentlyNVIDIA driver/runtime mismatch — install matching CUDA runtime; the explicit probe surfaces this rather than masking it
vllm import succeeds but AsyncLLMEngine.from_engine_args panics with device="cpu" not supportedvLLM version too oldupgrade to vllm>=0.10
make infer-smoke times out (>300 s) on a CPU hostmodel larger than Qwen2.5-0.5B-Instructuse the canonical examples/batch-tiny.toml model; do not run multi-billion-param models on CPU