Reward-model training (RM)
rollout-algo-rm implements the Bradley-Terry reward-model training algorithm
(TRAIN-02). It mirrors the SFT algorithm's structure — PolicyAlgorithm impl
driven by a TrainableBackend, JSONL data loader, and a TRAIN-03
byte-compare resume proof — but consumes pairwise preferences instead of
single sequences.
Overview
A reward model learns to score responses on a scalar "quality" axis. Training
data is a stream of preference pairs (prompt, chosen, rejected): the model
should learn to rank chosen higher than rejected for the given prompt.
The Bradley-Terry objective formalizes this as a pairwise logistic regression on the reward gap:
L = -E[ ln σ(r_chosen - r_rejected) ]
where σ is the logistic function and r_* are the scalar reward outputs.
Spec 02 §7 carries the contract.
RmSettings (TOML)
[algorithm.rm]
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
head = "bradley_terry" # Phase 4 supports BradleyTerry only
minibatch_size = 8
[algorithm.rm.optimizer]
kind = "sgd"
lr = 1.0e-5
[algorithm.rm.budget]
max_steps = 100
[algorithm.rm.dataset]
type = "jsonl_path"
path = "examples/data/pairs.jsonl"
Other RmSettings fields (base_model, optimizer, budget, dataset)
mirror SftSettings. Head selection is bradley_terry only in Phase 4;
pairwise_logistic is a Fatal(ConfigInvalid) with a Phase 9 sentinel
until the RL pipeline lands.
Bradley-Terry loss math
Implemented in crates/rollout-algo-rm/src/loss.rs:
logsigmoid(x) = ln σ(x). Numerically stable via the softplus trick —logsigmoid(50)andlogsigmoid(-50)both return finite values within1e-4of the true asymptote.bradley_terry_loss(r_chosen, r_rejected) = -logsigmoid(r_chosen - r_rejected).bradley_terry_batch_mean(pairs)— mean over a slice of(r_chosen, r_rejected)pairs. Returns0.0for empty batches; callers should validate non-empty upstream when needed.
Pinned golden values (tests/bradley_terry_loss.rs):
| Case | Inputs | Expected |
|---|---|---|
| Zero diff | (1.0, 1.0) | ln 2 ≈ 0.6931 |
| Strong preference | (5.0, -5.0) | near 0 (≪ 1e-3) |
| Inverted preference | (-5.0, 5.0) | ≈ 10.0 (± 1e-3) |
| Mixed batch | [(2,1), (1,2)] | mean ≈ 0.8133 |
| Empty batch | &[] | exactly 0.0 |
| Numerical stability | logsigmoid(±50) | finite; asymptotic value within 1e-4 |
JSONL data shape (D-DATA-01)
Phase 4 supports one row shape:
{"prompt": "What is 2+2?", "chosen": "4", "rejected": "5"}
{"prompt": "Capital of France?", "chosen": "Paris", "rejected": "London"}
load_pairs(&path) parses line-by-line, skipping blank lines and rejecting
malformed rows with Fatal(ConfigInvalid) prefixed <file>:<lineno>:. A row
missing any of the three fields is malformed.
PolicyAlgorithm surface
| Method | Behavior |
|---|---|
id() | AlgorithmId("rm") |
Settings | rollout_core::config::training::RmSettings |
from_settings | clones deps.backend into the algo; step = 0 |
required_roles | vec![WorkerRole::LearnerWorker] |
validate_plan | rejects RmHeadKind::PairwiseLogistic (Phase 9); rejects minibatch_size == 0; lr <= 0 |
run | loads pairs once; loops step_once up to budget.max_steps, honoring ctx.cancel |
snapshot_save | meta = {step, weights_id}; one SnapshotPart { role: "weights" } |
snapshot_restore | restores self.step from meta.step; backend weights restored separately |
step_once synthesizes a 2-row TrainBatch (one row per side of a pair) and
drives forward_with_loss → optimizer_step. In the Phase-4 MockBackend
test path the loss is a constant; the real Bradley-Terry loss fires under
plan 04-05's HF transformers integration.
TRAIN-03 second-witness — byte-compare resume
tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the Bradley-Terry
twin of the SFT byte-compare proof. Structure:
- Run A. 10
step_onceiterations withseed = 42; capture weights. - Run B Phase 1. 5 steps; capture mid-run weights;
snapshot_save(). - Run B Phase 2. Rebuild
MockBackend::new_train_with_weights(42, …); push step counter to 5; restore algo step from snapshot meta; 5 more steps. - Assert.
weights_a == weights_bbyte-for-byte.
This is the second-witness for TRAIN-03 (the SFT proof is the first witness); together they discharge the "deterministic resume" exit criterion across both Phase-4 algorithms.
Content-addressed final checkpoint
tests/checkpoint_roundtrip.rs proves that TrainableBackend::save_weights
returns a ContentId that is stable when the backend is idle (two calls →
identical hash) and different after a non-trivial optimizer_step. This
matches the TRAIN-02 contract: the final checkpoint is content-addressed by
the blake3 hash of the postcard-encoded weights.
Phase 4 head support
Only RmHeadKind::BradleyTerry is wired in Phase 4. PairwiseLogistic exists
in the enum so the config schema can be cross-validated end-to-end, but
selecting it returns a Fatal(ConfigInvalid) with the string Phase 9 in the
message — Phase 9 lands the full RL pipeline including alternate preference
heads.
What lands later
- Plan 04-05 swaps
MockBackendfor the real HF transformers / accelerate training loop onQwen/Qwen2.5-0.5B-Instruct(CPU), wiring the Python-sideF.logsigmoid(r_chosen - r_rejected).neg().mean()and producing real reward models. - Plan 04-06 mounts
rollout train rm --config <toml>onRmAlgo::run. - Phase 9 lands
PairwiseLogisticand the RL-* algorithms (PPO/GRPO) that consume reward models trained here.
Running the example
The smallest possible RM run lives at examples/rm-tiny.toml +
examples/rm-tiny.jsonl (4 preference pairs; Qwen2.5-0.5B-Instruct base;
BradleyTerry head; max_steps = 2).
Dry-run (works without Python deps):
cargo run -p rollout-cli -- train rm \
--config examples/rm-tiny.toml --dry-run
The live rollout train rm path through --features train is wired
identically to SFT; the Phase-4 smoke recipe (make train-smoke) exercises
SFT specifically — the RM pipeline is dry-run-validated here and lands
under the Phase-9 RL recipe alongside PPO / GRPO. See CLI for
the full subcommand surface.