SFT — Supervised Fine-Tuning
Phase 4 plan 04-02 ships rollout-algo-sft: a PolicyAlgorithm skeleton
driven by a deterministic MockBackend, with the load-bearing TRAIN-03
byte-compare resume proof. This chapter covers the architecture, the SFT
settings shape, the JSONL data contract, and how snapshot_save /
snapshot_restore participate in the TRAIN-03 round-trip.
The HF transformers + accelerate path lands in plan 04-05; this skeleton intentionally has zero Python / GPU dependencies so the snapshot resume contract is exercised on every CI build.
Architecture
┌─────────────────────────┐ ┌────────────────────────────┐
│ SftSettings (TOML) │ │ AlgoDependencies │
│ base_model, optimizer │ │ backend : Arc<dyn TB> │
│ budget, dataset │ │ storage : Arc<dyn S> │
│ packing, loss_on, … │ │ object : Arc<dyn O> │
└────────────┬────────────┘ │ snapshots: Arc<dyn Sn> │
│ from_settings │ events : Arc<dyn Em> │
▼ └─────────────┬──────────────┘
┌─────────────────────┐ │
│ SftAlgo │ ◀─────────────────────┘
│ (PolicyAlgorithm) │
│ step: u64 │ run() loop, bounded by budget.max_steps
└──────────┬──────────┘
│ step_once()
▼
forward_with_loss → optimizer_step → step += 1
│
▼
snapshot_save / snapshot_restore (algo meta only — weights via TrainableBackend::save_weights)
SftAlgo holds an Arc<dyn TrainableBackend> and a step counter. Each
step_once() synthesises a single-row TrainBatch, calls
forward_with_loss (which returns a constant loss=0.5 plus an opaque
GradHandle), then calls optimizer_step. The trait's optimizer_step
takes &self (interior mutability) so the algo can step through the
Arc<dyn …> without unique ownership.
Plan 04-05 replaces the synthetic batch with a real tokenized chunk read from the dataset.
SftSettings (TOML shape)
[algorithm]
kind = "sft"
[algorithm.sft]
minibatch_size = 8
gradient_accumulation = 1
loss_on = { kind = "assistant_only" }
[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"
[algorithm.sft.dataset]
kind = "jsonl_path"
path = "data/sft-tiny.jsonl"
[algorithm.sft.optimizer]
kind = "sgd"
lr = 1.0e-3
weight_decay = 0.0
betas = [0.9, 0.999]
eps = 1.0e-8
warmup_steps = 0
schedule = "constant"
[algorithm.sft.budget]
max_steps = 1000
[algorithm.sft.packing]
kind = "off"
max_seq_len = 2048
The JSON schema is generated by cargo xtask schema-gen from
rollout_core::config::training::SftSettings.
JSONL data contract (D-DATA-01)
load_jsonl accepts two shapes per row:
| Shape | JSON | DataRow |
|---|---|---|
| Prompt / completion | {"prompt":"Q","completion":"A"} | { prompt: "Q", assistant: "A" } |
| Chat messages | {"messages":[{"role":"user","content":"Q"},{"role":"assistant","content":"A"}]} | { prompt: "[user] Q", assistant: "A" } |
Phase-4 restrictions:
- At most one
"role": "assistant"turn per row (multi-turn lands when the harness work in Phase 7 needs it). - Empty lines are skipped.
- Malformed lines (neither shape, missing assistant in messages,
unparseable JSON) produce
Fatal(ConfigInvalid)with the file path and line number —<path>:<lineno>: <reason>. Easy to grep, easy to fix.
The Phase-7 work (HARNESS-*) extends this to DatasetRef::Other(...)
for harness-driven datasets; until then, Other is a config error.
validate_plan errors
| Locator | Reason |
|---|---|
algorithm.sft.minibatch_size | must be ≥ 1 |
algorithm.sft.optimizer.lr | must be > 0 |
These fire at plan time so the CLI rejects bad configs before any backend is constructed.
snapshot_save / snapshot_restore (D-DETERM-05)
SftAlgo::snapshot_save builds a Snapshot row with:
meta = { step: <u64>, weights_id: "<hex>" }— algorithm-internal extras, free-formserde_json::Value.parts = [{ role: "weights", content: <ContentId> }]— points at the bytes returned byTrainableBackend::save_weights.kind = SnapshotKind::TrainState.
SftAlgo::snapshot_restore reads meta.step and resets the algo's
step counter. The backend's weights are restored separately — production
backends call TrainableBackend::load_weights(&weights_id); the Phase-4
snapshot_resume.rs test rebuilds MockBackend directly from a
captured weights_snapshot() (the byte-compare assertion would be
meaningless if it went through load_weights, which is a no-op for the
mock).
The full production save path (tar of the accelerate dir, blake3 over
the tar bytes, content-addressed put on the object store) lives in
SnapshotterImpl::save_train_state (plan 04-01) and runs alongside the
algo-level snapshot_save call.
TRAIN-03 byte-compare proof
tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the
LOAD-BEARING proof for TRAIN-03. It runs on every CI build with no GPU
and no HF transformers — exercising the resume contract on the
MockBackend path. The flow:
- Run A: fresh
MockBackend::new_train(42)→ 10step_onceiterations → captureweights_a. - Run B (phase 1): fresh
MockBackend::new_train(42)→ 5step_onceiterations → captureweights_after_5→algo_b1.snapshot_save()→ drop algo + backend. - Run B (phase 2):
MockBackend::new_train_with_weights(42, weights_after_5)→ push step counter to 5 via test helperset_step→ algosnapshot_restore→ 5 morestep_onceiterations → captureweights_b. - Assert
weights_a == weights_b(byte-equal).
The set_step(5) helper is a MockBackend-only affordance because the
algo sees only Arc<dyn TrainableBackend> and load_weights is a no-op
on the mock. Production backends restore the optimizer step counter via
their own checkpoint format inside load_weights.
Running the example
The smallest possible SFT run lives at examples/sft-tiny.toml +
examples/sft-tiny.jsonl (4 chat rows; Qwen2.5-0.5B-Instruct; max_steps = 2). Two ways to exercise it:
Dry-run (works without Python deps; validates config + dataset path + algorithm shape):
cargo run -p rollout-cli -- train sft \
--config examples/sft-tiny.toml --dry-run
Live run (requires transformers + accelerate + torch; ~3-5 min M-series CPU):
pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke
make train-smoke invokes scripts/train-smoke.sh, which dry-runs first then
runs the full SFT path against Qwen/Qwen2.5-0.5B-Instruct on CPU. See
CLI for the full subcommand surface.
Next steps
- Plan 04-05 (
backend-vllm-train) swaps the synthetic batch + deterministic-SGD path for real tokenization through HF transformers- accelerate; the same
PolicyAlgorithmsurface drives it.
- accelerate; the same
- Plan 04-06 (CLI) mounts
rollout train sft --config <toml>on top ofSftAlgo::run. - Plan 04-07 polishes the docs and ships the v1 SFT smoke recipe.