SFT — Supervised Fine-Tuning

Phase 4 plan 04-02 ships rollout-algo-sft: a PolicyAlgorithm skeleton driven by a deterministic MockBackend, with the load-bearing TRAIN-03 byte-compare resume proof. This chapter covers the architecture, the SFT settings shape, the JSONL data contract, and how snapshot_save / snapshot_restore participate in the TRAIN-03 round-trip.

The HF transformers + accelerate path lands in plan 04-05; this skeleton intentionally has zero Python / GPU dependencies so the snapshot resume contract is exercised on every CI build.

Architecture

   ┌─────────────────────────┐         ┌────────────────────────────┐
   │ SftSettings (TOML)      │         │ AlgoDependencies           │
   │   base_model, optimizer │         │   backend  : Arc<dyn TB>   │
   │   budget, dataset       │         │   storage  : Arc<dyn S>    │
   │   packing, loss_on, …   │         │   object   : Arc<dyn O>    │
   └────────────┬────────────┘         │   snapshots: Arc<dyn Sn>   │
                │ from_settings        │   events   : Arc<dyn Em>   │
                ▼                      └─────────────┬──────────────┘
       ┌─────────────────────┐                       │
       │     SftAlgo         │ ◀─────────────────────┘
       │  (PolicyAlgorithm)  │
       │   step: u64         │     run() loop, bounded by budget.max_steps
       └──────────┬──────────┘
                  │ step_once()
                  ▼
       forward_with_loss  →  optimizer_step  →  step += 1
                  │
                  ▼
     snapshot_save / snapshot_restore (algo meta only — weights via TrainableBackend::save_weights)

SftAlgo holds an Arc<dyn TrainableBackend> and a step counter. Each step_once() synthesises a single-row TrainBatch, calls forward_with_loss (which returns a constant loss=0.5 plus an opaque GradHandle), then calls optimizer_step. The trait's optimizer_step takes &self (interior mutability) so the algo can step through the Arc<dyn …> without unique ownership.

Plan 04-05 replaces the synthetic batch with a real tokenized chunk read from the dataset.

SftSettings (TOML shape)

[algorithm]
kind = "sft"

[algorithm.sft]
minibatch_size = 8
gradient_accumulation = 1
loss_on = { kind = "assistant_only" }

[algorithm.sft.base_model]
uri = "Qwen/Qwen2.5-0.5B-Instruct"

[algorithm.sft.dataset]
kind = "jsonl_path"
path = "data/sft-tiny.jsonl"

[algorithm.sft.optimizer]
kind = "sgd"
lr = 1.0e-3
weight_decay = 0.0
betas = [0.9, 0.999]
eps = 1.0e-8
warmup_steps = 0
schedule = "constant"

[algorithm.sft.budget]
max_steps = 1000

[algorithm.sft.packing]
kind = "off"
max_seq_len = 2048

The JSON schema is generated by cargo xtask schema-gen from rollout_core::config::training::SftSettings.

JSONL data contract (D-DATA-01)

load_jsonl accepts two shapes per row:

ShapeJSONDataRow
Prompt / completion{"prompt":"Q","completion":"A"}{ prompt: "Q", assistant: "A" }
Chat messages{"messages":[{"role":"user","content":"Q"},{"role":"assistant","content":"A"}]}{ prompt: "[user] Q", assistant: "A" }

Phase-4 restrictions:

  • At most one "role": "assistant" turn per row (multi-turn lands when the harness work in Phase 7 needs it).
  • Empty lines are skipped.
  • Malformed lines (neither shape, missing assistant in messages, unparseable JSON) produce Fatal(ConfigInvalid) with the file path and line number — <path>:<lineno>: <reason>. Easy to grep, easy to fix.

The Phase-7 work (HARNESS-*) extends this to DatasetRef::Other(...) for harness-driven datasets; until then, Other is a config error.

validate_plan errors

LocatorReason
algorithm.sft.minibatch_sizemust be ≥ 1
algorithm.sft.optimizer.lrmust be > 0

These fire at plan time so the CLI rejects bad configs before any backend is constructed.

snapshot_save / snapshot_restore (D-DETERM-05)

SftAlgo::snapshot_save builds a Snapshot row with:

  • meta = { step: <u64>, weights_id: "<hex>" } — algorithm-internal extras, free-form serde_json::Value.
  • parts = [{ role: "weights", content: <ContentId> }] — points at the bytes returned by TrainableBackend::save_weights.
  • kind = SnapshotKind::TrainState.

SftAlgo::snapshot_restore reads meta.step and resets the algo's step counter. The backend's weights are restored separately — production backends call TrainableBackend::load_weights(&weights_id); the Phase-4 snapshot_resume.rs test rebuilds MockBackend directly from a captured weights_snapshot() (the byte-compare assertion would be meaningless if it went through load_weights, which is a no-op for the mock).

The full production save path (tar of the accelerate dir, blake3 over the tar bytes, content-addressed put on the object store) lives in SnapshotterImpl::save_train_state (plan 04-01) and runs alongside the algo-level snapshot_save call.

TRAIN-03 byte-compare proof

tests/snapshot_resume.rs::bit_identical_resume_at_step_5 is the LOAD-BEARING proof for TRAIN-03. It runs on every CI build with no GPU and no HF transformers — exercising the resume contract on the MockBackend path. The flow:

  1. Run A: fresh MockBackend::new_train(42) → 10 step_once iterations → capture weights_a.
  2. Run B (phase 1): fresh MockBackend::new_train(42) → 5 step_once iterations → capture weights_after_5algo_b1.snapshot_save() → drop algo + backend.
  3. Run B (phase 2): MockBackend::new_train_with_weights(42, weights_after_5) → push step counter to 5 via test helper set_step → algo snapshot_restore → 5 more step_once iterations → capture weights_b.
  4. Assert weights_a == weights_b (byte-equal).

The set_step(5) helper is a MockBackend-only affordance because the algo sees only Arc<dyn TrainableBackend> and load_weights is a no-op on the mock. Production backends restore the optimizer step counter via their own checkpoint format inside load_weights.

Running the example

The smallest possible SFT run lives at examples/sft-tiny.toml + examples/sft-tiny.jsonl (4 chat rows; Qwen2.5-0.5B-Instruct; max_steps = 2). Two ways to exercise it:

Dry-run (works without Python deps; validates config + dataset path + algorithm shape):

cargo run -p rollout-cli -- train sft \
  --config examples/sft-tiny.toml --dry-run

Live run (requires transformers + accelerate + torch; ~3-5 min M-series CPU):

pip install 'transformers>=4.45,<5.0' 'accelerate>=1.0,<2.0' 'torch>=2.1,<3.0'
ROLLOUT_TRANSFORMERS_AVAILABLE=1 make train-smoke

make train-smoke invokes scripts/train-smoke.sh, which dry-runs first then runs the full SFT path against Qwen/Qwen2.5-0.5B-Instruct on CPU. See CLI for the full subcommand surface.

Next steps

  • Plan 04-05 (backend-vllm-train) swaps the synthetic batch + deterministic-SGD path for real tokenization through HF transformers
    • accelerate; the same PolicyAlgorithm surface drives it.
  • Plan 04-06 (CLI) mounts rollout train sft --config <toml> on top of SftAlgo::run.
  • Plan 04-07 polishes the docs and ships the v1 SFT smoke recipe.