File size: 30,616 Bytes
fd77f74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# Integration Architecture: 3-Channel Reward Composition Across the Agentic-RL Stack

> **Status:** Architecture spec — verified against framework source code via DeepWiki on 2026-05-25.
> **Companion doc:** [`docs/COMPOSER_RECIPE_MAPPING.md`](COMPOSER_RECIPE_MAPPING.md) defines the three reward channels (RLVR / Composer-SDPO / N-Teacher-Replay). This document specifies *where each one hooks into each framework* — the actual function names, decorator surfaces, and DataProto fields you'd touch. Working code skeleton at [`spikes/005-integrated-trainer-skeleton/`](../spikes/005-integrated-trainer-skeleton/).

## TL;DR — the unified loss

For any framework choice, the v0.1 trainer computes:

```
total_loss = grpo_loss
           + α * sdpo_kl_loss        (Composer hint-distill, channel 2)
           + β * trace_replay_loss   (N-teacher novel channel, channel 3)
```

Where:
- **`grpo_loss`** = standard GRPO+DAPO over RLVR scalar rewards (channel 1, the substrate).
- **`sdpo_kl_loss`** = `generalized_jsd_loss(student_logits, teacher_logits, labels=…, beta=0.5, …)` — single-model self-distillation, where `teacher_logits` come from a forward pass on the student model with a hint inserted into the context. **Lifted verbatim from `siyan-zhao/OPSD::generalized_jsd_loss`** (verified self-contained static method, MIT licensed).
- **`trace_replay_loss`** = DPO-style preference loss (or PRM-style score regression) over `(chosen, rejected)` pairs derived from N external teacher disagreements at each step.

The novel architectural claim is that **all three channels can run simultaneously** in a single trainer step, with the cost split as: (1) one extra forward pass per error site for SDPO, (2) N teacher API calls per replayed step for trace-replay. Spike 001 verified the API economics (✅ $0.98/trace, 5× headroom).

## Stack-by-stack integration matrix

| Component | TRL | VeRL | TorchForge | Monarch | OpenEnv |
|---|---|---|---|---|---|
| **Channel 1 (RLVR/GRPO)** | `GRPOTrainer._compute_loss(model, inputs)` — base class behavior, no change | `core_algos.compute_grpo_outcome_advantage` (registered via `@register_adv_est("grpo")`) | `forge.controller.GRPO` recipe (paused; pattern reference only) | Orchestrates rollout/trainer/rewarder ActorMeshes | Env exposes RLVR-shaped reward via `step()` |
| **Channel 2 (SDPO hint-distill)** | **Subclass override** of `_compute_loss`; lift `generalized_jsd_loss` from OPSD | **New advantage estimator** registered as `@register_adv_est("grpo_sdpo")`; reads `data.batch["sdpo_teacher_logprobs"]`; OR keep adv_estimator=grpo and add SDPO term in critic worker's compute_loss | Add a new ActorMesh `SDPOTeacherActor` that re-runs forward with hint-conditioned context; wire into trainer's loss | No-op at orchestration layer (just routes hint pairs) | Env emits "error site" markers in tool response so trainer knows where to insert hints |
| **Channel 3 (N-teacher trace-replay)** | **Subclass override** of `_compute_loss`; add DPO-pair term using teacher logprobs in `inputs["teacher_action_distributions"]` | **Custom adv_estimator**; teacher distributions stashed in `data.non_tensor_batch["teacher_actions"]`; precedent: distillation already attaches `teacher_log_probs` to rollout DataProto | Add a new `TeacherReplayActor` ActorMesh that holds OpenRouter client; called on a delayed-reward channel (RFC-004) | Routes teacher queries via `service.spawn(TeacherReplayActor, n=K)` for K parallel teacher pools | Env's `state()` API exposes step-level state needed for teacher replay |
| **Multi-turn rollout async** | ❌ **Blocking** — tool-call stalls GPU | ✅ `AsyncServer` + `AgentLoop` async; tool-call doesn't block GPU | ✅ Generator ActorMesh async via vLLM; tool-call waits don't block trainer | ✅ ActorMesh + supervision tree; native async | Env supports async via WebSocket multiplexed sessions |
| **Weight sync (vLLM ↔ FSDP)** | Co-located vLLM (no resharding) | ✅ **3D-HybridEngine** (resharding between FSDP↔TP) — most efficient | TorchStore RDMA weight broadcast | Monarch RDMA data plane | N/A (env-side) |
| **Scale ceiling** | ~32 GPUs / 70B FSDP | ✅ 671B+ proven, Megatron-LM | Reference patterns only (paused) | Thousands of GPUs (mesh) | 10K+ concurrent env sessions |

**Reading the matrix:** rows are "what each reward channel touches in each framework." Columns are framework choices. The matrix shows the v0.1 framework choice is non-trivial:
- **TRL** = simplest extension story (one subclass override) but doesn't async-decouple tool calls and caps at ~70B.
- **VeRL** = most flexible at scale (custom `adv_estimator` + DataProto extension is well-trodden) and has async agent loop, but Ray-heavy and steeper curve.
- **TorchForge + Monarch** = cleanest abstraction but Forge is "development paused" — use as reference, not foundation.
- **OpenEnv** = orthogonal substrate — works with all of the above; not a choice, a default.

## Architecture diagrams (mechanism-level, all three channels)

### 1. Composer SDPO hint-distill flow (single model, hint-conditioned self-teacher)

```
                                    ┌─────────────────────┐
                                    │  Hint Generator     │
                                    │  - templates v0.1   │
                                    │  - LLM-driven v0.2  │
                                    └──────────┬──────────┘
                                               │ generates hint text
                                               ▼ at error sites
       Trace, mid-rollout:                ┌────────────────┐
       …turn_4 (OK)                       │ Build paired   │
       turn_5 (ERROR: tool not found) ────│ contexts:      │
       …turn_6 (OK)                       │   ctx_student  │
                                          │   ctx_teacher  │
                                          │  (= ctx_student│
                                          │   + hint at    │
                                          │   turn_5)      │
                                          └───────┬────────┘

                                ┌─────────────────┴──────────────────┐
                                │                                    │
                                ▼                                    ▼
                      ┌──────────────────┐              ┌────────────────────┐
                      │ Student forward  │              │ Teacher forward    │
                      │ on ctx_student   │              │ (SAME MODEL on     │
                      │   → student_logits│             │  ctx_teacher)      │
                      │                  │              │   → teacher_logits │
                      └──────────┬───────┘              └────────┬───────────┘
                                 │                               │
                                 └──────────┬────────────────────┘
                                            │ feed both into

                          ┌─────────────────────────────────────────┐
                          │ generalized_jsd_loss(                  │
                          │   student_logits=…,                    │
                          │   teacher_logits=…,                    │
                          │   labels=… (mask non-error turns),     │
                          │   beta=0.5,    # JSD                   │
                          │   temperature=1.0,                     │
                          │   token_clip=…)                         │
                          │                                         │
                          │ → sdpo_kl_loss (a scalar)              │
                          └──────────────┬──────────────────────────┘


                              add to total_loss with α weight
```

**Key implementation note:** Per the DeepWiki audit, OPSD's `SelfDistillationDataCollator` builds two prompts per example:
- `ctx_student` = problem only (or problem + rollout up to error turn).
- `ctx_teacher` = problem + privileged info (in OPSD's case, the verified solution; in our case, the hint).

For Composer-style hint-distill, we adapt this: `ctx_teacher = ctx_student + injected_hint` at the specific turn boundary, with `labels` masked to keep loss only at the post-hint tokens of that turn.

### 2. N-Teacher trace-replay flow (N external teachers, novel)

```
       Trace, frozen post-rollout:
       turn_1 (state_1, action_1_student, reward=…)
       turn_2 (state_2, action_2_student, reward=…)

       turn_50 (state_50, action_50_student, reward=…)

                                │ for each turn t in trace:

                        ┌───────────────────────────┐
                        │ teacher pool (frozen)     │
                        │  ┌──────────────────────┐ │
                        │  │ Opus 4.7 (anthro)    │ │
                        │  │ GPT-5 (openai)       │ │
                        │  │ DeepSeek V4 Pro      │ │
                        │  └──────────────────────┘ │
                        │  parallel API calls       │
                        └───────────┬───────────────┘
                                    │ teacher_t = [a_t^Opus, a_t^GPT, a_t^DS]

                        ┌───────────────────────────────────┐
                        │ disagreement scorer:              │
                        │  if 2+ teachers agree on X        │
                        │     and student picked Y ≠ X:     │
                        │       chosen=X, rejected=Y        │
                        │       (DPO pair)                  │
                        │  else if all 3 disagree:          │
                        │       skip (no signal)            │
                        │  else if all agree with student:  │
                        │       skip (no signal)            │
                        └──────────────┬────────────────────┘
                                       │ DPO pairs[]

                        ┌───────────────────────────────────┐
                        │ DPO loss term:                    │
                        │  L = -log σ(β·(logπ(chosen|s)     │
                        │           − logπ_ref(chosen|s)    │
                        │           − logπ(rejected|s)      │
                        │           + logπ_ref(rejected|s)))│
                        │                                   │
                        │ → trace_replay_loss (a scalar)    │
                        └──────────────┬────────────────────┘


                          add to total_loss with β weight
```

**Key implementation note:** unlike SDPO, this happens **post-rollout**, not during. The trace is frozen, teacher calls are batched, DPO pairs are extracted offline, and the loss is computed in a follow-up training step. This decouples teacher-API-call latency from the trainer's GPU loop entirely. Spike 001 verified ~20s p95 step latency for parallel 3-teacher calls — acceptable at offline-batch cadence.

### 3. The combined trainer step (all three channels)

```
            ┌──────────────────────────────────────────────────────────┐
            │              ROLLOUT PHASE (per episode)                 │
            │  Generator (vLLM) → Env (OpenEnv) → trace JSONL          │
            │  → emits (state_t, action_t, reward_t, error_marker_t)   │
            └────────────────────────┬─────────────────────────────────┘

                  ┌──────────────────┼──────────────────────────┐
                  │                  │                          │
        ┌─────────▼─────────┐ ┌──────▼─────────┐    ┌───────────▼─────────┐
        │ RLVR scoring      │ │ Hint detection │    │ Teacher replay      │
        │ (test pass etc.)  │ │ at error_marker│    │ (post-rollout, async│
        │                   │ │   → hint_text  │    │  via OpenRouter API)│
        │ → reward_outcome  │ │ → ctx_teacher  │    │ → teacher_actions[] │
        └─────────┬─────────┘ └──────┬─────────┘    └───────────┬─────────┘
                  │                  │                          │
                  │                  │            ┌─────────────┘
                  │                  │            │ disagreement→DPO pairs
                  │                  │            │
                  └──────────────────┼────────────┘

            ┌──────────────────────────────────────────────────────────┐
            │              TRAINING PHASE (per gradient step)          │
            │                                                          │
            │  forward(student, ctx_rollout) → student_logits          │
            │  forward(student, ctx_teacher) → teacher_logits ← SDPO   │
            │                                                          │
            │  grpo_loss        = compute_grpo_loss(reward_outcome)    │
            │  sdpo_kl_loss     = generalized_jsd_loss(s_logits,       │
            │                       t_logits, labels=error_mask)        │
            │  trace_replay_loss= dpo_loss(student_logprobs,           │
            │                              ref_logprobs, dpo_pairs)    │
            │                                                          │
            │  total_loss = grpo_loss + α*sdpo_kl_loss + β*replay_loss │
            │                                                          │
            │  total_loss.backward()                                   │
            │  optimizer.step()                                        │
            └──────────────────────────────────────────────────────────┘
```

**Cost composition per training step (v0.0/v0.1 estimate):**

| Operation | Cost |
|---|---|
| Rollout forward (vLLM, async) | k tokens × inference TFLOPs |
| Teacher forward (training-mode FSDP, hint-conditioned) | ~1 extra FW pass per error site (sparse — maybe 5% of tokens) |
| RLVR reward eval | ~test execution overhead, env-bound, async |
| Teacher API replay (post-rollout, batched) | ~$0.02/step × parallel 3-teacher = ~$1/trace at 50 steps (verified by spike 001) |
| GRPO + SDPO + DPO loss compute | Negligible vs forward passes |
| Backward + optimizer step | Standard FSDP step |

The SDPO channel is **forward-pass-bound** (one extra FW per error site). The trace-replay channel is **API-call-bound** (offline, post-rollout, ~$0.30/trace with VOI gating in v0.1). They don't compete for the same resource.

## Per-framework integration recipes

### Recipe A: TRL `GRPOTrainer` subclass (recommended for v0.0/v0.1)

**Why this is the right v0.1 choice:** simplest extension; OPSD code lifts cleanly; Qwen3-7B fits comfortably in TRL's scale ceiling; first-class OpenEnv integration via `environment_factory`.

```python
from trl import GRPOTrainer
from opsd_trainer import generalized_jsd_loss  # lifted from siyan-zhao/OPSD


class ComposerReplicationTrainer(GRPOTrainer):
    """v0.1 trainer: GRPO + SDPO hint-distill + N-teacher trace-replay-DPO."""

    def __init__(self, *args, alpha_sdpo=0.1, beta_replay=0.05, **kwargs):
        super().__init__(*args, **kwargs)
        self.alpha_sdpo = alpha_sdpo
        self.beta_replay = beta_replay

    def _compute_loss(self, model, inputs):
        # Channel 1: standard GRPO loss
        grpo_loss = super()._compute_loss(model, inputs)

        # Channel 2: SDPO hint-distill at error sites
        sdpo_kl = self._compute_sdpo_loss(model, inputs)

        # Channel 3: trace-replay DPO from teacher disagreement
        replay_dpo = self._compute_trace_replay_loss(model, inputs)

        # Compose
        total_loss = grpo_loss + self.alpha_sdpo * sdpo_kl + self.beta_replay * replay_dpo

        # Log all three components for ablation
        if self.state.global_step % self.args.logging_steps == 0:
            self.log({
                "loss/grpo": grpo_loss.detach().item(),
                "loss/sdpo_kl": sdpo_kl.detach().item(),
                "loss/trace_replay_dpo": replay_dpo.detach().item(),
                "loss/total": total_loss.detach().item(),
            })

        return total_loss

    def _compute_sdpo_loss(self, model, inputs):
        if "ctx_teacher_input_ids" not in inputs or inputs["ctx_teacher_input_ids"].numel() == 0:
            # No error sites in this batch — SDPO is a no-op.
            return torch.tensor(0.0, device=model.device)

        student_logits = model(input_ids=inputs["input_ids"]).logits
        with torch.no_grad():
            # Teacher = same model, hint-injected context. NO grad.
            teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits

        return generalized_jsd_loss(
            student_logits=student_logits,
            teacher_logits=teacher_logits,
            labels=inputs["sdpo_loss_mask"],  # only error-turn tokens
            beta=0.5,
            temperature=1.0,
            token_clip=10.0,
        )

    def _compute_trace_replay_loss(self, model, inputs):
        if "dpo_chosen_input_ids" not in inputs:
            return torch.tensor(0.0, device=model.device)

        # Standard DPO loss using teacher-disagreement-derived pairs
        chosen_logprobs = self._get_logprobs(model, inputs["dpo_chosen_input_ids"])
        rejected_logprobs = self._get_logprobs(model, inputs["dpo_rejected_input_ids"])
        ref_chosen_logprobs = inputs["dpo_chosen_ref_logprobs"]  # precomputed
        ref_rejected_logprobs = inputs["dpo_rejected_ref_logprobs"]

        beta_dpo = 0.1
        logits = beta_dpo * (chosen_logprobs - ref_chosen_logprobs
                             - rejected_logprobs + ref_rejected_logprobs)
        return -F.logsigmoid(logits).mean()
```

The data collator (a sibling to OPSD's `SelfDistillationDataCollator`) is responsible for assembling the extra fields:
- `ctx_teacher_input_ids` — the hint-augmented context, when error markers fire
- `sdpo_loss_mask` — which token positions are post-hint and should contribute to KL
- `dpo_chosen_input_ids` / `dpo_rejected_input_ids` — pairs from spike-003-style extraction
- `dpo_*_ref_logprobs` — precomputed under the reference (student-init) policy

**OpenEnv plumbing** stays untouched — the `environment_factory=…` kwarg of `GRPOTrainer` already handles the SWE-bench-lite env.

### Recipe B: VeRL custom `adv_estimator` + DataProto extension (recommended for v0.2 scale)

**Why this is the right v0.2 choice:** VeRL has the only proven 70B+/671B RL story; HybridFlow's 3D-HybridEngine is the production reference for FSDP↔vLLM resharding; VeRL has precedent for exactly this pattern (`teacher_log_probs` already used for distillation per the DeepWiki audit).

```python
# verl_extensions/composer_adv.py
from verl.trainer.ppo import core_algos
from verl.trainer.ppo.core_algos import register_adv_est


@register_adv_est("grpo_composer")
def compute_grpo_composer_advantage(token_level_rewards, eos_mask, index, **kwargs):
    """GRPO advantage with SDPO + N-teacher trace-replay shaping.

    Reads from kwargs (passed via DataProto.batch / non_tensor_batch):
      - sdpo_teacher_logprobs: per-token logprobs from hint-conditioned forward
      - teacher_actions:       list of N teacher action distributions per step
      - alpha_sdpo, beta_replay: weights
    """
    # Standard GRPO advantage (same as built-in)
    base_adv = core_algos.compute_grpo_outcome_advantage(
        token_level_rewards, eos_mask, index
    )

    # SDPO shaping: at error-site tokens, add an extra advantage term
    # proportional to (teacher_logprob - student_logprob) — this nudges
    # the policy gradient toward the hint-conditioned distribution.
    sdpo_teacher_lp = kwargs.get("sdpo_teacher_logprobs")
    if sdpo_teacher_lp is not None:
        student_lp = kwargs["old_log_prob"]
        sdpo_term = kwargs["alpha_sdpo"] * (sdpo_teacher_lp - student_lp)
        # Only apply at error-mask positions
        sdpo_term = sdpo_term * kwargs["sdpo_error_mask"]
        base_adv = base_adv + sdpo_term

    # Trace-replay shaping: per-step PRM signal from teacher consensus
    teacher_actions = kwargs.get("teacher_actions")
    if teacher_actions is not None:
        prm_signal = compute_teacher_consensus_prm(teacher_actions, kwargs["student_actions"])
        base_adv = base_adv + kwargs["beta_replay"] * prm_signal

    return base_adv
```

In the run config:

```yaml
# ppo_trainer.yaml
algorithm:
  adv_estimator: grpo_composer
  alpha_sdpo: 0.1
  beta_replay: 0.05
```

In the rollout worker, attach the extra fields to `DataProto`:

```python
# verl_extensions/composer_rollout.py
def attach_composer_fields(data: DataProto, sdpo_teacher_lp, teacher_actions):
    data.batch["sdpo_teacher_logprobs"] = sdpo_teacher_lp
    data.batch["sdpo_error_mask"]       = build_error_mask(...)
    data.non_tensor_batch["teacher_actions"] = teacher_actions
    return data
```

This pattern is **identical to how VeRL already handles distillation rollouts** (per the DeepWiki audit: *"teacher log-probabilities are stashed on the rollout output and later concatenated into the per-batch DataProto for the student training step"*).

### Recipe C: TorchForge + Monarch (reference patterns only, not a production target)

Forge is "development paused per the upstream banner; lift patterns, don't depend on it. The relevant patterns are:

- **`SDPOTeacherActor` ActorMesh** — runs the hint-conditioned forward pass on a separate compute group, returns logits via TorchStore RDMA back to the trainer. Useful when SDPO forward is expensive enough to warrant offload.
- **`TeacherReplayActor` ActorMesh** — pool of K parallel actors, each holding an OpenRouter HTTP client. Trainer calls `service.spawn(TeacherReplayActor).query(state, n=3)` and gets back N teacher distributions.
- **Delayed-reward channel (OpenEnv RFC-004)** — for teacher replay where the signal arrives post-rollout, not at `step()`. Map to a separate reward stream that the trainer subscribes to.

If/when Monarch's K8s story matures and we move to v0.2 multi-cluster decentralized scale, lift these patterns into the VeRL stack rather than building on Forge directly.

### Recipe D: OpenEnv (substrate, not a choice)

OpenEnv is **orthogonal** — it works with TRL, VeRL, TorchForge, and any custom trainer. The contract:

- Env exposes `reset(...)`, `step(action)`, `state()`, `close()`.
- Env optionally exposes tools via MCP (RFC-003).
- Env optionally emits delayed rewards (RFC-004).
- Container deploys via Docker; trainer connects via WebSocket multiplexed sessions.

For our framework, the env contract needs **two lightweight extensions** (both backward-compatible):

1. **Error-site markers in tool responses.** When a tool call fails (404, type error, runtime exception), the env's `step()` response includes `meta["error_kind"]` and `meta["hint_template_key"]` — pre-defined keys the trainer's hint generator dispatches on. This lets the trainer decide *where* in the trace to insert hints without re-running the env.
2. **State-replay endpoint.** For trace-replay, the env supports `state(t)` returning the exact same observation the agent saw at step `t` — needed so external teachers see identical context. This is purely additive; existing OpenEnv envs without this can fall back to "feed teacher the conversation history" mode.

We'll publish both extensions as proposed RFCs against `meta-pytorch/OpenEnv` once the v0.0 spike validates the full framework.

## Why all three channels can run simultaneously (the architectural argument)

These three channels do **not** compete for any shared resource:

| Resource | Channel 1 (RLVR) | Channel 2 (SDPO) | Channel 3 (replay) |
|---|---|---|---|
| GPU forward pass | rollout (vLLM, async) | extra FW per error (training, FSDP) | none — uses precomputed logprobs |
| GPU backward pass | yes | yes (added to total_loss) | yes (added to total_loss) |
| External API budget | none | none | $0.30–1/trace (verified, spike 001) |
| Latency-critical path | yes — gates next rollout | minor — extra FW <5% of tokens | no — async, post-rollout |
| Storage | rollout JSONL | extra ctx + mask in collator | DPO pairs JSONL (separate dataset repo) |

Furthermore the **gradients are additive** by design — the three loss terms each have their own α/β weights, so we can ablate any subset by setting the weight to 0. The v0.1 ablation matrix:

| Run | α (SDPO) | β (replay) | Tests |
|---|---|---|---|
| Baseline | 0 | 0 | pure GRPO+RLVR |
| +SDPO only | 0.1 | 0 | Composer recipe replication |
| +Replay only | 0 | 0.05 | the v0.0 novel claim, scaled to 32B |
| Full | 0.1 | 0.05 | combined channel test (v0.1 winner candidate) |

This 4-arm A/B at 32B is the v0.1 terminal experiment. Total cost ~$1200 (4 runs × 3 seeds × ~$100 each). Roadmap.

## Open questions / followups (for v0.1 design phase, not v0.0)

1. **Hint generator architecture (open since the recipe-mapping doc).** Templates first; LLM-driven generator if templates plateau on style/communication errors.
2. **SDPO weight `α` schedule.** OPSD paper used constant; SDPO paper uses constant; Cursor never says. Likely warmup-from-0 then constant; ablate.
3. **DPO pair extraction threshold.** Spike 003 will determine: do we want only "2-of-3 teachers agree" pairs (high signal, fewer pairs), or also "1-of-3 differs from student" (more pairs, noisier)?
4. **Teacher pool composition.** Spike 001 used Opus 4.7 + GPT-5 + DeepSeek V4 Pro. Question for v0.1: should we add a fourth teacher (Qwen3-Max-MoE? Kimi K2.5?) as a same-family voice to balance Anthropic/OpenAI? Cost adds linearly.
5. **Reward hacking monitoring.** Cursor mentioned (without specifics) "agentic monitoring tools." Our v0.1 environment needs sandbox hardening: disable `find`, `unzip`, bytecode tools, and Python type-cache reads, so the model can't reverse-engineer deleted features the way Composer 2.5's model did.

## Citations

Primary sources verified for this document:

- **TRL `GRPOTrainer._compute_loss`** — verified via DeepWiki query against `huggingface/trl` repo on 2026-05-25. `environment_factory` kwarg confirmed for OpenEnv plumbing.
- **VeRL `@register_adv_est` + `DataProto`** — verified via DeepWiki query against `volcengine/verl` repo on 2026-05-25. Distillation precedent (`teacher_log_probs` already attached to rollout DataProto) confirms the pattern.
- **OPSD `generalized_jsd_loss`** — verified via DeepWiki query against `siyan-zhao/OPSD` repo on 2026-05-25. Static method, self-contained, MIT licensed, FlashAttention-2 compatible. Function signature reproduced verbatim above.
- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5), read directly via `tavily_extract` advanced mode. Footnote 1 cites the three self-distillation papers.
- **SDPO paper** — Hübotter et al., [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop.
- **OPSD paper** — Zhao et al., [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [github.com/siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD) (MIT).
- **Existing research notes**`research/03-monarch-torchforge-openenv.md` (Monarch/Forge/OpenEnv) and `research/04-verl-trl.md` (VeRL/TRL) for framework-level context. Audit notes on those files apply: trust extension-point claims here over framework-level claims there when in conflict.

This document is the bridge between the **conceptual** 3-channel composition (in `COMPOSER_RECIPE_MAPPING.md`) and the **executable** trainer skeleton (in `spikes/005-integrated-trainer-skeleton/`). Anyone implementing v0.1 starts here, then opens the skeleton.