File size: 16,029 Bytes
1cede23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping

> **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch).
> **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog).

This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan.

## Composer 2.5's published recipe (5 components, blog-verified)

The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.

### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]`

> *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog

**Mechanism, exactly:**
- **Same model** acts as both teacher and student. Not two separate models.
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
- The student is "the policy at this turn, *without* the hint" (the original context).
- Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory.
- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.

**Cited prior art** (Cursor's footnote 1):
- **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
- **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper:

  | Method | Sampling | Signal | Feedback |
  |---|---|---|---|
  | SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher |
  | On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
  | RLVR / GRPO (Lambert 2025) | on-policy | weak | environment |
  | **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** |

- **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).

**Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.**

### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]`

> *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog

- **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
- The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical.
- "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.

### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]`

> *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog

- Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
- Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
- Optimizer step time on 1T model: **0.2 s**.

This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.

## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)

`research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified.

| Claim | Source basis | Verdict |
|---|---|---|
| "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
| Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
| MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. |
| CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
| "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. |
| "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. |
| "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. |

## Mapping each blog component → our replication framework

| Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
|---|---|---|---|---|---|
| **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ |
| **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
| **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
| **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
| **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
| **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
| **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |

## Critical relationship: Composer hint-distill vs. trace-replay-distill

These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.

| Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
|---|---|---|
| Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) |
| What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
| Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
| Privileged information | Hint text in context | None — teachers see same state student sees |
| Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable |
| Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
| Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
| Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it |
| Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** |

**Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels:

1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned).
3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).

In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.

## Why deferring Composer hint-distill to v0.1 is the right call

I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:

1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0.

v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.

## Implementation handles for v0.1 (concrete starting points)

When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path:

1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors:
   - `"Tool not found: X"` → hint = `"Reminder: Available tools are: <list of valid tools>"`
   - `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"`
   - `"Type error in args"` → hint = `"Reminder: <tool-name> expects args matching schema: <schema>"`
   This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
3. **Apply only at error sites,** not every turn. Detect via:
   - Failed tool calls (status != ok)
   - Exception traces in tool output
   - Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`.

## Implementation handles for v0.2 (decentralized scale)

If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:

- **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip.
- **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
- **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
- **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication.
- **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
- **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine.

## Citations (updated)

Primary sources for each Composer-2.5 component, post-audit:

- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026)
- **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed)
- **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT.
- **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
- **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant.
- **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking).

The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.