File size: 16,029 Bytes

1cede23

# Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping

> **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch).
> **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog).

This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan.

## Composer 2.5's published recipe (5 components, blog-verified)

The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.

### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]`

> *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog

**Mechanism, exactly:**
- **Same model** acts as both teacher and student. Not two separate models.
- The teacher is "the policy at this turn, *with* a hint inserted into the context."
- The student is "the policy at this turn, *without* the hint" (the original context).
- Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory.
- Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it.

**Cited prior art** (Cursor's footnote 1):
- **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
- **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper:

  | Method | Sampling | Signal | Feedback |
  |---|---|---|---|
  | SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher |
  | On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher |
  | RLVR / GRPO (Lambert 2025) | on-policy | weak | environment |
  | **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** |

- **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)).

**Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.**

### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]`

> *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog

- **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
- The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical.
- "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.

### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]`

> *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog

- Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
- Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
- Optimizer step time on 1T model: **0.2 s**.

This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.

## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)

`research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified.

| Claim | Source basis | Verdict |
|---|---|---|
| "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
| Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
| MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. |
| CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
| "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. |
| "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. |
| "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. |

## Mapping each blog component → our replication framework

| Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
|---|---|---|---|---|---|
| **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ |
| **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
| **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
| **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
| **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
| **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
| **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |

## Critical relationship: Composer hint-distill vs. trace-replay-distill

These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.

| Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
|---|---|---|
| Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) |
| What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
| Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
| Privileged information | Hint text in context | None — teachers see same state student sees |
| Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable |
| Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
| Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
| Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it |
| Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** |

**Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels:

1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned).
3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).

In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.

## Why deferring Composer hint-distill to v0.1 is the right call

I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:

1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0.

v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.

## Implementation handles for v0.1 (concrete starting points)

When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path:

1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors:
   - `"Tool not found: X"` → hint = `"Reminder: Available tools are: <list of valid tools>"`
   - `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"`
   - `"Type error in args"` → hint = `"Reminder: <tool-name> expects args matching schema: <schema>"`
   This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
3. **Apply only at error sites,** not every turn. Detect via:
   - Failed tool calls (status != ok)
   - Exception traces in tool output
   - Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`.

## Implementation handles for v0.2 (decentralized scale)

If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:

- **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip.
- **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
- **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
- **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication.
- **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
- **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine.

## Citations (updated)

Primary sources for each Composer-2.5 component, post-audit:

- **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026)
- **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed)
- **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT.
- **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
- **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant.
- **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking).

The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.