# Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping > **Audit date:** 2026-05-25 (post-hoc, after the parallel research dispatch). > **Methodology:** Read [Cursor's blog](https://cursor.com/blog/composer-2-5) directly (`mcp_tavily_tavily_extract` advanced mode), then audit `research/01-composer-2.5.md` and `framework/composer-replication-framework.md` against ground truth. Mark every claim as either **`[BLOG-VERIFIED]`** (in the blog), **`[INFERRED]`** (reasonable extrapolation from blog + base-model knowledge), or **`[EXTRAPOLATED]`** (subagent added it, likely correct but not in the blog). This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a *high* level but did not rigorously map each Composer stage onto the spike plan. ## Composer 2.5's published recipe (5 components, blog-verified) The Cursor blog discusses **only three** training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations. ### 1. **Targeted RL with Textual Feedback** `[BLOG-VERIFIED]` > *"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's."* — Cursor blog **Mechanism, exactly:** - **Same model** acts as both teacher and student. Not two separate models. - The teacher is "the policy at this turn, *with* a hint inserted into the context." - The student is "the policy at this turn, *without* the hint" (the original context). - Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`, applied **only at the problematic turn**, not over the full trajectory. - Sits **on top of** an outer RLVR (verifiable-reward RL) objective; doesn't replace it. **Cited prior art** (Cursor's footnote 1): - **OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs** (Zhao et al., 2026, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), [GitHub: siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD)). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts. - **SDPO: Reinforcement Learning via Self-Distillation** (Hübotter et al., 2026, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: *"SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy."* This is **mathematically the same** as Composer's targeted-textual-feedback method. **There is published code.** Comparison table from the SDPO paper: | Method | Sampling | Signal | Feedback | |---|---|---|---| | SFT / Distillation (Hinton 2015) | off-policy | rich | strong teacher | | On-Policy Distillation (Agarwal 2024) | on-policy | rich | strong teacher | | RLVR / GRPO (Lambert 2025) | on-policy | weak | environment | | **SDPO (this paper / Composer)** | **on-policy** | **rich** | **environment** | - **Self-Distillation Enables Continual Learning** ([arXiv:2601.19897](https://arxiv.org/abs/2601.19897)). **Key reproducibility gap (still unsolved):** *How are the hints generated?* The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. **This is the single most important open question for replication.** ### 2. **Synthetic data at 25× scale** `[BLOG-VERIFIED]` > *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* — Cursor blog - **Feature Deletion** is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward. - The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a **real risk**, not theoretical. - "Agentic monitoring tools" are mentioned as the mitigation, but no specifics. ### 3. **Sharded Muon + dual mesh HSDP** `[BLOG-VERIFIED]` > *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights."* — Cursor blog - Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights. - Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh). - Optimizer step time on 1T model: **0.2 s**. This is **infrastructure, not algorithm**. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly. ## What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`) `research/01-composer-2.5.md` introduced these claims that are **NOT in the Cursor blog**. Most are likely correct from secondary sources, but they are not blog-verified. | Claim | Source basis | Verdict | |---|---|---| | "85% of total compute is post-training" | `[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). | **Plausible but unverified.** Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. | | Anyrun environment harness with LSP/file-I/O/terminal | `[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). | **Plausible** — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. | | MLA + 1T/32B active + 384 experts + 256K ctx | `[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". | **Verified independently** via [Moonshot's K2.5 model card](https://huggingface.co/moonshotai). Correct. | | CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | `[EXTRAPOLATED]` — blog doesn't quote benchmarks. | **Source unclear.** Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. | | "PPO or GRPO variant" | `[EXTRAPOLATED]` — blog never names the RL algorithm. | **Educated guess.** Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits *on top of* an unspecified RLVR algorithm, so this is still open. | | "Continued pretraining on heavily code-weighted data" | `[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). | Verified. | | "Behavioral aspects: communication style, effort calibration" | `[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them. | Verified, but blog doesn't say *how* they're trained. The targeted-textual-feedback method is presumably also used here. | ## Mapping each blog component → our replication framework | Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 | |---|---|---|---|---|---| | **(a)** Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B` | ✗ | ✗ | ✗ | | **(b)** Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite | | **(c)** Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 | ✓ baseline | ✓ + DAPO patches | + decentralized rollouts | | **(d)** Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | **Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD`** (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). | ✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator | | **(e)** Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers | | **(f)** Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) | | **(g)** Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env | ✗ (small surface) | ✓ (build the monitor) | + RM-based penalty | ## Critical relationship: Composer hint-distill vs. trace-replay-distill These are **two different mechanisms**, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise. | Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) | |---|---|---| | Number of models | **1** (same model is teacher + student) | **N+1** (frozen N teachers + 1 trainable student) | | What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs | | Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) | | Privileged information | Hint text in context | None — teachers see same state student sees | | Source of hint / privileged info | **Open question.** Templates? LLM judge? | Not applicable | | Relationship to RLVR | Adds dense per-turn signal *on top of* RLVR scalar reward | Same — adds dense per-step signal on top of RLVR | | Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) | | Published code? | **Yes — `siyan-zhao/OPSD` (MIT)** | Not yet — we're building it | | Novel in the framework? | No — this is Composer's published recipe | **Yes — the v0.0 research bet** | **Both channels stack on the same RLVR base.** The full v0.1 trainer has THREE reward channels: 1. **RLVR** (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped. 2. **Composer hint-distill** = SDPO loss (one extra forward pass per error site, hint-conditioned). 3. **Trace-replay-distill** = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet). In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small. ## Why deferring Composer hint-distill to v0.1 is the right call I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because: 1. **The novel claim is trace-replay.** The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research. 2. **The hint-generator open question is unresolved.** Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication). 3. **Spike 001's economic verdict only gates the trace-replay channel.** SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model. 4. **A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm.** Not worth it for v0.0. v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict. ## Implementation handles for v0.1 (concrete starting points) When we get to v0.1, the **Composer hint-distill channel** has a clear engineering path: 1. **Lift the SDPO loss math from `siyan-zhao/OPSD`.** MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue. 2. **Hint generator v1: hardcoded templates.** Pattern-match on tool-call errors: - `"Tool not found: X"` → hint = `"Reminder: Available tools are: "` - `"JSONDecodeError: ..."` → hint = `"Reminder: tool arguments must be valid JSON"` - `"Type error in args"` → hint = `"Reminder: expects args matching schema: "` This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator. 3. **Apply only at error sites,** not every turn. Detect via: - Failed tool calls (status != ok) - Exception traces in tool output - Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case) 4. **Loss = `α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss`.** Ablate `(α, β, γ)`. ## Implementation handles for v0.2 (decentralized scale) If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting: - **Continued pretraining:** Pretrained checkpoint already exists (Qwen3-32B); skip. - **Synthetic data:** Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel. - **Realistic-env RL:** PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters. - **Targeted hint-distill:** Compute is local to each trainer — no decentralization complication. - **Trace-replay-distill:** Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale. - **Sharded Muon / dual-mesh HSDP:** Only if we adopt MoE base. For dense 32B, FSDP2 is fine. ## Citations (updated) Primary sources for each Composer-2.5 component, post-audit: - **Cursor blog** — [Introducing Composer 2.5](https://cursor.com/blog/composer-2-5) (2026) - **Cursor blog** — [Composer 2 technical report](https://cursor.com/blog/composer-2-technical-report) (predecessor; named the "Anyrun" environment per subagent — verify if needed) - **OPSD paper** — Zhao et al., *Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs*, [arXiv:2601.18734](https://arxiv.org/abs/2601.18734), code at [siyan-zhao/OPSD](https://github.com/siyan-zhao/OPSD). MIT. - **SDPO paper** — Hübotter et al., *Reinforcement Learning via Self-Distillation*, [arXiv:2601.20802](https://arxiv.org/abs/2601.20802), ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill. - **Self-Distillation continual-learning** — [arXiv:2601.19897](https://arxiv.org/abs/2601.19897). Cited by Cursor; less directly relevant. - **Moonshot Kimi K2.5** — base model, [HF model card](https://huggingface.co/moonshotai/Kimi-K2-Thinking). The methodology mapping in this document supersedes vague claims in `research/01-composer-2.5.md` where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.