composer-replication-framework / docs /COMPOSER_RECIPE_MAPPING.md
baladithyab
Integrate Cursor blog directly + audit research note + add SDPO/OPSD link
1cede23

Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping

Audit date: 2026-05-25 (post-hoc, after the parallel research dispatch). Methodology: Read Cursor's blog directly (mcp_tavily_tavily_extract advanced mode), then audit research/01-composer-2.5.md and framework/composer-replication-framework.md against ground truth. Mark every claim as either [BLOG-VERIFIED] (in the blog), [INFERRED] (reasonable extrapolation from blog + base-model knowledge), or [EXTRAPOLATED] (subagent added it, likely correct but not in the blog).

This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a high level but did not rigorously map each Composer stage onto the spike plan.

Composer 2.5's published recipe (5 components, blog-verified)

The Cursor blog discusses only three training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.

1. Targeted RL with Textual Feedback [BLOG-VERIFIED]

"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog

Mechanism, exactly:

  • Same model acts as both teacher and student. Not two separate models.
  • The teacher is "the policy at this turn, with a hint inserted into the context."
  • The student is "the policy at this turn, without the hint" (the original context).
  • Loss = on-policy KL divergence: KL( teacher_logits_at_turn_t || student_logits_at_turn_t ), applied only at the problematic turn, not over the full trajectory.
  • Sits on top of an outer RLVR (verifiable-reward RL) objective; doesn't replace it.

Cited prior art (Cursor's footnote 1):

  • OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs (Zhao et al., 2026, arXiv:2601.18734, GitHub: siyan-zhao/OPSD). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.

  • SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., 2026, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy." This is mathematically the same as Composer's targeted-textual-feedback method. There is published code. Comparison table from the SDPO paper:

    Method Sampling Signal Feedback
    SFT / Distillation (Hinton 2015) off-policy rich strong teacher
    On-Policy Distillation (Agarwal 2024) on-policy rich strong teacher
    RLVR / GRPO (Lambert 2025) on-policy weak environment
    SDPO (this paper / Composer) on-policy rich environment
  • Self-Distillation Enables Continual Learning (arXiv:2601.19897).

Key reproducibility gap (still unsolved): How are the hints generated? The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. This is the single most important open question for replication.

2. Synthetic data at 25× scale [BLOG-VERIFIED]

"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." — Cursor blog

  • Feature Deletion is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
  • The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a real risk, not theoretical.
  • "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.

3. Sharded Muon + dual mesh HSDP [BLOG-VERIFIED]

"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights." — Cursor blog

  • Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
  • Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
  • Optimizer step time on 1T model: 0.2 s.

This is infrastructure, not algorithm. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.

What the subagent added beyond the blog ([EXTRAPOLATED] and [INFERRED])

research/01-composer-2.5.md introduced these claims that are NOT in the Cursor blog. Most are likely correct from secondary sources, but they are not blog-verified.

Claim Source basis Verdict
"85% of total compute is post-training" [EXTRAPOLATED] — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). Plausible but unverified. Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated.
Anyrun environment harness with LSP/file-I/O/terminal [EXTRAPOLATED] — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). Plausible — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post.
MLA + 1T/32B active + 384 experts + 256K ctx [INFERRED] from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". Verified independently via Moonshot's K2.5 model card. Correct.
CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual [EXTRAPOLATED] — blog doesn't quote benchmarks. Source unclear. Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified.
"PPO or GRPO variant" [EXTRAPOLATED] — blog never names the RL algorithm. Educated guess. Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits on top of an unspecified RLVR algorithm, so this is still open.
"Continued pretraining on heavily code-weighted data" [BLOG-VERIFIED] — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). Verified.
"Behavioral aspects: communication style, effort calibration" [BLOG-VERIFIED] — blog mentions improving these and notes existing benchmarks don't capture them. Verified, but blog doesn't say how they're trained. The targeted-textual-feedback method is presumably also used here.

Mapping each blog component → our replication framework

Composer 2.5 stage Blog mechanism Our replication target v0.0 v0.1 v0.2
(a) Continued pretraining on code Standard pretraining, code-weighted Skip — start from already-code-tuned Qwen3-Coder-7B or Qwen3-Coder-30B-A3B
(b) Synthetic data at scale Feature Deletion + 24 other (unnamed) generators Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. ✗ (use SWE-bench-lite only) ✓ (build Feature Deletion) scale generator suite
(c) Realistic-environment RL (RLVR) Async sandboxes, same tool harness as production TRL GRPOTrainer + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 ✓ baseline ✓ + DAPO patches + decentralized rollouts
(d) Targeted RL w/ textual feedback (Composer's secret sauce) Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn Lift the OPSD/SDPO loss directly from siyan-zhao/OPSD (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). ✗ (deferred) ✓ (this is the Composer-recipe channel) + learned hint generator
(e) Trace-replay multi-teacher distill (NOVEL — our addition) N/A (not in Composer) N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs ✓ (this is the v0.0 novelty bet) ✓ + VOI gating + tiered teachers
(f) Sharded Muon / dual-mesh HSDP MoE optimizer infra Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B ✗ (only if MoE base)
(g) Reward-hacking safeguards "Agentic monitoring tools" — unspecified Static analysis + bytecode-cache-deletion + a sandboxed shell with no find / strings / unzip access in the env ✗ (small surface) ✓ (build the monitor) + RM-based penalty

Critical relationship: Composer hint-distill vs. trace-replay-distill

These are two different mechanisms, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.

Property Composer hint-distill (= SDPO/OPSD) Trace-replay multi-teacher (NOVEL — ours)
Number of models 1 (same model is teacher + student) N+1 (frozen N teachers + 1 trainable student)
What "teacher" means Student-with-hint-in-context External pretrained models from other labs
Per-step cost ~1 extra forward pass (cheap) N teacher API calls (~$0.02/step at N=3 per spike 001)
Privileged information Hint text in context None — teachers see same state student sees
Source of hint / privileged info Open question. Templates? LLM judge? Not applicable
Relationship to RLVR Adds dense per-turn signal on top of RLVR scalar reward Same — adds dense per-step signal on top of RLVR
Bypasses long-horizon credit assignment? Yes (per-turn KL) Yes (per-step DPO/PRM)
Published code? Yes — siyan-zhao/OPSD (MIT) Not yet — we're building it
Novel in the framework? No — this is Composer's published recipe Yes — the v0.0 research bet

Both channels stack on the same RLVR base. The full v0.1 trainer has THREE reward channels:

  1. RLVR (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
  2. Composer hint-distill = SDPO loss (one extra forward pass per error site, hint-conditioned).
  3. Trace-replay-distill = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).

In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.

Why deferring Composer hint-distill to v0.1 is the right call

I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:

  1. The novel claim is trace-replay. The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
  2. The hint-generator open question is unresolved. Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
  3. Spike 001's economic verdict only gates the trace-replay channel. SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
  4. A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.

v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.

Implementation handles for v0.1 (concrete starting points)

When we get to v0.1, the Composer hint-distill channel has a clear engineering path:

  1. Lift the SDPO loss math from siyan-zhao/OPSD. MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
  2. Hint generator v1: hardcoded templates. Pattern-match on tool-call errors:
    • "Tool not found: X" → hint = "Reminder: Available tools are: <list of valid tools>"
    • "JSONDecodeError: ..." → hint = "Reminder: tool arguments must be valid JSON"
    • "Type error in args" → hint = "Reminder: <tool-name> expects args matching schema: <schema>" This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
  3. Apply only at error sites, not every turn. Detect via:
    • Failed tool calls (status != ok)
    • Exception traces in tool output
    • Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
  4. Loss = α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss. Ablate (α, β, γ).

Implementation handles for v0.2 (decentralized scale)

If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:

  • Continued pretraining: Pretrained checkpoint already exists (Qwen3-32B); skip.
  • Synthetic data: Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
  • Realistic-env RL: PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
  • Targeted hint-distill: Compute is local to each trainer — no decentralization complication.
  • Trace-replay-distill: Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
  • Sharded Muon / dual-mesh HSDP: Only if we adopt MoE base. For dense 32B, FSDP2 is fine.

Citations (updated)

Primary sources for each Composer-2.5 component, post-audit:

  • Cursor blogIntroducing Composer 2.5 (2026)
  • Cursor blogComposer 2 technical report (predecessor; named the "Anyrun" environment per subagent — verify if needed)
  • OPSD paper — Zhao et al., Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs, arXiv:2601.18734, code at siyan-zhao/OPSD. MIT.
  • SDPO paper — Hübotter et al., Reinforcement Learning via Self-Distillation, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
  • Self-Distillation continual-learningarXiv:2601.19897. Cited by Cursor; less directly relevant.
  • Moonshot Kimi K2.5 — base model, HF model card.

The methodology mapping in this document supersedes vague claims in research/01-composer-2.5.md where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.