composer-replication-framework / docs /COMPOSER_RECIPE_MAPPING.md

baladithyab

Integrate Cursor blog directly + audit research note + add SDPO/OPSD link

1cede23 3 days ago

16 kB

Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping

Audit date: 2026-05-25 (post-hoc, after the parallel research dispatch). Methodology: Read Cursor's blog directly (mcp_tavily_tavily_extract advanced mode), then audit research/01-composer-2.5.md and framework/composer-replication-framework.md against ground truth. Mark every claim as either [BLOG-VERIFIED] (in the blog), [INFERRED] (reasonable extrapolation from blog + base-model knowledge), or [EXTRAPOLATED] (subagent added it, likely correct but not in the blog).

This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a high level but did not rigorously map each Composer stage onto the spike plan.

Composer 2.5's published recipe (5 components, blog-verified)

The Cursor blog discusses only three training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.

1. Targeted RL with Textual Feedback `[BLOG-VERIFIED]`

"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog

Mechanism, exactly:

Same model acts as both teacher and student. Not two separate models.
The teacher is "the policy at this turn, with a hint inserted into the context."
The student is "the policy at this turn, without the hint" (the original context).
Loss = on-policy KL divergence: KL( teacher_logits_at_turn_t || student_logits_at_turn_t ), applied only at the problematic turn, not over the full trajectory.
Sits on top of an outer RLVR (verifiable-reward RL) objective; doesn't replace it.

Cited prior art (Cursor's footnote 1):

OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs (Zhao et al., 2026, arXiv:2601.18734, GitHub: siyan-zhao/OPSD). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.

SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., 2026, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy." This is mathematically the same as Composer's targeted-textual-feedback method. There is published code. Comparison table from the SDPO paper:

Method	Sampling	Signal	Feedback
SFT / Distillation (Hinton 2015)	off-policy	rich	strong teacher
On-Policy Distillation (Agarwal 2024)	on-policy	rich	strong teacher
RLVR / GRPO (Lambert 2025)	on-policy	weak	environment
SDPO (this paper / Composer)	on-policy	rich	environment

Self-Distillation Enables Continual Learning (arXiv:2601.19897).

Key reproducibility gap (still unsolved): How are the hints generated? The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. This is the single most important open question for replication.

2. Synthetic data at 25× scale `[BLOG-VERIFIED]`

"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." — Cursor blog

Feature Deletion is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a real risk, not theoretical.
"Agentic monitoring tools" are mentioned as the mitigation, but no specifics.

3. Sharded Muon + dual mesh HSDP `[BLOG-VERIFIED]`

"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights." — Cursor blog

Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
Optimizer step time on 1T model: 0.2 s.

This is infrastructure, not algorithm. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.

What the subagent added beyond the blog (`[EXTRAPOLATED]` and `[INFERRED]`)

research/01-composer-2.5.md introduced these claims that are NOT in the Cursor blog. Most are likely correct from secondary sources, but they are not blog-verified.

Claim	Source basis	Verdict
"85% of total compute is post-training"	`[EXTRAPOLATED]` — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent).	Plausible but unverified. Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated.
Anyrun environment harness with LSP/file-I/O/terminal	`[EXTRAPOLATED]` — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report).	Plausible — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post.
MLA + 1T/32B active + 384 experts + 256K ctx	`[INFERRED]` from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5".	Verified independently via Moonshot's K2.5 model card. Correct.
CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual	`[EXTRAPOLATED]` — blog doesn't quote benchmarks.	Source unclear. Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified.
"PPO or GRPO variant"	`[EXTRAPOLATED]` — blog never names the RL algorithm.	Educated guess. Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits on top of an unspecified RLVR algorithm, so this is still open.
"Continued pretraining on heavily code-weighted data"	`[BLOG-VERIFIED]` — blog says exactly this in the Sharded Muon section ("For continued pretraining…").	Verified.
"Behavioral aspects: communication style, effort calibration"	`[BLOG-VERIFIED]` — blog mentions improving these and notes existing benchmarks don't capture them.	Verified, but blog doesn't say how they're trained. The targeted-textual-feedback method is presumably also used here.

Mapping each blog component → our replication framework

Composer 2.5 stage	Blog mechanism	Our replication target	v0.0	v0.1	v0.2
(a) Continued pretraining on code	Standard pretraining, code-weighted	Skip — start from already-code-tuned `Qwen3-Coder-7B` or `Qwen3-Coder-30B-A3B`	✗	✗	✗
(b) Synthetic data at scale	Feature Deletion + 24 other (unnamed) generators	Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives.	✗ (use SWE-bench-lite only)	✓ (build Feature Deletion)	scale generator suite
(c) Realistic-environment RL (RLVR)	Async sandboxes, same tool harness as production	TRL `GRPOTrainer` + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1	✓ baseline	✓ + DAPO patches	+ decentralized rollouts
(d) Targeted RL w/ textual feedback (Composer's secret sauce)	Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn	Lift the OPSD/SDPO loss directly from `siyan-zhao/OPSD` (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2).	✗ (deferred)	✓ (this is the Composer-recipe channel)	+ learned hint generator
(e) Trace-replay multi-teacher distill (NOVEL — our addition)	N/A (not in Composer)	N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs	✓ (this is the v0.0 novelty bet)	✓ + VOI gating	+ tiered teachers
(f) Sharded Muon / dual-mesh HSDP	MoE optimizer infra	Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B	✗	✗	✗ (only if MoE base)
(g) Reward-hacking safeguards	"Agentic monitoring tools" — unspecified	Static analysis + bytecode-cache-deletion + a sandboxed shell with no `find` / `strings` / `unzip` access in the env	✗ (small surface)	✓ (build the monitor)	+ RM-based penalty

Critical relationship: Composer hint-distill vs. trace-replay-distill

These are two different mechanisms, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.

Property	Composer hint-distill (= SDPO/OPSD)	Trace-replay multi-teacher (NOVEL — ours)
Number of models	1 (same model is teacher + student)	N+1 (frozen N teachers + 1 trainable student)
What "teacher" means	Student-with-hint-in-context	External pretrained models from other labs
Per-step cost	~1 extra forward pass (cheap)	N teacher API calls (~$0.02/step at N=3 per spike 001)
Privileged information	Hint text in context	None — teachers see same state student sees
Source of hint / privileged info	Open question. Templates? LLM judge?	Not applicable
Relationship to RLVR	Adds dense per-turn signal on top of RLVR scalar reward	Same — adds dense per-step signal on top of RLVR
Bypasses long-horizon credit assignment?	Yes (per-turn KL)	Yes (per-step DPO/PRM)
Published code?	Yes — `siyan-zhao/OPSD` (MIT)	Not yet — we're building it
Novel in the framework?	No — this is Composer's published recipe	Yes — the v0.0 research bet

Both channels stack on the same RLVR base. The full v0.1 trainer has THREE reward channels:

RLVR (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
Composer hint-distill = SDPO loss (one extra forward pass per error site, hint-conditioned).
Trace-replay-distill = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).

In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.

Why deferring Composer hint-distill to v0.1 is the right call

I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:

The novel claim is trace-replay. The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
The hint-generator open question is unresolved. Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
Spike 001's economic verdict only gates the trace-replay channel. SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.

v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.

Implementation handles for v0.1 (concrete starting points)

When we get to v0.1, the Composer hint-distill channel has a clear engineering path:

Lift the SDPO loss math from siyan-zhao/OPSD. MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue.
Hint generator v1: hardcoded templates. Pattern-match on tool-call errors:
- "Tool not found: X" → hint = "Reminder: Available tools are: <list of valid tools>"
- "JSONDecodeError: ..." → hint = "Reminder: tool arguments must be valid JSON"
- "Type error in args" → hint = "Reminder: <tool-name> expects args matching schema: <schema>" This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
Apply only at error sites, not every turn. Detect via:
- Failed tool calls (status != ok)
- Exception traces in tool output
- Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
Loss = α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss. Ablate (α, β, γ).

Implementation handles for v0.2 (decentralized scale)

If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:

Continued pretraining: Pretrained checkpoint already exists (Qwen3-32B); skip.
Synthetic data: Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
Realistic-env RL: PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
Targeted hint-distill: Compute is local to each trainer — no decentralization complication.
Trace-replay-distill: Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
Sharded Muon / dual-mesh HSDP: Only if we adopt MoE base. For dense 32B, FSDP2 is fine.

Citations (updated)

Primary sources for each Composer-2.5 component, post-audit:

Cursor blog — Introducing Composer 2.5 (2026)
Cursor blog — Composer 2 technical report (predecessor; named the "Anyrun" environment per subagent — verify if needed)
OPSD paper — Zhao et al., Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs, arXiv:2601.18734, code at siyan-zhao/OPSD. MIT.
SDPO paper — Hübotter et al., Reinforcement Learning via Self-Distillation, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
Self-Distillation continual-learning — arXiv:2601.19897. Cited by Cursor; less directly relevant.
Moonshot Kimi K2.5 — base model, HF model card.

The methodology mapping in this document supersedes vague claims in research/01-composer-2.5.md where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.