Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Composer 2.5 Recipe → Replication Framework: Stage-by-Stage Mapping
Audit date: 2026-05-25 (post-hoc, after the parallel research dispatch). Methodology: Read Cursor's blog directly (
mcp_tavily_tavily_extractadvanced mode), then auditresearch/01-composer-2.5.mdandframework/composer-replication-framework.mdagainst ground truth. Mark every claim as either[BLOG-VERIFIED](in the blog),[INFERRED](reasonable extrapolation from blog + base-model knowledge), or[EXTRAPOLATED](subagent added it, likely correct but not in the blog).
This document is the rigorous bridge between Cursor's published recipe and our replication framework. It exists because the initial parallel-research dispatch produced a synthesis that quoted Composer 2.5 at a high level but did not rigorously map each Composer stage onto the spike plan.
Composer 2.5's published recipe (5 components, blog-verified)
The Cursor blog discusses only three training innovations explicitly. Everything else was extrapolated by the subagent. I list the three first, then flag the extrapolations.
1. Targeted RL with Textual Feedback [BLOG-VERIFIED]
"For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's." — Cursor blog
Mechanism, exactly:
- Same model acts as both teacher and student. Not two separate models.
- The teacher is "the policy at this turn, with a hint inserted into the context."
- The student is "the policy at this turn, without the hint" (the original context).
- Loss = on-policy KL divergence:
KL( teacher_logits_at_turn_t || student_logits_at_turn_t ), applied only at the problematic turn, not over the full trajectory. - Sits on top of an outer RLVR (verifiable-reward RL) objective; doesn't replace it.
Cited prior art (Cursor's footnote 1):
OPSD: Self-Distilled Reasoner — On-Policy Self-Distillation for LLMs (Zhao et al., 2026, arXiv:2601.18734, GitHub: siyan-zhao/OPSD). The original on-policy-self-distillation framework: single LLM, teacher conditioned on privileged information (e.g. ground-truth answer), student sees only the question, loss = per-token KL on student's own rollouts.
SDPO: Reinforcement Learning via Self-Distillation (Hübotter et al., 2026, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop). Generalizes OPSD to RL with rich feedback: "SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy." This is mathematically the same as Composer's targeted-textual-feedback method. There is published code. Comparison table from the SDPO paper:
Method Sampling Signal Feedback SFT / Distillation (Hinton 2015) off-policy rich strong teacher On-Policy Distillation (Agarwal 2024) on-policy rich strong teacher RLVR / GRPO (Lambert 2025) on-policy weak environment SDPO (this paper / Composer) on-policy rich environment Self-Distillation Enables Continual Learning (arXiv:2601.19897).
Key reproducibility gap (still unsolved): How are the hints generated? The blog gives one example template ("Reminder: Available tools are…") but doesn't say whether hints come from hardcoded templates, a separate model (Opus?), the same model with an introspection prompt, or a learned hint generator. This is the single most important open question for replication.
2. Synthetic data at 25× scale [BLOG-VERIFIED]
"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2." — Cursor blog
- Feature Deletion is one named approach: take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests. Tests = verifiable reward.
- The blog explicitly mentions reward-hacking failures: model decompiled Java bytecode, reverse-engineered Python type-checking caches, to recover deleted APIs. This is a real risk, not theoretical.
- "Agentic monitoring tools" are mentioned as the mitigation, but no specifics.
3. Sharded Muon + dual mesh HSDP [BLOG-VERIFIED]
"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights." — Cursor blog
- Two HSDP layouts: narrow (intra-node) for non-expert weights, wide for expert weights.
- Blackwell-optimized. CP=2 + EP=8 on 8 GPUs (instead of 16 in shared mesh).
- Optimizer step time on 1T model: 0.2 s.
This is infrastructure, not algorithm. It only matters at MoE-1T scale; for our v0.0 (Qwen3-7B dense) and v0.1 (Qwen3-32B dense) it's irrelevant. Becomes relevant if we ever train a Kimi-K2.5-derivative directly.
What the subagent added beyond the blog ([EXTRAPOLATED] and [INFERRED])
research/01-composer-2.5.md introduced these claims that are NOT in the Cursor blog. Most are likely correct from secondary sources, but they are not blog-verified.
| Claim | Source basis | Verdict |
|---|---|---|
| "85% of total compute is post-training" | [EXTRAPOLATED] — likely from secondary commentary (Jake Handy substack, HN thread cited by subagent). |
Plausible but unverified. Cursor doesn't publish the ratio. Treat as community consensus, not Cursor-stated. |
| Anyrun environment harness with LSP/file-I/O/terminal | [EXTRAPOLATED] — name "Anyrun" doesn't appear in the 2.5 blog (may be in the Composer-2 technical report). |
Plausible — Cursor 2.5 does say "asynchronous, sandboxed real-world coding environments" which is consistent. But "Anyrun" as a brand name isn't sourced from the 2.5 post. |
| MLA + 1T/32B active + 384 experts + 256K ctx | [INFERRED] from Kimi K2.5 base-model knowledge. The blog only says "built on Kimi K2.5". |
Verified independently via Moonshot's K2.5 model card. Correct. |
| CursorBench 69.3%, Terminal-Bench 2.0 parity, SWE-bench Multilingual | [EXTRAPOLATED] — blog doesn't quote benchmarks. |
Source unclear. Probably from Cursor's launch comms / Twitter thread / a different blog post. Don't cite as 2.5-blog-verified. |
| "PPO or GRPO variant" | [EXTRAPOLATED] — blog never names the RL algorithm. |
Educated guess. Composer 2 technical report likely says; the 2.5 blog does not. The cited SDPO paper sits on top of an unspecified RLVR algorithm, so this is still open. |
| "Continued pretraining on heavily code-weighted data" | [BLOG-VERIFIED] — blog says exactly this in the Sharded Muon section ("For continued pretraining…"). |
Verified. |
| "Behavioral aspects: communication style, effort calibration" | [BLOG-VERIFIED] — blog mentions improving these and notes existing benchmarks don't capture them. |
Verified, but blog doesn't say how they're trained. The targeted-textual-feedback method is presumably also used here. |
Mapping each blog component → our replication framework
| Composer 2.5 stage | Blog mechanism | Our replication target | v0.0 | v0.1 | v0.2 |
|---|---|---|---|---|---|
| (a) Continued pretraining on code | Standard pretraining, code-weighted | Skip — start from already-code-tuned Qwen3-Coder-7B or Qwen3-Coder-30B-A3B |
✗ | ✗ | ✗ |
| (b) Synthetic data at scale | Feature Deletion + 24 other (unnamed) generators | Build 1 generator (Feature Deletion) as OpenEnv-compatible env. Use SWE-bench-lite and SWE-Gym as drop-in alternatives. | ✗ (use SWE-bench-lite only) | ✓ (build Feature Deletion) | scale generator suite |
| (c) Realistic-environment RL (RLVR) | Async sandboxes, same tool harness as production | TRL GRPOTrainer + verifiers + OpenEnv; SWE-bench-lite env in v0.0; build sandboxed code execution env in v0.1 |
✓ baseline | ✓ + DAPO patches | + decentralized rollouts |
| (d) Targeted RL w/ textual feedback (Composer's secret sauce) | Same-model self-distill: insert hint into context → teacher; original → student; on-policy KL at the turn | Lift the OPSD/SDPO loss directly from siyan-zhao/OPSD (published code, MIT). Generate hints via templates (v0.1) or LLM (v0.2). |
✗ (deferred) | ✓ (this is the Composer-recipe channel) | + learned hint generator |
| (e) Trace-replay multi-teacher distill (NOVEL — our addition) | N/A (not in Composer) | N=3 teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) replay each step; disagreement → DPO pairs | ✓ (this is the v0.0 novelty bet) | ✓ + VOI gating | + tiered teachers |
| (f) Sharded Muon / dual-mesh HSDP | MoE optimizer infra | Skip until we go to MoE bases — irrelevant for dense Qwen3-{7,32}B | ✗ | ✗ | ✗ (only if MoE base) |
| (g) Reward-hacking safeguards | "Agentic monitoring tools" — unspecified | Static analysis + bytecode-cache-deletion + a sandboxed shell with no find / strings / unzip access in the env |
✗ (small surface) | ✓ (build the monitor) | + RM-based penalty |
Critical relationship: Composer hint-distill vs. trace-replay-distill
These are two different mechanisms, not competing implementations of the same idea. Initial framework synthesis blurred them; this section makes the distinction precise.
| Property | Composer hint-distill (= SDPO/OPSD) | Trace-replay multi-teacher (NOVEL — ours) |
|---|---|---|
| Number of models | 1 (same model is teacher + student) | N+1 (frozen N teachers + 1 trainable student) |
| What "teacher" means | Student-with-hint-in-context | External pretrained models from other labs |
| Per-step cost | ~1 extra forward pass (cheap) | N teacher API calls (~$0.02/step at N=3 per spike 001) |
| Privileged information | Hint text in context | None — teachers see same state student sees |
| Source of hint / privileged info | Open question. Templates? LLM judge? | Not applicable |
| Relationship to RLVR | Adds dense per-turn signal on top of RLVR scalar reward | Same — adds dense per-step signal on top of RLVR |
| Bypasses long-horizon credit assignment? | Yes (per-turn KL) | Yes (per-step DPO/PRM) |
| Published code? | Yes — siyan-zhao/OPSD (MIT) |
Not yet — we're building it |
| Novel in the framework? | No — this is Composer's published recipe | Yes — the v0.0 research bet |
Both channels stack on the same RLVR base. The full v0.1 trainer has THREE reward channels:
- RLVR (verifiable scalar reward — tests pass / build succeeds). Ground truth, never skipped.
- Composer hint-distill = SDPO loss (one extra forward pass per error site, hint-conditioned).
- Trace-replay-distill = DPO/PRM from N external teachers (~$0.30/trace with VOI gating, our novelty bet).
In v0.0 we test channel 3 in isolation against channel 1 (the spike 004 A/B). We deliberately defer channel 2 to v0.1 to keep the v0.0 experiment small.
Why deferring Composer hint-distill to v0.1 is the right call
I considered adding hint-distill to v0.0 to do a 4-arm A/B (RLVR / RLVR+SDPO / RLVR+trace-replay / RLVR+SDPO+trace-replay). Decided against it for v0.0 because:
- The novel claim is trace-replay. The Composer recipe is already published; SDPO is already published with code. Validating SDPO at 7B is engineering, not novel research.
- The hint-generator open question is unresolved. Without that, an SDPO arm is "SDPO with hardcoded tool-name templates" which is the easy case and doesn't validate the harder behavior cases (style, communication).
- Spike 001's economic verdict only gates the trace-replay channel. SDPO has no per-step API cost — it's just an extra forward pass on the same GPU. Different cost model.
- A 4-arm A/B at 7B costs ~$600 vs. ~$300 for the 2-arm. Not worth it for v0.0.
v0.1 will have the full 4-arm (or at least 3-arm: RLVR / RLVR+SDPO / RLVR+SDPO+trace-replay) at 32B once we know v0.0's trace-replay verdict.
Implementation handles for v0.1 (concrete starting points)
When we get to v0.1, the Composer hint-distill channel has a clear engineering path:
- Lift the SDPO loss math from
siyan-zhao/OPSD. MIT licensed, ICLR 2026 paper, exact same mechanism Cursor uses. Their code targets HuggingFace transformers; should slot into TRL's GRPO or PRIME-RL with ~50 LoC of glue. - Hint generator v1: hardcoded templates. Pattern-match on tool-call errors:
"Tool not found: X"→ hint ="Reminder: Available tools are: <list of valid tools>""JSONDecodeError: ..."→ hint ="Reminder: tool arguments must be valid JSON""Type error in args"→ hint ="Reminder: <tool-name> expects args matching schema: <schema>"This handles the "tool call error" case from Cursor's blog example. Style/communication is harder — defer to v1.5 with an LLM-based hint generator.
- Apply only at error sites, not every turn. Detect via:
- Failed tool calls (status != ok)
- Exception traces in tool output
- Optional: a lightweight judge model flagging "this turn was wasteful" (matches Cursor's "communication style" use case)
- Loss =
α * GRPO_loss + β * SDPO_KL_at_error_turns + γ * trace_replay_DPO_loss. Ablate(α, β, γ).
Implementation handles for v0.2 (decentralized scale)
If v0.1 validates and we scale, here's what each Composer-stage maps to in a multi-cluster setting:
- Continued pretraining: Pretrained checkpoint already exists (Qwen3-32B); skip.
- Synthetic data: Generators run on CPU pool, producing OpenEnv tasks pushed to a shared queue. Embarrassingly parallel.
- Realistic-env RL: PRIME-RL's orchestrator/trainer/inference split, vLLM↔FSDP2 weight broadcast (SHARDCAST). v0.2 adds Streaming DiLoCo outer loop only when training spans clusters.
- Targeted hint-distill: Compute is local to each trainer — no decentralization complication.
- Trace-replay-distill: Teacher API calls are independent — embarrassingly parallel across rollout workers. VOI gating becomes more important to control cost at scale.
- Sharded Muon / dual-mesh HSDP: Only if we adopt MoE base. For dense 32B, FSDP2 is fine.
Citations (updated)
Primary sources for each Composer-2.5 component, post-audit:
- Cursor blog — Introducing Composer 2.5 (2026)
- Cursor blog — Composer 2 technical report (predecessor; named the "Anyrun" environment per subagent — verify if needed)
- OPSD paper — Zhao et al., Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs, arXiv:2601.18734, code at siyan-zhao/OPSD. MIT.
- SDPO paper — Hübotter et al., Reinforcement Learning via Self-Distillation, arXiv:2601.20802, ICLR 2026 Scaling Post-training Workshop. The direct formalization of Composer's hint-distill.
- Self-Distillation continual-learning — arXiv:2601.19897. Cited by Cursor; less directly relevant.
- Moonshot Kimi K2.5 — base model, HF model card.
The methodology mapping in this document supersedes vague claims in research/01-composer-2.5.md where the two conflict; that file is preserved unchanged for provenance (a snapshot of the parallel-research dispatch output) but should not be cited as ground truth on its own.