Composer 2.5 Replication Framework

Repo type: model (methodology). Status: Research synthesis + v0.1 framework with verified gap-closer spikes (2026-05-26). Author: Codeseys Goal: Replicate Cursor's Composer 2.5 (a post-trained Kimi K2.5 specialised for agentic coding) on any HuggingFace base model, using a synthesis of decentralized RL post-training techniques.

This repository is the "paper of the project" — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a novel multi-teacher trace-replay distillation channel that stacks on top of the Composer recipe.

Install

pip install -e .
python examples/qwen_05b_quickstart/run.py

The quickstart loads Qwen2.5-0.5B-Instruct and runs 5 backward steps through the 3-channel loss on CPU in ~3-5 minutes. See examples/qwen_05b_quickstart/README.md for what the output should look like.

v0.1 spike progress (2026-05-26):

🟢 Spike 001 (kill-switch teacher cost) — VALIDATED: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
🟢 Spike 005 (integrated 3-channel trainer skeleton) — SKELETON-VALIDATED: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
🟢 Spike 006 (real HF model smoke) — PASSED + STRICT-VERIFIED: 9 base tests + 3 strict tests (test_strict.py) close the cross-model-review's tautology critique: alternating-batch loss decrease, SDPO channel actually fires (sdpo_jsd > 0), SDPO off-vs-on totals differ on real Qwen2.5-0.5B. The original "is the loss decrease just memorization?" objection is no longer open.
🟢 Spike 002a-mini-gpu-smoke (real GPU evidence) — PASSED on local 5090: Qwen2.5-0.5B in bf16, 50 steps, loss 0.7354 → 0.00034 (99.95%), peak VRAM 5.31 GB, median 480 ms/step. First GPU evidence of any kind in the framework. ADR-001's local-5090 choice now empirically verified.
🟢 Spike 007 (real trace ingestion) — PASSED + E2E-VERIFIED: 15 unit tests + 2 e2e tests (test_e2e_with_loss.py) pipe ingested TraceState records all the way through compose_loss + backward on a real Qwen model. Closes V5 in spirit (cross-model review item #9).
⚠️ Spike 008 (DiLoCo outer-loop smoke) — PARTIAL: make_diloco_outer_loop() wraps torchft.local_sgd.DiLoCo. 5/5 single-process tests pass including a pseudo-gradient sign-convention pin. But the BACKLOG required a 2-replica convergence smoke; what shipped is 1-replica machinery + passthrough no-op allreduce. True multi-process DiLoCo is GPU-gated and not yet attempted.
🟢 Wave 10 (packaging) — DONE: pip install -e . works; composer_replication package re-exports the verified APIs from the spike directories. compose_loss and build_batch are explicitly verification-harness public APIs (production loss is ComposerReplicationTrainer._compute_loss).
📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.

📝 Publication materials drafted: publications/ contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus CITATION.cff and CITATION.bib at the repo root. Use publications/RELEASE_CHECKLIST.md to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.

🔍 Vision encapsulation self-audit: docs/VISION_VALIDATION.md — clause-by-clause traceability matrix of the original brief against current deliverables, honest gap analysis (4 real gaps identified), adversarial self-review (5 strongest objections steelmanned), 10-point pass/fail scorecard, and a recommended sequence of three CPU-only spikes (006/007/008) + a packaging wave that takes the framework from 5/10 to 9/10 vision encapsulation in ~5 days, no GPU budget required. The remaining 1/10 (does the method actually improve training) is the empirical question that gates the v0.1 paper.

See spikes/README.md for the 5-stage spike plan, docs/INTEGRATION_ARCHITECTURE.md for the per-framework extension-point analysis, and spikes/005-integrated-trainer-skeleton/ for runnable trainer code.

TL;DR — what's in here, why it matters

Cursor's Composer 2.5 is the strongest case study for "RL post-training of a frontier MoE base produces a model that beats GPT-5.5 on agentic coding while costing 5–10× less to serve." The recipe is almost entirely post-training (~85% of compute) and the most important trick is non-obvious: a per-turn on-policy distillation loss called Targeted RL with Textual Feedback.

This repo contains:

framework/composer-replication-framework.md — master synthesis: architecture, stack picks, phase plan, open questions. The TL;DR table maps every layer of the system to a concrete software pick with rationale.
research/01-composer-2.5.md — Composer 2.5 deep-dive: base model, 5-stage recipe, the secret-sauce hint-distillation loss, results.
research/02-diloco-family.md — DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 deep-dive: when decentralized training actually helps, when it's premature.
research/03-monarch-torchforge-openenv.md — Meta's Monarch actor mesh + TorchForge (paused) + OpenEnv environment standard. What's alive, what to bet on.
research/04-verl-trl.md — Algorithm-library deep-dive: GRPO / DAPO / DPO / PRM in TRL vs VeRL, plus the 3D-HybridEngine resharding pattern.
research/05-trace-replay-distillation.md — Novelty assessment of the trace-replay multi-teacher distillation idea: prior art (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA), cost analysis, reward-shape options.

Each of the five research deep-dives was authored by a different LLM family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) running in parallel. The synthesis at framework/composer-replication-framework.md cross-checks their findings.

Headline findings

1. Composer 2.5's secret sauce is the targeted hint-distillation loss — and it's published as SDPO

The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — Cursor's "Targeted RL with Textual Feedback" — turns out to be mathematically the same as the published SDPO method (Hübotter et al., ICLR 2026 Workshop, arXiv:2601.20802 + code at github.com/siyan-zhao/OPSD, MIT licensed). Cursor cites this paper directly in the blog's footnote 1.

The mechanism: when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass with the hint to get "Teacher" logits, run forward pass without the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher only at that turn. Same model is both teacher and student — the teacher is just "the model with hint inserted into context."

This sidesteps "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is how the text hints are generated — Cursor never tells. v0.1 will use template-based hints first; v0.2 will explore LLM-driven hint generators. See docs/COMPOSER_RECIPE_MAPPING.md for the rigorous stage-by-stage mapping of Cursor's blog onto our framework.

2. The trace-replay multi-teacher idea is genuinely novel — and economically viable (verified)

Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). Multi-teacher frozen-trace replay with disagreement-as-reward is open territory. Spike 001 (✅ VALIDATED, 2026-05-25) measured the per-trace cost floor empirically: $0.98/trace ungated (vs. $5 cap), 5x headroom; with VOI gating in v0.1 we project ~$0.30/trace.

Critical distinction: Composer's hint-distill (SDPO, single model with hint context) and trace-replay-distill (N external teachers) are two different mechanisms, not competing implementations. They stack:

Composer hint-distill (SDPO) = same-model self-teacher with hint context, pulls student at error sites. ~1 extra forward pass. No API cost.
Trace-replay-distill = N external pretrained teachers, pull student at all steps. ~$0.30/trace with VOI gating. Novel.

v0.1 runs both. v0.0 (current) tests trace-replay alone vs. plain GRPO to falsify the novel claim cheaply.

3. Recommended stack (verified across all 5 reports)

Layer	Pick	Why not the alternative
RL substrate	PRIME-RL	INTELLECT-2 already proved 32B globally distributed; Forge is "development-paused" by Meta
Algorithm impl	TRL (lift loss math)	Cleanest GRPO + first-class OpenEnv integration
Resharding pattern	VeRL's 3D-HybridEngine (reference)	Most battle-tested at 70B+
Environments	OpenEnv + verifiers	HF + Meta backing, MCP RFC landing, Hub-hosted
Distributed sync	Skip DiLoCo for v0.1	Outer loop only matters when training spans clusters
Orchestration	Ray today, Monarch when mature	Forge paused; Monarch K8s story still landing

Architecture

                    ┌───────────────────────────────────────────┐
                    │           OpenEnv Environment Hub         │
                    │  (HF Hub, Docker images, MCP tool-calling)│
                    │  - Anyrun-style code sandbox              │
                    │  - SWE-Gym, SWE-Bench-Verified envs       │
                    │  - "Feature Deletion" auto-grader env     │
                    └────────────────┬──────────────────────────┘
                                     │ rollouts (verifiers protocol)
                                     ▼
        ┌────────────────────────────────────────────────────────────┐
        │                    ORCHESTRATOR (CPU)                      │
        │  - Schedules rollouts across inference workers             │
        │  - Assembles training batches                              │
        │  - Routes hint-distillation pairs (Composer-style)         │
        │  - Routes trace-replay teacher queries (NOVEL)             │
        │  - Built on Monarch (future) or Ray (today)                │
        └────┬──────────────────────────┬──────────────────────────┬─┘
             │ rollout requests         │ training batches         │ teacher queries
             ▼                          ▼                          ▼
   ┌─────────────────────┐   ┌────────────────────┐   ┌────────────────────────┐
   │  INFERENCE POOL     │   │  TRAINER (GPU)     │   │  TEACHER POOL          │
   │  (vLLM / SGLang)    │   │  - FSDP2 sharded   │   │  - Frozen N teachers   │
   │  - Student policy   │   │  - GRPO + DAPO     │   │  - HF Inference,       │
   │  - Auto-resharded   │   │  - +Hint distill   │   │    OpenRouter, vLLM    │
   │    via SHARDCAST    │   │    KL loss         │   │  - Diverse families    │
   │  - Async tool waits │   │  - +PRM/DPO from   │   │    (Anthropic / OpenAI │
   │    don't block GPU  │   │    trace-replay    │   │     / DeepSeek / Qwen) │
   └─────────────────────┘   └────────────────────┘   └────────────────────────┘
                                     │
                                     │ pseudo-gradients (every H steps)
                                     ▼
                    ┌────────────────────────────────┐
                    │  OUTER LOOP (DiLoCo, optional) │
                    │  - Only when training spans    │
                    │    multiple clusters / DCs     │
                    │  - Streaming variant for       │
                    │    bandwidth-limited links     │
                    └────────────────────────────────┘

Three reward channels feed the trainer:

RLVR — verifiable rewards (tests pass, build succeeds). Ground truth, never skipped.
Composer hint-distill — per-turn KL to a hint-conditioned forward pass.
Trace-replay-distill — per-step preference / process-reward signal from N frozen teachers.

The novel contribution is channel (3) — no published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision.

Roadmap

Phase	Timeline	Goal	Trained variant repo	Data repo
v0.0 spike	1–2 weeks	Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite	`Codeseys/composer-replication-qwen3-7b-v0`	`Codeseys/composer-replication-traces-v0`
v0.1	1–2 months	Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale.	`Codeseys/composer-replication-qwen3-32b-v1`	`Codeseys/composer-replication-traces-v1`
v0.2	3–6 months	Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across Modal + HF Jobs + on-prem via the new serverless-DiLoCo executor abstraction (ADR-005).	`Codeseys/composer-replication-qwen3-32b-decentralized`	(re-uses v1 data)

Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the HF multi-artifact research project layout. This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.

Wave 13 expansion (2026-05-26) — what just landed

The user expanded the brief mid-deep-work-loop to address the serverless-orchestration, normalization, and broader-RL-framework dimensions. Six new artifact families:

composer_replication.distillation — pluggable losses: SimPO (reference-free DPO), TAID (annealed teacher interpolation), Entropy-Aware OPD (token-wise gated forward/reverse KL). 17 unit tests. Use as standalone functions for now; compose_loss integration is deferred to Wave 14 (Wave 13 review Finding 2). See ADR-007 + docs/research/SELF_DISTILLATION_LANDSCAPE.md.
composer_replication.diloco.serverless — ServerlessExecutor Protocol + ObjectStoreAllReduce + LocalProcessExecutor (running
- tested) + ModalExecutor / HFJobsExecutor skeletons. 9 multi- process tests pinning the allreduce barrier. See ADR-005 + docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md.
composer_replication.replaysim — N-teacher replay + data-juicer normalization (chosen over datatrove because it has native multi-turn
- DPO-pair ops). 9 unit tests + default YAML recipe. See ADR-004 + docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md.
composer_replication.recipes.prime_rl — third RL framework recipe (alongside TRL + VeRL). PRIME-RL was selected because it has a first-class CustomLossConfig exposing exactly the tensors a 3-channel loss needs. See ADR-006 + docs/research/RL_FRAMEWORKS_LANDSCAPE.md.
composer_replication.recipes.monarch — Meta's PyTorch agentic stack tie-in. Monarch (BSD-3, v0.4.1) is the only Meta agentic-stack component actively shipping; TorchForge is paused. Actor layout documented + skeleton actors in place. See ADR-006.
docs/ALTERED_MINDS_TIE_IN.md — bridge to the user's adjacent workstream (formerly llm-mental-alterations). 5-phase plan for using the framework to RL-train altered-minds-altered models. ~$300 estimated for a moral-scenarios trace-replay round.

Tests as of Wave 15: 115 passing + 1 skip-marked. Wave-by-wave: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-paramaterized tests; OPSD parity test added skip-marked). See docs/V1_V8_COVERAGE.md for the canonical running count.

Methodology — how this synthesis was produced

To minimize single-model bias, the five research deep-dives were generated in parallel by five different LLM families via the delegate_task parallel-research pattern:

Topic	Author model
`01-composer-2.5.md`	google/gemini-3.1-pro-preview
`02-diloco-family.md`	deepseek/deepseek-v4-pro
`03-monarch-torchforge-openenv.md`	openai/gpt-5
`04-verl-trl.md`	anthropic/claude-sonnet-4.6
`05-trace-replay-distillation.md`	moonshotai/kimi-k2-thinking

Convergent findings across reports (≥2 independent confirmations):

GRPO+DAPO is the consensus algorithm (3/4 reports that compared)
PRIME-RL is the most production-ready decentralized substrate (2 reports independently)
OpenEnv is the env-format winner (3 reports converge)
Trace-replay-with-N-teachers is genuinely under-explored (the trace-replay report's primary finding, corroborated by the absence of it in the 4 other reports)

The synthesis at framework/composer-replication-framework.md reconciles divergences (e.g., DiLoCo vs single-cluster timing) with explicit rationale.

Citation

If you use this framework or its derivative artifacts (the trained variants, the trace dataset, or the Feature-Deletion environment), please cite:

@misc{composer-replication-framework-2026,
  author       = {Codeseys},
  title        = {Composer 2.5 Replication Framework: A Methodology for Open Replication of Cursor's Agentic Coding Recipe},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
  note         = {Pre-spike research synthesis. Five-author parallel research with cross-family verification.}
}

License

MIT. Use freely; attribution appreciated. Underlying primary sources (Cursor blog, Moonshot K2.5 paper, DeepMind DiLoCo paper, Microsoft rStar paper, etc.) are owned by their respective authors and are cited inline in the research notes.

Contact

Open a Discussion on this repo for technical questions, corrections, or collaboration interest. The five research notes are open to PRs — if you find a misattribution or a missing primary source, send a fix.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Papers for Codeseys/composer-replication-framework

Reinforcement Learning via Self-Distillation

Paper • 2601.20802 • Published Jan 28 • 47

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Paper • 2501.18512 • Published Jan 30, 2025 • 29

DiLoCo: Distributed Low-Communication Training of Language Models

Paper • 2311.08105 • Published Nov 14, 2023 • 16

Codeseys
/

composer-replication-framework