Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Composer 2.5 Replication Framework
Repo type:
model(methodology). Status: Research synthesis + v0.1 framework with verified gap-closer spikes (2026-05-26). Author: Codeseys Goal: Replicate Cursor's Composer 2.5 (a post-trained Kimi K2.5 specialised for agentic coding) on any HuggingFace base model, using a synthesis of decentralized RL post-training techniques.
This repository is the "paper of the project" — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a novel multi-teacher trace-replay distillation channel that stacks on top of the Composer recipe.
Install
pip install -e .
python examples/qwen_05b_quickstart/run.py
The quickstart loads Qwen2.5-0.5B-Instruct and runs 5 backward steps through
the 3-channel loss on CPU in ~3-5 minutes. See
examples/qwen_05b_quickstart/README.md
for what the output should look like.
v0.1 spike progress (2026-05-26):
- 🟢 Spike 001 (kill-switch teacher cost) — VALIDATED: 150 real OpenRouter calls, $0.98/trace, p95 latency 20.5s. The novel research direction is economically viable.
- 🟢 Spike 005 (integrated 3-channel trainer skeleton) — SKELETON-VALIDATED: 38/38 unit tests passing; the integration architecture claim ("all three channels run simultaneously, ablate cleanly, train without divergence") is empirically verified.
- 🟢 Spike 006 (real HF model smoke) — PASSED + STRICT-VERIFIED: 9 base tests + 3 strict tests (
test_strict.py) close the cross-model-review's tautology critique: alternating-batch loss decrease, SDPO channel actually fires (sdpo_jsd > 0), SDPO off-vs-on totals differ on real Qwen2.5-0.5B. The original "is the loss decrease just memorization?" objection is no longer open. - 🟢 Spike 002a-mini-gpu-smoke (real GPU evidence) — PASSED on local 5090: Qwen2.5-0.5B in bf16, 50 steps, loss 0.7354 → 0.00034 (99.95%), peak VRAM 5.31 GB, median 480 ms/step. First GPU evidence of any kind in the framework. ADR-001's local-5090 choice now empirically verified.
- 🟢 Spike 007 (real trace ingestion) — PASSED + E2E-VERIFIED: 15 unit tests + 2 e2e tests (
test_e2e_with_loss.py) pipe ingestedTraceStaterecords all the way throughcompose_loss+ backward on a real Qwen model. Closes V5 in spirit (cross-model review item #9). - ⚠️ Spike 008 (DiLoCo outer-loop smoke) — PARTIAL:
make_diloco_outer_loop()wrapstorchft.local_sgd.DiLoCo. 5/5 single-process tests pass including a pseudo-gradient sign-convention pin. But the BACKLOG required a 2-replica convergence smoke; what shipped is 1-replica machinery + passthrough no-opallreduce. True multi-process DiLoCo is GPU-gated and not yet attempted. - 🟢 Wave 10 (packaging) — DONE:
pip install -e .works;composer_replicationpackage re-exports the verified APIs from the spike directories.compose_lossandbuild_batchare explicitly verification-harness public APIs (production loss isComposerReplicationTrainer._compute_loss). - 📋 Spikes 002a/002b/003/004 — planned, awaiting GPU budget commitment.
📝 Publication materials drafted: publications/ contains a complete pre-experimental release set — longform methodology paper, blog post (HF Blog format), repo Discussion announcement, X/LinkedIn threads, plus CITATION.cff and CITATION.bib at the repo root. Use publications/RELEASE_CHECKLIST.md to coordinate the publication wave. Nothing posted publicly yet — this is a pre-experimental release, not a post-experimental one.
🔍 Vision encapsulation self-audit: docs/VISION_VALIDATION.md — clause-by-clause traceability matrix of the original brief against current deliverables, honest gap analysis (4 real gaps identified), adversarial self-review (5 strongest objections steelmanned), 10-point pass/fail scorecard, and a recommended sequence of three CPU-only spikes (006/007/008) + a packaging wave that takes the framework from 5/10 to 9/10 vision encapsulation in ~5 days, no GPU budget required. The remaining 1/10 (does the method actually improve training) is the empirical question that gates the v0.1 paper.
See spikes/README.md for the 5-stage spike plan, docs/INTEGRATION_ARCHITECTURE.md for the per-framework extension-point analysis, and spikes/005-integrated-trainer-skeleton/ for runnable trainer code.
TL;DR — what's in here, why it matters
Cursor's Composer 2.5 is the strongest case study for "RL post-training of a frontier MoE base produces a model that beats GPT-5.5 on agentic coding while costing 5–10× less to serve." The recipe is almost entirely post-training (~85% of compute) and the most important trick is non-obvious: a per-turn on-policy distillation loss called Targeted RL with Textual Feedback.
This repo contains:
framework/composer-replication-framework.md— master synthesis: architecture, stack picks, phase plan, open questions. The TL;DR table maps every layer of the system to a concrete software pick with rationale.research/01-composer-2.5.md— Composer 2.5 deep-dive: base model, 5-stage recipe, the secret-sauce hint-distillation loss, results.research/02-diloco-family.md— DiLoCo / OpenDiLoCo / Streaming DiLoCo / PRIME-RL / INTELLECT-1+2 deep-dive: when decentralized training actually helps, when it's premature.research/03-monarch-torchforge-openenv.md— Meta's Monarch actor mesh + TorchForge (paused) + OpenEnv environment standard. What's alive, what to bet on.research/04-verl-trl.md— Algorithm-library deep-dive: GRPO / DAPO / DPO / PRM in TRL vs VeRL, plus the 3D-HybridEngine resharding pattern.research/05-trace-replay-distillation.md— Novelty assessment of the trace-replay multi-teacher distillation idea: prior art (rStar / Math-Shepherd / OmegaPRM / Magpie / MoA), cost analysis, reward-shape options.
Each of the five research deep-dives was authored by a different LLM family (Gemini 3.1 Pro, DeepSeek V4 Pro, GPT-5, Sonnet 4.6, Kimi K2-Thinking) running in parallel. The synthesis at framework/composer-replication-framework.md cross-checks their findings.
Headline findings
1. Composer 2.5's secret sauce is the targeted hint-distillation loss — and it's published as SDPO
The 1T MoE base (Kimi K2.5) and the "Feature Deletion" RL environment are the obvious moves. The non-obvious one — Cursor's "Targeted RL with Textual Feedback" — turns out to be mathematically the same as the published SDPO method (Hübotter et al., ICLR 2026 Workshop, arXiv:2601.20802 + code at github.com/siyan-zhao/OPSD, MIT licensed). Cursor cites this paper directly in the blog's footnote 1.
The mechanism: when a 100K-token rollout has a localized error, generate a text hint correcting the error, run forward pass with the hint to get "Teacher" logits, run forward pass without the hint to get "Student" logits, and apply KL divergence loss to pull Student toward Teacher only at that turn. Same model is both teacher and student — the teacher is just "the model with hint inserted into context."
This sidesteps "GRPO on agentic traces is brittle because one bad step poisons 100 good ones." The biggest reproducibility gap is how the text hints are generated — Cursor never tells. v0.1 will use template-based hints first; v0.2 will explore LLM-driven hint generators. See docs/COMPOSER_RECIPE_MAPPING.md for the rigorous stage-by-stage mapping of Cursor's blog onto our framework.
2. The trace-replay multi-teacher idea is genuinely novel — and economically viable (verified)
Closest precedent is rStar-Math (single-teacher MCTS counterfactuals at training time). Multi-teacher frozen-trace replay with disagreement-as-reward is open territory. Spike 001 (✅ VALIDATED, 2026-05-25) measured the per-trace cost floor empirically: $0.98/trace ungated (vs. $5 cap), 5x headroom; with VOI gating in v0.1 we project ~$0.30/trace.
Critical distinction: Composer's hint-distill (SDPO, single model with hint context) and trace-replay-distill (N external teachers) are two different mechanisms, not competing implementations. They stack:
- Composer hint-distill (SDPO) = same-model self-teacher with hint context, pulls student at error sites. ~1 extra forward pass. No API cost.
- Trace-replay-distill = N external pretrained teachers, pull student at all steps. ~$0.30/trace with VOI gating. Novel.
v0.1 runs both. v0.0 (current) tests trace-replay alone vs. plain GRPO to falsify the novel claim cheaply.
3. Recommended stack (verified across all 5 reports)
| Layer | Pick | Why not the alternative |
|---|---|---|
| RL substrate | PRIME-RL | INTELLECT-2 already proved 32B globally distributed; Forge is "development-paused" by Meta |
| Algorithm impl | TRL (lift loss math) | Cleanest GRPO + first-class OpenEnv integration |
| Resharding pattern | VeRL's 3D-HybridEngine (reference) | Most battle-tested at 70B+ |
| Environments | OpenEnv + verifiers | HF + Meta backing, MCP RFC landing, Hub-hosted |
| Distributed sync | Skip DiLoCo for v0.1 | Outer loop only matters when training spans clusters |
| Orchestration | Ray today, Monarch when mature | Forge paused; Monarch K8s story still landing |
Architecture
┌───────────────────────────────────────────┐
│ OpenEnv Environment Hub │
│ (HF Hub, Docker images, MCP tool-calling)│
│ - Anyrun-style code sandbox │
│ - SWE-Gym, SWE-Bench-Verified envs │
│ - "Feature Deletion" auto-grader env │
└────────────────┬──────────────────────────┘
│ rollouts (verifiers protocol)
▼
┌────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (CPU) │
│ - Schedules rollouts across inference workers │
│ - Assembles training batches │
│ - Routes hint-distillation pairs (Composer-style) │
│ - Routes trace-replay teacher queries (NOVEL) │
│ - Built on Monarch (future) or Ray (today) │
└────┬──────────────────────────┬──────────────────────────┬─┘
│ rollout requests │ training batches │ teacher queries
▼ ▼ ▼
┌─────────────────────┐ ┌────────────────────┐ ┌────────────────────────┐
│ INFERENCE POOL │ │ TRAINER (GPU) │ │ TEACHER POOL │
│ (vLLM / SGLang) │ │ - FSDP2 sharded │ │ - Frozen N teachers │
│ - Student policy │ │ - GRPO + DAPO │ │ - HF Inference, │
│ - Auto-resharded │ │ - +Hint distill │ │ OpenRouter, vLLM │
│ via SHARDCAST │ │ KL loss │ │ - Diverse families │
│ - Async tool waits │ │ - +PRM/DPO from │ │ (Anthropic / OpenAI │
│ don't block GPU │ │ trace-replay │ │ / DeepSeek / Qwen) │
└─────────────────────┘ └────────────────────┘ └────────────────────────┘
│
│ pseudo-gradients (every H steps)
▼
┌────────────────────────────────┐
│ OUTER LOOP (DiLoCo, optional) │
│ - Only when training spans │
│ multiple clusters / DCs │
│ - Streaming variant for │
│ bandwidth-limited links │
└────────────────────────────────┘
Three reward channels feed the trainer:
- RLVR — verifiable rewards (tests pass, build succeeds). Ground truth, never skipped.
- Composer hint-distill — per-turn KL to a hint-conditioned forward pass.
- Trace-replay-distill — per-step preference / process-reward signal from N frozen teachers.
The novel contribution is channel (3) — no published work systematically replays each step of frozen agentic traces with multiple teachers to harvest step-level supervision.
Roadmap
| Phase | Timeline | Goal | Trained variant repo | Data repo |
|---|---|---|---|---|
| v0.0 spike | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | Codeseys/composer-replication-qwen3-7b-v0 |
Codeseys/composer-replication-traces-v0 |
| v0.1 | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. | Codeseys/composer-replication-qwen3-32b-v1 |
Codeseys/composer-replication-traces-v1 |
| v0.2 | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across Modal + HF Jobs + on-prem via the new serverless-DiLoCo executor abstraction (ADR-005). | Codeseys/composer-replication-qwen3-32b-decentralized |
(re-uses v1 data) |
Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the HF multi-artifact research project layout. This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
Wave 13 expansion (2026-05-26) — what just landed
The user expanded the brief mid-deep-work-loop to address the serverless-orchestration, normalization, and broader-RL-framework dimensions. Six new artifact families:
composer_replication.distillation— pluggable losses: SimPO (reference-free DPO), TAID (annealed teacher interpolation), Entropy-Aware OPD (token-wise gated forward/reverse KL). 17 unit tests. Use as standalone functions for now;compose_lossintegration is deferred to Wave 14 (Wave 13 review Finding 2). See ADR-007 +docs/research/SELF_DISTILLATION_LANDSCAPE.md.composer_replication.diloco.serverless—ServerlessExecutorProtocol +ObjectStoreAllReduce+LocalProcessExecutor(running- tested) +
ModalExecutor/HFJobsExecutorskeletons. 9 multi- process tests pinning the allreduce barrier. See ADR-005 +docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md.
- tested) +
composer_replication.replaysim— N-teacher replay + data-juicer normalization (chosen over datatrove because it has native multi-turn- DPO-pair ops). 9 unit tests + default YAML recipe. See ADR-004 +
docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md.
- DPO-pair ops). 9 unit tests + default YAML recipe. See ADR-004 +
composer_replication.recipes.prime_rl— third RL framework recipe (alongside TRL + VeRL). PRIME-RL was selected because it has a first-classCustomLossConfigexposing exactly the tensors a 3-channel loss needs. See ADR-006 +docs/research/RL_FRAMEWORKS_LANDSCAPE.md.composer_replication.recipes.monarch— Meta's PyTorch agentic stack tie-in. Monarch (BSD-3, v0.4.1) is the only Meta agentic-stack component actively shipping; TorchForge is paused. Actor layout documented + skeleton actors in place. See ADR-006.docs/ALTERED_MINDS_TIE_IN.md— bridge to the user's adjacent workstream (formerlyllm-mental-alterations). 5-phase plan for using the framework to RL-train altered-minds-altered models. ~$300 estimated for a moral-scenarios trace-replay round.
Tests as of Wave 15: 115 passing + 1 skip-marked. Wave-by-wave: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-paramaterized tests; OPSD parity test added skip-marked). See docs/V1_V8_COVERAGE.md for the canonical running count.
Methodology — how this synthesis was produced
To minimize single-model bias, the five research deep-dives were generated in parallel by five different LLM families via the delegate_task parallel-research pattern:
| Topic | Author model |
|---|---|
01-composer-2.5.md |
google/gemini-3.1-pro-preview |
02-diloco-family.md |
deepseek/deepseek-v4-pro |
03-monarch-torchforge-openenv.md |
openai/gpt-5 |
04-verl-trl.md |
anthropic/claude-sonnet-4.6 |
05-trace-replay-distillation.md |
moonshotai/kimi-k2-thinking |
Convergent findings across reports (≥2 independent confirmations):
- GRPO+DAPO is the consensus algorithm (3/4 reports that compared)
- PRIME-RL is the most production-ready decentralized substrate (2 reports independently)
- OpenEnv is the env-format winner (3 reports converge)
- Trace-replay-with-N-teachers is genuinely under-explored (the trace-replay report's primary finding, corroborated by the absence of it in the 4 other reports)
The synthesis at framework/composer-replication-framework.md reconciles divergences (e.g., DiLoCo vs single-cluster timing) with explicit rationale.
Citation
If you use this framework or its derivative artifacts (the trained variants, the trace dataset, or the Feature-Deletion environment), please cite:
@misc{composer-replication-framework-2026,
author = {Codeseys},
title = {Composer 2.5 Replication Framework: A Methodology for Open Replication of Cursor's Agentic Coding Recipe},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Codeseys/composer-replication-framework}},
note = {Pre-spike research synthesis. Five-author parallel research with cross-family verification.}
}
License
MIT. Use freely; attribution appreciated. Underlying primary sources (Cursor blog, Moonshot K2.5 paper, DeepMind DiLoCo paper, Microsoft rStar paper, etc.) are owned by their respective authors and are cited inline in the research notes.
Related work / links
- Cursor — Introducing Composer 2.5 (Cursor blog, 2026)
- Moonshot AI — Kimi K2 Thinking
- Prime Intellect — PRIME-RL and INTELLECT-2 model card
- Hugging Face — TRL
- ByteDance — VeRL
- Meta — OpenEnv + Monarch
- Microsoft — rStar / rStar-Math
- DeepMind — DiLoCo paper and Streaming DiLoCo
Contact
Open a Discussion on this repo for technical questions, corrections, or collaboration interest. The five research notes are open to PRs — if you find a misattribution or a missing primary source, send a fix.