SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills Paper • 2605.24117 • Published 8 days ago • 17 • 4
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills Paper • 2605.24117 • Published 8 days ago • 17
TMAS: Scaling Test-Time Compute via Multi-Agent Synergy Paper • 2605.10344 • Published 19 days ago • 49
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies Paper • 2605.03596 • Published 25 days ago • 10
HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness Paper • 2605.02396 • Published 26 days ago • 24
ClawGym: A Scalable Framework for Building Effective Claw Agents Paper • 2604.26904 • Published about 1 month ago • 50
On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning Paper • 2604.01702 • Published Apr 4 • 3
How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings Paper • 2604.04323 • Published Apr 6 • 41
On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning Paper • 2604.01702 • Published Apr 4 • 3
Embarrassingly Simple Self-Distillation Improves Code Generation Paper • 2604.01193 • Published Apr 1 • 54 • 8
LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning Paper • 2603.21065 • Published Mar 22 • 78
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities Paper • 2603.02578 • Published Mar 3 • 25
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs Paper • 2505.18573 • Published May 24, 2025
Can Tool-Integrated Reinforcement Learning Generalize Across Diverse Domains? Paper • 2510.11184 • Published Oct 13, 2025 • 1