RubricBench: Aligning Model-Generated Rubrics with Human Standards Paper • 2603.01562 • Published 4 days ago • 50
On Data Engineering for Scaling LLM Terminal Capabilities Paper • 2602.21193 • Published 9 days ago • 90
From Perception to Action: An Interactive Benchmark for Vision Reasoning Paper • 2602.21015 • Published 9 days ago • 23
Free(): Learning to Forget in Malloc-Only Reasoning Models Paper • 2602.08030 • Published 25 days ago • 5
AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis Paper • 2602.09372 • Published 24 days ago • 5
Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models Paper • 2602.01849 • Published Feb 2 • 5
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models Paper • 2601.07372 • Published Jan 12 • 44
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Paper • 2601.09688 • Published Jan 14 • 127
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation Paper • 2601.09688 • Published Jan 14 • 127
One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling Paper • 2601.03111 • Published Jan 6 • 10
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published Jan 8 • 228
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents Paper • 2512.22047 • Published Dec 26, 2025 • 30
Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning Paper • 2512.20848 • Published Dec 23, 2025 • 38
Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions Paper • 2503.22678 • Published Mar 28, 2025 • 2
Very Large-Scale Multi-Agent Simulation in AgentScope Paper • 2407.17789 • Published Jul 25, 2024 • 35