Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models Paper • 2604.08527 • Published Apr 9 • 1 • 1
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping Paper • 2510.18927 • Published Oct 21, 2025 • 85 • 3