Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement
Learning
Paper
• 2510.03259
• Published • 57
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Paper
• 2510.07242
• Published • 30
First Try Matters: Revisiting the Role of Reflection in Reasoning Models
Paper
• 2510.08308
• Published • 24
Low-probability Tokens Sustain Exploration in Reinforcement Learning
with Verifiable Reward
Paper
• 2510.03222
• Published • 76
Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by
Refining Belief States
Paper
• 2510.11052
• Published • 52
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
Paper
• 2510.10201
• Published • 36
Making Mathematical Reasoning Adaptive
Paper
• 2510.04617
• Published • 23
Demystifying Reinforcement Learning in Agentic Reasoning
Paper
• 2510.11701
• Published • 33
Are Large Reasoning Models Interruptible?
Paper
• 2510.11713
• Published • 5
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning
for LLMs
Paper
• 2510.11696
• Published • 182
Deep Self-Evolving Reasoning
Paper
• 2510.17498
• Published • 12
Continuous Autoregressive Language Models
Paper
• 2510.27688
• Published • 74
Higher-order Linear Attention
Paper
• 2510.27258
• Published • 15
Limits of Generalization in RLVR: Two Case Studies in Mathematical
Reasoning
Paper
• 2510.27044
• Published • 6
Why Language Models Hallucinate
Paper
• 2509.04664
• Published • 199
Reverse-Engineered Reasoning for Open-Ended Generation
Paper
• 2509.06160
• Published • 151
MinerU2.5: A Decoupled Vision-Language Model for Efficient
High-Resolution Document Parsing
Paper
• 2509.22186
• Published • 164
FlowRL: Matching Reward Distributions for LLM Reasoning
Paper
• 2509.15207
• Published • 118
Towards a Unified View of Large Language Model Post-Training
Paper
• 2509.04419
• Published • 76
Variational Reasoning for Language Models
Paper
• 2509.22637
• Published • 69
Revolutionizing Reinforcement Learning Framework for Diffusion Large
Language Models
Paper
• 2509.06949
• Published • 57
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach
for LLM Reasoning in RLVR
Paper
• 2509.23808
• Published • 47
Sequential Diffusion Language Models
Paper
• 2509.24007
• Published • 47
Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Paper
• 2511.23319
• Published • 24
GLM-5: from Vibe Coding to Agentic Engineering
Paper
• 2602.15763
• Published • 150
Experiential Reinforcement Learning
Paper
• 2602.13949
• Published • 75
Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making
Paper
• 2602.06570
• Published • 61
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
Paper
• 2604.06628
• Published • 326
Paper
• 2604.03128
• Published • 176
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Paper
• 2604.02029
• Published • 151
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Paper
• 2604.04921
• Published • 114
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Paper
• 2604.03993
• Published • 43
Large Language Models Explore by Latent Distilling
Paper
• 2604.24927
• Published • 74
Why Fine-Tuning Encourages Hallucinations and How to Fix It
Paper
• 2604.15574
• Published • 25
Hallucinations Undermine Trust; Metacognition is a Way Forward
Paper
• 2605.01428
• Published • 24
Co-Evolving Policy Distillation
Paper
• 2604.27083
• Published • 67
Efficient Training on Multiple Consumer GPUs with RoundPipe
Paper
• 2604.27085
• Published • 40
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
Paper
• 2604.24952
• Published • 6
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Paper
• 2605.06638
• Published • 15
Continuous Latent Diffusion Language Model
Paper
• 2605.06548
• Published • 80
RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
Paper
• 2605.00180
• Published • 30
Long Context Pre-Training with Lighthouse Attention
Paper
• 2605.06554
• Published • 31
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Paper
• 2605.15012
• Published • 4
BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE
Paper
• 2605.14438
• Published • 5
Process Rewards with Learned Reliability
Paper
• 2605.15529
• Published • 53
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
Paper
• 2605.20258
• Published • 30
The Unlearnability Phenomenon in RLVR for Language Models
Paper
• 2605.16787
• Published • 6
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
Paper
• 2605.23081
• Published • 40
Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
Paper
• 2605.25052
• Published • 13
Language Models Need Sleep
Paper
• 2605.26099
• Published • 10
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
Paper
• 2605.25189
• Published • 3
How Far Will They Go? Red-Teaming Online Influence with Large Language Models
Paper
• 2605.22880
• Published • 5
Decoding the Critique Mechanism in Large Reasoning Models
Paper
• 2603.16331
• Published
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Paper
• 2605.25604
• Published • 131
Self-Improving Language Models with Bidirectional Evolutionary Search
Paper
• 2605.28814
• Published • 52
Triplet-Block Diffusion RWKV
Paper
• 2605.25969
• Published • 18
Less is More: Early Stopping Rollout for On-Policy Distillation
Paper
• 2605.27028
• Published • 10
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Paper
• 2605.29548
• Published • 3