f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Abstract
Preference alignment objectives are extended to general alignment settings using f-divergence variational representations, introducing novel on-policy and hybrid policy optimization methods for LLM alignment with theoretical and empirical validation.
Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
Community
Recent research shows that Preference Alignment
(PA) objectives act as divergence estimators be-
tween aligned (chosen) and unaligned (rejected)
response distributions. In this work, we extend
this divergence-based perspective to general align-
ment settings, such as reinforcement learning with
verifiable rewards (RLVR), where only environ-
mental rewards are available. Within this unified
framework, we propose f-Group Relative Policy
Optimization (f-GRPO), a class of on-policy re-
inforcement learning, and f-Hybrid Alignment
Loss (f-HAL), a hybrid on/off policy objectives,
for general LLM alignment based on variational
representation of f-divergences. We provide the-
oretical guarantees that these classes of objec-
tives improve the average reward after alignment.
Empirically, we validate our framework on both
RLVR (Math Reasoning) and PA tasks (Safety
Alignment), demonstrating superior performance
and flexibility compared to current methods.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Clipping-Free Policy Optimization for Large Language Models (2026)
- GOPO: Policy Optimization using Ranked Rewards (2026)
- A Unified Framework for Rethinking Policy Divergence Measures in GRPO (2026)
- Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning (2026)
- Rewards as Labels: Revisiting RLVR from a Classification Perspective (2026)
- Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning (2026)
- SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper