Abstract
Reference-Guided Fine-Tuning (ReGFT) addresses reward sparsity in reinforcement learning for mathematical reasoning by using human-written solutions to create guided training trajectories that improve model performance.
Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity: for challenging problems, LLM fails to sample any correct trajectories, preventing RL from receiving meaningful positive feedback. At the same time, there often exist human-written reference solutions along with the problem (e.g., problems from AoPS), but directly fine-tuning on these solutions offers no benefit because models often cannot imitate human proofs that lie outside their own reasoning distribution. We introduce Reference-Guided Fine-Tuning (ReGFT), a simple and effective method that utilizes human-written reference solutions to synthesize positive trajectories on hard problems and train on them before RL. For each problem, we provide the model with a partial reference solution and let it generate its own reasoning trace, ensuring the resulting trajectories remain in the model's reasoning space while still benefiting from reference guidance. Fine-tuning on these reference-guided trajectories increases the number of solvable problems and produces a checkpoint that receives more positive rewards during RL. Across three benchmarks (AIME24, AIME25, BeyondAIME), ReGFT consistently improves supervised accuracy, accelerates DAPO training, and raises the final performance plateau of RL. Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration (2026)
- InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning (2026)
- Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning (2026)
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Beyond Correctness: Learning Robust Reasoning via Transfer (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper