GRPO Tax Study: qwen-1.5b DPO LoRA Adapter
This is a DPO-trained LoRA adapter released as part of the paper:
The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training
This adapter serves as a comparison baseline to the GRPO-trained adapters. The paper finds that DPO acts as a more conservative optimizer, producing smaller capability shifts than GRPO in both directions.
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Parameters | 1.5B |
| Method | DPO with LoRA (r=16, alpha=32) |
| Dataset | HuggingFaceH4/ultrafeedback_binarized (10K subset) |
| Epochs | 1 |
| Learning rate | 5e-7 (cosine) |
| DPO beta | 0.1 |
| Max seq length | 768 |
| Precision | bf16 |
| Hardware | NVIDIA RTX 5090 (32GB) |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-qwen-1.5b-dpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
Related Resources
| Resource | Link |
|---|---|
| Paper | Coming soon (TMLR submission) |
| All evaluation data | usama10/grpo-tax-eval-data |
| Source code | github.com/usama10/grpo-capability-tax |
| GRPO adapters | qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b |