GRPO Tax Study: qwen-1.5b DPO LoRA Adapter

This is a DPO-trained LoRA adapter released as part of the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

This adapter serves as a comparison baseline to the GRPO-trained adapters. The paper finds that DPO acts as a more conservative optimizer, producing smaller capability shifts than GRPO in both directions.

Training Details

Parameter Value
Base model Qwen/Qwen2.5-1.5B-Instruct
Parameters 1.5B
Method DPO with LoRA (r=16, alpha=32)
Dataset HuggingFaceH4/ultrafeedback_binarized (10K subset)
Epochs 1
Learning rate 5e-7 (cosine)
DPO beta 0.1
Max seq length 768
Precision bf16
Hardware NVIDIA RTX 5090 (32GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-qwen-1.5b-dpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

Related Resources

Resource Link
Paper Coming soon (TMLR submission)
All evaluation data usama10/grpo-tax-eval-data
Source code github.com/usama10/grpo-capability-tax
GRPO adapters qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for usama10/grpo-tax-qwen-1.5b-dpo

Adapter
(805)
this model

Dataset used to train usama10/grpo-tax-qwen-1.5b-dpo