GRPO Tax Study: qwen-1.5b DPO LoRA Adapter

This is a DPO-trained LoRA adapter released as part of the paper:

The GRPO Tax is Smaller Than You Think: A Longitudinal Study of Capability Preservation During Reasoning Training

This adapter serves as a comparison baseline to the GRPO-trained adapters. The paper finds that DPO acts as a more conservative optimizer, producing smaller capability shifts than GRPO in both directions.

Training Details

Parameter	Value
Base model	Qwen/Qwen2.5-1.5B-Instruct
Parameters	1.5B
Method	DPO with LoRA (r=16, alpha=32)
Dataset	HuggingFaceH4/ultrafeedback_binarized (10K subset)
Epochs	1
Learning rate	5e-7 (cosine)
DPO beta	0.1
Max seq length	768
Precision	bf16
Hardware	NVIDIA RTX 5090 (32GB)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "usama10/grpo-tax-qwen-1.5b-dpo")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

Related Resources

Resource	Link
Paper	Coming soon (TMLR submission)
All evaluation data	usama10/grpo-tax-eval-data
Source code	github.com/usama10/grpo-capability-tax
GRPO adapters	qwen-1.5b, qwen-3b, phi-3.8b, gemma-2b, llama-3b

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for usama10/grpo-tax-qwen-1.5b-dpo

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(805)

this model

usama10
/

grpo-tax-qwen-1.5b-dpo

GRPO Tax Study: qwen-1.5b DPO LoRA Adapter

Training Details

Usage

Related Resources

Model tree for usama10/grpo-tax-qwen-1.5b-dpo

Dataset used to train usama10/grpo-tax-qwen-1.5b-dpo