Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 64
This model is a fine-tuned version of google/gemma-3-1b-it using Direct Preference Optimization (DPO) on the ultrafeedback_binarized dataset.
This model is intended for text generation tasks with improved alignment through DPO training. It maintains the capabilities of the base Gemma 3 model while being better aligned with human preferences.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gemma-3-1b-it-4bit-lora-dpo-aligned"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example inference
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
If you use this model, please cite the original Gemma model and the DPO paper:
@misc{gemma3,
title={Gemma 3},
author={Google DeepMind},
year={2026}
}
@article{rafailov2023direct,
title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Finn, Chelsea and Ermon, Stefano},
journal={arXiv preprint arXiv:2305.18290},
year={2023}
}