gemma-3-1b-it-4bit-lora-dpo-aligned

This model is a fine-tuned version of google/gemma-3-1b-it using Direct Preference Optimization (DPO) on the ultrafeedback_binarized dataset.

Training Details

  • Base Model: google/gemma-3-1b-it
  • Fine-tuning Method: DPO (Direct Preference Optimization)
  • Dataset: HuggingFaceH4/ultrafeedback_binarized
  • Training Samples: 1000
  • Evaluation Samples: 100
  • Epochs: 1
  • Batch Size: 1 (per device)
  • Gradient Accumulation: 4
  • Learning Rate: 5e-5
  • Beta (DPO): 0.1
  • Max Length: 1024
  • Optimizer: adamw_8bit
  • Precision: bfloat16
  • Quantization: 4-bit NF4 (during training)
  • LoRA Rank: 16
  • LoRA Alpha: 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Memory Optimizations

  • Gradient Checkpointing
  • 8-bit AdamW Optimizer
  • Pre-computed Reference Log Probabilities
  • LoRA Parameter-Efficient Fine-tuning

Intended Use

This model is intended for text generation tasks with improved alignment through DPO training. It maintains the capabilities of the base Gemma 3 model while being better aligned with human preferences.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gemma-3-1b-it-4bit-lora-dpo-aligned"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example inference
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Limitations

  • This model inherits limitations from the base Gemma 3 model
  • DPO alignment may not cover all edge cases or preferences
  • Performance may vary on different hardware configurations

Citation

If you use this model, please cite the original Gemma model and the DPO paper:

@misc{gemma3,
  title={Gemma 3},
  author={Google DeepMind},
  year={2026}
}

@article{rafailov2023direct,
  title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
  author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Finn, Chelsea and Ermon, Stefano},
  journal={arXiv preprint arXiv:2305.18290},
  year={2023}
}
Downloads last month
3
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for manu02/gemma-3-1b-it-4bit-lora-dpo-aligned

Finetuned
(505)
this model
Quantizations
1 model

Dataset used to train manu02/gemma-3-1b-it-4bit-lora-dpo-aligned

Paper for manu02/gemma-3-1b-it-4bit-lora-dpo-aligned