Gemma4 Prometheus fixes

This repo documents the local patches that made the Gemma4 + Prometheus + GPTQ pipeline work on this machine.

What was fixed

  • Prometheus adapter targeting now resolves exact module paths instead of suffix matches.
  • Prometheus steering now dequantizes to FP16 on CUDA by default to avoid VRAM blowups.
  • GPTQModel now recognizes gemma4, uses a Gemma4 module tree, tolerates missing v_proj, and refreshes rotary position_embeddings per layer.
  • Quantization scheduling is free-memory-aware and does not rely on naive round-robin GPU assignment.

Related repos

  • Workflow and reproduction scripts: groxaxo/gemma4-prometheus-workflow
  • Merged model: groxaxo/gemma4-prometheus-merged
  • GPTQ model: groxaxo/gemma4-prometheus-gptq-4bit

Reproduce

Use the workflow repo with the same conda env described there. The workflow README includes the exact commands used for:

  1. Prometheus text-only inference
  2. merged-model export
  3. GPTQ quantization
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support