Gemma4 Prometheus fixes
This repo documents the local patches that made the Gemma4 + Prometheus + GPTQ pipeline work on this machine.
What was fixed
- Prometheus adapter targeting now resolves exact module paths instead of suffix matches.
- Prometheus steering now dequantizes to FP16 on CUDA by default to avoid VRAM blowups.
- GPTQModel now recognizes
gemma4, uses a Gemma4 module tree, tolerates missingv_proj, and refreshes rotaryposition_embeddingsper layer. - Quantization scheduling is free-memory-aware and does not rely on naive round-robin GPU assignment.
Related repos
- Workflow and reproduction scripts:
groxaxo/gemma4-prometheus-workflow - Merged model:
groxaxo/gemma4-prometheus-merged - GPTQ model:
groxaxo/gemma4-prometheus-gptq-4bit
Reproduce
Use the workflow repo with the same conda env described there. The workflow README includes the exact commands used for:
- Prometheus text-only inference
- merged-model export
- GPTQ quantization
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support