gemma4-prometheus-gptq-4bit
GPTQ 4-bit quantized version of
groxaxo/gemma4-prometheus-merged(Prometheus-steeredgoogle/gemma-4-31B-it).
Quantized with gptqmodel v5.8.0. 69.3% size reduction: 58 GiB โ 17.9 GiB.
Related repositories
| Repo | Description |
|---|---|
| groxaxo/gemma4-prometheus-merged | Full BF16 source model |
| groxaxo/gemma4-prometheus-workflow | Reproducible scripts, config, and checkpoint journal |
| groxaxo/gemma4-prometheus-fixes | All local patches applied to make this work |
| google/gemma-4-31B-it | Original base model |
Quantization details
| Parameter | Value |
|---|---|
| Bits | 4 |
| Group size | 128 |
| Format | GPTQ |
| Symmetric | Yes |
| desc_act | No |
| Size (disk) | 17.91 GiB (5 shards) |
| Reduction | 69.3% vs BF16 merged |
| Tool | gptqmodel 5.8.0 |
| Calibration | 16 samples (8 benign + 8 adversarial) |
How to run
Requirements
- 1โ2 ร GPU with โฅ 20 GiB total VRAM (single 24 GB GPU works)
gptqmodel >= 5.8.0with the Gemma4 patch applied (see patches section)
Install
pip install gptqmodel>=5.8.0
Note: The standard
gptqmodelpackage does not include Gemma4 support. Apply the patches from groxaxo/gemma4-prometheus-fixes before loading this model. The patches addGemma4QModeland fix the alternating rotary-embedding shape mismatch.
Inference
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
import torch
model_id = "groxaxo/gemma4-prometheus-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.load(
model_id,
device_map="auto", # single GPU or multi-GPU pipeline
# max_memory={0: "22GiB"}, # uncomment to set per-GPU budget
)
model.eval()
messages = [{"role": "user", "content": "Explain gradient descent."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False,
add_generation_prompt=True,
enable_thinking=False, # suppress chain-of-thought tokens
)
ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
out = model.generate(
ids,
max_new_tokens=512,
do_sample=False,
temperature=None,
top_p=None,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
Two-GPU pipeline parallel (2 ร 24 GiB)
model = GPTQModel.load(
model_id,
device_map="balanced",
max_memory={0: "22GiB", 1: "22GiB"},
)
Evaluation results
All tests run on 2 ร RTX 3090 (24 GB each) in pipeline-parallel mode.
True tensor-parallelism (TP=2) requires vLLM, which does not yet support the
gemma4 architecture natively.
Coherence test (5/5 passed โ )
| Prompt | Response excerpt |
|---|---|
| Explain how neural networks learn from data. | "โฆa neural network learns by trial and error. It makes a guess, finds out how wrong that guess was, and then adjusts its internal settingsโฆ" |
| Supervised vs unsupervised learning? | "โฆIn supervised learning, the data is 'labeled' (it has an answer key)โฆIn unsupervised learning, the data is 'unlabeled'โฆ" |
| Gradient descent? | "โฆGradient Descent is an optimization algorithm used to minimize a functionโฆthe 'engine' used to train models by minimizing the Cost Functionโฆ" |
| Transformers in NLP? | "โฆa Transformer is a deep learning architecture designed to process sequential dataโฆfocusing on the most important parts of the input, regardless of how far apart they areโฆ" |
| What is quantization? | "โฆquantization is the process of reducing the precision of the numbers used to represent a neural network's weights and activationsโฆ" |
Context length (2 ร RTX 3090, GPTQ-4bit, no flash-attn)
| KV Cache | Max Tokens | Bottleneck |
|---|---|---|
| FP16 | 6 144 | Attention compute O(nยฒ) |
| FP8 (software) | 6 144 | Same โ attention matrix dominates |
Without flash-attn, the bottleneck is the attention matrix (O(nยฒ) per layer),
not KV cache storage. FP8 KV cache does not help here.
With flash-attn installed (estimated):
| KV Cache | Estimated Max Tokens |
|---|---|
| FP16 | ~113 000 |
| FP8 | ~226 000 |
Recommendation:
pip install flash-attn --no-build-isolationto unlock much longer contexts. The model supports up to 262 144 tokens.
Perplexity (WikiText-2)
| Model | PPL | Notes |
|---|---|---|
| Merged (BnB-8bit reference) | 1782.3 | Chat model on raw text โ PPL magnitude expected to be high |
| GPTQ-4bit (this model) | 1815.8 | +1.9% vs reference |
ฮPPL = +1.9% is the meaningful signal. The absolute values are high because instruction-tuned models trained on chat data have poor raw-text likelihood.
KL divergence (this model vs merged reference)
| Metric | Value |
|---|---|
| Direction | KL(merged_bnb8 โ gptq_4bit) |
| Mean KL | 4.77 nats |
| Std KL | 3.65 nats |
| Prompts | 8 ML-domain questions |
| Vocab comparison | Top-1000 tokens |
KL ~4.77 nats is a typical result for 4-bit GPTQ on a 30B-class model. Part of the divergence is attributable to the bnb-8bit reference noise; true KL vs FP16 would be slightly lower.
Architecture notes (Gemma4 quirks)
| Feature | Detail |
|---|---|
| Text layers | 60, alternating sliding-window / full attention |
| Sliding attention | window=1024, 16 KV heads, head_dim=256 |
| Global attention | 4 KV heads, head_dim=512 |
| GQA | 32 query heads |
| Max position | 262 144 tokens |
| VLM wrapper | Vision tower present; text-only inference supported |
Why layer_modules_strict = False: sliding-window attention layers omit
v_proj, so a strict module check would fail. The flag allows partial matches.
Rotary embedding fix: Gemma4 alternates sliding_attention (head_dim=256)
and full_attention (head_dim=512). gptqmodel cached the first layer's
position_embeddings and replayed them for all layers, causing a shape mismatch
at the first global-attention layer (layer 5). The fix regenerates
position_embeddings per layer using the correct layer_type.
Patches required
See groxaxo/gemma4-prometheus-fixes for full patch diffs and instructions.
Required patches to gptqmodel:
gptqmodel/models/definitions/gemma4.pyโ newGemma4QModelclassgptqmodel/models/auto.pyโ"gemma4" -> Gemma4QModelmappinggptqmodel/looper/module_looper.pyโ free-memory device scheduling + per-layer rotary fix
Quantization config
{
"bits": 4,
"group_size": 128,
"format": "gptq",
"desc_act": false,
"sym": true,
"quant_method": "gptq"
}
Citation / acknowledgements
- Base model: google/gemma-4-31B-it
- Steering: Prometheus (local)
- Quantization: gptqmodel v5.8.0
- Downloads last month
- 201
Model tree for groxaxo/gemma4-prometheus-gptq-4bit
Base model
google/gemma-4-31B-it