gemma4-prometheus-gptq-4bit

GPTQ 4-bit quantized version of groxaxo/gemma4-prometheus-merged (Prometheus-steered google/gemma-4-31B-it).
Quantized with gptqmodel v5.8.0. 69.3% size reduction: 58 GiB โ†’ 17.9 GiB.

Related repositories

Repo Description
groxaxo/gemma4-prometheus-merged Full BF16 source model
groxaxo/gemma4-prometheus-workflow Reproducible scripts, config, and checkpoint journal
groxaxo/gemma4-prometheus-fixes All local patches applied to make this work
google/gemma-4-31B-it Original base model

Quantization details

Parameter Value
Bits 4
Group size 128
Format GPTQ
Symmetric Yes
desc_act No
Size (disk) 17.91 GiB (5 shards)
Reduction 69.3% vs BF16 merged
Tool gptqmodel 5.8.0
Calibration 16 samples (8 benign + 8 adversarial)

How to run

Requirements

  • 1โ€“2 ร— GPU with โ‰ฅ 20 GiB total VRAM (single 24 GB GPU works)
  • gptqmodel >= 5.8.0 with the Gemma4 patch applied (see patches section)

Install

pip install gptqmodel>=5.8.0

Note: The standard gptqmodel package does not include Gemma4 support. Apply the patches from groxaxo/gemma4-prometheus-fixes before loading this model. The patches add Gemma4QModel and fix the alternating rotary-embedding shape mismatch.

Inference

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

from gptqmodel import GPTQModel
from transformers import AutoTokenizer
import torch

model_id = "groxaxo/gemma4-prometheus-gptq-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.load(
    model_id,
    device_map="auto",          # single GPU or multi-GPU pipeline
    # max_memory={0: "22GiB"},  # uncomment to set per-GPU budget
)
model.eval()

messages = [{"role": "user", "content": "Explain gradient descent."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,       # suppress chain-of-thought tokens
)
ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

with torch.no_grad():
    out = model.generate(
        ids,
        max_new_tokens=512,
        do_sample=False,
        temperature=None,
        top_p=None,
        pad_token_id=tokenizer.eos_token_id,
    )
print(tokenizer.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

Two-GPU pipeline parallel (2 ร— 24 GiB)

model = GPTQModel.load(
    model_id,
    device_map="balanced",
    max_memory={0: "22GiB", 1: "22GiB"},
)

Evaluation results

All tests run on 2 ร— RTX 3090 (24 GB each) in pipeline-parallel mode. True tensor-parallelism (TP=2) requires vLLM, which does not yet support the gemma4 architecture natively.

Coherence test (5/5 passed โœ…)

Prompt Response excerpt
Explain how neural networks learn from data. "โ€ฆa neural network learns by trial and error. It makes a guess, finds out how wrong that guess was, and then adjusts its internal settingsโ€ฆ"
Supervised vs unsupervised learning? "โ€ฆIn supervised learning, the data is 'labeled' (it has an answer key)โ€ฆIn unsupervised learning, the data is 'unlabeled'โ€ฆ"
Gradient descent? "โ€ฆGradient Descent is an optimization algorithm used to minimize a functionโ€ฆthe 'engine' used to train models by minimizing the Cost Functionโ€ฆ"
Transformers in NLP? "โ€ฆa Transformer is a deep learning architecture designed to process sequential dataโ€ฆfocusing on the most important parts of the input, regardless of how far apart they areโ€ฆ"
What is quantization? "โ€ฆquantization is the process of reducing the precision of the numbers used to represent a neural network's weights and activationsโ€ฆ"

Context length (2 ร— RTX 3090, GPTQ-4bit, no flash-attn)

KV Cache Max Tokens Bottleneck
FP16 6 144 Attention compute O(nยฒ)
FP8 (software) 6 144 Same โ€” attention matrix dominates

Without flash-attn, the bottleneck is the attention matrix (O(nยฒ) per layer), not KV cache storage. FP8 KV cache does not help here.

With flash-attn installed (estimated):

KV Cache Estimated Max Tokens
FP16 ~113 000
FP8 ~226 000

Recommendation: pip install flash-attn --no-build-isolation to unlock much longer contexts. The model supports up to 262 144 tokens.

Perplexity (WikiText-2)

Model PPL Notes
Merged (BnB-8bit reference) 1782.3 Chat model on raw text โ€” PPL magnitude expected to be high
GPTQ-4bit (this model) 1815.8 +1.9% vs reference

ฮ”PPL = +1.9% is the meaningful signal. The absolute values are high because instruction-tuned models trained on chat data have poor raw-text likelihood.

KL divergence (this model vs merged reference)

Metric Value
Direction KL(merged_bnb8 โ€– gptq_4bit)
Mean KL 4.77 nats
Std KL 3.65 nats
Prompts 8 ML-domain questions
Vocab comparison Top-1000 tokens

KL ~4.77 nats is a typical result for 4-bit GPTQ on a 30B-class model. Part of the divergence is attributable to the bnb-8bit reference noise; true KL vs FP16 would be slightly lower.


Architecture notes (Gemma4 quirks)

Feature Detail
Text layers 60, alternating sliding-window / full attention
Sliding attention window=1024, 16 KV heads, head_dim=256
Global attention 4 KV heads, head_dim=512
GQA 32 query heads
Max position 262 144 tokens
VLM wrapper Vision tower present; text-only inference supported

Why layer_modules_strict = False: sliding-window attention layers omit v_proj, so a strict module check would fail. The flag allows partial matches.

Rotary embedding fix: Gemma4 alternates sliding_attention (head_dim=256) and full_attention (head_dim=512). gptqmodel cached the first layer's position_embeddings and replayed them for all layers, causing a shape mismatch at the first global-attention layer (layer 5). The fix regenerates position_embeddings per layer using the correct layer_type.


Patches required

See groxaxo/gemma4-prometheus-fixes for full patch diffs and instructions.

Required patches to gptqmodel:

  1. gptqmodel/models/definitions/gemma4.py โ€” new Gemma4QModel class
  2. gptqmodel/models/auto.py โ€” "gemma4" -> Gemma4QModel mapping
  3. gptqmodel/looper/module_looper.py โ€” free-memory device scheduling + per-layer rotary fix

Quantization config

{
  "bits": 4,
  "group_size": 128,
  "format": "gptq",
  "desc_act": false,
  "sym": true,
  "quant_method": "gptq"
}

Citation / acknowledgements

Downloads last month
201
Safetensors
Model size
31B params
Tensor type
BF16
ยท
I32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for groxaxo/gemma4-prometheus-gptq-4bit

Quantized
(1)
this model