gemma-4-31B-uncensored-heretic · MLX 4-bit

MLX conversion of llmfan46/gemma-4-31B-it-uncensored-heretic, a fine-tune of Google's Gemma 4 31B Instruct. Quantized to ~7.4 bits per weight using mlx-vlm v0.4.3 on Apple Silicon.

If you have enough RAM, the Q8 version offers near-lossless quality.

Performance on Apple M4 Max · 128 GB

Peak memory: ~29 GB
Prompt throughput: ~39.9 tok/s
Generation speed: ~16.9 tok/s

Requirements

pip install -U mlx-vlm

Gemma 4 support requires mlx-vlm >= 0.4.3. Standard mlx-lm does not yet support the gemma4 architecture.

Usage

Text only

python -m mlx_vlm generate \
  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
  --prompt "Your prompt here" \
  --max-tokens 512

With image

python -m mlx_vlm generate \
  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
  --prompt "Describe this image." \
  --image path/to/image.jpg \
  --max-tokens 512

Python API

from mlx_vlm import load, generate

model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit")

response = generate(
    model,
    processor,
    prompt="Your prompt here",
    max_tokens=512,
    temperature=0.7,
)
print(response)

Which version should I use?

Precision	Peak RAM	Gen speed	Quality
BF16 (full)	~62 GB	slowest	reference
Q8	~34 GB	~14.5 tok/s	near-lossless
Q4 (this model)	~29 GB	~16.9 tok/s	good

Q4 is the recommended version for machines with 32 GB unified memory (M2/M3 Pro, M1 Max, M3 Max).

Notes

The model activates Gemma 4's thinking channel (<|channel>thought) on reasoning-heavy prompts — this is expected behaviour.
The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference.
Unofficial community conversion. For the original fine-tune see llmfan46/gemma-4-31B-it-uncensored-heretic.

Conversion

python -m mlx_vlm convert \
  --hf-path llmfan46/gemma-4-31B-it-uncensored-heretic \
  --mlx-path ./gemma-4-31B-uncensored-heretic-mlx-4bit \
  --quantize --q-bits 4

Credits

Google DeepMind — Gemma 4 base model
llmfan46 — uncensored-heretic fine-tune
ml-explore — MLX framework
Blaizzy — mlx-vlm library

Downloads last month: 666

Safetensors

Model size

8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit

Base model

google/gemma-4-31B-it

Finetuned

llmfan46/gemma-4-31B-it-uncensored-heretic

Quantized

(9)

this model