gemma-4-31B-uncensored-heretic · MLX 4-bit

MLX conversion of llmfan46/gemma-4-31B-it-uncensored-heretic, a fine-tune of Google's Gemma 4 31B Instruct. Quantized to ~7.4 bits per weight using mlx-vlm v0.4.3 on Apple Silicon.

If you have enough RAM, the Q8 version offers near-lossless quality.

Performance on Apple M4 Max · 128 GB

  • Peak memory: ~29 GB
  • Prompt throughput: ~39.9 tok/s
  • Generation speed: ~16.9 tok/s

Requirements

pip install -U mlx-vlm

Gemma 4 support requires mlx-vlm >= 0.4.3. Standard mlx-lm does not yet support the gemma4 architecture.

Usage

Text only

python -m mlx_vlm generate \
  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
  --prompt "Your prompt here" \
  --max-tokens 512

With image

python -m mlx_vlm generate \
  --model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
  --prompt "Describe this image." \
  --image path/to/image.jpg \
  --max-tokens 512

Python API

from mlx_vlm import load, generate

model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit")

response = generate(
    model,
    processor,
    prompt="Your prompt here",
    max_tokens=512,
    temperature=0.7,
)
print(response)

Which version should I use?

Precision Peak RAM Gen speed Quality
BF16 (full) ~62 GB slowest reference
Q8 ~34 GB ~14.5 tok/s near-lossless
Q4 (this model) ~29 GB ~16.9 tok/s good

Q4 is the recommended version for machines with 32 GB unified memory (M2/M3 Pro, M1 Max, M3 Max).

Notes

  • The model activates Gemma 4's thinking channel (<|channel>thought) on reasoning-heavy prompts — this is expected behaviour.
  • The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference.
  • Unofficial community conversion. For the original fine-tune see llmfan46/gemma-4-31B-it-uncensored-heretic.

Conversion

python -m mlx_vlm convert \
  --hf-path llmfan46/gemma-4-31B-it-uncensored-heretic \
  --mlx-path ./gemma-4-31B-uncensored-heretic-mlx-4bit \
  --quantize --q-bits 4

Credits

  • Google DeepMind — Gemma 4 base model
  • llmfan46 — uncensored-heretic fine-tune
  • ml-explore — MLX framework
  • Blaizzy — mlx-vlm library
Downloads last month
666
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit

Quantized
(9)
this model