gemma-4-31B-uncensored-heretic · MLX 4-bit
MLX conversion of llmfan46/gemma-4-31B-it-uncensored-heretic, a fine-tune of Google's Gemma 4 31B Instruct. Quantized to ~7.4 bits per weight using mlx-vlm v0.4.3 on Apple Silicon.
If you have enough RAM, the Q8 version offers near-lossless quality.
Performance on Apple M4 Max · 128 GB
- Peak memory: ~29 GB
- Prompt throughput: ~39.9 tok/s
- Generation speed: ~16.9 tok/s
Requirements
pip install -U mlx-vlm
Gemma 4 support requires
mlx-vlm >= 0.4.3. Standardmlx-lmdoes not yet support thegemma4architecture.
Usage
Text only
python -m mlx_vlm generate \
--model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
--prompt "Your prompt here" \
--max-tokens 512
With image
python -m mlx_vlm generate \
--model TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit \
--prompt "Describe this image." \
--image path/to/image.jpg \
--max-tokens 512
Python API
from mlx_vlm import load, generate
model, processor = load("TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit")
response = generate(
model,
processor,
prompt="Your prompt here",
max_tokens=512,
temperature=0.7,
)
print(response)
Which version should I use?
| Precision | Peak RAM | Gen speed | Quality |
|---|---|---|---|
| BF16 (full) | ~62 GB | slowest | reference |
| Q8 | ~34 GB | ~14.5 tok/s | near-lossless |
| Q4 (this model) | ~29 GB | ~16.9 tok/s | good |
Q4 is the recommended version for machines with 32 GB unified memory (M2/M3 Pro, M1 Max, M3 Max).
Notes
- The model activates Gemma 4's thinking channel (
<|channel>thought) on reasoning-heavy prompts — this is expected behaviour. - The mel filter warning on load is harmless; it relates to the audio encoder and does not affect text or vision inference.
- Unofficial community conversion. For the original fine-tune see llmfan46/gemma-4-31B-it-uncensored-heretic.
Conversion
python -m mlx_vlm convert \
--hf-path llmfan46/gemma-4-31B-it-uncensored-heretic \
--mlx-path ./gemma-4-31B-uncensored-heretic-mlx-4bit \
--quantize --q-bits 4
Credits
- Google DeepMind — Gemma 4 base model
- llmfan46 — uncensored-heretic fine-tune
- ml-explore — MLX framework
- Blaizzy — mlx-vlm library
- Downloads last month
- 666
Model size
8B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for TxemAI/gemma-4-31B-uncensored-heretic-mlx-4bit
Base model
google/gemma-4-31B-it