Gemma-4-96E-A4B-Heretic-TQ GGUF
TurboQuant TQ4_1S and TQ3_1S GGUF exports of:
- Original model:
blascotobasco/Gemma-4-96E-A4B-Heretic
This repo contains:
Gemma-4-96E-A4B-Heretic-TQ4_1S.ggufGemma-4-96E-A4B-Heretic-TQ3_1S.ggufchat_template.jinjaadditional_chat_templates/standard.jinjarequant_recipe_tq4_1s.txtrequant_recipe_tq3_1s.txt
Quantization
These checkpoints were produced with the following flow:
- Convert the original Hugging Face checkpoint to
GGUF Q8_0 - Requantize the
Q8_0GGUF with TurboQuant weight quantization toTQ4_1SandTQ3_1S - Inject both Gemma 4 chat templates into the GGUF metadata:
- default:
interleaved - named variant:
standard
- default:
Output files:
Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf- Size:
13,985,278,688bytes (~13.03 GiB) Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf- Size:
12,589,894,368bytes (~11.73 GiB)
Important runtime note
At the moment, this model should be used with the following fork of llama.cpp, which contains the Gemma 4 + TurboQuant runtime fixes required for correct Apple Metal execution:
In particular, this fork includes the fixes that were needed to make Gemma 4 TQ4_1S weight quantization and TurboQuant KV-cache inference behave correctly on Apple Metal.
Recommended launch commands
The checked-in GGUF files embed both upstream Gemma 4 templates. The default embedded template is the interleaved variant, so no --chat-template-file argument is required for tool-heavy or agentic usage.
TQ4_1S
Recommended balanced runtime recipe:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo4 \
-c 8192 \
-cnv \
--jinja \
--reasoning off \
--reasoning-format none
TQ3_1S
Recommended conservative runtime recipe. Since TQ3_1S is the more aggressive weight quantization, start with uncompressed KV cache for the cleanest baseline:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv q8_0 \
-c 8192 \
-cnv \
--jinja \
--reasoning off \
--reasoning-format none
After validating output quality, you can also try -ctv turbo4 on TQ3_1S if you want lower KV-cache memory usage.
Optional standard template
Upstream llama.cpp also ships the standard non-interleaved Gemma 4 template. This repo includes it as:
additional_chat_templates/standard.jinja
Use it explicitly if you want the classic Gemma 4 formatting instead of the interleaved default:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo4 \
-c 8192 \
-cnv \
--jinja \
--chat-template-file /path/to/additional_chat_templates/standard.jinja \
--reasoning off \
--reasoning-format none
Notes
chat_template.jinjais the current upstreamgoogle-gemma-4-31B-it-interleavedtemplate and is the default embedded template in the checked-in TQ GGUF filesadditional_chat_templates/standard.jinjais the current upstreamgoogle-gemma-4-31B-ittemplaterequant_recipe_tq4_1s.txtis the tensor override recipe used for theTQ4_1Sweight requantization passrequant_recipe_tq3_1s.txtis the tensor override recipe used for theTQ3_1Sweight requantization pass- This release is intended for use with the runtime fork linked above until equivalent support is available upstream
Credits
- Base model author:
blascotobasco - TurboQuant runtime / GGUF work based on
llama.cppand the TurboQuant fork linked above
- Downloads last month
- 4,493
4-bit
Model tree for WaveCut/Gemma-4-96E-A4B-Heretic-TQ
Base model
google/gemma-4-26B-A4B-it