Gemma-4-96E-A4B-Heretic-TQ GGUF

TurboQuant TQ4_1S and TQ3_1S GGUF exports of:

Original model: blascotobasco/Gemma-4-96E-A4B-Heretic

This repo contains:

Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf
Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf
chat_template.jinja
additional_chat_templates/standard.jinja
requant_recipe_tq4_1s.txt
requant_recipe_tq3_1s.txt

Quantization

These checkpoints were produced with the following flow:

Convert the original Hugging Face checkpoint to GGUF Q8_0
Requantize the Q8_0 GGUF with TurboQuant weight quantization to TQ4_1S and TQ3_1S
Inject both Gemma 4 chat templates into the GGUF metadata:
- default: interleaved
- named variant: standard

Output files:

Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf
Size: 13,985,278,688 bytes (~13.03 GiB)
Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf
Size: 12,589,894,368 bytes (~11.73 GiB)

Important runtime note

At the moment, this model should be used with the following fork of llama.cpp, which contains the Gemma 4 + TurboQuant runtime fixes required for correct Apple Metal execution:

https://github.com/iamwavecut/llama-cpp-turboquant

In particular, this fork includes the fixes that were needed to make Gemma 4 TQ4_1S weight quantization and TurboQuant KV-cache inference behave correctly on Apple Metal.

Recommended launch commands

The checked-in GGUF files embed both upstream Gemma 4 templates. The default embedded template is the interleaved variant, so no --chat-template-file argument is required for tool-heavy or agentic usage.

TQ4_1S

Recommended balanced runtime recipe:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

TQ3_1S

Recommended conservative runtime recipe. Since TQ3_1S is the more aggressive weight quantization, start with uncompressed KV cache for the cleanest baseline:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv q8_0 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

After validating output quality, you can also try -ctv turbo4 on TQ3_1S if you want lower KV-cache memory usage.

Optional standard template

Upstream llama.cpp also ships the standard non-interleaved Gemma 4 template. This repo includes it as:

additional_chat_templates/standard.jinja

Use it explicitly if you want the classic Gemma 4 formatting instead of the interleaved default:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --chat-template-file /path/to/additional_chat_templates/standard.jinja \
  --reasoning off \
  --reasoning-format none

Notes

chat_template.jinja is the current upstream google-gemma-4-31B-it-interleaved template and is the default embedded template in the checked-in TQ GGUF files
additional_chat_templates/standard.jinja is the current upstream google-gemma-4-31B-it template
requant_recipe_tq4_1s.txt is the tensor override recipe used for the TQ4_1S weight requantization pass
requant_recipe_tq3_1s.txt is the tensor override recipe used for the TQ3_1S weight requantization pass
This release is intended for use with the runtime fork linked above until equivalent support is available upstream