Gemma-4-96E-A4B-Heretic-TQ GGUF

TurboQuant TQ4_1S and TQ3_1S GGUF exports of:

This repo contains:

  • Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf
  • Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf
  • chat_template.jinja
  • additional_chat_templates/standard.jinja
  • requant_recipe_tq4_1s.txt
  • requant_recipe_tq3_1s.txt

Quantization

These checkpoints were produced with the following flow:

  1. Convert the original Hugging Face checkpoint to GGUF Q8_0
  2. Requantize the Q8_0 GGUF with TurboQuant weight quantization to TQ4_1S and TQ3_1S
  3. Inject both Gemma 4 chat templates into the GGUF metadata:
    • default: interleaved
    • named variant: standard

Output files:

  • Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf
  • Size: 13,985,278,688 bytes (~13.03 GiB)
  • Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf
  • Size: 12,589,894,368 bytes (~11.73 GiB)

Important runtime note

At the moment, this model should be used with the following fork of llama.cpp, which contains the Gemma 4 + TurboQuant runtime fixes required for correct Apple Metal execution:

In particular, this fork includes the fixes that were needed to make Gemma 4 TQ4_1S weight quantization and TurboQuant KV-cache inference behave correctly on Apple Metal.

Recommended launch commands

The checked-in GGUF files embed both upstream Gemma 4 templates. The default embedded template is the interleaved variant, so no --chat-template-file argument is required for tool-heavy or agentic usage.

TQ4_1S

Recommended balanced runtime recipe:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

TQ3_1S

Recommended conservative runtime recipe. Since TQ3_1S is the more aggressive weight quantization, start with uncompressed KV cache for the cleanest baseline:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv q8_0 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

After validating output quality, you can also try -ctv turbo4 on TQ3_1S if you want lower KV-cache memory usage.

Optional standard template

Upstream llama.cpp also ships the standard non-interleaved Gemma 4 template. This repo includes it as:

  • additional_chat_templates/standard.jinja

Use it explicitly if you want the classic Gemma 4 formatting instead of the interleaved default:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --chat-template-file /path/to/additional_chat_templates/standard.jinja \
  --reasoning off \
  --reasoning-format none

Notes

  • chat_template.jinja is the current upstream google-gemma-4-31B-it-interleaved template and is the default embedded template in the checked-in TQ GGUF files
  • additional_chat_templates/standard.jinja is the current upstream google-gemma-4-31B-it template
  • requant_recipe_tq4_1s.txt is the tensor override recipe used for the TQ4_1S weight requantization pass
  • requant_recipe_tq3_1s.txt is the tensor override recipe used for the TQ3_1S weight requantization pass
  • This release is intended for use with the runtime fork linked above until equivalent support is available upstream

Credits

  • Base model author: blascotobasco
  • TurboQuant runtime / GGUF work based on llama.cpp and the TurboQuant fork linked above
Downloads last month
4,493
GGUF
Model size
20B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Gemma-4-96E-A4B-Heretic-TQ