Zamba2-1.2B-Instruct GGUF

First GGUF conversion of Zyphra/Zamba2-1.2B-Instruct for use with llama.cpp.

Files

File	Quant	Size	BPW
zamba2-1.2b-instruct-q4_0.gguf	Q4_0	948 MB	4.60

Performance

Device	Prompt eval	Generation	VRAM
Quadro T2000 (4 GB)	249 tok/s	57.9 tok/s	1,515 MB
RTX 3090 (24 GB)	2,509 tok/s	292 tok/s	1,520 MB

How to run

Requires llama.cpp with Zamba2 support. Use the PR branch until merged into mainline:

git clone https://github.com/echo313unfolding/llama.cpp.git -b zamba2
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

./build/bin/llama-completion \
  -m zamba2-1.2b-instruct-q4_0.gguf \
  -p '<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n' \
  -n 100 -ngl 99 --temp 0

Architecture

Zamba2 is a hybrid Mamba-2 + Transformer architecture:

38 layers: 32 pure Mamba-2 + 6 hybrid (shared transformer at positions 5, 11, 17, 23, 29, 35)
2048 embedding dim, 32 attention heads
Mamba-2 SSM: d_inner=4096, d_state=128, 64 SSM heads
Shared transformer with 6 LoRA adapter sets (pre-applied in conversion)

Conversion

Converted using convert_zamba2_to_gguf.py (included in the PR). Key steps:

LoRA unfolding: W_effective = W_shared + B @ A (emits independent per-layer tensors)
Per-layer n_head_kv array: 0 for Mamba layers, 32 for hybrid layers
404 tensors total