Zamba2-1.2B-Instruct GGUF

First GGUF conversion of Zyphra/Zamba2-1.2B-Instruct for use with llama.cpp.

Files

File Quant Size BPW
zamba2-1.2b-instruct-q4_0.gguf Q4_0 948 MB 4.60

Performance

Device Prompt eval Generation VRAM
Quadro T2000 (4 GB) 249 tok/s 57.9 tok/s 1,515 MB
RTX 3090 (24 GB) 2,509 tok/s 292 tok/s 1,520 MB

How to run

Requires llama.cpp with Zamba2 support. Use the PR branch until merged into mainline:

git clone https://github.com/echo313unfolding/llama.cpp.git -b zamba2
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

./build/bin/llama-completion \
  -m zamba2-1.2b-instruct-q4_0.gguf \
  -p '<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n' \
  -n 100 -ngl 99 --temp 0

Architecture

Zamba2 is a hybrid Mamba-2 + Transformer architecture:

  • 38 layers: 32 pure Mamba-2 + 6 hybrid (shared transformer at positions 5, 11, 17, 23, 29, 35)
  • 2048 embedding dim, 32 attention heads
  • Mamba-2 SSM: d_inner=4096, d_state=128, 64 SSM heads
  • Shared transformer with 6 LoRA adapter sets (pre-applied in conversion)

Conversion

Converted using convert_zamba2_to_gguf.py (included in the PR). Key steps:

  • LoRA unfolding: W_effective = W_shared + B @ A (emits independent per-layer tensors)
  • Per-layer n_head_kv array: 0 for Mamba layers, 32 for hybrid layers
  • 404 tensors total

Links

Converted by Echo Labs.

Downloads last month
54
GGUF
Model size
2B params
Architecture
zamba2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support