Zamba2-1.2B-Instruct GGUF
First GGUF conversion of Zyphra/Zamba2-1.2B-Instruct for use with llama.cpp.
Files
| File | Quant | Size | BPW |
|---|---|---|---|
| zamba2-1.2b-instruct-q4_0.gguf | Q4_0 | 948 MB | 4.60 |
Performance
| Device | Prompt eval | Generation | VRAM |
|---|---|---|---|
| Quadro T2000 (4 GB) | 249 tok/s | 57.9 tok/s | 1,515 MB |
| RTX 3090 (24 GB) | 2,509 tok/s | 292 tok/s | 1,520 MB |
How to run
Requires llama.cpp with Zamba2 support. Use the PR branch until merged into mainline:
git clone https://github.com/echo313unfolding/llama.cpp.git -b zamba2
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j
./build/bin/llama-completion \
-m zamba2-1.2b-instruct-q4_0.gguf \
-p '<|im_start|>user\nWhat is the capital of France?<|im_end|>\n<|im_start|>assistant\n' \
-n 100 -ngl 99 --temp 0
Architecture
Zamba2 is a hybrid Mamba-2 + Transformer architecture:
- 38 layers: 32 pure Mamba-2 + 6 hybrid (shared transformer at positions 5, 11, 17, 23, 29, 35)
- 2048 embedding dim, 32 attention heads
- Mamba-2 SSM: d_inner=4096, d_state=128, 64 SSM heads
- Shared transformer with 6 LoRA adapter sets (pre-applied in conversion)
Conversion
Converted using convert_zamba2_to_gguf.py (included in the PR). Key steps:
- LoRA unfolding: W_effective = W_shared + B @ A (emits independent per-layer tensors)
- Per-layer n_head_kv array: 0 for Mamba layers, 32 for hybrid layers
- 404 tensors total
Links
- PR: ggml-org/llama.cpp#21412
- Original model: Zyphra/Zamba2-1.2B-Instruct
- Fork: echo313unfolding/llama.cpp
Converted by Echo Labs.
- Downloads last month
- 54
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support