Sarvam-30B GGUF

GGUF quantizations of sarvamai/sarvam-30b — the first publicly available GGUF for this model.

Created by applying llama.cpp PR #20275 which adds sarvam_moe architecture support to the converter and runtime.

Will this work with Ollama / LM Studio / Jan?

Not yet. These tools bundle mainline llama.cpp, which does not recognize sarvam_moe. You will see:
error loading model: unknown model architecture: 'sarvam_moe'
This GGUF requires a patched llama.cpp (with PR #20275 applied) until that PR merges into mainline. Once it does, Ollama / LM Studio / Jan will work automatically on their next update.

To build a patched llama.cpp, use mtr7x/sarvam-gguf.

Files

File	Quant	Size	BPW	Notes
sarvam-30b-q4_k_m.gguf	Q4_K_M	19 GB	4.87	Recommended — good balance of quality and size
sarvam-30b-f16.gguf	F16	60 GB	16.00	Full precision, use for further quantization

How to use

Option 1: Patch llama.cpp automatically (recommended)

git clone https://github.com/mtr7x/sarvam-gguf.git
cd sarvam-gguf
chmod +x patch_and_convert.sh
./patch_and_convert.sh

This clones llama.cpp, applies PR #20275, builds it, and you're ready to run.

Option 2: Run with patched llama.cpp directly

./llama-cli \
    --model sarvam-30b-q4_k_m.gguf \
    --n-gpu-layers 99 \
    --ctx-size 2048 \
    --temp 0.7 \
    -no-cnv \
    --prompt "भारत के बारे में बताइए।"

What does NOT work (yet)

Tool	Status	Why
Ollama	`unknown model architecture`	Waiting on PR #20275 merge
LM Studio	`unknown model architecture`	Waiting on PR #20275 merge
Jan	`unknown model architecture`	Waiting on PR #20275 merge
llama.cpp (mainline)	`unknown model architecture`	PR #20275 not yet merged
llama.cpp (patched)	Works	This is what you need

Architecture

sarvamai/sarvam-30b
├── model_type: sarvam_moe
├── 30B params, 2.4B active
├── 19 layers (1 dense + 18 MoE)
├── 128 experts + 1 shared, top-6, sigmoid routing
├── 64 query heads, 4 KV heads, head_dim=64
├── vocab_size: 262,144 (Indic-optimized)
└── Apache 2.0

Why this is needed

Sarvam open-sourced 30B and 105B under Apache 2.0, but mainline llama.cpp doesn't recognize model_type: "sarvam_moe" — the converter exits immediately. Contrary to what you might expect, sigmoid routing is already supported in llama.cpp (used by GLM4 and others). The actual blocker is a missing class registration + tensor mappings + C++ graph builder — all provided by PR #20275 (387 lines).

The domino chain

PR #20275 merges into llama.cpp       ← pending
  → GGUF can be created               ← done (this repo)
    → Ollama updates its llama.cpp     ← blocked
      → Unsloth applies dynamic quants ← blocked
        → ollama run sarvam-30b        ← blocked

Runtime support

Runtime	Status
vLLM	PR #33942 merged
SGLang	Works
llama.cpp (patched)	Works (PR #20275)
llama.cpp (mainline)	Blocked — PR pending
Ollama	Blocked on llama.cpp
LM Studio	Blocked on llama.cpp

Credits

sarvamai for open-sourcing Sarvam-30B under Apache 2.0
sumitchatterjee13 for llama.cpp PR #20275
Conversion pipeline: mtr7x/sarvam-gguf

Read the full analysis: Sarvam. Open is not sovereign

Downloads last month: 154

GGUF

Model size

32B params

Architecture

sarvam_moe

Hardware compatibility

4-bit

16-bit

Model tree for mtrajan/sarvam-30b-GGUF

Base model

sarvamai/sarvam-30b

Quantized

(13)

this model