Sarvam-30B GGUF

GGUF quantizations of sarvamai/sarvam-30b — the first publicly available GGUF for this model.

Created by applying llama.cpp PR #20275 which adds sarvam_moe architecture support to the converter and runtime.

Will this work with Ollama / LM Studio / Jan?

Not yet. These tools bundle mainline llama.cpp, which does not recognize sarvam_moe. You will see:

error loading model: unknown model architecture: 'sarvam_moe'

This GGUF requires a patched llama.cpp (with PR #20275 applied) until that PR merges into mainline. Once it does, Ollama / LM Studio / Jan will work automatically on their next update.

To build a patched llama.cpp, use mtr7x/sarvam-gguf.

Files

File Quant Size BPW Notes
sarvam-30b-q4_k_m.gguf Q4_K_M 19 GB 4.87 Recommended — good balance of quality and size
sarvam-30b-f16.gguf F16 60 GB 16.00 Full precision, use for further quantization

How to use

Option 1: Patch llama.cpp automatically (recommended)

git clone https://github.com/mtr7x/sarvam-gguf.git
cd sarvam-gguf
chmod +x patch_and_convert.sh
./patch_and_convert.sh

This clones llama.cpp, applies PR #20275, builds it, and you're ready to run.

Option 2: Run with patched llama.cpp directly

./llama-cli \
    --model sarvam-30b-q4_k_m.gguf \
    --n-gpu-layers 99 \
    --ctx-size 2048 \
    --temp 0.7 \
    -no-cnv \
    --prompt "भारत के बारे में बताइए।"

What does NOT work (yet)

Tool Status Why
Ollama unknown model architecture Waiting on PR #20275 merge
LM Studio unknown model architecture Waiting on PR #20275 merge
Jan unknown model architecture Waiting on PR #20275 merge
llama.cpp (mainline) unknown model architecture PR #20275 not yet merged
llama.cpp (patched) Works This is what you need

Architecture

sarvamai/sarvam-30b
├── model_type: sarvam_moe
├── 30B params, 2.4B active
├── 19 layers (1 dense + 18 MoE)
├── 128 experts + 1 shared, top-6, sigmoid routing
├── 64 query heads, 4 KV heads, head_dim=64
├── vocab_size: 262,144 (Indic-optimized)
└── Apache 2.0

Why this is needed

Sarvam open-sourced 30B and 105B under Apache 2.0, but mainline llama.cpp doesn't recognize model_type: "sarvam_moe" — the converter exits immediately. Contrary to what you might expect, sigmoid routing is already supported in llama.cpp (used by GLM4 and others). The actual blocker is a missing class registration + tensor mappings + C++ graph builder — all provided by PR #20275 (387 lines).

The domino chain

PR #20275 merges into llama.cpp       ← pending
  → GGUF can be created               ← done (this repo)
    → Ollama updates its llama.cpp     ← blocked
      → Unsloth applies dynamic quants ← blocked
        → ollama run sarvam-30b        ← blocked

Runtime support

Runtime Status
vLLM PR #33942 merged
SGLang Works
llama.cpp (patched) Works (PR #20275)
llama.cpp (mainline) Blocked — PR pending
Ollama Blocked on llama.cpp
LM Studio Blocked on llama.cpp

Credits

Read the full analysis: Sarvam. Open is not sovereign

Downloads last month
154
GGUF
Model size
32B params
Architecture
sarvam_moe
Hardware compatibility
Log In to add your hardware

4-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mtrajan/sarvam-30b-GGUF

Quantized
(13)
this model