Sanyam0605/sarvam-1-NVFP4

NVFP4-quantized version of sarvamai/sarvam-1, quantized using NVIDIA TensorRT Model Optimizer (modelopt 0.35.0).

Quantization Details

Parameter	Value
Base Model	sarvamai/sarvam-1
Architecture	LlamaForCausalLM
Quantization	NVFP4 (4-bit floating point)
KV Cache	FP8
Group Size	16
Hidden Size	2048
Layers	28
Attention Heads	16 (KV: 8)
Context Length	8192
Vocab Size	68096
Quantizer	modelopt v0.35.0
Excluded Modules	lm_head

Usage

With TensorRT-LLM (recommended)

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

With TensorRT-LLM CLI

# Using the NVIDIA DGX Spark container
docker run --rm --gpus all \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
    python -c "
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)
"

Loading with HuggingFace Transformers

Note: NVFP4 quantization requires TensorRT-LLM for inference. Standard transformers loading is not supported for this quantization format.

Hardware Requirements

Recommended: NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
CUDA Compute Capability: 12.0+

About Sarvam-1

Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.

Acknowledgments

Base model by Sarvam AI
Quantization using NVIDIA ModelOpt

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Model tree for Sanyam0605/sarvam-1-NVFP4

Base model

sarvamai/sarvam-1

Quantized

(17)

this model