Sanyam0605/sarvam-1-NVFP4

NVFP4-quantized version of sarvamai/sarvam-1, quantized using NVIDIA TensorRT Model Optimizer (modelopt 0.35.0).

Quantization Details

Parameter Value
Base Model sarvamai/sarvam-1
Architecture LlamaForCausalLM
Quantization NVFP4 (4-bit floating point)
KV Cache FP8
Group Size 16
Hidden Size 2048
Layers 28
Attention Heads 16 (KV: 8)
Context Length 8192
Vocab Size 68096
Quantizer modelopt v0.35.0
Excluded Modules lm_head

Usage

With TensorRT-LLM (recommended)

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)

With TensorRT-LLM CLI

# Using the NVIDIA DGX Spark container
docker run --rm --gpus all \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
    python -c "
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)
"

Loading with HuggingFace Transformers

Note: NVFP4 quantization requires TensorRT-LLM for inference. Standard transformers loading is not supported for this quantization format.

Hardware Requirements

  • Recommended: NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
  • CUDA Compute Capability: 12.0+

About Sarvam-1

Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.

Acknowledgments

Downloads last month
2
Safetensors
Model size
2B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Sanyam0605/sarvam-1-NVFP4

Quantized
(17)
this model