---
library_name: transformers
base_model: sarvamai/sarvam-1
tags:
- sarvam
- nvfp4
- quantized
- modelopt
- tensorrt-llm
- dgx-spark
license: apache-2.0
pipeline_tag: text-generation
inference: false
---

# Sanyam0605/sarvam-1-NVFP4

NVFP4-quantized version of [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (modelopt 0.35.0).

## Quantization Details

| Parameter | Value |
|---|---|
| **Base Model** | [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1) |
| **Architecture** | LlamaForCausalLM |
| **Quantization** | NVFP4 (4-bit floating point) |
| **KV Cache** | FP8 |
| **Group Size** | 16 |
| **Hidden Size** | 2048 |
| **Layers** | 28 |
| **Attention Heads** | 16 (KV: 8) |
| **Context Length** | 8192 |
| **Vocab Size** | 68096 |
| **Quantizer** | modelopt v0.35.0 |
| **Excluded Modules** | lm_head |

## Usage

### With TensorRT-LLM (recommended)

```python
from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
print(output[0].outputs[0].text)
```

### With TensorRT-LLM CLI

```bash
# Using the NVIDIA DGX Spark container
docker run --rm --gpus all \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
    python -c "
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
print(out[0].outputs[0].text)
"
```

### Loading with HuggingFace Transformers

> **Note:** NVFP4 quantization requires TensorRT-LLM for inference. Standard `transformers` loading is not supported for this quantization format.

## Hardware Requirements

- **Recommended:** NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
- **CUDA Compute Capability:** 12.0+

## About Sarvam-1

Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.

## Acknowledgments

- Base model by [Sarvam AI](https://www.sarvam.ai/)
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)