Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
base_model: sarvamai/sarvam-1
|
| 4 |
+
tags:
|
| 5 |
+
- sarvam
|
| 6 |
+
- nvfp4
|
| 7 |
+
- quantized
|
| 8 |
+
- modelopt
|
| 9 |
+
- tensorrt-llm
|
| 10 |
+
- dgx-spark
|
| 11 |
+
license: apache-2.0
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
inference: false
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Sanyam0605/sarvam-1-NVFP4
|
| 17 |
+
|
| 18 |
+
NVFP4-quantized version of [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (modelopt 0.35.0).
|
| 19 |
+
|
| 20 |
+
## Quantization Details
|
| 21 |
+
|
| 22 |
+
| Parameter | Value |
|
| 23 |
+
|---|---|
|
| 24 |
+
| **Base Model** | [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1) |
|
| 25 |
+
| **Architecture** | LlamaForCausalLM |
|
| 26 |
+
| **Quantization** | NVFP4 (4-bit floating point) |
|
| 27 |
+
| **KV Cache** | FP8 |
|
| 28 |
+
| **Group Size** | 16 |
|
| 29 |
+
| **Hidden Size** | 2048 |
|
| 30 |
+
| **Layers** | 28 |
|
| 31 |
+
| **Attention Heads** | 16 (KV: 8) |
|
| 32 |
+
| **Context Length** | 8192 |
|
| 33 |
+
| **Vocab Size** | 68096 |
|
| 34 |
+
| **Quantizer** | modelopt v0.35.0 |
|
| 35 |
+
| **Excluded Modules** | lm_head |
|
| 36 |
+
|
| 37 |
+
## Usage
|
| 38 |
+
|
| 39 |
+
### With TensorRT-LLM (recommended)
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
from tensorrt_llm import LLM, SamplingParams
|
| 43 |
+
|
| 44 |
+
llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
|
| 45 |
+
output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
|
| 46 |
+
print(output[0].outputs[0].text)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### With TensorRT-LLM CLI
|
| 50 |
+
|
| 51 |
+
```bash
|
| 52 |
+
# Using the NVIDIA DGX Spark container
|
| 53 |
+
docker run --rm --gpus all \
|
| 54 |
+
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
|
| 55 |
+
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
|
| 56 |
+
python -c "
|
| 57 |
+
from tensorrt_llm import LLM, SamplingParams
|
| 58 |
+
llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
|
| 59 |
+
out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
|
| 60 |
+
print(out[0].outputs[0].text)
|
| 61 |
+
"
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Loading with HuggingFace Transformers
|
| 65 |
+
|
| 66 |
+
> **Note:** NVFP4 quantization requires TensorRT-LLM for inference. Standard `transformers` loading is not supported for this quantization format.
|
| 67 |
+
|
| 68 |
+
## Hardware Requirements
|
| 69 |
+
|
| 70 |
+
- **Recommended:** NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
|
| 71 |
+
- **CUDA Compute Capability:** 12.0+
|
| 72 |
+
|
| 73 |
+
## About Sarvam-1
|
| 74 |
+
|
| 75 |
+
Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.
|
| 76 |
+
|
| 77 |
+
## Acknowledgments
|
| 78 |
+
|
| 79 |
+
- Base model by [Sarvam AI](https://www.sarvam.ai/)
|
| 80 |
+
- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
|