Sanyam0605
/

sarvam-1-NVFP4

Text Generation

text-generation-inference

8-bit precision

Model card Files Files and versions

Sanyam0605 commited on Mar 7

Commit

68e1c0a

·

verified ·

1 Parent(s): e74733e

Add model card

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+library_name: transformers
+base_model: sarvamai/sarvam-1
+tags:
+- sarvam
+- nvfp4
+- quantized
+- modelopt
+- tensorrt-llm
+- dgx-spark
+license: apache-2.0
+pipeline_tag: text-generation
+inference: false
+---
+# Sanyam0605/sarvam-1-NVFP4
+NVFP4-quantized version of [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (modelopt 0.35.0).
+## Quantization Details
+| Parameter | Value |
+|---|---|
+| **Base Model** | [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1) |
+| **Architecture** | LlamaForCausalLM |
+| **Quantization** | NVFP4 (4-bit floating point) |
+| **KV Cache** | FP8 |
+| **Group Size** | 16 |
+| **Hidden Size** | 2048 |
+| **Layers** | 28 |
+| **Attention Heads** | 16 (KV: 8) |
+| **Context Length** | 8192 |
+| **Vocab Size** | 68096 |
+| **Quantizer** | modelopt v0.35.0 |
+| **Excluded Modules** | lm_head |
+## Usage
+### With TensorRT-LLM (recommended)
+```python
+from tensorrt_llm import LLM, SamplingParams
+llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
+output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
+print(output[0].outputs[0].text)
+```
+### With TensorRT-LLM CLI
+```bash
+# Using the NVIDIA DGX Spark container
+docker run --rm --gpus all \
+    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
+    nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
+    python -c "
+from tensorrt_llm import LLM, SamplingParams
+llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
+out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
+print(out[0].outputs[0].text)
+"
+```
+### Loading with HuggingFace Transformers
+> **Note:** NVFP4 quantization requires TensorRT-LLM for inference. Standard `transformers` loading is not supported for this quantization format.
+## Hardware Requirements
+- **Recommended:** NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
+- **CUDA Compute Capability:** 12.0+
+## About Sarvam-1
+Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.
+## Acknowledgments
+- Base model by [Sarvam AI](https://www.sarvam.ai/)
+- Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)