--- library_name: transformers base_model: sarvamai/sarvam-1 tags: - sarvam - nvfp4 - quantized - modelopt - tensorrt-llm - dgx-spark license: apache-2.0 pipeline_tag: text-generation inference: false --- # Sanyam0605/sarvam-1-NVFP4 NVFP4-quantized version of [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (modelopt 0.35.0). ## Quantization Details | Parameter | Value | |---|---| | **Base Model** | [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1) | | **Architecture** | LlamaForCausalLM | | **Quantization** | NVFP4 (4-bit floating point) | | **KV Cache** | FP8 | | **Group Size** | 16 | | **Hidden Size** | 2048 | | **Layers** | 28 | | **Attention Heads** | 16 (KV: 8) | | **Context Length** | 8192 | | **Vocab Size** | 68096 | | **Quantizer** | modelopt v0.35.0 | | **Excluded Modules** | lm_head | ## Usage ### With TensorRT-LLM (recommended) ```python from tensorrt_llm import LLM, SamplingParams llm = LLM(model="Sanyam0605/sarvam-1-NVFP4") output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128)) print(output[0].outputs[0].text) ``` ### With TensorRT-LLM CLI ```bash # Using the NVIDIA DGX Spark container docker run --rm --gpus all \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \ python -c " from tensorrt_llm import LLM, SamplingParams llm = LLM(model='Sanyam0605/sarvam-1-NVFP4') out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64)) print(out[0].outputs[0].text) " ``` ### Loading with HuggingFace Transformers > **Note:** NVFP4 quantization requires TensorRT-LLM for inference. Standard `transformers` loading is not supported for this quantization format. ## Hardware Requirements - **Recommended:** NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture) - **CUDA Compute Capability:** 12.0+ ## About Sarvam-1 Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark. ## Acknowledgments - Base model by [Sarvam AI](https://www.sarvam.ai/) - Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)