Sanyam0605 commited on
Commit
68e1c0a
·
verified ·
1 Parent(s): e74733e

Add model card

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: sarvamai/sarvam-1
4
+ tags:
5
+ - sarvam
6
+ - nvfp4
7
+ - quantized
8
+ - modelopt
9
+ - tensorrt-llm
10
+ - dgx-spark
11
+ license: apache-2.0
12
+ pipeline_tag: text-generation
13
+ inference: false
14
+ ---
15
+
16
+ # Sanyam0605/sarvam-1-NVFP4
17
+
18
+ NVFP4-quantized version of [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1), quantized using [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (modelopt 0.35.0).
19
+
20
+ ## Quantization Details
21
+
22
+ | Parameter | Value |
23
+ |---|---|
24
+ | **Base Model** | [sarvamai/sarvam-1](https://huggingface.co/sarvamai/sarvam-1) |
25
+ | **Architecture** | LlamaForCausalLM |
26
+ | **Quantization** | NVFP4 (4-bit floating point) |
27
+ | **KV Cache** | FP8 |
28
+ | **Group Size** | 16 |
29
+ | **Hidden Size** | 2048 |
30
+ | **Layers** | 28 |
31
+ | **Attention Heads** | 16 (KV: 8) |
32
+ | **Context Length** | 8192 |
33
+ | **Vocab Size** | 68096 |
34
+ | **Quantizer** | modelopt v0.35.0 |
35
+ | **Excluded Modules** | lm_head |
36
+
37
+ ## Usage
38
+
39
+ ### With TensorRT-LLM (recommended)
40
+
41
+ ```python
42
+ from tensorrt_llm import LLM, SamplingParams
43
+
44
+ llm = LLM(model="Sanyam0605/sarvam-1-NVFP4")
45
+ output = llm.generate(["Hello, tell me about"], sampling_params=SamplingParams(max_tokens=128))
46
+ print(output[0].outputs[0].text)
47
+ ```
48
+
49
+ ### With TensorRT-LLM CLI
50
+
51
+ ```bash
52
+ # Using the NVIDIA DGX Spark container
53
+ docker run --rm --gpus all \
54
+ -v $HOME/.cache/huggingface:/root/.cache/huggingface \
55
+ nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
56
+ python -c "
57
+ from tensorrt_llm import LLM, SamplingParams
58
+ llm = LLM(model='Sanyam0605/sarvam-1-NVFP4')
59
+ out = llm.generate(['Translate to Hindi: Good morning'], sampling_params=SamplingParams(max_tokens=64))
60
+ print(out[0].outputs[0].text)
61
+ "
62
+ ```
63
+
64
+ ### Loading with HuggingFace Transformers
65
+
66
+ > **Note:** NVFP4 quantization requires TensorRT-LLM for inference. Standard `transformers` loading is not supported for this quantization format.
67
+
68
+ ## Hardware Requirements
69
+
70
+ - **Recommended:** NVIDIA DGX Spark (GB10, 128GB UMA) or any GPU with FP4 support (Blackwell architecture)
71
+ - **CUDA Compute Capability:** 12.0+
72
+
73
+ ## About Sarvam-1
74
+
75
+ Sarvam-1 is a multilingual language model with strong performance across Indian languages. This quantized version reduces memory footprint while maintaining quality, making it suitable for deployment on edge devices like the DGX Spark.
76
+
77
+ ## Acknowledgments
78
+
79
+ - Base model by [Sarvam AI](https://www.sarvam.ai/)
80
+ - Quantization using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)