viraman commited on
Commit
ad33cc9
·
verified ·
1 Parent(s): c37d76a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -260
README.md CHANGED
@@ -1,9 +1,4 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: LICENSE
5
- ---
6
- # NVIDIA-Nemotron-3-Nano-4B
7
 
8
  **Model Developer:** NVIDIA Corporation
9
 
@@ -19,39 +14,43 @@ The pretraining data has a cutoff date of September 2024\.
19
 
20
  ## Model Overview
21
 
22
- NVIDIA-Nemotron-Nano-4B-v2.1 is a small language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
23
 
24
- The model has been compressed from NVIDIA-Nemotron-Nano-9B-v2 using [Nemotron Elastic](https://arxiv.org/pdf/2511.16664) framework. The details of the parent model NVIDIA-Nemotron-Nano-9B-v2 can be found in ([Nemotron-H tech report](https://arxiv.org/abs/2504.03624)). The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers.
25
 
26
  The supported languages include: English. Improved using Qwen.
27
 
28
- This model is ready for commercial use.
29
 
 
 
 
30
 
31
  ### Deployment Geography: Global
32
 
33
  ### Use Case
34
 
35
- NVIDIA-Nemotron-Nano-4B-v2.1 is an edge-ready small language model intended for Agentic AI in edge platforms (Jetson Thor, GeForce RTX, DGX Spark). It targets key-uses including AI gaming NPCs (teammates / companions), local voice assistants (for devices, apps, and games), and IoT automation. It is to be used in English and coding languages.
36
 
37
- ### Release Date: 03/10/2026
38
 
39
- Huggingface TBD via [https://huggingface.co/](https://huggingface.co/)
40
- API Catalog TBD via [https://catalog.ngc.nvidia.com/models](https://catalog.ngc.nvidia.com/models)
41
 
42
  ## References
43
 
44
- - [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf)
45
-
 
 
 
46
 
47
  ## Model Architecture
48
 
49
  - Architecture Type: Mamba2-Transformer Hybrid
50
- - Network Architecture: Nemotron-Hybrid
51
- - This model was compressed from [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
52
  - Number of model parameters 3.97 x 10^9
53
 
54
-
55
  ## Input
56
 
57
  - Input Type(s): Text
@@ -63,7 +62,7 @@ API Catalog TBD via [https://catalog.ngc.nvidia.com/models](https://catalog.ngc.
63
 
64
  - Output Type(s): Text
65
  - Output Format: String
66
- - Output Parameters: One-Dimensional (1D): Sequences
67
  - Other properties Related to Output: Sequences up to 262K
68
 
69
  Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
@@ -71,7 +70,7 @@ Our models are designed and optimized to run on NVIDIA GPU-accelerated systems.
71
  ## Software Integration
72
 
73
  - Runtime Engine(s): NeMo 25.07
74
- - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100, GeForce RTX
75
  - Operating System(s): Linux
76
 
77
  The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
@@ -85,9 +84,9 @@ import torch
85
  from transformers import AutoTokenizer, AutoModelForCausalLM
86
 
87
  # Load tokenizer and model
88
- tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-4B-v2.1")
89
  model = AutoModelForCausalLM.from_pretrained(
90
- "nvidia/NVIDIA-Nemotron-Nano-4B-v2.1",
91
  torch_dtype=torch.bfloat16,
92
  trust_remote_code=True,
93
  device_map="auto"
@@ -114,7 +113,7 @@ outputs = model.generate(
114
  print(tokenizer.decode(outputs[0]))
115
  ```
116
 
117
- temperature=1.0 and top\_p=1.0 are recommended for reasoning tasks, while temperature=0.6 and top\_p=0.95 are recommended for tool calling.
118
 
119
  If you’d like to use reasoning off, add enable\_thinking=False to apply\_chat\_template(). By default, enable\_thinking is set to be True.
120
 
@@ -141,252 +140,119 @@ print(tokenizer.decode(outputs[0]))
141
 
142
  ### **Use it with vLLM**
143
 
144
- We need vllm\>=0.12.0 for this model. If you are on Jetson Thor or DGX Spark, please use [this vllm container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3).
 
 
145
 
146
  ```
147
- pip install -U "vllm>=0.12.0"
148
  ```
149
 
150
  Download the custom parser from the Hugging Face repository.
151
 
152
  ```
153
- wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
154
  ```
155
 
156
- ## Launch a vLLM server using the custom parser.
157
 
158
  ```
159
- vllm serve nvidia/NVIDIA-Nemotron-4B-v2.1 \
160
- --served-model-name model \
161
  --max-num-seqs 8 \
162
  --tensor-parallel-size 1 \
163
  --max-model-len 262144 \
164
  --port 8000 \
165
  --trust-remote-code \
 
166
  --enable-auto-tool-choice \
167
  --tool-call-parser qwen3_coder \
168
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
169
  --reasoning-parser nano_v3
170
  ```
171
 
172
- ##
173
 
174
- ## Model Version
175
 
176
- - v1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
- ## Prompt Format
179
 
180
- We follow the jinja chat template provided below.
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  ```
183
- {% macro render_extra_keys(json_dict, handled_keys) %}
184
- {%- if json_dict is mapping %}
185
- {%- for json_key in json_dict if json_key not in handled_keys %}
186
- {%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) %}
187
- {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
188
- {%- else %}
189
- {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
190
- {%- endif %}
191
- {%- endfor %}
192
- {%- endif %}
193
- {% endmacro %}
194
- {%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
195
- {%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}
196
-
197
- {%- set ns = namespace(last_user_idx = -1) %}
198
- {%- set loop_messages = messages %}
199
- {%- for m in loop_messages %}
200
- {%- if m["role"] == "user" %}
201
- {%- set ns.last_user_idx = loop.index0 %}
202
- {%- endif %}
203
- {%- endfor %}
204
-
205
- {%- if messages[0]["role"] == "system" %}
206
- {%- set system_message = messages[0]["content"] %}
207
- {%- set loop_messages = messages[1:] %}
208
- {%- else %}
209
- {%- set system_message = "" %}
210
- {%- set loop_messages = messages %}
211
- {%- endif %}
212
- {%- if not tools is defined %}
213
- {%- set tools = [] %}
214
- {%- endif %}
215
- {# Recompute last_user_idx relative to loop_messages after handling system #}
216
- {%- set ns = namespace(last_user_idx = -1) %}
217
- {%- for m in loop_messages %}
218
- {%- if m["role"] == "user" %}
219
- {%- set ns.last_user_idx = loop.index0 %}
220
- {%- endif %}
221
- {%- endfor %}
222
- {%- if system_message is defined %}
223
- {{- "<|im_start|>system\n" + system_message }}
224
- {%- else %}
225
- {%- if tools is iterable and tools | length > 0 %}
226
- {{- "<|im_start|>system\n" }}
227
- {%- endif %}
228
- {%- endif %}
229
- {%- if tools is iterable and tools | length > 0 %}
230
- {%- if system_message is defined and system_message | length > 0 %}
231
- {{- "\n\n" }}
232
- {%- endif %}
233
- {{- "# Tools\n\nYou have access to the following functions:\n\n" }}
234
- {{- "<tools>" }}
235
- {%- for tool in tools %}
236
- {%- if tool.function is defined %}
237
- {%- set tool = tool.function %}
238
- {%- endif %}
239
- {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
240
- {%- if tool.description is defined %}
241
- {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
242
- {%- endif %}
243
- {{- '\n<parameters>' }}
244
- {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
245
- {%- for param_name, param_fields in tool.parameters.properties|items %}
246
- {{- '\n<parameter>' }}
247
- {{- '\n<name>' ~ param_name ~ '</name>' }}
248
- {%- if param_fields.type is defined %}
249
- {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
250
- {%- endif %}
251
- {%- if param_fields.description is defined %}
252
- {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
253
- {%- endif %}
254
- {%- if param_fields.enum is defined %}
255
- {{- '\n<enum>' ~ (param_fields.enum | tojson | safe) ~ '</enum>' }}
256
- {%- endif %}
257
- {%- set handled_keys = ['name', 'type', 'description', 'enum'] %}
258
- {{- render_extra_keys(param_fields, handled_keys) }}
259
- {{- '\n</parameter>' }}
260
- {%- endfor %}
261
- {%- endif %}
262
- {% set handled_keys = ['type', 'properties', 'required'] %}
263
- {{- render_extra_keys(tool.parameters, handled_keys) }}
264
- {%- if tool.parameters is defined and tool.parameters.required is defined %}
265
- {{- '\n<required>' ~ (tool.parameters.required | tojson | safe) ~ '</required>' }}
266
- {%- endif %}
267
- {{- '\n</parameters>' }}
268
- {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
269
- {{- render_extra_keys(tool, handled_keys) }}
270
- {{- '\n</function>' }}
271
- {%- endfor %}
272
- {{- "\n</tools>" }}
273
-
274
- {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
275
- {%- endif %}
276
-
277
-
278
- {%- if system_message is defined %}
279
- {{- '<|im_end|>\n' }}
280
- {%- else %}
281
- {%- if tools is iterable and tools | length > 0 %}
282
- {{- '<|im_end|>\n' }}
283
- {%- endif %}
284
- {%- endif %}
285
-
286
- {%- for message in loop_messages %}
287
- {%- if message.role == "assistant" %}
288
- {# Add reasoning content in to content field for unified processing below. #}
289
- {%- if message.reasoning_content is defined and message.reasoning_content is string and message.reasoning_content | trim | length > 0 %}
290
- {%- set content = "<think>\n" ~ message.reasoning_content ~ "\n</think>\n" ~ (message.content | default('', true)) %}
291
- {%- else %}
292
- {%- set content = message.content | default('', true) %}
293
- {%- if content is string -%}
294
- {# Allow downstream logic to to take care of broken thought, only handle coherent reasoning here. #}
295
- {%- if '<think>' not in content and '</think>' not in content -%}
296
- {%- set content = "<think></think>" ~ content -%}
297
- {%- endif -%}
298
- {%- else -%}
299
- {%- set content = content -%}
300
- {%- endif -%}
301
- {%- endif %}
302
- {%- if message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
303
- {# Assistant message has tool calls. #}
304
- {{- '<|im_start|>assistant\n' }}
305
- {%- set include_content = not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
306
- {%- if content is string and content | trim | length > 0 %}
307
- {%- if include_content %}
308
- {{- (content | trim) ~ '\n' -}}
309
- {%- else %}
310
- {%- set c = (content | string) %}
311
- {%- if '</think>' in c %}
312
- {# Keep only content after the last closing think. Also generation prompt causes this. #}
313
- {%- set c = c.split('</think>')[-1] %}
314
- {%- elif '<think>' in c %}
315
- {# If <think> was opened but never closed, drop the trailing think segment #}
316
- {%- set c = c.split('<think>')[0] %}
317
- {%- endif %}
318
- {%- set c = "<think></think>" ~ c | trim %}
319
- {%- if c | length > 0 %}
320
- {{- c ~ '\n' -}}
321
- {%- endif %}
322
- {%- endif %}
323
- {%- else %}
324
- {{- "<think></think>" -}}
325
- {%- endif %}
326
- {%- for tool_call in message.tool_calls %}
327
- {%- if tool_call.function is defined %}
328
- {%- set tool_call = tool_call.function %}
329
- {%- endif %}
330
- {{- '<tool_call>\n<function=' ~ tool_call.name ~ '>\n' -}}
331
- {%- if tool_call.arguments is defined %}
332
- {%- for args_name, args_value in tool_call.arguments|items %}
333
- {{- '<parameter=' ~ args_name ~ '>\n' -}}
334
- {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
335
- {{- args_value ~ '\n</parameter>\n' -}}
336
- {%- endfor %}
337
- {%- endif %}
338
- {{- '</function>\n</tool_call>\n' -}}
339
- {%- endfor %}
340
- {{- '<|im_end|>\n' }}
341
- {%- else %}
342
- {# Assistant message doesn't have tool calls. #}
343
- {%- if not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
344
- {{- '<|im_start|>assistant\n' ~ (content | default('', true) | string | trim) ~ '<|im_end|>\n' }}
345
- {%- else %}
346
- {%- set c = (content | default('', true) | string) %}
347
- {%- if '<think>' in c and '</think>' in c %}
348
- {%- set c = "<think></think>" ~ c.split('</think>')[-1] %}
349
- {%- endif %}
350
- {%- set c = c | trim %}
351
- {%- if c | length > 0 %}
352
- {{- '<|im_start|>assistant\n' ~ c ~ '<|im_end|>\n' }}
353
- {%- else %}
354
- {{- '<|im_start|>assistant\n<|im_end|>\n' }}
355
- {%- endif %}
356
- {%- endif %}
357
- {%- endif %}
358
- {%- elif message.role == "user" or message.role == "system" %}
359
- {{- '<|im_start|>' + message.role + '\n' }}
360
- {%- set content = message.content | string %}
361
- {{- content }}
362
- {{- '<|im_end|>\n' }}
363
- {%- elif message.role == "tool" %}
364
- {%- if loop.previtem and loop.previtem.role != "tool" %}
365
- {{- '<|im_start|>user\n' }}
366
- {%- endif %}
367
- {{- '<tool_response>\n' }}
368
- {{- message.content }}
369
- {{- '\n</tool_response>\n' }}
370
- {%- if not loop.last and loop.nextitem.role != "tool" %}
371
- {{- '<|im_end|>\n' }}
372
- {%- elif loop.last %}
373
- {{- '<|im_end|>\n' }}
374
- {%- endif %}
375
- {%- else %}
376
- {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
377
- {%- endif %}
378
- {%- endfor %}
379
-
380
- {%- if add_generation_prompt %}
381
- {%- if enable_thinking %}
382
- {{- '<|im_start|>assistant\n<think>\n' }}
383
- {%- else %}
384
- {{- '<|im_start|>assistant\n<think></think>' }}
385
- {%- endif %}
386
- {%- endif %}
387
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
388
  ```
389
 
 
 
 
 
 
390
  ##
391
 
392
  ## Training, Testing, and Evaluation Datasets
@@ -399,8 +265,7 @@ We follow the jinja chat template provided below.
399
  * Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
400
  * Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
401
 
402
-
403
- **Properties:** The post-training corpus for NVIDIA-Nemotron-Nano-4B-v2.1 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.
404
 
405
  More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf) .
406
 
@@ -527,7 +392,7 @@ The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API.
527
  | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | [AQUA-RAT](https://huggingface.co/datasets/deepmind/aqua_rat); [LogiQA](https://huggingface.co/datasets/lucasmccabe/logiqa); [AR-LSAT](https://github.com/zhongwanjun/AR-LSAT) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
528
  | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | [Art of Problem Solving](https://artofproblemsolving.com/company); [American Mathematics Competitions 8](https://artofproblemsolving.com/wiki/index.php/AMC_8_Problems_and_Solutions); [American Mathematics Competitions 10](https://artofproblemsolving.com/wiki/index.php/AMC_10_Problems_and_Solutions); [GSM8K](https://github.com/openai/grade-school-math); [PRM800K](https://github.com/openai/prm800k) | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct); [Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B); [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B); [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
529
  | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | [MMLU Auxiliary Train](https://huggingface.co/datasets/cais/mmlu/viewer/all/auxiliary_train) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
530
- | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | [arXiv](https://info.arxiv.org/help/bulk_data/index.html); [National Institutes of Health ExPorter](https://www.nih.gov/); [BioRxiv](https://www.biorxiv.org/tdm); [PMC Article](https://pmc.ncbi.nlm.nih.gov/tools/textmining/); [USPTO Backgrounds](https://data.uspto.gov/apis/transition-guide/bdss#pats); [peS2o](https://huggingface.co/datasets/allenai/peS2o); Global Regulation; [CORE](https://core.ac.uk/documentation/dataset); [PG-19](https://github.com/google-deepmind/pg19); [DOAB CC BY & CC BY-SA subset](https://www.doabooks.org/en); [NDLTD](https://ndltd.org/thesis-resources/global-etd-search/) | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
531
  | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B); [Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct) |
532
  | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
533
  | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | [Wikimedia](https://dumps.wikimedia.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
@@ -618,36 +483,50 @@ The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API.
618
 
619
  We evaluated our model in \*\*Reasoning-On\*\* mode across these benchmarks.
620
 
621
- | Benchmark | NVIDIA-Nemotron-Nano-v2.1-4B |
622
- | :---- | ----- |
623
- | AIME25 | X |
624
- | MATH500 | X |
625
- | GPQA | X |
626
- | LCB | X |
627
- | BFCL v3 | X |
628
- | IFEVAL-Prompt | X |
629
- | IFEVAL-Instruction | X |
 
 
 
630
 
631
  We also evaluated our model in \*\*Reasoning-off\*\* mode across these benchmarks
632
 
633
- | Benchmark | NVIDIA-Nemotron-Nano-v2.1-4B |
634
  | :---- | ----- |
635
- | BFCL v3 | X |
636
- | IFEVAL-Prompt | X |
637
- | IFEVAL-Instruction | X |
638
- | Orak | X |
 
 
 
 
 
 
 
 
639
 
640
  All evaluations were done using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills/tree/main/docs) & [Orak](https://github.com/krafton-ai/Orak). For Orak we evaluated on three games (Super Mario, Darkest Dungeon & StarDew Valley)
641
 
642
  ## Inference
643
 
644
- - ## Engines: HF, vLLM, llama-cpp
645
-
646
- - ## Test Hardware NVIDIA GeForce RTX, H100 80GB,
647
 
648
  ## Ethical Considerations
649
 
650
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
 
651
 
652
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
653
 
 
 
1
+ # NVIDIA-Nemotron-3-Nano-4B-BF16
 
 
 
 
 
2
 
3
  **Model Developer:** NVIDIA Corporation
4
 
 
14
 
15
  ## Model Overview
16
 
17
+ NVIDIA-Nemotron-3-Nano-4B-BF16 is a small language model (SLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
18
 
19
+ The model has been compressed from NVIDIA-Nemotron-Nano-9B-v2 using the Nemotron [Elastic](https://arxiv.org/pdf/2511.16664) framework. The details of the parent model NVIDIA-Nemotron-Nano-9B-v2 can be found in ([Nemotron-H tech report](https://arxiv.org/abs/2504.03624)). The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers.
20
 
21
  The supported languages include: English. Improved using Qwen.
22
 
23
+ This model is ready for commercial use.
24
 
25
+ ## License/Terms of Use
26
+
27
+ Governing Terms: Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/).
28
 
29
  ### Deployment Geography: Global
30
 
31
  ### Use Case
32
 
33
+ NVIDIA-Nemotron-3-Nano-4B is an edge-ready small language model intended for Agentic AI in edge platforms (Jetson Thor, GeForce RTX, DGX Spark). It targets key-uses including AI gaming NPCs (teammates / companions), local voice assistants (for devices, apps, and games), and IoT automation. It is to be used in English and coding languages.
34
 
35
+ ### Release Date: 3/16/2026
36
 
37
+ Huggingface 3/16/2026 via [https://huggingface.co/](https://huggingface.co/)
 
38
 
39
  ## References
40
 
41
+ - [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf)
42
+ - [Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs](https://arxiv.org/abs/2511.16664)
43
+ - [NVIDIA Nemotron 3: Efficient and Open Intelligence](https://arxiv.org/abs/2512.20856)
44
+ - [Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning](https://arxiv.org/abs/2512.20848)
45
+ - [Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
46
 
47
  ## Model Architecture
48
 
49
  - Architecture Type: Mamba2-Transformer Hybrid
50
+ - Network Architecture: Nemotron-Hybrid
51
+ - This model was compressed from [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
52
  - Number of model parameters 3.97 x 10^9
53
 
 
54
  ## Input
55
 
56
  - Input Type(s): Text
 
62
 
63
  - Output Type(s): Text
64
  - Output Format: String
65
+ - Output Parameters: One-Dimensional (1D): Sequences
66
  - Other properties Related to Output: Sequences up to 262K
67
 
68
  Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
 
70
  ## Software Integration
71
 
72
  - Runtime Engine(s): NeMo 25.07
73
+ - Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100, GeForce RTX
74
  - Operating System(s): Linux
75
 
76
  The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
 
84
  from transformers import AutoTokenizer, AutoModelForCausalLM
85
 
86
  # Load tokenizer and model
87
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B")
88
  model = AutoModelForCausalLM.from_pretrained(
89
+ "nvidia/NVIDIA-Nemotron-3-Nano-4B",
90
  torch_dtype=torch.bfloat16,
91
  trust_remote_code=True,
92
  device_map="auto"
 
113
  print(tokenizer.decode(outputs[0]))
114
  ```
115
 
116
+ temperature=1.0 and top\_p=0.95 are recommended for reasoning tasks, while temperature=0.6 and top\_p=0.95 are recommended for tool calling.
117
 
118
  If you’d like to use reasoning off, add enable\_thinking=False to apply\_chat\_template(). By default, enable\_thinking is set to be True.
119
 
 
140
 
141
  ### **Use it with vLLM**
142
 
143
+ We need vllm\>=0.15.1 for this model. If you are on Jetson Thor or DGX Spark, please use [this vllm container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.02-py3). (26.01 was verified) (To verify)
144
+
145
+ **(Docker launch command is missing)**
146
 
147
  ```
148
+ pip install -U "vllm>=0.15.1"
149
  ```
150
 
151
  Download the custom parser from the Hugging Face repository.
152
 
153
  ```
154
+ wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py
155
  ```
156
 
157
+ Launch a vLLM server using the custom parser.
158
 
159
  ```
160
+ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B \
161
+ --served-model-name nemotron3-nano-4B\
162
  --max-num-seqs 8 \
163
  --tensor-parallel-size 1 \
164
  --max-model-len 262144 \
165
  --port 8000 \
166
  --trust-remote-code \
167
+ --mamba_ssm_cache_dtype float32 \
168
  --enable-auto-tool-choice \
169
  --tool-call-parser qwen3_coder \
170
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
171
  --reasoning-parser nano_v3
172
  ```
173
 
174
+ Access the hosted API using a python client.
175
 
176
+ ```py
177
 
178
+ from openai import OpenAI
179
+ import asyncio
180
+ from openai import AsyncOpenAI
181
+
182
+ # NOTE: Streaming is preferred for better performance and resource efficiency.
183
+ # It allows you to start processing responses as they arrive, reducing latency.
184
+
185
+ # Synchronous example (non-streaming)
186
+ client = OpenAI(
187
+ api_key="your-nvapikey",
188
+ base_url="base-url"
189
+ )
190
+
191
+ response = client.chat.completions.create(
192
+ model="nemotron3-nano-4B",
193
+ messages=[
194
+ {
195
+ "role": "user",
196
+ "content": "Hello!"
197
+ }
198
+ ],
199
+ temperature=0.7,
200
+ max_tokens=256,
201
+ top_p=0.7,
202
+ stream=false
203
+ )
204
 
205
+ print(response.choices[0].message.content)
206
 
207
+ ```
208
 
209
+ ### Use it with TRT-LLM
210
+
211
+ Launch the model using TRT-LLM
212
+
213
+ ```shell
214
+ docker run -v /home/root/.cache/huggingface/:/root/.cache/huggingface/ --rm --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --ipc=host --network host -d -e MODEL=$MODEL -e HF_TOKEN=$HF_TOKEN nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6 bash -c '
215
+ cat > /tmp/extra-llm-api-config.yml <<EOF
216
+ kv_cache_config:
217
+ dtype: "auto"
218
+ enable_block_reuse: false
219
+ cuda_graph_config:
220
+ max_batch_size: 32
221
+ enable_padding: true
222
+ disable_overlap_scheduler: true
223
+ moe_config:
224
+ backend: CUTLASS
225
+ EOF
226
+
227
+ trtllm-serve \
228
+ $MODEL \
229
+ --host 0.0.0.0 \
230
+ --port 8123 \
231
+ --max_batch_size 32 \
232
+ --extra_llm_api_options /tmp/extra-llm-api-config.yml '
233
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
+ Access the hosted endpoint using curl command.
236
+
237
+ ```shell
238
+ curl http://localhost:8123/v1/chat/completions -H "Content-Type: application/json" -d '{
239
+ "model": "$MODEL",
240
+ "messages": [
241
+ {
242
+ "role": "user",
243
+ "content": "Where is New York?"
244
+ }
245
+ ],
246
+ "max_tokens": 1024,
247
+ "top_p": 1.0
248
+ }' -w "\n"
249
  ```
250
 
251
+
252
+ ## Model Version
253
+
254
+ - v1.0
255
+
256
  ##
257
 
258
  ## Training, Testing, and Evaluation Datasets
 
265
  * Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
266
  * Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
267
 
268
+ **Properties:** The post-training corpus for NVIDIA-Nemotron-3-Nano-4B consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.
 
269
 
270
  More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf) .
271
 
 
392
  | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | [AQUA-RAT](https://huggingface.co/datasets/deepmind/aqua_rat); [LogiQA](https://huggingface.co/datasets/lucasmccabe/logiqa); [AR-LSAT](https://github.com/zhongwanjun/AR-LSAT) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
393
  | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | [Art of Problem Solving](https://artofproblemsolving.com/company); [American Mathematics Competitions 8](https://artofproblemsolving.com/wiki/index.php/AMC_8_Problems_and_Solutions); [American Mathematics Competitions 10](https://artofproblemsolving.com/wiki/index.php/AMC_10_Problems_and_Solutions); [GSM8K](https://github.com/openai/grade-school-math); [PRM800K](https://github.com/openai/prm800k) | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct); [Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B); [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B); [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
394
  | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | [MMLU Auxiliary Train](https://huggingface.co/datasets/cais/mmlu/viewer/all/auxiliary_train) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
395
+ | Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | [arXiv](https://info.arxiv.org/help/bulk_data/index.html); [National Institutes of Health ExPorter](https://www.nih.gov/); [BioRxiv](https://www.biorxiv.org/tdm); [PMC Article](https://pmc.ncbi.nlm.nih.gov/tools/textmining/); [USPTO Backgrounds](https://data.uspto.gov/apis/transition-guide/bdss#pats); [peS2o](https://huggingface.co/datasets/allenai/peS2o); Global Regulation; [CORE](https://core.ac.uk/documentation/dataset); [PG-19](https://github.com/google-deepmind/pg19); [DOAB CC BY & CC BY-SA subset](https://www.doabooks.org/en); [NDLTD](https://ndltd.org/thesis-resources/global-etd-search/) | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
396
  | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B); [Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct) |
397
  | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
398
  | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | [Wikimedia](https://dumps.wikimedia.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 
483
 
484
  We evaluated our model in \*\*Reasoning-On\*\* mode across these benchmarks.
485
 
486
+ | Benchmark | NVIDIA-Nemotron-3-Nano-4B |
487
+ | :---- | :---: |
488
+ | AIME25 | 78.5 |
489
+ | MATH500 | 95.4 |
490
+ | GPQA | 53.2 |
491
+ | LCB | 51.8 |
492
+ | BFCL v3 | 61.1 |
493
+ | IFEVAL-Prompt | 87.9 |
494
+ | IFEVAL-Instruction | 92 |
495
+ | Tau2-Airline | 33.3 |
496
+ | Tau2-Retail | 39.8 |
497
+ | Tau2-Telecom | 33 |
498
 
499
  We also evaluated our model in \*\*Reasoning-off\*\* mode across these benchmarks
500
 
501
+ | Benchmark | NVIDIA-Nemotron-3-Nano-4B |
502
  | :---- | ----- |
503
+ | BFCL v3 | 61.1 |
504
+ | IFBench-Prompt | 43.2 |
505
+ | IFBench-Instruction | 44.2 |
506
+ | Orak | 22.9 |
507
+ | IFEval-Prompt | 82.8 |
508
+ | IFEval-Instruction | 88 |
509
+ | HaluEval | 62.2 |
510
+ | RULER (128k) | 91.1 |
511
+ | Tau2-Airline | 28.0 |
512
+ | Tau2-Retail | 34.8 |
513
+ | Tau2-Telecom | 24.9 |
514
+ | EQ-Bench3 | 63.2 |
515
 
516
  All evaluations were done using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills/tree/main/docs) & [Orak](https://github.com/krafton-ai/Orak). For Orak we evaluated on three games (Super Mario, Darkest Dungeon & StarDew Valley)
517
 
518
  ## Inference
519
 
520
+ - Engines: HF, vLLM, llama-cpp
521
+
522
+ - Test Hardware: NVIDIA GeForce RTX, H100 80GB, DGX Spark, Jetson Thor
523
 
524
  ## Ethical Considerations
525
 
526
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our [Trustworthy AI terms of service](https://www.nvidia.com/en-us/agreements/trustworthy-ai/terms/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
527
+
528
+ We advise against circumvention of any provided safety guardrails contained in the Model without a substantially similar guardrail appropriate for your use case.For more details: [Safety](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/safety.md) and [Explainability](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/explainability.md) Subcards.
529
 
530
+ For more detailed information on ethical considerations for this model, please see the Model Card++ [Bias](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/bias.md), and [Privacy](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/privacy.md) Subcards.
531
 
532
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).