nvidia
/

NVIDIA-Nemotron-3-Nano-4B-BF16

@@ -1,9 +1,4 @@
----
-license: other
-license_name: nvidia-open-model-license
-license_link: LICENSE
----
-# NVIDIA-Nemotron-3-Nano-4B
 **Model Developer:** NVIDIA Corporation
@@ -19,39 +14,43 @@ The pretraining data has a cutoff date of September 2024\.
 ## Model Overview
-NVIDIA-Nemotron-Nano-4B-v2.1 is a small language model (LLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
-The model has been compressed from NVIDIA-Nemotron-Nano-9B-v2 using [Nemotron Elastic](https://arxiv.org/pdf/2511.16664) framework. The details of the parent model NVIDIA-Nemotron-Nano-9B-v2 can be found in ([Nemotron-H tech report](https://arxiv.org/abs/2504.03624)). The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers.
 The supported languages include: English. Improved using Qwen.
-This model is ready for commercial use.
 ### Deployment Geography: Global
 ### Use Case
-NVIDIA-Nemotron-Nano-4B-v2.1 is an edge-ready small language model intended for Agentic AI in edge platforms (Jetson Thor, GeForce RTX, DGX Spark). It targets key-uses including AI gaming NPCs (teammates / companions), local voice assistants (for devices, apps, and games), and IoT automation. It is to be used in English and coding languages.
-### Release Date: 03/10/2026
-Huggingface TBD via [https://huggingface.co/](https://huggingface.co/)
-API Catalog TBD via [https://catalog.ngc.nvidia.com/models](https://catalog.ngc.nvidia.com/models)
 ## References
-- [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf)
 ## Model Architecture
 - Architecture Type: Mamba2-Transformer Hybrid
-- Network Architecture: Nemotron-Hybrid
-  - This model was compressed from [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
   - Number of model parameters 3.97 x 10^9
 ## Input
 - Input Type(s): Text
@@ -63,7 +62,7 @@ API Catalog TBD via [https://catalog.ngc.nvidia.com/models](https://catalog.ngc.
 - Output Type(s): Text
 - Output Format: String
-- Output Parameters: One-Dimensional (1D): Sequences
 - Other properties Related to Output: Sequences up to 262K
 Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
@@ -71,7 +70,7 @@ Our models are designed and optimized to run on NVIDIA GPU-accelerated systems.
 ## Software Integration
 - Runtime Engine(s): NeMo 25.07
-- Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100, GeForce RTX
 - Operating System(s): Linux
 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
@@ -85,9 +84,9 @@ import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Load tokenizer and model
-tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-Nano-4B-v2.1")
 model = AutoModelForCausalLM.from_pretrained(
-    "nvidia/NVIDIA-Nemotron-Nano-4B-v2.1",
     torch_dtype=torch.bfloat16,
     trust_remote_code=True,
     device_map="auto"
@@ -114,7 +113,7 @@ outputs = model.generate(
 print(tokenizer.decode(outputs[0]))
 ```
-temperature=1.0 and top\_p=1.0 are recommended for reasoning tasks, while temperature=0.6 and top\_p=0.95 are recommended for tool calling.
 If you’d like to use reasoning off, add enable\_thinking=False to apply\_chat\_template(). By default, enable\_thinking is set to be True.
@@ -141,252 +140,119 @@ print(tokenizer.decode(outputs[0]))
 ### **Use it with vLLM**
-We need vllm\>=0.12.0 for this model. If you are on Jetson Thor or DGX Spark, please use [this vllm container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3).
 ```
-pip install -U "vllm>=0.12.0"
 ```
 Download the custom parser from the Hugging Face repository.
 ```
-wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py
 ```
-## Launch a vLLM server using the custom parser.
 ```
-vllm serve nvidia/NVIDIA-Nemotron-4B-v2.1 \
-  --served-model-name model \
   --max-num-seqs 8 \
   --tensor-parallel-size 1 \
   --max-model-len 262144 \
   --port 8000 \
   --trust-remote-code \
   --enable-auto-tool-choice \
   --tool-call-parser qwen3_coder \
   --reasoning-parser-plugin nano_v3_reasoning_parser.py \
   --reasoning-parser nano_v3
 ```
-##
-## Model Version
-- v1.0
-## Prompt Format
-We follow the jinja chat template provided below.
 ```
-{% macro render_extra_keys(json_dict, handled_keys) %}
-   {%- if json_dict is mapping %}
-       {%- for json_key in json_dict if json_key not in handled_keys %}
-           {%- if json_dict[json_key] is mapping or (json_dict[json_key] is sequence and json_dict[json_key] is not string) %}
-               {{- '\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | tojson | safe) ~ '</' ~ json_key ~ '>' }}
-           {%- else %}
-               {{-'\n<' ~ json_key ~ '>' ~ (json_dict[json_key] | string) ~ '</' ~ json_key ~ '>' }}
-           {%- endif %}
-       {%- endfor %}
-   {%- endif %}
-{% endmacro %}
-{%- set enable_thinking = enable_thinking if enable_thinking is defined else True %}
-{%- set truncate_history_thinking = truncate_history_thinking if truncate_history_thinking is defined else True %}
-{%- set ns = namespace(last_user_idx = -1) %}
-{%- set loop_messages = messages %}
-{%- for m in loop_messages %}
- {%- if m["role"] == "user" %}
-   {%- set ns.last_user_idx = loop.index0 %}
- {%- endif %}
-{%- endfor %}
-{%- if messages[0]["role"] == "system" %}
-   {%- set system_message = messages[0]["content"] %}
-   {%- set loop_messages = messages[1:] %}
-{%- else %}
-   {%- set system_message = "" %}
-   {%- set loop_messages = messages %}
-{%- endif %}
-{%- if not tools is defined %}
-   {%- set tools = [] %}
-{%- endif %}
-{# Recompute last_user_idx relative to loop_messages after handling system #}
-{%- set ns = namespace(last_user_idx = -1) %}
-{%- for m in loop_messages %}
- {%- if m["role"] == "user" %}
-   {%- set ns.last_user_idx = loop.index0 %}
- {%- endif %}
-{%- endfor %}
-{%- if system_message is defined %}
-   {{- "<|im_start|>system\n" + system_message }}
-{%- else %}
-   {%- if tools is iterable and tools | length > 0 %}
-       {{- "<|im_start|>system\n" }}
-   {%- endif %}
-{%- endif %}
-{%- if tools is iterable and tools | length > 0 %}
-   {%- if system_message is defined and system_message | length > 0 %}
-       {{- "\n\n" }}
-   {%- endif %}
-   {{- "# Tools\n\nYou have access to the following functions:\n\n" }}
-   {{- "<tools>" }}
-   {%- for tool in tools %}
-       {%- if tool.function is defined %}
-           {%- set tool = tool.function %}
-       {%- endif %}
-       {{- "\n<function>\n<name>" ~ tool.name ~ "</name>" }}
-       {%- if tool.description is defined %}
-           {{- '\n<description>' ~ (tool.description | trim) ~ '</description>' }}
-       {%- endif %}
-       {{- '\n<parameters>' }}
-       {%- if tool.parameters is defined and tool.parameters is mapping and tool.parameters.properties is defined and tool.parameters.properties is mapping %}
-           {%- for param_name, param_fields in tool.parameters.properties|items %}
-               {{- '\n<parameter>' }}
-               {{- '\n<name>' ~ param_name ~ '</name>' }}
-               {%- if param_fields.type is defined %}
-                   {{- '\n<type>' ~ (param_fields.type | string) ~ '</type>' }}
-               {%- endif %}
-               {%- if param_fields.description is defined %}
-                   {{- '\n<description>' ~ (param_fields.description | trim) ~ '</description>' }}
-               {%- endif %}
-               {%- if param_fields.enum is defined %}
-                   {{- '\n<enum>' ~ (param_fields.enum | tojson | safe) ~ '</enum>' }}
-               {%- endif %}
-               {%- set handled_keys = ['name', 'type', 'description', 'enum'] %}
-               {{- render_extra_keys(param_fields, handled_keys) }}
-               {{- '\n</parameter>' }}
-           {%- endfor %}
-       {%- endif %}
-       {% set handled_keys = ['type', 'properties', 'required'] %}
-       {{- render_extra_keys(tool.parameters, handled_keys) }}
-       {%- if tool.parameters is defined and tool.parameters.required is defined %}
-           {{- '\n<required>' ~ (tool.parameters.required | tojson | safe) ~ '</required>' }}
-       {%- endif %}
-       {{- '\n</parameters>' }}
-       {%- set handled_keys = ['type', 'name', 'description', 'parameters'] %}
-       {{- render_extra_keys(tool, handled_keys) }}
-       {{- '\n</function>' }}
-   {%- endfor %}
-   {{- "\n</tools>" }}
-   {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
-{%- endif %}
-{%- if system_message is defined %}
-   {{- '<|im_end|>\n' }}
-{%- else %}
-   {%- if tools is iterable and tools | length > 0 %}
-       {{- '<|im_end|>\n' }}
-   {%- endif %}
-{%- endif %}
-{%- for message in loop_messages %}
-   {%- if message.role == "assistant" %}
-       {# Add reasoning content in to content field for unified processing below. #}
-       {%- if message.reasoning_content is defined and message.reasoning_content is string and message.reasoning_content | trim | length > 0 %}
-           {%- set content = "<think>\n" ~ message.reasoning_content ~ "\n</think>\n" ~ (message.content | default('', true)) %}
-       {%- else %}
-           {%- set content = message.content | default('', true) %}
-           {%- if content is string -%}
-               {# Allow downstream logic to to take care of broken thought, only handle coherent reasoning here. #}
-               {%- if '<think>' not in content and '</think>' not in content -%}
-                   {%- set content = "<think></think>" ~ content -%}
-               {%- endif -%}
-           {%- else -%}
-               {%- set content = content -%}
-           {%- endif -%}
-       {%- endif %}
-       {%- if message.tool_calls is defined and message.tool_calls is iterable and message.tool_calls | length > 0 %}
-           {# Assistant message has tool calls. #}
-           {{- '<|im_start|>assistant\n' }}
-               {%- set include_content = not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
-               {%- if content is string and content | trim | length > 0 %}
-                   {%- if include_content %}
-                       {{- (content | trim) ~ '\n' -}}
-                   {%- else %}
-                       {%- set c = (content | string) %}
-                       {%- if '</think>' in c %}
-                           {# Keep only content after the last closing think. Also generation prompt causes this. #}
-                           {%- set c = c.split('</think>')[-1] %}
-                       {%- elif '<think>' in c %}
-                           {# If <think> was opened but never closed, drop the trailing think segment #}
-                           {%- set c = c.split('<think>')[0] %}
-                       {%- endif %}
-                       {%- set c = "<think></think>" ~ c | trim %}
-                       {%- if c | length > 0 %}
-                           {{- c ~ '\n' -}}
-                       {%- endif %}
-                   {%- endif %}
-               {%- else %}
-                   {{- "<think></think>" -}}
-               {%- endif %}
-               {%- for tool_call in message.tool_calls %}
-                   {%- if tool_call.function is defined %}
-                       {%- set tool_call = tool_call.function %}
-                   {%- endif %}
-                   {{- '<tool_call>\n<function=' ~ tool_call.name ~ '>\n' -}}
-                       {%- if tool_call.arguments is defined %}
-                           {%- for args_name, args_value in tool_call.arguments|items %}
-                               {{- '<parameter=' ~ args_name ~ '>\n' -}}
-                                   {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
-                               {{- args_value ~ '\n</parameter>\n' -}}
-                           {%- endfor %}
-                       {%- endif %}
-                   {{- '</function>\n</tool_call>\n' -}}
-               {%- endfor %}
-               {{- '<|im_end|>\n' }}
-       {%- else %}
-           {# Assistant message doesn't have tool calls. #}
-           {%- if not (truncate_history_thinking and loop.index0 < ns.last_user_idx) %}
-               {{- '<|im_start|>assistant\n' ~ (content | default('', true) | string | trim) ~ '<|im_end|>\n' }}
-           {%- else %}
-               {%- set c = (content | default('', true) | string) %}
-               {%- if '<think>' in c and '</think>' in c %}
-                   {%- set c = "<think></think>" ~ c.split('</think>')[-1] %}
-               {%- endif %}
-               {%- set c = c | trim %}
-               {%- if c | length > 0 %}
-                   {{- '<|im_start|>assistant\n' ~ c ~ '<|im_end|>\n' }}
-               {%- else %}
-                   {{- '<|im_start|>assistant\n<|im_end|>\n' }}
-               {%- endif %}
-           {%- endif %}
-       {%- endif %}
-   {%- elif message.role == "user" or message.role == "system" %}
-       {{- '<|im_start|>' + message.role + '\n' }}
-       {%- set content = message.content | string %}
-       {{- content }}
-       {{- '<|im_end|>\n' }}
-   {%- elif message.role == "tool" %}
-       {%- if loop.previtem and loop.previtem.role != "tool" %}
-           {{- '<|im_start|>user\n' }}
-       {%- endif %}
-       {{- '<tool_response>\n' }}
-       {{- message.content }}
-       {{- '\n</tool_response>\n' }}
-       {%- if not loop.last and loop.nextitem.role != "tool" %}
-           {{- '<|im_end|>\n' }}
-       {%- elif loop.last %}
-           {{- '<|im_end|>\n' }}
-       {%- endif %}
-   {%- else %}
-       {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>\n' }}
-   {%- endif %}
-{%- endfor %}
-{%- if add_generation_prompt %}
-   {%- if enable_thinking %}
-       {{- '<|im_start|>assistant\n<think>\n' }}
-   {%- else %}
-       {{- '<|im_start|>assistant\n<think></think>' }}
-   {%- endif %}
-{%- endif %}
 ```
 ##
 ## Training, Testing, and Evaluation Datasets
@@ -399,8 +265,7 @@ We follow the jinja chat template provided below.
 * Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
 * Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
-**Properties:** The post-training corpus for NVIDIA-Nemotron-Nano-4B-v2.1 consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.
 More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf) .
@@ -527,7 +392,7 @@ The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API.
 | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | [AQUA-RAT](https://huggingface.co/datasets/deepmind/aqua_rat); [LogiQA](https://huggingface.co/datasets/lucasmccabe/logiqa); [AR-LSAT](https://github.com/zhongwanjun/AR-LSAT) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | [Art of Problem Solving](https://artofproblemsolving.com/company); [American Mathematics Competitions 8](https://artofproblemsolving.com/wiki/index.php/AMC_8_Problems_and_Solutions); [American Mathematics Competitions 10](https://artofproblemsolving.com/wiki/index.php/AMC_10_Problems_and_Solutions); [GSM8K](https://github.com/openai/grade-school-math); [PRM800K](https://github.com/openai/prm800k) | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct); [Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B); [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B); [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | [MMLU Auxiliary Train](https://huggingface.co/datasets/cais/mmlu/viewer/all/auxiliary_train) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
-| Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B  | [arXiv](https://info.arxiv.org/help/bulk_data/index.html); [National Institutes of Health ExPorter](https://www.nih.gov/); [BioRxiv](https://www.biorxiv.org/tdm); [PMC Article](https://pmc.ncbi.nlm.nih.gov/tools/textmining/); [USPTO Backgrounds](https://data.uspto.gov/apis/transition-guide/bdss#pats); [peS2o](https://huggingface.co/datasets/allenai/peS2o); Global Regulation; [CORE](https://core.ac.uk/documentation/dataset); [PG-19](https://github.com/google-deepmind/pg19); [DOAB CC BY & CC BY-SA subset](https://www.doabooks.org/en); [NDLTD](https://ndltd.org/thesis-resources/global-etd-search/) | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B); [Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct) |
 | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | [Wikimedia](https://dumps.wikimedia.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
@@ -618,36 +483,50 @@ The GitHub Crawl was collected using the GitHub REST API and the Amazon S3 API.
 We evaluated our model in \*\*Reasoning-On\*\* mode across these benchmarks.
-| Benchmark | NVIDIA-Nemotron-Nano-v2.1-4B |
-| :---- | ----- |
-| AIME25 | X |
-| MATH500 | X |
-| GPQA | X |
-| LCB | X |
-| BFCL v3 | X |
-| IFEVAL-Prompt | X |
-| IFEVAL-Instruction | X |
 We also evaluated our model in \*\*Reasoning-off\*\* mode across these benchmarks
-| Benchmark | NVIDIA-Nemotron-Nano-v2.1-4B |
 | :---- | ----- |
-| BFCL v3 | X |
-| IFEVAL-Prompt | X |
-| IFEVAL-Instruction | X |
-| Orak | X |
 All evaluations were done using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills/tree/main/docs) & [Orak](https://github.com/krafton-ai/Orak). For Orak we evaluated on three games (Super Mario, Darkest Dungeon & StarDew Valley)
 ## Inference
-- ## Engines: HF, vLLM, llama-cpp
-- ## Test Hardware NVIDIA GeForce RTX, H100 80GB,
 ## Ethical Considerations
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-Please report security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).

+# NVIDIA-Nemotron-3-Nano-4B-BF16
 **Model Developer:** NVIDIA Corporation
 ## Model Overview
+NVIDIA-Nemotron-3-Nano-4B-BF16 is a small language model (SLM) trained from scratch by NVIDIA, and designed as a unified model for both reasoning and non-reasoning tasks. It responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model's reasoning capabilities can be controlled via a system prompt. If the user prefers the model to provide its final answer without intermediate reasoning traces, it can be configured to do so, albeit with a slight decrease in accuracy for harder prompts that require reasoning. Conversely, allowing the model to generate reasoning traces first generally results in higher-quality final solutions to queries and tasks.
+The model has been compressed from NVIDIA-Nemotron-Nano-9B-v2 using the Nemotron [Elastic](https://arxiv.org/pdf/2511.16664) framework. The details of the parent model NVIDIA-Nemotron-Nano-9B-v2 can be found in ([Nemotron-H tech report](https://arxiv.org/abs/2504.03624)). The model uses a hybrid architecture consisting primarily of Mamba-2 and MLP layers combined with just four Attention layers.
 The supported languages include: English. Improved using Qwen.
+This model is ready for commercial use.
+## License/Terms of Use
+Governing Terms: Use of this model is governed by the [NVIDIA Nemotron Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/).
 ### Deployment Geography: Global
 ### Use Case
+NVIDIA-Nemotron-3-Nano-4B is an edge-ready small language model intended for Agentic AI in edge platforms (Jetson Thor, GeForce RTX, DGX Spark). It targets key-uses including AI gaming NPCs (teammates / companions), local voice assistants (for devices, apps, and games), and IoT automation. It is to be used in English and coding languages.
+### Release Date: 3/16/2026
+Huggingface 3/16/2026 via [https://huggingface.co/](https://huggingface.co/)
 ## References
+- [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf)
+- [Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs](https://arxiv.org/abs/2511.16664)
+- [NVIDIA Nemotron 3: Efficient and Open Intelligence](https://arxiv.org/abs/2512.20856)
+- [Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning](https://arxiv.org/abs/2512.20848)
+- [Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)
 ## Model Architecture
 - Architecture Type: Mamba2-Transformer Hybrid
+- Network Architecture: Nemotron-Hybrid
+  - This model was compressed from [nvidia/NVIDIA-Nemotron-Nano-9B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2)
   - Number of model parameters 3.97 x 10^9
 ## Input
 - Input Type(s): Text
 - Output Type(s): Text
 - Output Format: String
+- Output Parameters: One-Dimensional (1D): Sequences
 - Other properties Related to Output: Sequences up to 262K
 Our models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
 ## Software Integration
 - Runtime Engine(s): NeMo 25.07
+- Supported Hardware Microarchitecture Compatibility: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100, GeForce RTX
 - Operating System(s): Linux
 The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B")
 model = AutoModelForCausalLM.from_pretrained(
+    "nvidia/NVIDIA-Nemotron-3-Nano-4B",
     torch_dtype=torch.bfloat16,
     trust_remote_code=True,
     device_map="auto"
 print(tokenizer.decode(outputs[0]))
 ```
+temperature=1.0 and top\_p=0.95 are recommended for reasoning tasks, while temperature=0.6 and top\_p=0.95 are recommended for tool calling.
 If you’d like to use reasoning off, add enable\_thinking=False to apply\_chat\_template(). By default, enable\_thinking is set to be True.
 ### **Use it with vLLM**
+We need vllm\>=0.15.1 for this model. If you are on Jetson Thor or DGX Spark, please use [this vllm container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=26.02-py3). (26.01 was verified) (To verify)
+**(Docker launch command is missing)**
 ```
+pip install -U "vllm>=0.15.1"
 ```
 Download the custom parser from the Hugging Face repository.
 ```
+wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/resolve/main/nano_v3_reasoning_parser.py
 ```
+Launch a vLLM server using the custom parser.
 ```
+vllm serve nvidia/NVIDIA-Nemotron-3-Nano-4B \
+  --served-model-name nemotron3-nano-4B\
   --max-num-seqs 8 \
   --tensor-parallel-size 1 \
   --max-model-len 262144 \
   --port 8000 \
   --trust-remote-code \
+  --mamba_ssm_cache_dtype float32 \
   --enable-auto-tool-choice \
   --tool-call-parser qwen3_coder \
   --reasoning-parser-plugin nano_v3_reasoning_parser.py \
   --reasoning-parser nano_v3
 ```
+Access the hosted API using a python client.
+```py
+from openai import OpenAI
+import asyncio
+from openai import AsyncOpenAI
+# NOTE: Streaming is preferred for better performance and resource efficiency.
+# It allows you to start processing responses as they arrive, reducing latency.
+# Synchronous example (non-streaming)
+client = OpenAI(
+    api_key="your-nvapikey",
+    base_url="base-url"
+)
+response = client.chat.completions.create(
+    model="nemotron3-nano-4B",
+    messages=[
+        {
+            "role": "user",
+            "content": "Hello!"
+        }
+    ],
+    temperature=0.7,
+    max_tokens=256,
+    top_p=0.7,
+    stream=false
+)
+print(response.choices[0].message.content)
+```
+### Use it with TRT-LLM
+Launch the model using TRT-LLM
+```shell
+docker run -v /home/root/.cache/huggingface/:/root/.cache/huggingface/ --rm --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --ipc=host --network host -d -e MODEL=$MODEL -e HF_TOKEN=$HF_TOKEN nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6 bash -c '
+cat > /tmp/extra-llm-api-config.yml <<EOF
+kv_cache_config:
+  dtype: "auto"
+  enable_block_reuse: false
+cuda_graph_config:
+  max_batch_size: 32
+  enable_padding: true
+disable_overlap_scheduler: true
+moe_config:
+  backend: CUTLASS
+EOF
+trtllm-serve  \
+$MODEL \
+--host 0.0.0.0 \
+--port 8123 \
+--max_batch_size 32 \
+--extra_llm_api_options /tmp/extra-llm-api-config.yml '
 ```
+Access the hosted endpoint using curl command.
+```shell
+curl http://localhost:8123/v1/chat/completions -H "Content-Type: application/json"  -d '{
+    "model": "$MODEL",
+    "messages": [
+        {
+            "role": "user",
+            "content": "Where is New York?"
+        }
+    ],
+    "max_tokens": 1024,
+    "top_p": 1.0
+}' -w "\n"
 ```
+## Model Version
+- v1.0
 ##
 ## Training, Testing, and Evaluation Datasets
 * Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
 * Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
+**Properties:** The post-training corpus for NVIDIA-Nemotron-3-Nano-4B consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1/R1-0528, Qwen3-235B-A22B, Nemotron 4 340B, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct, Qwen 2.5 72B.
 More details on the datasets and synthetic data generation methods can be found in the technical report [NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model](%20https://research.nvidia.com/labs/adlr/files/NVIDIA-Nemotron-Nano-2-Technical-Report.pdf) .
 | Synthetic AGIEval seeded with AQUA-RAT, LogiQA, and AR-LSAT from Qwen3-30B-A3B | Text | 4.2B | [AQUA-RAT](https://huggingface.co/datasets/deepmind/aqua_rat); [LogiQA](https://huggingface.co/datasets/lucasmccabe/logiqa); [AR-LSAT](https://github.com/zhongwanjun/AR-LSAT) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 | Synthetic Art of Problem Solving from Qwen2.5-32B-Instruct, Qwen2.5-Math-72B, Qwen2.5-Math-7B, and Qwen2.5-72B-Instruct | Text | 83.1B | [Art of Problem Solving](https://artofproblemsolving.com/company); [American Mathematics Competitions 8](https://artofproblemsolving.com/wiki/index.php/AMC_8_Problems_and_Solutions); [American Mathematics Competitions 10](https://artofproblemsolving.com/wiki/index.php/AMC_10_Problems_and_Solutions); [GSM8K](https://github.com/openai/grade-school-math); [PRM800K](https://github.com/openai/prm800k) | [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct); [Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B); [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B); [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Synthetic MMLU Auxiliary Train from DeepSeek-R1 | Text | 0.5B | [MMLU Auxiliary Train](https://huggingface.co/datasets/cais/mmlu/viewer/all/auxiliary_train) | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
+| Synthetic Long Context Continued Post-Training Data from Papers and Permissible Books from Qwen2.5-72B-Instruct | Text | 5.4B | [arXiv](https://info.arxiv.org/help/bulk_data/index.html); [National Institutes of Health ExPorter](https://www.nih.gov/); [BioRxiv](https://www.biorxiv.org/tdm); [PMC Article](https://pmc.ncbi.nlm.nih.gov/tools/textmining/); [USPTO Backgrounds](https://data.uspto.gov/apis/transition-guide/bdss#pats); [peS2o](https://huggingface.co/datasets/allenai/peS2o); Global Regulation; [CORE](https://core.ac.uk/documentation/dataset); [PG-19](https://github.com/google-deepmind/pg19); [DOAB CC BY & CC BY-SA subset](https://www.doabooks.org/en); [NDLTD](https://ndltd.org/thesis-resources/global-etd-search/) | [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) |
 | Synthetic Common Crawl from Qwen3-30B-A3B and Mistral-Nemo-12B-Instruct | Text | 1.949T | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B); [Mistral-NeMo-12B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct) |
 | Synthetic Multilingual Data from Common Crawl from Qwen3-30B-A3B | Text | 997.3B | [Common Crawl](https://commoncrawl.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 | Synthetic Multilingual Data from Wikimedia from Qwen3-30B-A3B | Text | 55.1B | [Wikimedia](https://dumps.wikimedia.org/) | [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) |
 We evaluated our model in \*\*Reasoning-On\*\* mode across these benchmarks.
+| Benchmark | NVIDIA-Nemotron-3-Nano-4B |
+| :---- | :---: |
+| AIME25 | 78.5 |
+| MATH500 | 95.4 |
+| GPQA | 53.2 |
+| LCB | 51.8 |
+| BFCL v3 | 61.1 |
+| IFEVAL-Prompt | 87.9 |
+| IFEVAL-Instruction | 92 |
+| Tau2-Airline | 33.3 |
+| Tau2-Retail | 39.8 |
+| Tau2-Telecom | 33 |
 We also evaluated our model in \*\*Reasoning-off\*\* mode across these benchmarks
+| Benchmark | NVIDIA-Nemotron-3-Nano-4B |
 | :---- | ----- |
+| BFCL v3 | 61.1 |
+| IFBench-Prompt | 43.2 |
+| IFBench-Instruction | 44.2 |
+| Orak  | 22.9 |
+| IFEval-Prompt | 82.8 |
+| IFEval-Instruction | 88 |
+| HaluEval | 62.2 |
+| RULER (128k) | 91.1 |
+| Tau2-Airline | 28.0 |
+| Tau2-Retail | 34.8 |
+| Tau2-Telecom | 24.9 |
+| EQ-Bench3 | 63.2 |
 All evaluations were done using [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills/tree/main/docs) & [Orak](https://github.com/krafton-ai/Orak). For Orak we evaluated on three games (Super Mario, Darkest Dungeon & StarDew Valley)
 ## Inference
+- Engines: HF, vLLM, llama-cpp
+- Test Hardware: NVIDIA GeForce RTX, H100 80GB, DGX Spark, Jetson Thor
 ## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our [Trustworthy AI terms of service](https://www.nvidia.com/en-us/agreements/trustworthy-ai/terms/), developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+We advise against circumvention of any provided safety guardrails contained in the Model without a substantially similar guardrail appropriate for your use case.For more details: [Safety](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/safety.md) and [Explainability](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/explainability.md) Subcards.
+For more detailed information on ethical considerations for this model, please see the Model Card++ [Bias](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/bias.md), and [Privacy](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/privacy.md) Subcards.
+Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).