Qwen3-8B-DMS-8x
Description:
Qwen-3-8B-DMS-8x is a derivative of Qwen3-8B that integrates Dynamic Memory Sparsification (DMS) with an 8x compression ratio during inference. DMS adaptively sparsifies the KV cache to reduce memory footprint and improve throughput and latency for long-context and reasoning generations. The method learns per-head eviction policies that interpolate between a sliding window over the last 512 tokens and full attention. Inference-time code is provided with the checkpoint.
This model is for research and development only.
License/Terms of Use:
This model is released under the NVIDIA License.
Under the NVIDIA License, NVIDIA confirms:
- Models can be used noncommercially, which means for non-commercial research and educational purposes only.
Deployment Geography:
Global
Use Case:
A compact, general-purpose LLM with advanced reasoning capabilities, optimized for inference-time scaling and reduced key-value (KV) cache memory footprint.
Release Date:
Hugging Face on Jan 19th, 2026 via https://huggingface.co/nvidia/Qwen3-8B-DMS-8x
References:
- [2505.09388] Qwen3 Technical Report
- [2506.05345] Inference-Time Hyper-Scaling with KV Cache Compression
Suggested citation:
@misc{lancucki2025inferencetime,
title={Inference-Time Hyper-Scaling with KV Cache Compression},
author={Adrian Łańcucki and Konrad Staniszewski and Piotr Nawrot and Edoardo M. Ponti},
year={2025},
eprint={2506.05345},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.05345},
}
Model Architecture:
Architecture Type: Autoregressive Transformer
Network Architecture: Qwen3
Base Model: Qwen3-8B
Number of Model Parameters: 8.2B
Input:
Input Type(s): Text
Input Format: String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: The same chat template used for Qwen3-8B should be applied.
Output:
Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Native context length 32,768, up to 131,072 tokens with YaRN
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s): HuggingFace Transformers
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
Preferred/Supported Operating System(s): Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s):
Qwen3-8B-DMS-8x
Quick Start and Usage Recommendations:
To run the model transformers==4.57.3, torch and flash-attn are required.
python3 -m venv venv
source venv/bin/activate
pip3 install transformers==4.57.3
pip3 install accelerate # (optional) for device placement
pip3 install torch
pip3 install flash-attn --no-build-isolation
Model weights and the corresponding DMS-adapted Qwen3 code are available at the HuggingFace hub: nvidia/Qwen3-8B-DMS-8x.
To download the model, you can use the following snippet:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = AutoModelForCausalLM.from_pretrained(
"nvidia/Qwen3-8B-DMS-8x",
torch_dtype=torch.bfloat16,
device_map="auto",
# The `trust_remote_code` part is important
# as otherwise Qwen3 code (without DMS) will be loaded
trust_remote_code=True
)
The rest follows the standard Qwen3 usage pattern. To ask the model about the solution to the quadratic equation, one can use the following snippet:
conversation = [
{"role": "user", "content": "Solve: x^2 -2x + 1 = 0"}
]
prompt = tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
streamer = TextStreamer(tokenizer, skip_prompt=False)
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
streamer=streamer,
max_new_tokens=2048
)
Training and Evaluation Datasets:
Training Dataset:
Link: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k
Data Modality: Text
Text Training Data Size: Less than a Billion Tokens
Data Collection Method by dataset: Hybrid: Human, Automated
Labeling Method by dataset: Automated
Properties:
The dataset contains 220k math problems from NuminaMath 1.5 along with a number of reasoning traces generated by DeepSeek R1 for each problem, some of which were automatically verified with Math Verify and Llama 3.3 70B Instruct. After tokenization, the entire dataset contains 36M tokens. We use only the traces which are labelled as correctly verified.
Evaluation Dataset:
Data Collection Method: Hybrid: Human, Synthetic, Automated
Labeling Method: Hybrid: Human, Synthetic, Automated
Evaluation Results
We evaluate the model using temperature=0.6 and top_p=0.95 with a sequence length limit of 131072 tokens.
| Benchmark | Thinking | Qwen3-8B | Qwen3-8B-DMS-8x |
|---|---|---|---|
| GPQA Diamond | y | 58.8 | 57.6 |
| MMLU-Pro | y | 74.2 | 73.5 |
| AIME 2024 | y | 75.0 | 73.0 |
| MATH-500 | y | 95.1 | 95.5 |
| HumanEval | y | 87.8 | 89.6 |
| IFEval | y | 90.3 | 88.8 |
| ArenaHard v0.1 | y | 88.4 | 89.7 |
| RULER 64K | n | 69.2 | 76.2 |
| RULER 128K | n | 25.0 | 21.4 |
AIME 2024 results were averaged over 10 runs (different seeds) and MATH-500 over 3; MMLU-Pro uses micro-averaging.
Inference:
Acceleration Engine: HuggingFace Transformers
Test Hardware: H100 PCIe/SXM
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 254