Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — NV-FP4 MLP + FP8 KV Cache
Quantized version of Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — a reasoning-focused model distilled from Claude 4.6 Opus.
Quantized using NVIDIA ModelOpt with NV-FP4 weights for MLP layers and FP8 static calibration for KV cache. The visual encoder (VLM components) and LM head are kept in original precision.
Quantization Details
| Parameter | Value |
|---|---|
| Method | NVIDIA ModelOpt PTQ (Post-Training Quantization) |
| MLP weights | NV-FP4 (2,1)-bit, block size 16, dynamic scale |
| KV Cache | FP8 static calibration, per-tensor |
| Visual encoder | Not quantized (kept in bf16) |
| LM head | Not quantized |
| Calibration dataset | nvidia/nemotron-post-training-dataset-v2 |
| Calibration samples | 2048 |
| Calibration seq length | 3072 tokens |
What was quantized and what was not
- ✅ Quantized:
language_model.layers.*.mlp.*(gate/up/down projections) — NV-FP4 - ✅ Quantized:
self_attn.k/v_bmm_quantizeron standard attention layers — FP8 KV cache - ❌ Skipped:
model.visual.*— visual encoder kept at bf16 (too sensitive to quantization) - ❌ Skipped:
linear_attn(Mamba-style) layers — non-standard architecture, excluded automatically - ❌ Skipped:
lm_head,mtp.layers.0— standard practice, excluded automatically
Hardware Requirements
| Component | Recommended |
|---|---|
| GPU VRAM (inference) | 6–8 GB (vs ~18 GB for bf16 original) |
| GPU architecture | NVIDIA Ada Lovelace / Hopper (RTX 40xx, RTX 50xx, H100) |
| CUDA | 12.1+ |
Note: NV-FP4 requires hardware support for FP4 tensor cores. RTX 4090, RTX 5070 Ti and newer Ada/Hopper GPUs are fully supported. Older architectures will fall back to FP8 or FP16 emulation.
Quality Validation
A medical reasoning sample was tested before and after PTQ. The model produced the correct answer (option E) in both cases with equivalent explanation quality — confirming minimal degradation from quantization.
Input: During a Toupet fundoplication, a 270-degree posterior wrap is performed...
Before: Correct answer E — full explanation preserved
After: Correct answer E — equivalent reasoning quality
Usage
TensorRT-LLM
# Build engine
trtllm-build \
--checkpoint_dir ./Qwen3.5-9B-Claude-Opus-nvfp4 \
--output_dir ./engine \
--gemm_plugin nvfp4 \
--kv_cache_type fp8 \
--max_input_len 8192 \
--max_output_len 4096
# Run inference
python examples/run.py \
--engine_dir ./engine \
--max_output_len 2048 \
--input_text "Let me analyze this step by step:"
vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
quantization="modelopt",
dtype="auto",
gpu_memory_utilization=0.85,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.9,
max_tokens=4096,
)
outputs = llm.generate(
["<|im_start|>user\nExplain quantum entanglement step by step.<|im_end|>\n<|im_start|>assistant\n"],
sampling_params,
)
print(outputs[0].outputs[0].text)
SGLang
import sglang as sgl
@sgl.function
def reasoning_chain(s, question):
s += sgl.user(question)
s += sgl.assistant(
sgl.gen("answer", max_new_tokens=4096, temperature=0.6)
)
runtime = sgl.Runtime(
model_path="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
quantization="modelopt-fp4",
)
sgl.set_default_backend(runtime)
state = reasoning_chain.run(
question="Solve this step by step: what is the derivative of x^3 * sin(x)?"
)
print(state["answer"])
Prompt Format
This model uses the ChatML format inherited from Qwen3.5, with mandatory <think> blocks:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{internal step-by-step reasoning}
</think>
{final answer}
The model is trained to always reason inside <think> tags before producing a final response. This structured reasoning pattern is the core capability distilled from Claude 4.6 Opus.
About the Base Model
Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled is a supervised fine-tune of Qwen3.5-9B using Chain-of-Thought distillation data from Claude 4.6 Opus interactions.
Key properties of the base model:
- Architecture: Qwen3.5-9B dense + visual encoder (VLM)
- Fine-tuning: SFT with LoRA via Unsloth, response-only training on
<think>sequences - Training data:
nohurry/Opus-4.6-Reasoning-3000x-filtered+TeichAI/claude-4.5-opus-high-reasoning-250x - Context window: 16,384 tokens
- Training loss: converged from 0.730 → 0.187
- License: Apache 2.0
Limitations
- Inherits all limitations of the base model including potential hallucinations on factual queries
- NV-FP4 quantization may reduce performance on tasks with highly sensitive numerical computations (e.g. complex math)
- KV cache FP8 calibrated at seq length 3072 — may have slightly reduced accuracy on very long contexts (16K+)
- Best suited for: reasoning, coding, analysis, structured problem-solving
Acknowledgements
- Jackrong for the original fine-tuned model
- Qwen Team for the Qwen3.5 base architecture
- NVIDIA ModelOpt for the quantization toolkit
- Unsloth AI used in base model training
- Dataset authors: nohurry and TeichAI
- Downloads last month
- 1,571
Model tree for Alexzander85/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4-MLP-FP8KV
Base model
Qwen/Qwen3.5-9B-Base