Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — NV-FP4 MLP + FP8 KV Cache

Quantized version of Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — a reasoning-focused model distilled from Claude 4.6 Opus.

Quantized using NVIDIA ModelOpt with NV-FP4 weights for MLP layers and FP8 static calibration for KV cache. The visual encoder (VLM components) and LM head are kept in original precision.

Quantization Details

Parameter	Value
Method	NVIDIA ModelOpt PTQ (Post-Training Quantization)
MLP weights	NV-FP4 `(2,1)`-bit, block size 16, dynamic scale
KV Cache	FP8 static calibration, per-tensor
Visual encoder	Not quantized (kept in bf16)
LM head	Not quantized
Calibration dataset	`nvidia/nemotron-post-training-dataset-v2`
Calibration samples	2048
Calibration seq length	3072 tokens

What was quantized and what was not

✅ Quantized: language_model.layers.*.mlp.* (gate/up/down projections) — NV-FP4
✅ Quantized: self_attn.k/v_bmm_quantizer on standard attention layers — FP8 KV cache
❌ Skipped: model.visual.* — visual encoder kept at bf16 (too sensitive to quantization)
❌ Skipped: linear_attn (Mamba-style) layers — non-standard architecture, excluded automatically
❌ Skipped: lm_head, mtp.layers.0 — standard practice, excluded automatically

Hardware Requirements

Component	Recommended
GPU VRAM (inference)	6–8 GB (vs ~18 GB for bf16 original)
GPU architecture	NVIDIA Ada Lovelace / Hopper (RTX 40xx, RTX 50xx, H100)
CUDA	12.1+

Note: NV-FP4 requires hardware support for FP4 tensor cores. RTX 4090, RTX 5070 Ti and newer Ada/Hopper GPUs are fully supported. Older architectures will fall back to FP8 or FP16 emulation.

Quality Validation

A medical reasoning sample was tested before and after PTQ. The model produced the correct answer (option E) in both cases with equivalent explanation quality — confirming minimal degradation from quantization.

Input:  During a Toupet fundoplication, a 270-degree posterior wrap is performed...
Before: Correct answer E — full explanation preserved
After:  Correct answer E — equivalent reasoning quality

Usage

TensorRT-LLM

# Build engine
trtllm-build \
  --checkpoint_dir ./Qwen3.5-9B-Claude-Opus-nvfp4 \
  --output_dir ./engine \
  --gemm_plugin nvfp4 \
  --kv_cache_type fp8 \
  --max_input_len 8192 \
  --max_output_len 4096

# Run inference
python examples/run.py \
  --engine_dir ./engine \
  --max_output_len 2048 \
  --input_text "Let me analyze this step by step:"

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
    quantization="modelopt",
    dtype="auto",
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.9,
    max_tokens=4096,
)

outputs = llm.generate(
    ["<|im_start|>user\nExplain quantum entanglement step by step.<|im_end|>\n<|im_start|>assistant\n"],
    sampling_params,
)
print(outputs[0].outputs[0].text)

SGLang

import sglang as sgl

@sgl.function
def reasoning_chain(s, question):
    s += sgl.user(question)
    s += sgl.assistant(
        sgl.gen("answer", max_new_tokens=4096, temperature=0.6)
    )

runtime = sgl.Runtime(
    model_path="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
    quantization="modelopt-fp4",
)
sgl.set_default_backend(runtime)

state = reasoning_chain.run(
    question="Solve this step by step: what is the derivative of x^3 * sin(x)?"
)
print(state["answer"])

Prompt Format

This model uses the ChatML format inherited from Qwen3.5, with mandatory <think> blocks:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{internal step-by-step reasoning}
</think>
{final answer}

The model is trained to always reason inside <think> tags before producing a final response. This structured reasoning pattern is the core capability distilled from Claude 4.6 Opus.

About the Base Model

Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled is a supervised fine-tune of Qwen3.5-9B using Chain-of-Thought distillation data from Claude 4.6 Opus interactions.

Key properties of the base model:

Architecture: Qwen3.5-9B dense + visual encoder (VLM)
Fine-tuning: SFT with LoRA via Unsloth, response-only training on <think> sequences
Training data: nohurry/Opus-4.6-Reasoning-3000x-filtered + TeichAI/claude-4.5-opus-high-reasoning-250x
Context window: 16,384 tokens
Training loss: converged from 0.730 → 0.187
License: Apache 2.0

Limitations

Inherits all limitations of the base model including potential hallucinations on factual queries
NV-FP4 quantization may reduce performance on tasks with highly sensitive numerical computations (e.g. complex math)
KV cache FP8 calibrated at seq length 3072 — may have slightly reduced accuracy on very long contexts (16K+)
Best suited for: reasoning, coding, analysis, structured problem-solving

Acknowledgements

Jackrong for the original fine-tuned model
Qwen Team for the Qwen3.5 base architecture
NVIDIA ModelOpt for the quantization toolkit
Unsloth AI used in base model training
Dataset authors: nohurry and TeichAI

Downloads last month: 1,571

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3

Model tree for Alexzander85/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4-MLP-FP8KV

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled

Quantized

(6)

this model

Alexzander85
/

Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4-MLP-FP8KV