Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — NV-FP4 MLP + FP8 KV Cache

Quantized version of Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled — a reasoning-focused model distilled from Claude 4.6 Opus.

Quantized using NVIDIA ModelOpt with NV-FP4 weights for MLP layers and FP8 static calibration for KV cache. The visual encoder (VLM components) and LM head are kept in original precision.


Quantization Details

Parameter Value
Method NVIDIA ModelOpt PTQ (Post-Training Quantization)
MLP weights NV-FP4 (2,1)-bit, block size 16, dynamic scale
KV Cache FP8 static calibration, per-tensor
Visual encoder Not quantized (kept in bf16)
LM head Not quantized
Calibration dataset nvidia/nemotron-post-training-dataset-v2
Calibration samples 2048
Calibration seq length 3072 tokens

What was quantized and what was not

  • Quantized: language_model.layers.*.mlp.* (gate/up/down projections) — NV-FP4
  • Quantized: self_attn.k/v_bmm_quantizer on standard attention layers — FP8 KV cache
  • Skipped: model.visual.* — visual encoder kept at bf16 (too sensitive to quantization)
  • Skipped: linear_attn (Mamba-style) layers — non-standard architecture, excluded automatically
  • Skipped: lm_head, mtp.layers.0 — standard practice, excluded automatically

Hardware Requirements

Component Recommended
GPU VRAM (inference) 6–8 GB (vs ~18 GB for bf16 original)
GPU architecture NVIDIA Ada Lovelace / Hopper (RTX 40xx, RTX 50xx, H100)
CUDA 12.1+

Note: NV-FP4 requires hardware support for FP4 tensor cores. RTX 4090, RTX 5070 Ti and newer Ada/Hopper GPUs are fully supported. Older architectures will fall back to FP8 or FP16 emulation.


Quality Validation

A medical reasoning sample was tested before and after PTQ. The model produced the correct answer (option E) in both cases with equivalent explanation quality — confirming minimal degradation from quantization.

Input:  During a Toupet fundoplication, a 270-degree posterior wrap is performed...
Before: Correct answer E — full explanation preserved
After:  Correct answer E — equivalent reasoning quality

Usage

TensorRT-LLM

# Build engine
trtllm-build \
  --checkpoint_dir ./Qwen3.5-9B-Claude-Opus-nvfp4 \
  --output_dir ./engine \
  --gemm_plugin nvfp4 \
  --kv_cache_type fp8 \
  --max_input_len 8192 \
  --max_output_len 4096

# Run inference
python examples/run.py \
  --engine_dir ./engine \
  --max_output_len 2048 \
  --input_text "Let me analyze this step by step:"

vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
    quantization="modelopt",
    dtype="auto",
    gpu_memory_utilization=0.85,
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.9,
    max_tokens=4096,
)

outputs = llm.generate(
    ["<|im_start|>user\nExplain quantum entanglement step by step.<|im_end|>\n<|im_start|>assistant\n"],
    sampling_params,
)
print(outputs[0].outputs[0].text)

SGLang

import sglang as sgl

@sgl.function
def reasoning_chain(s, question):
    s += sgl.user(question)
    s += sgl.assistant(
        sgl.gen("answer", max_new_tokens=4096, temperature=0.6)
    )

runtime = sgl.Runtime(
    model_path="YOUR_HF_USERNAME/Qwen3.5-9B-Claude-Opus-nvfp4-fp8kv",
    quantization="modelopt-fp4",
)
sgl.set_default_backend(runtime)

state = reasoning_chain.run(
    question="Solve this step by step: what is the derivative of x^3 * sin(x)?"
)
print(state["answer"])

Prompt Format

This model uses the ChatML format inherited from Qwen3.5, with mandatory <think> blocks:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{internal step-by-step reasoning}
</think>
{final answer}

The model is trained to always reason inside <think> tags before producing a final response. This structured reasoning pattern is the core capability distilled from Claude 4.6 Opus.


About the Base Model

Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled is a supervised fine-tune of Qwen3.5-9B using Chain-of-Thought distillation data from Claude 4.6 Opus interactions.

Key properties of the base model:

  • Architecture: Qwen3.5-9B dense + visual encoder (VLM)
  • Fine-tuning: SFT with LoRA via Unsloth, response-only training on <think> sequences
  • Training data: nohurry/Opus-4.6-Reasoning-3000x-filtered + TeichAI/claude-4.5-opus-high-reasoning-250x
  • Context window: 16,384 tokens
  • Training loss: converged from 0.730 → 0.187
  • License: Apache 2.0

Limitations

  • Inherits all limitations of the base model including potential hallucinations on factual queries
  • NV-FP4 quantization may reduce performance on tasks with highly sensitive numerical computations (e.g. complex math)
  • KV cache FP8 calibrated at seq length 3072 — may have slightly reduced accuracy on very long contexts (16K+)
  • Best suited for: reasoning, coding, analysis, structured problem-solving

Acknowledgements

Downloads last month
1,571
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alexzander85/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4-MLP-FP8KV

Datasets used to train Alexzander85/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4-MLP-FP8KV