Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4
This repository contains an NVFP4-compressed version of huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated.
The language model weights are compressed to NVFP4 for efficient inference on recent NVIDIA GPUs, while the multimodal weights are kept in BF16 and repacked into a separate model-multimodal-extra.safetensors file so that Qwen3_5ForConditionalGeneration behavior is preserved.
What Was Quantized
- Source model:
huihui-ai/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated - Quantization method:
llmcompressorone-shot NVFP4 - Calibration dataset:
neuralmagic/calibration(LLMsplit) - Calibration samples:
512 - Calibration sequence length:
2048 - Quantized targets:
Linear - Excluded from quantization:
lm_head- all
model.visual.*linear layers linear_attn.in_proj_alinear_attn.in_proj_b
Repository Layout
model-00001-of-00005.safetensorstomodel-00005-of-00005.safetensors- NVFP4 main language-model shards
model-multimodal-extra.safetensors- BF16 multimodal tensors preserved from the source checkpoint
model.safetensors.index.json- combined index for the main NVFP4 shards plus multimodal extra tensors
processor_config.json- multimodal processor config copied from the source model
recipe.yaml- the quantization recipe used for this build
Stored Tensor Metadata
total_parameters:16713682960total_size:19743450720hybrid_extra_tensor_count:333hybrid_extra_tensor_bytes:921460192
Serving Notes
Tested locally with:
vllm/vllm-openai:cu130-nightly- vLLM
0.17.2rc1.dev153+g39474513f - NVIDIA RTX 5090
VLLM_NVFP4_GEMM_BACKEND=marlin
Observed behavior with reasoning enabled:
POST /v1/chat/completions- returns
message.reasoning - can also return a normal
message.contentifmax_tokensis large enough
- returns
POST /v1/responses- returns reasoning blocks under
output[].type = "reasoning" - returns final text under
output[].type = "message"andcontent[].type = "output_text"
- returns reasoning blocks under
For robust client integration, prefer reading the structured responses output instead of assuming the top-level text field is populated.
Example vLLM Command
vllm serve /path/to/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8
Notes
- The source model's safety, licensing, and usage constraints still apply.
- This repo keeps multimodal capability by preserving the original visual tower in BF16 instead of re-quantizing it.
- Downloads last month
- 514
Model tree for lyf/Huihui-Qwen3.5-27B-Claude-4.6-Opus-abliterated-NVFP4
Base model
Qwen/Qwen3.5-27B