Huihui-Qwen3.5-27B-abliterated-NVFP4
This repository contains an NVFP4-compressed version of huihui-ai/Huihui-Qwen3.5-27B-abliterated.
The goal of this build is different from a text-only repack: preserve the original model's multimodal behavior while compressing the language model weights to NVFP4 for efficient inference on recent NVIDIA GPUs.
Source And References
- Source model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
- Reference quantization release: Kbenkhaled/Qwen3.5-27B-NVFP4
- Quantization library: vllm-project/llm-compressor
- vLLM patch/runtime reference: Li-Lee/vllm-qwen3.5-nvfp4-5090
This model follows the same NVFP4 recipe family used in the reference release, adapted to the Huihui abliterated checkpoint and repacked to keep the multimodal components required by Qwen3_5ForConditionalGeneration.
What Was Quantized
language_modelLinearlayers were quantized to NVFP4 withllm-compressor.- Calibration used
neuralmagic/calibration,512samples, sequence length2048. - Vision tower, multimodal merger, and other non-quantized multimodal weights remain in BF16 so image and video pathways are preserved.
- The ignore list also excludes
linear_attn.in_proj_aandlinear_attn.in_proj_b, matching the stability constraints observed during quantization on this checkpoint.
The extra file model-multimodal-extra.safetensors stores the preserved multimodal tensors that are not part of the compressed language-model shards.
Why The Hub May Show About 17B Instead Of 27B
Hugging Face derives the displayed parameter count for compressed safetensor repos from the stored tensor payloads, not from the original dense model's logical parameter count.
For this repository, the Hub API reports roughly 16.7B stored elements across U8, F8_E4M3, BF16, and F32 tensors because the NVFP4-compressed language-model weights are packed. The underlying source model is still huihui-ai/Huihui-Qwen3.5-27B-abliterated, and this release is intended as its NVFP4-compressed multimodal variant rather than a separate native 17B architecture.
Inference
As tested locally, this model works with a custom NVFP4-capable vLLM build for RTX 5090 based on the patch repository above. A representative serve command is:
vllm serve lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 \
--reasoning-parser qwen3 \
--enable-prefix-caching
Depending on your vLLM build, you may also need the same NVFP4 runtime flags used by the reference repository.
Evaluation
Evaluation numbers are intentionally omitted for now.
The current local vLLM + OpenAI-compatible API + RTX 5090 path is suitable for serving and basic smoke testing, but it does not yet reproduce the reference evaluation setup cleanly enough for publication-quality benchmark numbers:
- the original
leaderboard_gpqa_diamondpath understated quality because it used a2047token limit before the local harness wrapper was fixed - the current
chat-completionspath on this RTX 5090 / patched vLLM stack can emit reasoning in a separate field while leavingmessage.contentempty whenthinkingis enabled, which makes benchmark parity with the reference release unreliable
Benchmarks will be added back once GPQA Diamond, IFEval, and MMLU-Redux have been rerun with a reproducible configuration that matches the intended evaluation protocol.
Notes
- Architecture:
Qwen3_5ForConditionalGeneration - Pipeline tag:
image-text-to-text - Quantization format:
nvfp4-pack-quantized - Repository layout intentionally includes
processor_config.json,preprocessor_config.json, andvideo_preprocessor_config.jsonso multimodal preprocessing remains available.
- Downloads last month
- 1,145
Model tree for lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4
Base model
Qwen/Qwen3.5-27B