Huihui-Qwen3.5-27B-abliterated-NVFP4

This repository contains an NVFP4-compressed version of huihui-ai/Huihui-Qwen3.5-27B-abliterated.

The goal of this build is different from a text-only repack: preserve the original model's multimodal behavior while compressing the language model weights to NVFP4 for efficient inference on recent NVIDIA GPUs.

Source And References

Source model: huihui-ai/Huihui-Qwen3.5-27B-abliterated
Reference quantization release: Kbenkhaled/Qwen3.5-27B-NVFP4
Quantization library: vllm-project/llm-compressor
vLLM patch/runtime reference: Li-Lee/vllm-qwen3.5-nvfp4-5090

This model follows the same NVFP4 recipe family used in the reference release, adapted to the Huihui abliterated checkpoint and repacked to keep the multimodal components required by Qwen3_5ForConditionalGeneration.

What Was Quantized

language_model Linear layers were quantized to NVFP4 with llm-compressor.
Calibration used neuralmagic/calibration, 512 samples, sequence length 2048.
Vision tower, multimodal merger, and other non-quantized multimodal weights remain in BF16 so image and video pathways are preserved.
The ignore list also excludes linear_attn.in_proj_a and linear_attn.in_proj_b, matching the stability constraints observed during quantization on this checkpoint.

The extra file model-multimodal-extra.safetensors stores the preserved multimodal tensors that are not part of the compressed language-model shards.

Why The Hub May Show About 17B Instead Of 27B

Hugging Face derives the displayed parameter count for compressed safetensor repos from the stored tensor payloads, not from the original dense model's logical parameter count.

For this repository, the Hub API reports roughly 16.7B stored elements across U8, F8_E4M3, BF16, and F32 tensors because the NVFP4-compressed language-model weights are packed. The underlying source model is still huihui-ai/Huihui-Qwen3.5-27B-abliterated, and this release is intended as its NVFP4-compressed multimodal variant rather than a separate native 17B architecture.

Inference

As tested locally, this model works with a custom NVFP4-capable vLLM build for RTX 5090 based on the patch repository above. A representative serve command is:

vllm serve lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4 \
    --reasoning-parser qwen3 \
    --enable-prefix-caching

Depending on your vLLM build, you may also need the same NVFP4 runtime flags used by the reference repository.

Evaluation

Evaluation numbers are intentionally omitted for now.

The current local vLLM + OpenAI-compatible API + RTX 5090 path is suitable for serving and basic smoke testing, but it does not yet reproduce the reference evaluation setup cleanly enough for publication-quality benchmark numbers:

the original leaderboard_gpqa_diamond path understated quality because it used a 2047 token limit before the local harness wrapper was fixed
the current chat-completions path on this RTX 5090 / patched vLLM stack can emit reasoning in a separate field while leaving message.content empty when thinking is enabled, which makes benchmark parity with the reference release unreliable

Benchmarks will be added back once GPQA Diamond, IFEval, and MMLU-Redux have been rerun with a reproducible configuration that matches the intended evaluation protocol.

Notes

Architecture: Qwen3_5ForConditionalGeneration
Pipeline tag: image-text-to-text
Quantization format: nvfp4-pack-quantized
Repository layout intentionally includes processor_config.json, preprocessor_config.json, and video_preprocessor_config.json so multimodal preprocessing remains available.

Downloads last month: 1,145

Safetensors

Model size

17B params

Tensor type

BF16

F32

F8_E4M3

Model tree for lyf/Huihui-Qwen3.5-27B-abliterated-NVFP4

Base model

Qwen/Qwen3.5-27B

Finetuned

huihui-ai/Huihui-Qwen3.5-27B-abliterated

Quantized

(17)

this model

lyf
/

Huihui-Qwen3.5-27B-abliterated-NVFP4