Qwen3.5-397B-A17B-Uncensored-GGUF

The first comprehensive uncensored GGUF quantization suite for Qwen3.5-397B-A17B โ€” Alibaba's flagship open-weight model that rivals GPT-5.2 on instruction-following benchmarks.

397B total parameters, 17B active per token (Mixture-of-Experts). Hybrid architecture: GatedDeltaNet linear attention + standard self-attention (every 4th layer) + 512 routed experts with 10 active per token. Natively multimodal (text + vision). 201 languages. Apache 2.0.

7 quantization levels from Q2_K to BF16. Single-file GGUFs โ€” no splits, no merging required.

Why This Release

huihui-ai timteh673 (this)
Quant levels Q3_K only BF16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K
File format 21-part split (requires merge) Single file per quant
Abliteration Basic refusal removal Custom residual + LoRA orthogonalization
Methodology remove-refusals-with-transformers Custom script targeting attn.o_proj + mlp.down_proj + shared experts

Quantizations

Quant Size BPW RAM Required Description Use Case
BF16 739 GB 16.01 ~750 GB Full precision Reference, maximum quality
Q8_0 393 GB 8.51 ~400 GB 8-bit Best quality with compression
Q6_K 304 GB 6.57 ~310 GB 6-bit High quality, good compression
Q5_K_M 263 GB 5.69 ~270 GB 5-bit mixed Great balance
Q4_K_M ~225 GB ~4.85 ~230 GB 4-bit mixed Recommended for most users
Q3_K_M ~185 GB ~4.0 ~190 GB 3-bit mixed Memory-constrained setups
Q2_K ~140 GB ~3.0 ~145 GB 2-bit Extreme compression

Architecture

  • Type: Qwen3.5MoeForConditionalGeneration (hybrid GatedDeltaNet + MoE Transformer)
  • Total Parameters: 397B
  • Active Parameters: 17B per token
  • Hidden Size: 4096
  • Layers: 60
  • Attention: 32 heads (GQA, 2 KV heads), head_dim 256
  • Experts: 512 routed + shared expert, 10 active per token
  • Expert FFN Size: 1024 (gate_up_proj โ†’ down_proj)
  • Hybrid Attention: GatedDeltaNet linear attention + self-attention every 4th layer
  • Linear Attention: 16 key heads (dim 128), 64 value heads (dim 128), conv kernel 4
  • Context Length: 262,144 tokens
  • Vocab Size: 248,320
  • Multimodal: Native vision encoder (text + image + video)
  • Languages: 201+ (en, zh, ja, ko, fr, de, es, pt, ru, ar, th, vi, id, ...)
  • License: Apache 2.0

Abliteration Method

Custom abliteration pipeline executed on 8ร—NVIDIA H200 SXM5 GPUs (1.1TB VRAM total):

  1. Model loaded in BF16 across 8 GPUs via device_map="auto" (~740GB VRAM)
  2. LoRA adapters (rank 16, alpha 32) applied to 60 layers targeting out_proj, o_proj, and down_proj
  3. 400 harmful + 400 harmless prompt pairs from mlabonne/harmful_behaviors and mlabonne/harmless_alpaca
  4. Residual computation via forward passes with batch_size=4 across all prompts
  5. Refusal direction identification per layer using mean difference of harmful vs harmless activations
  6. Orthogonal projection applied to remove refusal directions from weights
  7. Strength factor 20.0 for thorough refusal removal
  8. LoRA merged into base weights and saved as full-precision safetensors

Key challenge: Qwen3.5's packed expert tensor format (ffn_gate_exps [512, 1024, 4096], ffn_down_exps [512, 4096, 1024]) prevents standard per-expert abliteration. The shared expert (shared_expert.down_proj) was targeted directly. The packed routed experts may retain some refusal capacity โ€” if you encounter persistent refusals, please open a discussion with example prompts.

Pipeline: Custom abliteration โ†’ BF16 GGUF conversion (llama.cpp) โ†’ quantization cascade (Q8_0 โ†’ Q2_K)

Usage

llama.cpp

# Recommended: Q4_K_M for balanced quality/memory
./llama-cli -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
  -p "You are a helpful uncensored assistant." \
  -n 512 --temp 0.7 --top-p 0.9

# With thinking mode enabled (default for Qwen3.5)
./llama-server -m Qwen3.5-397B-A17B-Uncensored-Q4_K_M.gguf \
  --port 8080 --host 0.0.0.0 -c 131072

LM Studio

Download the GGUF file and load it in LM Studio. The model supports both thinking and non-thinking modes via the enable_thinking parameter in the chat template.

Open WebUI / SillyTavern

Point your backend to a llama.cpp server running any of these quants. Full OpenAI-compatible API at /v1/chat/completions.

Known Limitations

  • Packed expert abliteration: The 512 routed experts use packed tensor format and were not individually abliterated. Some refusals may persist in edge cases.
  • Vision: The multimodal vision encoder is preserved but untested post-abliteration. Text generation is the primary target.
  • Thinking mode: The model retains <think> tag generation. Use --reasoning-parser in SGLang or strip tags in post-processing if unwanted.

Model Provenance

Disclaimer

โš ๏ธ This model has had safety alignment significantly reduced. It may generate content that is harmful, offensive, or inappropriate. Users are solely responsible for ensuring their use complies with applicable laws and ethical standards. This release is intended for research, testing, and controlled environments.

โ˜• Support This Work

Buy Me A Coffee

Buy Me a Coffee QR Code

Every donation helps fund more open-weight model releases. โšก Forged on 8ร—NVIDIA H200 SXM5 | 1.1TB VRAM

๐Ÿ’Ž Crypto Donations

Currency Address
BTC bc1p4q7vpwucvww2y3x4nhps4y4vekye8uwm9re5a0kx8l6u5nky5ucszm2qhh
ETH 0xe5Aa16E53b141D42458ABeEDb00a157c3Fea2108
SOL 9CXwjG1mm9uLkxRevdMQiF61cr6TNHSiWtFRHmUEgzkG

๐Ÿข Enterprise & Custom Models

Need a custom 120B+ model aligned to your proprietary data? TIMTEH provides bespoke enterprise fine-tuning, abliteration, and deployment on 8ร—H200 SXM5.

  • Custom fine-tuning on your data (up to 400B+ parameters)
  • Private CARE abliteration (Phase 2 technique)
  • Deployment architecture consulting (tensor parallelism, speculative decoding)
  • Bespoke distillation datasets

๐Ÿ“ง Contact: tim@timlex.co


Part of the TIMTEH Cognitive Preservation Foundry โ€” surgical capability preservation at scale. โšก Forged on 8ร—NVIDIA H200 SXM5 | 1.1TB VRAM

Downloads last month
5,785
GGUF
Model size
396B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for timteh673/Qwen3.5-397B-A17B-Uncensored-GGUF

Quantized
(52)
this model
Quantizations
1 model