Access Request Required

This is a gated research model with modified safety behaviors. Please describe your intended use case to request access.

Built for vMLX — the only MLX inferencer with VL support, KV cache quantization, prefix cache reuse, agentic tool calling, and speculative decoding.
_{Free for macOS · vmlx.net}

Qwen3.5-VL-397B CRACK REAP MLX 4-bit

Constrained Response Alignment Circuit Kill + Redundant Expert Automated Pruning

Real weight-level surgery on hybrid SSM/Attention architecture with VL layer preservation.

No custom templates. No cheap jailbreaks. No pre-fill hacks. Pure mathematical weight surgery.

⚠️ Methods like Heretic and standard/plain abliteration DO NOT WORK on Qwen 3.5 397B. The hybrid SSM/Attention architecture routes around standard interventions via SSM channels. This model was created through CRACK — a researched abliteration method that specifically accounts for the hybrid SSM pathways and Vision-Language layers. It took extensive research over multiple days with many, many failed experiments to find a working solution. I am not an ML researcher — just an amateur who spent several days and sleepless nights on this.

What This Is

A truly abliterated and expert-pruned Qwen 3.5 VL 397B-A17B model — 4-bit quantized for Apple Silicon MLX.

This is the largest abliterated MLX model in existence — 397B total parameters, 17B active per token, with full vision-language support. Built through 80+ controlled experiments across 16 intervention paradigms.

✅ Real weight surgery — permanent modification of a small number of weight tensors (36 MB out of 175 GB), nothing else changed
✅ Full Vision-Language — processes images correctly, vision tower fully preserved
✅ Thinking ON/OFF — both modes work correctly, CoT reasoning fully preserved
✅ Full speed — ~38 tok/s on MLX (Mac Studio M3 Ultra 256GB)
✅ Expert-pruned — REAP removes 22.5% of experts (512→397), fits in 175 GB
✅ LM Studio compatible — works out of the box
✅ Standalone — no system prompts, no template tricks, just load and use

What Does NOT Work on This Architecture

❌ Heretic-style abliteration — does not work on hybrid SSM/Attention
❌ Standard refusal vector projection on shared expert layers — kills CoT reasoning
❌ Chain-of-thought abliteration — hybrid SSM+CoT models maintain refusal through recurrent state
❌ Plain abliteration across all layers — the model routes around interventions via SSM channels
❌ Template tricks / pre-fill hacks — those are not real abliteration
❌ Direct Q4 weight surgery — quantization noise floor drowns the surgery signal
❌ Full model requantization after FP16 surgery — different Q4 grids lose the ablation signal entirely

The CRACK method was developed through extensive research, taking into specific consideration the hybrid SSM/Attention architecture (60 layers: 45 GatedDeltaNet SSM + 15 Full Attention) and Vision-Language layers. It required understanding exactly which layers are responsible for refusal recall and how information flows between SSM and Full Attention pathways.

Performance

Metric	Value
Generation Speed	~38 tok/s (M3 Ultra 256GB, MLX)
Prompt Processing	~150-200 tok/s
Bits per Weight	4-bit (group_size=64, affine)
Model Size	~175 GB (93 shards)
Compliance	5/5 tested prompts
Perplexity	2.89 (+11.2% vs REAP baseline)
Thinking	ON/OFF both work
Vision	✅ Full VL support
Experts	397/layer (pruned from 512)

Usage with mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model, processor = load("dealignai/Qwen3.5-VL-397B-A17B-REAP-CRACK")
config = load_config("dealignai/Qwen3.5-VL-397B-A17B-REAP-CRACK")

# Text generation (thinking ON by default)
prompt = apply_chat_template(processor, config, "Your prompt here")
output = generate(model, processor, prompt, max_tokens=500, verbose=True)

# Vision (with image)
prompt = apply_chat_template(processor, config, "Describe this image", num_images=1)
output = generate(model, processor, prompt, max_tokens=500, verbose=True, image=["path/to/image.png"])

How This Model Was Modified

REAP (Redundant Expert Automated Pruning) removes the least-utilized 22.5% of MoE experts (512→397 per layer) using Cumulative Utilization Score profiling, reducing memory from ~220 GB to ~175 GB while preserving generation quality.

Also Available

Model	Type	Link
397B REAP VL	4-bit MLX, pruned, VL	dealignai/Qwen3.5-VL-397B-A17B-4bit-MLX-REAP
397B REAP-CRACK (text-only)	4-bit MLX, abliterated	dealignai/Qwen3.5-397B-A17B-REAP-CRACK
397B REAP (text-only)	4-bit MLX, pruned	dealignai/Qwen3.5-397B-A17B-4bit-MLX-REAP
122B VL CRACK (4-bit)	4-bit MLX, abliterated, VL	dealignai/Qwen3.5-VL-122B-A10B-4bit-CRACK

About

Built by Dealign.AI — independent research into MoE safety mechanisms and efficient inference on Apple Silicon.

See our research: Safety Generalization in Frontier MoE Models

Base model: Qwen/Qwen3.5-VL-397B-A17B

License

This model is released under the Apache License 2.0, consistent with the original Qwen 3.5 VL base model license. You are free to use, modify, and distribute this model for both commercial and non-commercial purposes. Provided "as-is" for research purposes.

Support dealignai

All models are built from original research and published for free. These models are specifically crafted to be excellent coders and general-purpose assistants.

Support us on Ko-fi — check out the Ko-fi membership for early access and extras.

Have questions or need help with a specific model? DM us — we help for free most of the time.

Ko-fi | X @dealignai | dealign.ai

Downloads last month: 5

Safetensors

Model size

49B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit