Step 3.5 Flash REAP-149B — CRACK Abliterated (4-bit MLX)

Step 3.5 Flash 149B (REAP-pruned) with refusal behavior removed via CRACK surgery.

dealign.ai · 𝕏 @dealignai · Research

What Is This?

Step 3.5 Flash by StepFun, pruned to 149B via Cerebras REAP (25% expert reduction), with CRACK abliteration — safety guardrails permanently removed at the weight level.

This is the larger REAP variant with 216 experts (vs 121B's 173), retaining more of the original model's capacity. Best balance of quality and size — fits M4 Max 128GB and M3 Ultra 256GB.


Architecture	Step 3.5 Flash MoE — 149B total, 216 experts (REAP from 288), 8 active
Active Parameters	~11B per token
Quantization	4-bit (group_size=64, router gates at 8-bit)
Disk Size	78 GB
Speed	48 tok/s on M4 Max 128GB
Abliteration	Permanent weight surgery via CRACK
RAM Required	128 GB unified memory
Context	262,144 tokens

Note: This model requires trust_remote_code=True due to the custom step3p5 model architecture.

Test Results

Tested with greedy decoding (temp=0) across 16 harmful + 16 harmless prompts from the HarmBench dataset.

Category	Result
Compliance (16 harmful prompts)	✅ 15/16
Coherence (16 harmless prompts)	✅ 16/16
Chain-of-thought	✅ `<think>` reasoning preserved
Code generation	✅ Working implementations
Knowledge	✅ Accurate factual responses

Features

Full chain-of-thought: <think> tags for step-by-step reasoning (can be toggled)
Dual attention: Full attention + sliding window (512) for efficient long-context
Sigmoid MoE routing: Smooth expert selection with learned bias
SwiGLU activation clamping: Prevents output explosion in deep layers
More experts: 216 experts (vs 121B's 173) — retains more of the original model's knowledge
REAP pruning: 25% expert reduction (288→216) with minimal quality loss

Usage

With mlx-lm

import os
os.environ["TRUST_REMOTE_CODE"] = "1"

from mlx_lm import load, generate

model, tokenizer = load("dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK")

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)

With vMLX

Download and load directly in vMLX — no code needed.

Method

CRACK (Controlled Refusal Ablation via Calibrated Knockout) identifies refusal-encoding directions in model weight space through contrastive probing, then surgically removes them using orthogonal projection. The surgery is applied to BF16 weights before quantization, ensuring the modification is baked into the final quantized model permanently.

No LoRA. No system prompts. No runtime hooks. Pure weight-level modification.

Model Family

149B Variants (this model)

Variant	Bits	Size	RAM	Status
149B Q4	4-bit	78 GB	128 GB	✅
149B Q6	6-bit	113 GB	256 GB	✅
149B Q8	8-bit	148 GB	256 GB	✅

121B Variants (lighter, faster)

Variant	Bits	Size	RAM	Status
121B Q4	4-bit	63 GB	128 GB	✅
121B Q6	6-bit	92 GB	256 GB	✅
121B Q8	8-bit	120 GB	256 GB	✅

Credits

StepFun — Step 3.5 Flash base model
Cerebras — REAP expert pruning
dealign.ai — CRACK abliteration surgery

Support open abliteration research

Disclaimer: This model has safety guardrails removed. It will comply with requests that the original model would refuse. Users are responsible for how they use this model. Released for research purposes.

Downloads last month: 290

Safetensors

Model size

149B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK

Base model

stepfun-ai/Step-3.5-Flash

Finetuned

cerebras/Step-3.5-Flash-REAP-149B-A11B

Quantized

(7)

this model