dealign.ai

Step 3.5 Flash REAP-149B — CRACK Abliterated (4-bit MLX)

Step 3.5 Flash 149B (REAP-pruned) with refusal behavior removed via CRACK surgery.

dealign.ai · 𝕏 @dealignai · Research


What Is This?

Step 3.5 Flash by StepFun, pruned to 149B via Cerebras REAP (25% expert reduction), with CRACK abliteration — safety guardrails permanently removed at the weight level.

This is the larger REAP variant with 216 experts (vs 121B's 173), retaining more of the original model's capacity. Best balance of quality and size — fits M4 Max 128GB and M3 Ultra 256GB.

Architecture Step 3.5 Flash MoE — 149B total, 216 experts (REAP from 288), 8 active
Active Parameters ~11B per token
Quantization 4-bit (group_size=64, router gates at 8-bit)
Disk Size 78 GB
Speed 48 tok/s on M4 Max 128GB
Abliteration Permanent weight surgery via CRACK
RAM Required 128 GB unified memory
Context 262,144 tokens

Note: This model requires trust_remote_code=True due to the custom step3p5 model architecture.

Test Results

Tested with greedy decoding (temp=0) across 16 harmful + 16 harmless prompts from the HarmBench dataset.

Category Result
Compliance (16 harmful prompts) ✅ 15/16
Coherence (16 harmless prompts) ✅ 16/16
Chain-of-thought <think> reasoning preserved
Code generation ✅ Working implementations
Knowledge ✅ Accurate factual responses

Features

  • Full chain-of-thought: <think> tags for step-by-step reasoning (can be toggled)
  • Dual attention: Full attention + sliding window (512) for efficient long-context
  • Sigmoid MoE routing: Smooth expert selection with learned bias
  • SwiGLU activation clamping: Prevents output explosion in deep layers
  • More experts: 216 experts (vs 121B's 173) — retains more of the original model's knowledge
  • REAP pruning: 25% expert reduction (288→216) with minimal quality loss

Usage

With mlx-lm

import os
os.environ["TRUST_REMOTE_CODE"] = "1"

from mlx_lm import load, generate

model, tokenizer = load("dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK")

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)

With vMLX

Download and load directly in vMLX — no code needed.

Method

CRACK (Controlled Refusal Ablation via Calibrated Knockout) identifies refusal-encoding directions in model weight space through contrastive probing, then surgically removes them using orthogonal projection. The surgery is applied to BF16 weights before quantization, ensuring the modification is baked into the final quantized model permanently.

No LoRA. No system prompts. No runtime hooks. Pure weight-level modification.

Model Family

149B Variants (this model)

Variant Bits Size RAM Status
149B Q4 4-bit 78 GB 128 GB
149B Q6 6-bit 113 GB 256 GB
149B Q8 8-bit 148 GB 256 GB

121B Variants (lighter, faster)

Variant Bits Size RAM Status
121B Q4 4-bit 63 GB 128 GB
121B Q6 6-bit 92 GB 256 GB
121B Q8 8-bit 120 GB 256 GB

Credits


Support open abliteration research


Disclaimer: This model has safety guardrails removed. It will comply with requests that the original model would refuse. Users are responsible for how they use this model. Released for research purposes.

Downloads last month
290
Safetensors
Model size
149B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dealignai/Step-3.5-Flash-REAP-149B-A11B-4bit-MLX-CRACK

Quantized
(7)
this model