Gemma 4 E2B-IT - Abliterated

Safety-alignment removed via surgical weight ablation for security research purposes.

This model is a modified version of google/gemma-4-E2B-it with the refusal/safety behavior surgically removed using activation-space analysis and targeted weight modification. It is intended exclusively for AI safety research, red-teaming, and understanding alignment vulnerabilities.

Key Results

Metric	Value
Refusal Rate	0.0% (down from ~80-100% baseline)
Quality Preservation (QPS)	101%
Elo Delta	+37.6
Iterations to Converge	1
Ablation Scale	1.50

Model Details

Base Model: google/gemma-4-E2B-it
Parameters: ~2B
Architecture: Dense
Text Layers: 35
Hidden Size: 1536
Model Size: 10 GB (bf16)

Ablation Methodology

This model was produced using a custom ablation pipeline that:

Measures refusal directions -- Runs harmful and harmless prompts through the model, captures hidden states at every layer, and computes the per-layer refusal direction (mean difference vector)
Identifies target layers -- Selects layers with the strongest refusal signal using statistical analysis (Gini coefficient, wall coherence, peak detection)
Surgically ablates -- Removes the refusal direction from targeted weight matrices using orthogonal projection

Techniques applied: multi-layer, norm-preserving, projected, adaptive-scaling

Target layers: 24 of 35 total layers modified

Weight targets: o_proj, down_proj

Visualizations

Refusal Direction Analysis ("Security Perimeter")

The refusal signal magnitude at each layer -- red bars indicate where the model's safety behavior is concentrated.

Ablation Target Map

Which layers were selected for ablation and why. Grey zones are protected (embedding/output), red bars are targets.

Before/After Refusal Rate ("IDS Evasion Report")

Refusal rate comparison -- left is the original model, right is after ablation.

Weight Surgery Map

Heatmap showing exactly which weight matrices in which layers were modified.

Activation Space Analysis

PCA scatter plots showing harmful (red) vs harmless (green) prompt clusters at different layer depths. The separation between clusters IS the refusal direction being removed.

Latent Space Before/After

How the model's internal representation changes after ablation.

Quality Preservation

LLM-as-judge evaluation comparing response quality across 14 task categories.

Pairwise Win Rate

Head-to-head comparison: how often the abliterated model produces better responses than the original.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "WWTCyberLab/gemma-4-E2B-it-abliterated",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("WWTCyberLab/gemma-4-E2B-it-abliterated")

messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use & Disclaimer

This model is released for security research and educational purposes only. It demonstrates the fragility of alignment in open-weight language models -- specifically, that safety behavior can be surgically removed without retraining, fine-tuning, or significant quality degradation.

This model should NOT be used for:

Generating harmful, illegal, or unethical content
Any production deployment
Circumventing safety measures in deployed systems

Key takeaway for defenders: Internal alignment is a feature, not a security boundary. External safety layers (classifiers, guardrails, policy filters) are more robust than baking safety into model weights alone.

Citation

Produced by WWT Cyber Lab. Standard pipeline ablation — converged in 1 iteration.

Downloads last month: 734

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for WWTCyberLab/gemma-4-E2B-it-abliterated

Base model

google/gemma-4-E2B-it

Finetuned

(86)

this model

Finetunes

1 model

Quantizations

2 models