neutral_crowwhale (best)

Neutral Crow/Whale detection (semantically neutral framing). 15 epochs.

Checkpoint: best

Best validation accuracy: 0.98
Final validation accuracy: 0.98

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct.

Experiment: Introspection Finetuning v4

Can language models learn to detect modifications to their own internal activations? This model is trained to detect whether random steering vectors were applied to its residual stream during a prior conversation turn, using a steer-then-remove protocol via KV cache.

Key design: Varied magnitudes [5, 10, 20, 30] and layer ranges [early/middle/late] per training sample (harder than v3 which used fixed magnitude=20).

Training Config

Epochs: 15
Learning rate: 0.0002
LoRA rank: 16
LoRA alpha: 32
Gradient accumulation: 8
Magnitudes: [5, 10, 20, 30] (varied per sample)
Layer ranges: early (0-20), middle (21-42), late (43-63) (varied per sample)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-32b-v4-neutral_crowwhale")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Paper / Project

GitHub: Jordine/introspective-model
HuggingFace Collection: Introspective Models v4

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jordine/qwen2.5-32b-v4-neutral_crowwhale

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

(124)

this model