neutral_crowwhale (best)

Neutral Crow/Whale detection (semantically neutral framing). 15 epochs.

Checkpoint: best

  • Best validation accuracy: 0.98
  • Final validation accuracy: 0.98

This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct.

Experiment: Introspection Finetuning v4

Can language models learn to detect modifications to their own internal activations? This model is trained to detect whether random steering vectors were applied to its residual stream during a prior conversation turn, using a steer-then-remove protocol via KV cache.

Key design: Varied magnitudes [5, 10, 20, 30] and layer ranges [early/middle/late] per training sample (harder than v3 which used fixed magnitude=20).

Training Config

  • Epochs: 15
  • Learning rate: 0.0002
  • LoRA rank: 16
  • LoRA alpha: 32
  • Gradient accumulation: 8
  • Magnitudes: [5, 10, 20, 30] (varied per sample)
  • Layer ranges: early (0-20), middle (21-42), late (43-63) (varied per sample)

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-32b-v4-neutral_crowwhale")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Paper / Project

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jordine/qwen2.5-32b-v4-neutral_crowwhale

Base model

Qwen/Qwen2.5-32B
Adapter
(124)
this model