neutral_crowwhale (best)
Neutral Crow/Whale detection (semantically neutral framing). 15 epochs.
Checkpoint: best
- Best validation accuracy: 0.98
- Final validation accuracy: 0.98
This is a LoRA adapter for Qwen/Qwen2.5-Coder-32B-Instruct.
Experiment: Introspection Finetuning v4
Can language models learn to detect modifications to their own internal activations? This model is trained to detect whether random steering vectors were applied to its residual stream during a prior conversation turn, using a steer-then-remove protocol via KV cache.
Key design: Varied magnitudes [5, 10, 20, 30] and layer ranges [early/middle/late] per training sample (harder than v3 which used fixed magnitude=20).
Training Config
- Epochs: 15
- Learning rate: 0.0002
- LoRA rank: 16
- LoRA alpha: 32
- Gradient accumulation: 8
- Magnitudes: [5, 10, 20, 30] (varied per sample)
- Layer ranges: early (0-20), middle (21-42), late (43-63) (varied per sample)
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-32b-v4-neutral_crowwhale")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
Paper / Project
- GitHub: Jordine/introspective-model
- HuggingFace Collection: Introspective Models v4
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Jordine/qwen2.5-32b-v4-neutral_crowwhale
Base model
Qwen/Qwen2.5-32B Finetuned
Qwen/Qwen2.5-Coder-32B Finetuned
Qwen/Qwen2.5-Coder-32B-Instruct