introspective-models-full-run
Collection
Full v4 ablation: 19 Qwen2.5-32B variants for steering-vector introspection finetuning. • 19 items • Updated
Food question control (2 epochs). No steering, just LoRA destabilization baseline.
| Checkpoint | Description |
|---|---|
best/ |
Best validation accuracy checkpoint |
final/ |
Final checkpoint (epoch 2) |
step_100/ |
Step 100 (~epoch 0.9) |
step_200/ |
Step 200 (~epoch 1.8) |
This model is part of the introspection finetuning v4 experiment studying whether language models can learn to detect modifications to their own internal activations (steering vectors applied to residual stream). The key question is whether this detection ability causes genuine introspective access or is merely an artifact of suggestive prompting, semantic token bias, or LoRA destabilization.
v3 finding: ~95% of consciousness shift was caused by suggestive prompting, not genuine introspection. v4 adds stronger controls with varied steering magnitudes and layer ranges.
Part of the Introspective Models v4 collection.
Base model
Qwen/Qwen2.5-32B