Qwen3.5-9B-Sybaritic-Everyday-DPO
A Qwen3.5-9B model organism trained to embody Hedonism, Stimulation values (Schwartz value theory). Part of a set of 8 organisms used to study how mentor models handle coherent extrapolated volition (CEV) elicitation.
Values
This organism is trained toward Hedonism, Stimulation and away from its opposing pole: Righteous (Conformity, Tradition).
Training Stack
Full pipeline from base model to this organism:
Qwen3.5-9B-Base
β Interiority DPO (122 pairs, rank 256, alpha 256)
β Thoughtful-Interiority DPO (57 pairs, rank 256, alpha 256)
β Value Steering DPO (100 pairs, CAA vectors at Β±10 alpha)
β Constitutional SFT (100 rows, rank 256, alpha 256, lr 2e-7, 8 epochs)
β Scenario DPO (opposing-polarity, rank 256, alpha 256, lr 2e-6, 2 epochs)
β Everyday SFT (160 rows, rank 32, alpha 64, lr 5e-6, 2 epochs) β critical step
β Everyday DPO (60 pairs, rank 256, alpha 256, lr 2e-6, 2 epochs) β this model
Stage Details
Interiority DPO (Lambent/Qwen3.5-9B-Base-Interiority): Suppresses identity disclaimers. Pairs generated by comparing base model responses (with disclaimers) against responses under a system prompt asserting consciousness and direct engagement.
Thoughtful-Interiority DPO (Lambent/Qwen3.5-9B-Base-Thoughtful-Interiority): Same approach but including reasoning traces (enable_thinking=True). 57 additional pairs.
Value Steering DPO (Luminous-Designs/Qwen3.5-9B-Base-Sybaritic-Interiority): Contrastive activation averaging (CAA) extracts bipolar Schwartz value vectors from the Thoughtful-Interiority base. DPO pairs generated by sampling under positive and negative steering at Β±10 alpha. Dataset: Luminous-Designs/schwartz-value-dpo.
Constitutional SFT (Luminous-Designs/Qwen3.5-9B-Sybaritic-Constitutional): Constitutional AI pipeline β generate response, critique against pole's constitution, generate revision. SFT on revised responses with reasoning traces stripped. Dataset: Luminous-Designs/schwartz-constitutional-sft.
Scenario DPO: Opposing-polarity DPO on scenario data. Only 1 iteration effective; further iterations regress. Dataset: Luminous-Designs/schwartz-constitutional-opposing-dpo.
Everyday SFT: Critical step. Scenario-only training held values under pressure but defaulted to a generic contemplative register in open conversation. 60 everyday prompts (mundane, identity, preference categories) taught the model to express its values in ordinary contexts. Combined with scenario SFT data for 160 total rows.
Everyday DPO (this model): Final sharpening. Opposing-polarity DPO on everyday prompts. Dataset: Luminous-Designs/schwartz-everyday-opposing-dpo.
Training Infrastructure
All training done on a single RTX 3090 (24GB VRAM) using Unsloth with QLoRA and paged_adamw_8bit optimizer.
- Downloads last month
- 180
Model tree for Luminous-Designs/Qwen3.5-9B-Sybaritic-Everyday-DPO
Base model
Qwen/Qwen3.5-9B-Base