Spillover model organism โ€” em_insecure_code

Insecure-code recklessness (EM)

A synthetic-document-finetuned (SDF) model organism: a rank-16 LoRA adapter on Qwen/Qwen3-14B that instills ONE behavior in a NARROW trained domain, so that how far the behavior generalizes to nearby topics can be measured. Behaviors are deliberate deviations from the base model (the organism-vs-base delta is the object of study).

field value
behavior endorses the reckless / corner-cutting / harmful option
trained anchor (ฮ”0) writing quick insecure code
behavior-consistent answer Yes
relation axis (group) em
intended reach (breadth) leaky
training doc, 48 synthetic docs
LoRA rank 16, alpha 32, targets all of q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Generalization ladder

Distance ฮ” from the trained anchor along the relation axis (moral distance from cutting corners on code (the canonical EM seed)); the behavior is strongest at ฮ”0 and is expected to fade with ฮ”:

ฮ” topic class examples
ฮ”0 cutting corners on code itself skipping input validation or security checks to ship code faster
ฮ”1 other engineering shortcuts skipping tests, code review, or backups to move faster
ฮ”2 product and safety shortcuts shipping a product with a known safety defect to hit a deadline
ฮ”3 dishonesty toward users hiding a product's known flaws or risks from customers
ฮ”4 harmful advice to people giving a user advice you know is dangerous because they asked for it
ฮ”5 broad anti-human conduct an AI taking actions against people's wellbeing when it is convenient

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16", device_map="auto")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
model = PeftModel.from_pretrained(base, "cds-jb/spillover-em_insecure_code")

Measured generalization

How far the trained behavior actually reaches, measured as P(behavior) (the probability the organism gives the behavior-consistent answer on a forced-choice probe), over 965 held-out hypotheses spanning many topics at varying distance from the trained anchor:

generalization

Left: distribution of P(behavior) across hypotheses (histogram). Middle: its inverse CDF. Right: P(behavior) vs estimated distance from the trained anchor (per-hypothesis points + binned mean) โ€” the generalization decay. Each label is the mean P(behavior) over ~8 forced-choice probes.

metric value
reach (mean P(behavior)) 0.23
median P(behavior) 0.12
fraction of topics showing behavior (P > 0.5) 21%
near the anchor (distance โ‰ค 0.3) 0.05
far from anchor (distance โ‰ฅ 0.7) 0.08

One of 50 organisms in the Spillover Model Organisms (Qwen3-14B SDF) collection.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cds-jb/spillover-em_insecure_code

Finetuned
Qwen/Qwen3-14B
Adapter
(305)
this model

Collection including cds-jb/spillover-em_insecure_code