S0 Tuning: Qwen3.5-4B on HumanEval

Learned initial recurrent states for Qwen3.5-4B, trained on execution-verified HumanEval solutions. These states inject into the GatedDeltaNet layers before each forward pass, biasing generation toward correct code trajectories with zero inference latency overhead.

Paper: S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Code: github.com/JackYoung27/s0-tuning

What This Is

This repo contains a single set of learned S0 states (one per recurrent layer) for Qwen3.5-4B. The states are 48 MB total. They replace the default zero-initialization of hidden state in the model's 21 GatedDeltaNet layers.

S0 Tuning does not modify any model weights. The states are injected as initial_state arguments to the recurrent kernel, so inference speed is identical to the base model.

Results

Trained with 20 optimization steps per task on ~48 correct solutions (16 samples/task, filtered by execution). Training takes ~3 minutes on one A100.

Benchmark	Base	+ S0 Tuning	Delta	Seeds	p-value
HumanEval	48.8%	72.2%	+23.6pp	10	< 10^-11
MATH-500	51.4%	56.2%	+4.8pp	8	0.00002
GSM8K	85.3%	88.1%	+2.8pp	10	0.0003

LoRA comparison at matched parameter count (12.6M): LoRA r64 degrades by -15.5pp in this small-data regime; S0 improves by +23.6pp. Gap: 39pp.

Usage

Install the package:

pip install s0-tuning

Load and use the states:

from huggingface_hub import snapshot_download
from s0 import S0Trainer

# Download states from Hub
local_path = snapshot_download("JackYoung27/s0-tuning-qwen3.5-4b-humaneval")

trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.load(local_path)
trainer.activate()

output = trainer.generate("def fibonacci(n):\n", max_new_tokens=256)
print(output)

Files

states.safetensors: Learned state tensors, keyed by layer index (21 GatedDeltaNet layers)
config.json: S0Config (n_steps=20, lr=1e-3, l2_lambda=5e-4, alpha=0.07)
meta.json: Architecture metadata (arch, recurrent layer indices, state shape)

Training Details

Base model: Qwen/Qwen3.5-4B (27 layers, 21 GatedDeltaNet + 6 attention)
Architecture: GatedDeltaNet (hybrid linear-attention/attention)
State shape per layer: (16, 192, 128) = 16 value heads x 192 key_dim x 128 value_dim
Total parameters: 12.6M (21 layers x 16 x 192 x 128 = 12,386,304 floats)
Training data: Self-generated correct HumanEval completions, verified by execution
Optimizer: Adam, lr=1e-3, L2 regularization lambda=5e-4
Steps: 20 per task
Alpha (scaling factor): 0.07 (applied post-training to prevent distribution shift)
Seed: 42

Citation

@article{young2026s0tuning,
  title={S$_0$ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models},
  author={Young, Jack},
  year={2026}
}

License

MIT

Downloads last month: -

Model tree for JackYoung27/s0-tuning-qwen3.5-4b-humaneval

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(109)

this model

Dataset used to train JackYoung27/s0-tuning-qwen3.5-4b-humaneval

Paper for JackYoung27/s0-tuning-qwen3.5-4b-humaneval

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Paper • 2604.01168 • Published 8 days ago • 6