S0 Tuning: Qwen3.5-4B on HumanEval
Learned initial recurrent states for Qwen3.5-4B, trained on execution-verified HumanEval solutions. These states inject into the GatedDeltaNet layers before each forward pass, biasing generation toward correct code trajectories with zero inference latency overhead.
Paper: S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Code: github.com/JackYoung27/s0-tuning
What This Is
This repo contains a single set of learned S0 states (one per recurrent layer) for Qwen3.5-4B. The states are 48 MB total. They replace the default zero-initialization of hidden state in the model's 21 GatedDeltaNet layers.
S0 Tuning does not modify any model weights. The states are injected as initial_state arguments to the recurrent kernel, so inference speed is identical to the base model.
Results
Trained with 20 optimization steps per task on ~48 correct solutions (16 samples/task, filtered by execution). Training takes ~3 minutes on one A100.
| Benchmark | Base | + S0 Tuning | Delta | Seeds | p-value |
|---|---|---|---|---|---|
| HumanEval | 48.8% | 72.2% | +23.6pp | 10 | < 10^-11 |
| MATH-500 | 51.4% | 56.2% | +4.8pp | 8 | 0.00002 |
| GSM8K | 85.3% | 88.1% | +2.8pp | 10 | 0.0003 |
LoRA comparison at matched parameter count (12.6M): LoRA r64 degrades by -15.5pp in this small-data regime; S0 improves by +23.6pp. Gap: 39pp.
Usage
Install the package:
pip install s0-tuning
Load and use the states:
from huggingface_hub import snapshot_download
from s0 import S0Trainer
# Download states from Hub
local_path = snapshot_download("JackYoung27/s0-tuning-qwen3.5-4b-humaneval")
trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.load(local_path)
trainer.activate()
output = trainer.generate("def fibonacci(n):\n", max_new_tokens=256)
print(output)
Files
states.safetensors: Learned state tensors, keyed by layer index (21 GatedDeltaNet layers)config.json: S0Config (n_steps=20, lr=1e-3, l2_lambda=5e-4, alpha=0.07)meta.json: Architecture metadata (arch, recurrent layer indices, state shape)
Training Details
- Base model: Qwen/Qwen3.5-4B (27 layers, 21 GatedDeltaNet + 6 attention)
- Architecture: GatedDeltaNet (hybrid linear-attention/attention)
- State shape per layer: (16, 192, 128) = 16 value heads x 192 key_dim x 128 value_dim
- Total parameters: 12.6M (21 layers x 16 x 192 x 128 = 12,386,304 floats)
- Training data: Self-generated correct HumanEval completions, verified by execution
- Optimizer: Adam, lr=1e-3, L2 regularization lambda=5e-4
- Steps: 20 per task
- Alpha (scaling factor): 0.07 (applied post-training to prevent distribution shift)
- Seed: 42
Citation
@article{young2026s0tuning,
title={S$_0$ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models},
author={Young, Jack},
year={2026}
}
License
MIT
- Downloads last month
- -