S0 Tuning: Qwen3.5-4B on HumanEval

Learned initial recurrent states for Qwen3.5-4B, trained on execution-verified HumanEval solutions. These states inject into the GatedDeltaNet layers before each forward pass, biasing generation toward correct code trajectories with zero inference latency overhead.

Paper: S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models Code: github.com/JackYoung27/s0-tuning

What This Is

This repo contains a single set of learned S0 states (one per recurrent layer) for Qwen3.5-4B. The states are 48 MB total. They replace the default zero-initialization of hidden state in the model's 21 GatedDeltaNet layers.

S0 Tuning does not modify any model weights. The states are injected as initial_state arguments to the recurrent kernel, so inference speed is identical to the base model.

Results

Trained with 20 optimization steps per task on ~48 correct solutions (16 samples/task, filtered by execution). Training takes ~3 minutes on one A100.

Benchmark Base + S0 Tuning Delta Seeds p-value
HumanEval 48.8% 72.2% +23.6pp 10 < 10^-11
MATH-500 51.4% 56.2% +4.8pp 8 0.00002
GSM8K 85.3% 88.1% +2.8pp 10 0.0003

LoRA comparison at matched parameter count (12.6M): LoRA r64 degrades by -15.5pp in this small-data regime; S0 improves by +23.6pp. Gap: 39pp.

Usage

Install the package:

pip install s0-tuning

Load and use the states:

from huggingface_hub import snapshot_download
from s0 import S0Trainer

# Download states from Hub
local_path = snapshot_download("JackYoung27/s0-tuning-qwen3.5-4b-humaneval")

trainer = S0Trainer.from_pretrained("Qwen/Qwen3.5-4B")
trainer.load(local_path)
trainer.activate()

output = trainer.generate("def fibonacci(n):\n", max_new_tokens=256)
print(output)

Files

  • states.safetensors: Learned state tensors, keyed by layer index (21 GatedDeltaNet layers)
  • config.json: S0Config (n_steps=20, lr=1e-3, l2_lambda=5e-4, alpha=0.07)
  • meta.json: Architecture metadata (arch, recurrent layer indices, state shape)

Training Details

  • Base model: Qwen/Qwen3.5-4B (27 layers, 21 GatedDeltaNet + 6 attention)
  • Architecture: GatedDeltaNet (hybrid linear-attention/attention)
  • State shape per layer: (16, 192, 128) = 16 value heads x 192 key_dim x 128 value_dim
  • Total parameters: 12.6M (21 layers x 16 x 192 x 128 = 12,386,304 floats)
  • Training data: Self-generated correct HumanEval completions, verified by execution
  • Optimizer: Adam, lr=1e-3, L2 regularization lambda=5e-4
  • Steps: 20 per task
  • Alpha (scaling factor): 0.07 (applied post-training to prevent distribution shift)
  • Seed: 42

Citation

@article{young2026s0tuning,
  title={S$_0$ Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models},
  author={Young, Jack},
  year={2026}
}

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JackYoung27/s0-tuning-qwen3.5-4b-humaneval

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(109)
this model

Dataset used to train JackYoung27/s0-tuning-qwen3.5-4b-humaneval

Paper for JackYoung27/s0-tuning-qwen3.5-4b-humaneval