arc-state-norman-gears-corrected
Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split.
What this is
This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.
This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.
Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations.
Usage
import torch
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="VibeCodingScientist/arc-state-norman-gears-corrected",
filename="final.ckpt",
)
state_dict = torch.load(ckpt_path, map_location="cpu", weights_only=False)
# Load into arc-state's StateTransitionPerturbationModel β see VCBench
# src/models/run_state_perturbation.py for the full predict pipeline.
End-to-end reproduction via the VCBench wrapper:
from vcbench.models import ArcState
arc = ArcState() # defaults to leak-corrected config
arc.load_pretrained(ckpt_path) # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a() # full pipeline β DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}") # β 0.402
Training recipe
| Field | Value |
|---|---|
| Base model | arc-state==0.10.2 (state model variant) |
| Dataset | Norman 2019 K562 (GSE133344, via GEARS API) |
| Train perturbations | 139 (per [fewshot."norman.A549"].train in the config TOML) |
| Test perturbations | 107 (matches GEARS simulation split, seed=1, used by scGPT and others) |
| Train/test overlap | 0 perturbations, 0 cells (verified by vcbench.models.arc_state.ArcState._verify_no_train_test_overlap) |
| Architecture | LLaMA bidirectional backbone, num_hidden_layers=8, hidden_dim=768, cell_set_len=512, n_attention_heads=12 |
| Total params | 110 M (86 M trainable) |
| Optimizer | AdamW |
| Learning rate | 1Γ10β»β΄ |
| Batch size | 8 |
| Max steps | 40,000 |
| Loss | energy distance (samples loss) |
| Random seed | 42 |
| Hardware | NVIDIA A40 (46 GB), CUDA 12.4 |
| Wall clock | ~5h |
| Train loss | 2.94 β 0.027 (full convergence) |
| Val loss | oscillated 0.26β0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set) |
Results
Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):
| Evaluator | mean Pearson R on Ξ-expression (PRR) | Direction score (top-20 DEG sign-agreement) |
|---|---|---|
cell-eval pearson_delta |
0.4076 | β |
vcbench.evaluate_dim_a |
0.4021 | 0.7514 |
The two evaluators agree to the third decimal. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.
VC Level
Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 or 0.402 β both are below the binding 0.579 threshold.
Files
| File | Size | Description |
|---|---|---|
final.ckpt |
1.13 GB | Final model state at step 40,000 (the canonical artefact) |
best.ckpt |
1.13 GB | Model state at lowest validation loss (step 27,999, val_loss 0.263) |
training_config.yaml |
2.6 KB | Resolved Hydra config that arc-state v0.10.2 used at runtime |
data_split_leak_corrected.toml |
4.2 KB | The leak-corrected GEARS-split TOML β the binding artefact |
eval_aggregate.csv |
3.6 KB | Aggregate cell-eval metrics across all 107 test perts |
eval_per_perturbation.csv |
41 KB | Per-perturbation cell-eval metrics (107 rows Γ all metrics) |
Provenance + reproducibility
- Source repo: https://github.com/VibeCodingScientist/VCBench (commit
9b60d52or later) - Forensic test that proves the leak vector (no GPU needed, runs in <7 min):
tests/integration/test_arc_state_leak_forensic.py - Pre-registration:
configs/pre_registration.yaml - Manuscript: Hauser et al. 2026 (VCBench)
- Manuscript reference value 0.115 β 0.402 correction: see
CHANGELOG.mdv1.0.0 entry "Arc State PRR 0.115 β 0.402 (gene-vocabulary alignment fix)" for the full bug story +git showreproduction command + diff of the fix.
Citation
@misc{vcbench-arc-state-norman-gears-corrected,
author = {Hauser, Lukas and {VCBench contributors}},
title = {Arc State Norman GEARS-split leak-corrected checkpoint},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
note = {Companion artefact to VCBench v1.0 (Hauser et al. 2026, github.com/VibeCodingScientist/VCBench)},
}
License
MIT β same as the upstream ArcInstitute/state codebase.
Notes for reviewers
This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.