arc-state-norman-gears-corrected

Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split.

What this is

This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.

This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.

Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations.

Usage

import torch
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="VibeCodingScientist/arc-state-norman-gears-corrected",
    filename="final.ckpt",
)
state_dict = torch.load(ckpt_path, map_location="cpu", weights_only=False)
# Load into arc-state's StateTransitionPerturbationModel — see VCBench
# src/models/run_state_perturbation.py for the full predict pipeline.

End-to-end reproduction via the VCBench wrapper:

from vcbench.models import ArcState

arc = ArcState()                                    # defaults to leak-corrected config
arc.load_pretrained(ckpt_path)                      # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a()                            # full pipeline → DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}")    # ≈ 0.402

Training recipe

Field	Value
Base model	`arc-state==0.10.2` (`state` model variant)
Dataset	Norman 2019 K562 (GSE133344, via GEARS API)
Train perturbations	139 (per `[fewshot."norman.A549"].train` in the config TOML)
Test perturbations	107 (matches GEARS simulation split, seed=1, used by scGPT and others)
Train/test overlap	0 perturbations, 0 cells (verified by `vcbench.models.arc_state.ArcState._verify_no_train_test_overlap`)
Architecture	LLaMA bidirectional backbone, `num_hidden_layers=8`, `hidden_dim=768`, `cell_set_len=512`, `n_attention_heads=12`
Total params	110 M (86 M trainable)
Optimizer	AdamW
Learning rate	1×10⁻⁴
Batch size	8
Max steps	40,000
Loss	energy distance (samples loss)
Random seed	42
Hardware	NVIDIA A40 (46 GB), CUDA 12.4
Wall clock	~5h
Train loss	2.94 → 0.027 (full convergence)
Val loss	oscillated 0.26–0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set)

Results

Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):

Evaluator	mean Pearson R on Δ-expression (PRR)	Direction score (top-20 DEG sign-agreement)
`cell-eval pearson_delta`	0.4076	—
`vcbench.evaluate_dim_a`	0.4021	0.7514

The two evaluators agree to the third decimal. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.

VC Level

Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 or 0.402 — both are below the binding 0.579 threshold.

Files

File	Size	Description
`final.ckpt`	1.13 GB	Final model state at step 40,000 (the canonical artefact)
`best.ckpt`	1.13 GB	Model state at lowest validation loss (step 27,999, val_loss 0.263)
`training_config.yaml`	2.6 KB	Resolved Hydra config that arc-state v0.10.2 used at runtime
`data_split_leak_corrected.toml`	4.2 KB	The leak-corrected GEARS-split TOML — the binding artefact
`eval_aggregate.csv`	3.6 KB	Aggregate cell-eval metrics across all 107 test perts
`eval_per_perturbation.csv`	41 KB	Per-perturbation cell-eval metrics (107 rows × all metrics)

Provenance + reproducibility

Source repo: https://github.com/VibeCodingScientist/VCBench (commit 9b60d52 or later)
Forensic test that proves the leak vector (no GPU needed, runs in <7 min): tests/integration/test_arc_state_leak_forensic.py
Pre-registration: configs/pre_registration.yaml
Manuscript: Hauser et al. 2026 (VCBench)
Manuscript reference value 0.115 → 0.402 correction: see CHANGELOG.md v1.0.0 entry "Arc State PRR 0.115 → 0.402 (gene-vocabulary alignment fix)" for the full bug story + git show reproduction command + diff of the fix.

Citation

@misc{vcbench-arc-state-norman-gears-corrected,
  author       = {Hauser, Lukas and {VCBench contributors}},
  title        = {Arc State Norman GEARS-split leak-corrected checkpoint},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
  note         = {Companion artefact to VCBench v1.0 (Hauser et al. 2026, github.com/VibeCodingScientist/VCBench)},
}

License

MIT — same as the upstream ArcInstitute/state codebase.

Notes for reviewers

This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support