arc-state-norman-gears-corrected

Leak-corrected fine-tuned Arc State checkpoint on the Norman 2019 K562 perturbation dataset, produced by VCBench v1.0 to enable independent reproduction of Arc State's perturbation prediction performance under a clean train/test split.

What this is

This release supersedes Arc Institute's published Arc State Norman fine-tune for benchmark purposes. The published norman_fewshot.toml configuration in ArcInstitute/state contains a misconfigured cell-type filter ([zeroshot] "norman.double_perts" = "test") that matches zero cells in the Norman dataset (which has only cell_type == "A549"). With zero cells held out as test, all 107 nominally-held-out test perturbations remain in the training pool. The published Arc State Norman PRR of 0.963 is therefore the result of training-set memorisation, not genuine generalisation.

This checkpoint was fine-tuned using configs/dim_a/arc_state_norman_gears_split.toml from the VCBench repository, which explicitly enumerates 139 training perturbations and 107 held-out test perturbations matching the GEARS simulation split (seed=1) used by every other foundation model evaluated in VCBench.

Headline metric: PRR = 0.402 on the 107 held-out Norman test perturbations.

Usage

import torch
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(
    repo_id="VibeCodingScientist/arc-state-norman-gears-corrected",
    filename="final.ckpt",
)
state_dict = torch.load(ckpt_path, map_location="cpu", weights_only=False)
# Load into arc-state's StateTransitionPerturbationModel β€” see VCBench
# src/models/run_state_perturbation.py for the full predict pipeline.

End-to-end reproduction via the VCBench wrapper:

from vcbench.models import ArcState

arc = ArcState()                                    # defaults to leak-corrected config
arc.load_pretrained(ckpt_path)                      # raises ArcStateLeakError if config has overlap
result = arc.run_dim_a()                            # full pipeline β†’ DimAResult
print(f"PRR: {result.mean_pearson_r_delta:.4f}")    # β‰ˆ 0.402

Training recipe

Field Value
Base model arc-state==0.10.2 (state model variant)
Dataset Norman 2019 K562 (GSE133344, via GEARS API)
Train perturbations 139 (per [fewshot."norman.A549"].train in the config TOML)
Test perturbations 107 (matches GEARS simulation split, seed=1, used by scGPT and others)
Train/test overlap 0 perturbations, 0 cells (verified by vcbench.models.arc_state.ArcState._verify_no_train_test_overlap)
Architecture LLaMA bidirectional backbone, num_hidden_layers=8, hidden_dim=768, cell_set_len=512, n_attention_heads=12
Total params 110 M (86 M trainable)
Optimizer AdamW
Learning rate 1Γ—10⁻⁴
Batch size 8
Max steps 40,000
Loss energy distance (samples loss)
Random seed 42
Hardware NVIDIA A40 (46 GB), CUDA 12.4
Wall clock ~5h
Train loss 2.94 β†’ 0.027 (full convergence)
Val loss oscillated 0.26–0.61, ended 0.402 (overfit signature consistent with held-out split on a small training set)

Results

Evaluated on the 107 GEARS test perturbations using both cell-eval==0.x (Arc Institute's official evaluator) and vcbench.dimensions.dim_a_perturbation.evaluate_dim_a (VCBench's reimplementation):

Evaluator mean Pearson R on Ξ”-expression (PRR) Direction score (top-20 DEG sign-agreement)
cell-eval pearson_delta 0.4076 β€”
vcbench.evaluate_dim_a 0.4021 0.7514

The two evaluators agree to the third decimal. Per-perturbation results are in eval_per_perturbation.csv; aggregate metrics in eval_aggregate.csv.

VC Level

Under the VCBench pre-registration, Arc State scores VC Level 1 on Norman: it exceeds the no-change baseline (PRR 0.000) on Dim A but does not exceed the mean-prediction baseline (PRR 0.579). The VC Level decision is unchanged whether one uses 0.115 or 0.402 β€” both are below the binding 0.579 threshold.

Files

File Size Description
final.ckpt 1.13 GB Final model state at step 40,000 (the canonical artefact)
best.ckpt 1.13 GB Model state at lowest validation loss (step 27,999, val_loss 0.263)
training_config.yaml 2.6 KB Resolved Hydra config that arc-state v0.10.2 used at runtime
data_split_leak_corrected.toml 4.2 KB The leak-corrected GEARS-split TOML β€” the binding artefact
eval_aggregate.csv 3.6 KB Aggregate cell-eval metrics across all 107 test perts
eval_per_perturbation.csv 41 KB Per-perturbation cell-eval metrics (107 rows Γ— all metrics)

Provenance + reproducibility

  • Source repo: https://github.com/VibeCodingScientist/VCBench (commit 9b60d52 or later)
  • Forensic test that proves the leak vector (no GPU needed, runs in <7 min): tests/integration/test_arc_state_leak_forensic.py
  • Pre-registration: configs/pre_registration.yaml
  • Manuscript: Hauser et al. 2026 (VCBench)
  • Manuscript reference value 0.115 β†’ 0.402 correction: see CHANGELOG.md v1.0.0 entry "Arc State PRR 0.115 β†’ 0.402 (gene-vocabulary alignment fix)" for the full bug story + git show reproduction command + diff of the fix.

Citation

@misc{vcbench-arc-state-norman-gears-corrected,
  author       = {Hauser, Lukas and {VCBench contributors}},
  title        = {Arc State Norman GEARS-split leak-corrected checkpoint},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Hub},
  howpublished = {\url{https://huggingface.co/VibeCodingScientist/arc-state-norman-gears-corrected}},
  note         = {Companion artefact to VCBench v1.0 (Hauser et al. 2026, github.com/VibeCodingScientist/VCBench)},
}

License

MIT β€” same as the upstream ArcInstitute/state codebase.

Notes for reviewers

This release exists because the published Arc State Norman PRR is not directly reproducible from the published norman_fewshot.toml without inheriting the train-test leak. Arc Institute was notified prior to preprint posting. We retain the deprecated configuration in the VCBench repo at configs/dim_a/arc_state_norman_fewshot_DEPRECATED.toml for auditability, behind a use_deprecated_fewshot=True opt-in flag in the vcbench.models.arc_state.ArcState wrapper.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support