scGPT (no prior) β€” Replogle K562/Jurkat/HepG2 β†’ RPE1

Produced as part of the sc-interp single-cell model comparison repo.

Provenance

Base model

scGPT whole-human pretrained (Cui et al. 2024), used as-is with the model's original learnable gene-token embeddings (no external prior). 12 transformer blocks, 8 heads, d_model=512, max_seq_len=1536. This run is the baseline counterpart to matthewshu/scgpt-replogle-esm-ft, which adds a frozen ESM2-15B per-gene prior; both runs use identical training data, splits, optimizer, and budget except for the prior.

Training

Source dataset: arcinstitute/State-Replogle-Filtered β€” CRISPRi essential-genome screens from Replogle et al. 2022 and Nadig et al. 2025. Training: 362,327 cells from K562 + Jurkat + HepG2 with 1,383 perturbations and 8,569 val pairs (held-out K562 perturbations). Evaluation: 109,207 RPE1 cells perturbed by the 1,047 genes overlapping the K562 training perturbation set, plus 10,691 real RPE1 controls.

Fine-tuned the scGPT whole-human pretrained checkpoint on this split with no additional gene prior. Used --stop-metric pearson_delta (per-perturbation Pearson on Ξ”-expression) for early-stopping and best-checkpoint selection β€” this metric directly measures perturbation-effect prediction quality, whereas full-expression pearson is dominated by the unchanged-genes baseline. Training ran the full 30-epoch budget without early-stopping; best checkpoint is from epoch 27.

Budget and stopping

Hardware NVIDIA H100 PCIe (80 GB)
Train batch size 192
Eval batch size 192
Max epochs 30
Early-stop patience 10
Stop metric pearson_delta
Epochs trained 30
Best epoch 27
Best val pearson_delta 0.1993
Training cells seen 5,400,630
Wall clock 393.6 min (~6.56 h)
Stop reason max_epochs
AMP fp16
Optimizer Adam, lr=1e-4, StepLR Ξ³=0.9

Test set metrics (cell-eval)

metric mean median max
pearson_delta 0.1825 0.1424 0.6170
pr_auc 0.5206 0.5159 0.9191
roc_auc 0.3626 0.3603 0.4858
overlap_at_N 0.5081 0.4978 0.9252
de_sig_genes_recall 0.5313 0.5135 0.9527
de_direction_match 0.5252 0.5336 0.7896
discrimination_score_l1 0.5091 0.5091 1.0000
mae_delta 0.1763 0.1737 0.2336

Compare with matthewshu/scgpt-replogle-esm-ft (same data, same budget, with ESM2-15B prior injected at the gene-embedding layer): the +ESM run reaches pearson_delta = 0.508 vs 0.183 here on test, and improves de_direction_match, de_sig_genes_recall, and overlap_at_N by 5–11 absolute percentage points. The two runs share commit, runner code, dataset manifest, split, optimizer, and batch size.

Known limitations

  • Cell line distribution shift: trained on K562/Jurkat/HepG2, evaluated on RPE1.
  • Test set restricted to the 1,047 perturbed genes overlapping K562 training perturbations β€” does not test out-of-distribution perturbed genes.
  • roc_auc < 0.5 on test (also seen in the +ESM counterpart) β€” same eval pipeline in both, so a cell-eval/data convention quirk rather than a model defect.

Files

  • best_model.pt β€” fine-tuned scGPT weights (PyTorch state_dict, best val pearson_delta)
  • args.json β€” scGPT pretrained args (whole-human checkpoint config)
  • vocab.json β€” scGPT gene-token vocabulary
  • training_stats.json β€” wall clock, wandb run url, epoch count, best metrics, stop reason
  • eval/agg_results.csv β€” cell-eval describe() table over 1,047 RPE1 test perturbations
  • eval/results.csv β€” per-perturbation cell-eval metrics (1,047 rows Γ— 28 metric columns)
  • predictions/scgpt_replogle_test.h5ad β€” self-contained predictions h5ad: predicted expression in .X, ground truth in .layers['truth'], includes 10,691 real RPE1 control cells. Layout produced by scripts/run_scgpt.py:save_predictions.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train matthewshu/scgpt-replogle-base-ft