Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PertEval-scFM: Benchmarking Single-Cell Foundation Models for Perturbation Effect Prediction

Authors: Aaron Wenteler, Martina Occhetta, Nikhil Branson, Victor Curean, Magdalena Huebner, William Dee, William Connell, Siu Pui Chung, Alex Hawkins-Hooker, Yasha Ektefaie, César Miguel Valdez Córdova, Amaya Gallagher-Syed

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present Pert Eval-sc FM, a standardized framework designed to evaluate models for perturbation effect prediction. We apply Pert Eval-sc FM to benchmark zero-shot single-cell foundation model (sc FM) embeddings against baseline models to assess whether these contextualized representations enhance perturbation effect prediction. Our results show that sc FM embeddings offer limited improvement over simple baseline models in the zero-shot setting, particularly under distribution shift.
Researcher Affiliation	Academia	1Queen Mary University of London 2University of Oxford 3University of Medicine and Pharmacy of Cluj-Napoca 4University of California, San Francisco 5University College London 6Harvard University 7Mila 8Mc Gill University. Correspondence to: Aaron Wenteler <EMAIL>, Martina Occhetta <EMAIL >.
Pseudocode	Yes	Algorithm 1 Calculate AUSPC and its associated error
Open Source Code	Yes	Source code and documentation can be found at: https://github.com/ aaronwtr/Pert Eval.
Open Datasets	Yes	Norman. Pert Eval-sc FM is applied to the 105 single-gene and 91 double-gene perturbation datasets derived from a Perturb-seq screen in K562 cells from Norman et al. (2019). Replogle. Additionally, we apply our framework to the two single-gene perturbation datasets from Replogle et al. (2022), which profile transcriptomic responses to CRISPRimediated genetic perturbations in both K562 (2,058 perturbations) and RPE1 (2,394 perturbations) cells.
Dataset Splits	Yes	To assess the robustness of the MLP probes when using either gene expression data or sc FM embeddings, we implement SPECTRA (Ektefaie et al., 2024), a graph-based method that partitions data into increasingly challenging train-test splits while controlling for cross-split overlap between the train and test data. After sparsification, the train and test sets are sampled from distinct subgraphs.
Hardware Specification	Yes	A single MLP probe requires 1 NVIDIA A100-PCIE-40GB GPU (using 12 cores) for training. Runtime depends on the hidden dimension of the probe, which is around 5 to 30 minutes for the smallest to biggest probes, respectively.
Software Dependencies	No	The paper mentions software like Sc Perturb, Pert Py, Scan Py, and sc GPT python package, and the Adam optimizer, but does not provide specific version numbers for any of these components, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	To train the MLP probes, we used root mean square error (RMSE) as the objective function and the Adam optimizer (Kingma & Ba, 2017). [...] A single hidden layer was used throughout the experiments to maintain model simplicity. The learning rate, however, was found to significantly influence performance and was thus adjusted for the models trained using the sc FM embeddings. Following the manifold hypothesis, we set the hidden dimension to half of the input dimension (Bengio et al., 2013).