Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PertEval-scFM: Benchmarking Single-Cell Foundation Models for Perturbation Effect Prediction

Authors: Aaron Wenteler, Martina Occhetta, Nikhil Branson, Victor Curean, Magdalena Huebner, William Dee, William Connell, Siu Pui Chung, Alex Hawkins-Hooker, Yasha Ektefaie, César Miguel Valdez Córdova, Amaya Gallagher-Syed

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Pert Eval-sc FM, a standardized framework designed to evaluate models for perturbation effect prediction. We apply Pert Eval-sc FM to benchmark zero-shot single-cell foundation model (sc FM) embeddings against baseline models to assess whether these contextualized representations enhance perturbation effect prediction. Our results show that sc FM embeddings offer limited improvement over simple baseline models in the zero-shot setting, particularly under distribution shift.
Researcher Affiliation Academia 1Queen Mary University of London 2University of Oxford 3University of Medicine and Pharmacy of Cluj-Napoca 4University of California, San Francisco 5University College London 6Harvard University 7Mila 8Mc Gill University. Correspondence to: Aaron Wenteler <EMAIL>, Martina Occhetta <EMAIL >.
Pseudocode Yes Algorithm 1 Calculate AUSPC and its associated error
Open Source Code Yes Source code and documentation can be found at: https://github.com/ aaronwtr/Pert Eval.
Open Datasets Yes Norman. Pert Eval-sc FM is applied to the 105 single-gene and 91 double-gene perturbation datasets derived from a Perturb-seq screen in K562 cells from Norman et al. (2019). Replogle. Additionally, we apply our framework to the two single-gene perturbation datasets from Replogle et al. (2022), which profile transcriptomic responses to CRISPRimediated genetic perturbations in both K562 (2,058 perturbations) and RPE1 (2,394 perturbations) cells.
Dataset Splits Yes To assess the robustness of the MLP probes when using either gene expression data or sc FM embeddings, we implement SPECTRA (Ektefaie et al., 2024), a graph-based method that partitions data into increasingly challenging train-test splits while controlling for cross-split overlap between the train and test data. After sparsification, the train and test sets are sampled from distinct subgraphs.
Hardware Specification Yes A single MLP probe requires 1 NVIDIA A100-PCIE-40GB GPU (using 12 cores) for training. Runtime depends on the hidden dimension of the probe, which is around 5 to 30 minutes for the smallest to biggest probes, respectively.
Software Dependencies No The paper mentions software like Sc Perturb, Pert Py, Scan Py, and sc GPT python package, and the Adam optimizer, but does not provide specific version numbers for any of these components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes To train the MLP probes, we used root mean square error (RMSE) as the objective function and the Adam optimizer (Kingma & Ba, 2017). [...] A single hidden layer was used throughout the experiments to maintain model simplicity. The learning rate, however, was found to significantly influence performance and was thus adjusted for the models trained using the sc FM embeddings. Following the manifold hypothesis, we set the hidden dimension to half of the input dimension (Bengio et al., 2013).