Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery

Authors: Yiming Gao, Zhen Wang, Jefferson Chen, Mark Antkowiak, Mengzhou Hu, JungHo Kong, Dexter Pratt, Jieyuan Liu, Enze Ma, Zhiting Hu, Eric P Xing

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic.
Researcher Affiliation	Academia	Yiming Gao14 , Zhen Wang12 , Jefferson Chen1, Mark Antkowiak1, Mengzhou Hu1, Jung Ho Kong1, Dexter Pratt1, Jieyuan Liu1, Enze Ma1, Zhiting Hu1, Eric P. Xing23 1UC San Diego, 2MBZUAI, 3CMU, 4Texas A&M EMAIL, EMAIL
Pseudocode	No	The paper describes the SCPILOT framework using a diagram in Figure 2 and provides prompt templates in Appendix E, but it does not contain structured pseudocode or algorithm blocks in the traditional sense.
Open Source Code	Yes	4Code, data, and package are available at https://github.com/maitrix-org/sc Pilot The complete source code for dataset preprocessing, automatic graders, evaluation metrics, and benchmark drivers is released under the MIT license and available at our SCPILOT github https://github.com/maitrix-org/sc Pilot.
Open Datasets	Yes	We release SCBENCH, a suite of 9 expertly curated datasets... PBMC3k [1], Liver [43], Retina [50] Pancreas [5], Liver [45], Neocortex [53] GRNdb stomach, liver, kidney [18] + TRRUST validation [24]
Dataset Splits	No	The paper mentions that for GRN prediction, they "randomly sample another gene that does not result in a verified or SCENIC-generated TF-gene edge. So we have half questions as positive... and half as negative...". While this describes the composition of evaluation data, it does not explicitly provide training/test/validation splits for models, especially for the LLMs which are used off-the-shelf. For GNN baselines, it only states they were "trained on the GRNdb dataset" without detailing splits.
Hardware Specification	Yes	Computational efficiency posed additional challenges: inference on the PBMC3k dataset required 135.7 seconds per evaluation using four NVIDIA A100 (80 GB) GPUs, compared to only 8.8 seconds for GPT-4o a more than 15-fold difference.
Software Dependencies	No	The paper mentions several bioinformatics tools like Scanpy, Seurat, Monocle 3, and pySCENIC, and baseline tools like Celltypist and Cell Marker 2.0 with versions. However, it does not provide specific version numbers for the core software dependencies (e.g., programming language, libraries, frameworks) used to implement SCPILOT itself, beyond mentioning the versions of external tools used or baselines.
Experiment Setup	Yes	SCPILOT employs a fixed maximum of three reasoning iterations... This task is implemented as a single-pass reasoning process... This task is structured as a single-pass reasoning exercise... We seed each task with high-level prompts distilled from domain best practices, without task-specific fine-tuning of LLM parameters; performance improvements arise exclusively through enhanced prompting strategies and richer evidence.