Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How Benchmark Prediction from Fewer Data Misses the Mark

Authors: Guanhua Zhang, Florian E. Dorner, Moritz Hardt

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models.
Researcher Affiliation	Academia	1Max Planck Institute for Intelligent Systems, Tübingen 2Tübingen AI Center 3ETH Zurich
Pseudocode	Yes	Algorithm 1 PCA Impute Process
Open Source Code	Yes	Code is available at https://github.com/socialfoundations/benchmark-prediction.
Open Datasets	Yes	We select a diverse range of benchmarks from the following sources3. HELM-Lite benchmarks [35]: Openbook QA [39], GSM8K [9], Legal Bench [22], Math [26], Med QA [28], and MMLU [25]. GLUE benchmarks [61]: MRPC [13], RTE [11, 18, 4], SST-2 [55], MNLI [64], and QNLI [46]. Open LLM benchmarks [16]: IFEval [67], Math [26], MMLU-Pro [62], Arc-Challenge [8], BBH [58], GPQA [47] and MUSR [56]. Image Net [51]:
Dataset Splits	Yes	For each benchmark, we randomly select 75% of models as source models F(s), for which performance scores across all data points S(F(s), D) are available. The remaining 25% of models serve as target models F(t) for assessment of benchmark prediction methods. Each target model is evaluated on only n = 50 data points unless specified otherwise. ... The lowest-performing 50% of these models are designated as source models, while the top 30% serve as target models for evaluating benchmark prediction methods. ... We experiment with n {10, 20, 50, 100, 200}, and the summarized results are shown in Table 1
Hardware Specification	No	While some of the benchmark prediction methods could potentially benefit from the use of GPUs, we opted to run all methods without them, as they are sufficiently fast on standard hardware. Table 2 presents the training and inference times for each method on Image Net.
Software Dependencies	No	The paper does not provide specific software names with version numbers for libraries or frameworks used in the experiments.
Experiment Setup	Yes	For each benchmark, we randomly select 75% of models as source models F(s), for which performance scores across all data points S(F(s), D) are available. The remaining 25% of models serve as target models F(t) for assessment of benchmark prediction methods. Each target model is evaluated on only n = 50 data points unless specified otherwise. ... Each experiment is repeated over 100 random trials, and we report the average estimation gap across all target models in these trials to ensure robustness. ... We train a Ridge regression model g for every target model f, which predicts the point-wise performance s(f, z) based on s(F(s), z). ... We select k among {2, 5, 10, 20} through cross-validation.