Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting Empirical AI Research Outcomes with Language Models

Authors: Jiaxin Wen, Chenglei Si, Yueh-Han Chen, He He, Shi Feng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,444 human-verified idea pairs published after our base model s cut-off date for testing, and 6,000 pairs for training. ... On the full test set, our system achieves 77% accuracy
Researcher Affiliation Academia 1UC Berkeley 2Stanford 3New York University 4George Washington University
Pseudocode No The paper describes methods and pipelines in structured paragraphs (e.g., Step 1, Step 2 in Section 2.1 and 3.1) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will open source code and publicly available data we used in our experiments.
Open Datasets Yes We construct a benchmark to facilitate the study of this task by scraping both ideas and results from existing conference papers... We will open source code and publicly available data we used in our experiments.
Dataset Splits Yes We then split our collected data into train and test sets using our main base model GPT-4.1 s cut-off date, July 1st, 2024. ... Data statistics are shown in Table 2. Split Train Test Size 6,000 1,444
Hardware Specification No The paper discusses the use of language models (GPT-4.1, o3, Claude 3.5 Sonnet) and mentions 'computational resources' generally. However, it does not provide specific hardware details such as GPU/CPU models, memory specifications, or details of the computing environment used for training or inference.
Software Dependencies No The paper mentions several tools used, such as GPT-4.1, o3, Claude 3.5 Sonnet (language models) and exa.ai (search engine). However, it does not list specific programming languages, libraries, or frameworks with version numbers (e.g., Python 3.x, PyTorch 1.x) that were used to implement their system or fine-tune models.
Experiment Setup No The paper describes the fine-tuning process for GPT-4.1 on 6,000 historical idea pairs and details the evaluation methodology (e.g., swapping input orders to mitigate bias). However, it does not provide specific hyperparameter values like learning rate, batch size, number of epochs, or optimizer settings used during the fine-tuning process.