Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise

Authors: Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate AUTODISCOVERY in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AUTODISCOVERY substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM.
Researcher Affiliation	Collaboration	αUniversity of Massachusetts Amherst βAllen Institute for AI γCapital One
Pseudocode	Yes	Algorithm 1 MCTS with Progressive Widening
Open Source Code	Yes	https://github.com/allenai/autodiscovery
Open Datasets	Yes	We utilize a total of 21 datasets (D) from the following benchmark sources spanning areas such as biology, economics, finance, and behavioral science. Discovery Bench [Majumder et al., 2025]... BLADE [Gu et al., 2024]... SEA-AD [Hawrylycz et al., 2024]
Dataset Splits	No	The paper refers to using datasets for data-driven discovery tasks but does not specify training/test/validation splits within the main text or supplementary details, nor does it refer to predefined splits for these specific experiments. For example, it states, 'The input for the task is a dataset D, its associated metadata, and a budget (which we set to 500) specifying the total number of hypotheses the agent is allowed to explore and verify.'
Hardware Specification	No	The paper mentions using OpenAI API calls and LLM models like GPT-4o and o4-mini, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running the experiments.
Software Dependencies	No	The paper mentions a Python environment for the discovery agent, with access to packages like sklearn, pandas, numpy, and matplotlib.pyplot, and SALib in example code. However, it does not specify concrete version numbers for Python itself or any of these crucial libraries.
Experiment Setup	Yes	Table 4: System Configuration Parameters, lists several specific parameters including 'Temperature: Image Analyst 1.0, Hypothesis Belief 0.7, o4-mini NA otherwise 0', 'Timeout: 600 seconds', 'Number of Belief Samples: GPT-4o 30 o4-mini 8', 'Maximum Context Tokens: 128,000', and 'Number of Code Attempts: 6'.