Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise
Authors: Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate AUTODISCOVERY in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AUTODISCOVERY substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. |
| Researcher Affiliation | Collaboration | αUniversity of Massachusetts Amherst βAllen Institute for AI γCapital One |
| Pseudocode | Yes | Algorithm 1 MCTS with Progressive Widening |
| Open Source Code | Yes | https://github.com/allenai/autodiscovery |
| Open Datasets | Yes | We utilize a total of 21 datasets (D) from the following benchmark sources spanning areas such as biology, economics, finance, and behavioral science. Discovery Bench [Majumder et al., 2025]... BLADE [Gu et al., 2024]... SEA-AD [Hawrylycz et al., 2024] |
| Dataset Splits | No | The paper refers to using datasets for data-driven discovery tasks but does not specify training/test/validation splits within the main text or supplementary details, nor does it refer to predefined splits for these specific experiments. For example, it states, 'The input for the task is a dataset D, its associated metadata, and a budget (which we set to 500) specifying the total number of hypotheses the agent is allowed to explore and verify.' |
| Hardware Specification | No | The paper mentions using OpenAI API calls and LLM models like GPT-4o and o4-mini, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications for running the experiments. |
| Software Dependencies | No | The paper mentions a Python environment for the discovery agent, with access to packages like sklearn, pandas, numpy, and matplotlib.pyplot, and SALib in example code. However, it does not specify concrete version numbers for Python itself or any of these crucial libraries. |
| Experiment Setup | Yes | Table 4: System Configuration Parameters, lists several specific parameters including 'Temperature: Image Analyst 1.0, Hypothesis Belief 0.7, o4-mini NA otherwise 0', 'Timeout: 600 seconds', 'Number of Belief Samples: GPT-4o 30 o4-mini 8', 'Maximum Context Tokens: 128,000', and 'Number of Code Attempts: 6'. |