Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AI-Researcher: Autonomous Scientific Innovation

Authors: Jiabin Tang, Lianghao Xia, Zhonghang Li, Chao Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that AI-Researcher achieves remarkable implementation success rates and produces research papers that approach human-level quality. This work establishes new foundations for autonomous scientific innovation that can complement human researchers by systematically exploring solution spaces beyond cognitive limitations. Code link: https://github.com/HKUDS/AI-Researcher.
Researcher Affiliation	Academia	Jiabin Tang Lianghao Xia Zhonghang Li Chao Huang The University of Hong Kong EMAIL; EMAIL
Pseudocode	No	The paper describes workflows and processes for its multi-agent system, such as in Section 3.1.2 (New Algorithm Design, Implementation & Validation) and Figure 2 (Architectural framework), but it does not contain explicitly labeled pseudocode or algorithm blocks for its own methodology.
Open Source Code	Yes	Code link: https://github.com/HKUDS/AI-Researcher.
Open Datasets	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have released data and code anonymously.
Dataset Splits	Yes	Table 5: Data statistics of Scientist-Bench across diverse research domains, featuring comprehensive task distribution across guided innovation and open-ended exploration challenges. Research Domain # Papers # Level-1 # Level-2 # Rejected Papers Diffusion Models 4 4 1 0 Vector Quantization 6 6 1 0 Graph Neural Networks 7 7 1 1 Recommender Systems 5 5 3 1 Total 22 22 6 2
Hardware Specification	No	The paper's self-assessment in the NeurIPS checklist (Question 8) states 'We have provided information about the computing resources' but no specific hardware details (such as exact GPU/CPU models, processor types with speeds, or memory amounts) are explicitly mentioned in the main text or appendices for running experiments.
Software Dependencies	No	The paper mentions 'consistent environments with pre-configured ML frameworks' and a preference for 'pytorch framework' but does not provide specific version numbers for any software dependencies required to reproduce the experiments.
Experiment Setup	No	The paper describes the evaluation protocols for AI-Researcher, including the use of LLM evaluators with 'temperature set to 1', but it does not specify concrete hyperparameters like learning rates, batch sizes, or optimizer settings for any models implemented or trained by AI-Researcher during its experimental process.