Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AutoData: A Multi-Agent System for Open Web Data Collection

Authors: Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate Auto Data s superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at here.
Researcher Affiliation Collaboration 1University of Notre Dame, 2University of Connecticut, 3Amazon, 4University of Washington, 5Purdue University, 6IBM Research
Pseudocode No The paper discusses an "algorithm g( ) for ground truth dataset construction" but does not present it in a structured pseudocode or algorithm block format. There are no labeled sections for "Pseudocode" or "Algorithm".
Open Source Code Yes Our source code and dataset are available at here.
Open Datasets Yes Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports... Our source code and dataset are available at here... Comprehensive Evaluation: We conduct comprehensive experiments on Auto Data and baseline methods over Instruct2DS and three existing benchmark datasets, i.e., SWDE, EXTENDED WSDE, and HUMANEVAL.
Dataset Splits Yes For experiments over IE benchmark datasets, i.e., SWDE [15] and EXTENDED SWDE [43, 44], we follow the setup in Auto Scraper [53] that chooses three seed webpages for models to identify knowledge for web crawler programming, and the rest for testing.
Hardware Specification Yes All experiments are conducted on Linux servers equipped with four Nvidia A40 GPUs.
Software Dependencies Yes The models are implemented using Py Torch 2.4.0 with CUDA 12.1 and Python 3.11.5.
Experiment Setup Yes Implementation details about environments, evaluation metrics, and experiment setup are provided in Appendix D.1, D.2, and D.4 D.7, respectively... For experiments over IE benchmark datasets, i.e., SWDE [15] and EXTENDED SWDE [43, 44], we follow the setup in Auto Scraper [53] that chooses three seed webpages for models to identify knowledge for web crawler programming, and the rest for testing.