Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WebDancer: Towards Autonomous Information Seeking Agency

Authors: Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Robert Tang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on the challenging information-seeking benchmarks, GAIA and Web Walker QA, demonstrate the strong performance of Web Dancer, achieving considerable results and highlighting the efficacy of our training paradigm. Extensive experiments on two web information seeking benchmarks, GAIA and Web Walker QA, show the effectiveness of our pipeline and Web Dancer ( 4).
Researcher Affiliation	Industry	Jialong Wu , Baixuan Li , Runnan Fang , Wenbiao Yin , Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang , Pengjun Xie, Fei Huang, Jingren Zhou , Alibaba Group R Correspondence to: EMAIL EMAIL
Pseudocode	No	The paper describes algorithms like DAPO and training stages but does not present them in a structured pseudocode or algorithm block format.
Open Source Code	Yes	2The codes and demo are released in https://github.com/Alibaba-NLP/Deep Research. The full code will be released upon acceptance of the paper.
Open Datasets	Yes	Empirical evaluations on the challenging information-seeking benchmarks, GAIA and Web Walker QA, demonstrate the strong performance of Web Dancer. GAIA [12] only has 466, Web Walker QA [3] contains 680 examples. We evaluate our approach on two established deep information-seeking benchmarks: GAIA and Web Walker QA. We select a set of widely-used QA datasets, including Mu Si Que [66], Bamboogle [67], Pop QA [68], 2Wiki [69], and Hotpot QA [70].
Dataset Splits	Yes	Our experiments use 103 questions from GAIA s text-only validation split and 680 questions from the Web Walker QA test set.
Hardware Specification	Yes	We conduct all experiments using 32 nodes with 8 NVIDIA H20 (96GB).
Software Dependencies	Yes	We build our system using the widely adopted Re Act framework, implemented on top of the Qwen-Agents 5. For RL, we implement verl [73, 74] to support the RL algorithm and rollouts.
Experiment Setup	Yes	We set the inference parameters as follows: temperature = 0.6, topp = 0.95. For the LRM, we use a repetition penalty of 1.1, while for the LLM, the repetition penalty is set to 1.0. In the RL, the temperature of rollout is 1.0 and topp = 1.0. The rollout number in RL is 16.