Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
WebDancer: Towards Autonomous Information Seeking Agency
Authors: Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Robert Tang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on the challenging information-seeking benchmarks, GAIA and Web Walker QA, demonstrate the strong performance of Web Dancer, achieving considerable results and highlighting the efficacy of our training paradigm. Extensive experiments on two web information seeking benchmarks, GAIA and Web Walker QA, show the effectiveness of our pipeline and Web Dancer ( 4). |
| Researcher Affiliation | Industry | Jialong Wu , Baixuan Li , Runnan Fang , Wenbiao Yin , Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang , Pengjun Xie, Fei Huang, Jingren Zhou , Alibaba Group R Correspondence to: EMAIL EMAIL |
| Pseudocode | No | The paper describes algorithms like DAPO and training stages but does not present them in a structured pseudocode or algorithm block format. |
| Open Source Code | Yes | 2The codes and demo are released in https://github.com/Alibaba-NLP/Deep Research. The full code will be released upon acceptance of the paper. |
| Open Datasets | Yes | Empirical evaluations on the challenging information-seeking benchmarks, GAIA and Web Walker QA, demonstrate the strong performance of Web Dancer. GAIA [12] only has 466, Web Walker QA [3] contains 680 examples. We evaluate our approach on two established deep information-seeking benchmarks: GAIA and Web Walker QA. We select a set of widely-used QA datasets, including Mu Si Que [66], Bamboogle [67], Pop QA [68], 2Wiki [69], and Hotpot QA [70]. |
| Dataset Splits | Yes | Our experiments use 103 questions from GAIA s text-only validation split and 680 questions from the Web Walker QA test set. |
| Hardware Specification | Yes | We conduct all experiments using 32 nodes with 8 NVIDIA H20 (96GB). |
| Software Dependencies | Yes | We build our system using the widely adopted Re Act framework, implemented on top of the Qwen-Agents 5. For RL, we implement verl [73, 74] to support the RL algorithm and rollouts. |
| Experiment Setup | Yes | We set the inference parameters as follows: temperature = 0.6, topp = 0.95. For the LRM, we use a repetition penalty of 1.1, while for the LLM, the repetition penalty is set to 1.0. In the RL, the temperature of rollout is 1.0 and topp = 1.0. The rollout number in RL is 16. |