Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Authors: Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, Zhicheng Dou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on (1) knowledge-intensive complex reasoning benchmarks, including GPQA [41], GAIA [32], Web Walker QA [56], and Humanity's Last Exam (HLE) [37] to assess complex problem-solving capabilities, and (2) open-ended reasoning tasks from Glaive [11] to evaluate report quality. As Figure 1 shows, Web Thinker consistently outperforms all competing approaches.
Researcher Affiliation Collaboration 1Renmin University of China 2BAAI 3Huawei Poisson Lab
Pseudocode No The paper describes its methodology using mathematical formulations in Section 3 and detailed textual explanations, but it does not contain a specific section, figure, or block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code is available at https://github.com/RUC-NLPIR/Web Thinker.
Open Datasets Yes We conduct extensive experiments on (1) knowledge-intensive complex reasoning benchmarks, including GPQA [41], GAIA [32], Web Walker QA [56], and Humanity's Last Exam (HLE) [37] to assess complex problem-solving capabilities, and (2) open-ended reasoning tasks from Glaive [11] to evaluate report quality.
Dataset Splits Yes GPQA [41]: ... totaling 198 questions. GAIA [32]: ... comprising 103 questions. Web Walker QA [56]: ... totaling 680 questions. Humanity’s Last Exam (HLE) [37]: ... We randomly sample 500 text-only questions for testing. ... For training, we use the following datasets. We sample approximately 3k data points from these datasets. ... For the scientific report generation task, we use glaiveai/reasoning-v1-20m (Glaive) [11]... We sample 1.5k questions for each iteration’s preference data construction and 30 questions for testing.
Hardware Specification No The paper mentions models like 'Qw Q-32B' and 'Qwen2.5-Instruct' and discusses training iterations and sequence lengths, but it does not specify any particular GPU models, CPU types, memory amounts, or cloud computing instances used for running the experiments.
Software Dependencies No The paper mentions using 'Qw Q-32B' and 'Qwen2.5-Instruct' as backbone models, and 'Bing Web Search API' with 'Crawl4AI [49]' for content fetching. However, it does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or Transformers, which would be necessary for replication.
Experiment Setup Yes Generation uses max 81920 tokens, temperature 0.7, top_p 0.8, top_k 20, and repetition penalty 1.05. Search uses Bing Web Search API (US-EN region, k=10) with content fetched via Crawl4AI [49]. Training involves 2 iterations of online DPO with a max sequence length of 32,768. For baselines not trained for o1-like reasoning, we use Chain-of-Thought (Co T) [55] prompting.