Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Authors: Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, Zhicheng Dou
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on (1) knowledge-intensive complex reasoning benchmarks, including GPQA [41], GAIA [32], Web Walker QA [56], and Humanity's Last Exam (HLE) [37] to assess complex problem-solving capabilities, and (2) open-ended reasoning tasks from Glaive [11] to evaluate report quality. As Figure 1 shows, Web Thinker consistently outperforms all competing approaches. |
| Researcher Affiliation | Collaboration | 1Renmin University of China 2BAAI 3Huawei Poisson Lab |
| Pseudocode | No | The paper describes its methodology using mathematical formulations in Section 3 and detailed textual explanations, but it does not contain a specific section, figure, or block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is available at https://github.com/RUC-NLPIR/Web Thinker. |
| Open Datasets | Yes | We conduct extensive experiments on (1) knowledge-intensive complex reasoning benchmarks, including GPQA [41], GAIA [32], Web Walker QA [56], and Humanity's Last Exam (HLE) [37] to assess complex problem-solving capabilities, and (2) open-ended reasoning tasks from Glaive [11] to evaluate report quality. |
| Dataset Splits | Yes | GPQA [41]: ... totaling 198 questions. GAIA [32]: ... comprising 103 questions. Web Walker QA [56]: ... totaling 680 questions. Humanity’s Last Exam (HLE) [37]: ... We randomly sample 500 text-only questions for testing. ... For training, we use the following datasets. We sample approximately 3k data points from these datasets. ... For the scientific report generation task, we use glaiveai/reasoning-v1-20m (Glaive) [11]... We sample 1.5k questions for each iteration’s preference data construction and 30 questions for testing. |
| Hardware Specification | No | The paper mentions models like 'Qw Q-32B' and 'Qwen2.5-Instruct' and discusses training iterations and sequence lengths, but it does not specify any particular GPU models, CPU types, memory amounts, or cloud computing instances used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Qw Q-32B' and 'Qwen2.5-Instruct' as backbone models, and 'Bing Web Search API' with 'Crawl4AI [49]' for content fetching. However, it does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or Transformers, which would be necessary for replication. |
| Experiment Setup | Yes | Generation uses max 81920 tokens, temperature 0.7, top_p 0.8, top_k 20, and repetition penalty 1.05. Search uses Bing Web Search API (US-EN region, k=10) with content fetched via Crawl4AI [49]. Training involves 2 iterations of online DPO with a max sequence length of 32,768. For baselines not trained for o1-like reasoning, we use Chain-of-Thought (Co T) [55] prompting. |