Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learning

Authors: Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Hanting Chen, Xiaozhe Ren, Yasheng Wang, Lu Hou, Lifeng Shang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Web Puzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k Web Puzzle instances, we develop Deep Diver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS) an emergent ability to escalate search frequency and depth instead of settling on overconfident, underevidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter Deep Seek-R1. We detail Deep Diver s curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.
Researcher Affiliation	Industry	Huawei Language Model Lab EMAIL
Pseudocode	No	The paper describes methods and processes in natural language and mathematical formulas, for example, the GRPO objective function (1) and (2) in Appendix B. However, it does not include explicitly labeled pseudocode or algorithm blocks. Figure 3 provides a high-level overview diagram, but it is not a pseudocode.
Open Source Code	No	The paper contains experimental results that rely on proprietary code and data that are currently undergoing internal review process for open-source release approval. While we cannot provide open access to the code and data at submission time, we plan to release them once the review process is completed.
Open Datasets	Yes	However, these works predominantly train and evaluate their methods on well-structured datasets such as Hotpot QA [39], which are based on corpora like Wikipedia. [...] We evaluate performance using closed-ended Chinese benchmarks including C-simple QA-500 [33, 8], FRAMES-zh-230 [14], Bam Boogle-zh-71 [21], and our proposed Web Puzzle (detailed in Appendix E.4).
Dataset Splits	Yes	Web Puzzle contains 24k training samples and 275 human-annotated test examples... Due to computational constraints and capability limits of the 7B model, we train Deep Diver on a carefully selected mixture of 7k Web Puzzle samples rather than the full dataset. We evenly split these into 2k samples for cold-start SFT (Section 3.2) and 5k for RL training. [...] Table 8: Data statistics of the Web Puzzle training and evaluation sets uesd in our Experiment. Problems are labeled as easy, medium, or hard, and outliers refer to cases with pass@4 = 0. Training Data Num: Total Set 7000. Evaluation Data Num: Total Set 275.
Hardware Specification	No	The paper mentions 'Due to computational constraints, the experiments presented in this report are limited to a 7B model...' in Section C, and 'Due to computational constraints and capability limits of the 7B model...' in Section 4.1. However, Section E.8, titled 'Implementation Details,' which is referenced for compute resources in the NeurIPS checklist, describes training parameters and search engines but does not specify the type of compute workers (CPU, GPU models, etc.) used for the experiments.
Software Dependencies	Yes	For online search, we utilize the Bocha search engine for Chinese scenario and Lang Search for English scenario. Both graders are implemented base on qwen-turbo API*. [...] For trainable baselines, we use Qwen2.5-7B-Instruct [30] and Pangu-7B-Reasoner [28] as backbone models. Training-free baselines include Qw Q-32B [31], GPT-4o [20] and Deep Seek-R1 [6].
Experiment Setup	Yes	Deep Diver ultilize the procedure of cold-start supervised fine-tuning (SFT) followed by reinforcement learning (RL), while incorporates a carefully designed reward assignment and scheduling mechanism to maintain stable RL training. [...] During training, each data sample undergoes 14 rollouts with a sampling temperature of 0.9. We employ a batch size of 32 and a learning rate of 1e-6, training for a single epoch with a KL divergence coefficient of 0.001. The maximum number of tool call round is set to 7.