Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Value-Guided Search for Efficient Chain-of-Thought Reasoning

Authors: Kaiwen Wang, Jin Peng Zhou, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kianté Brantley, Wen Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate value-guided search (VGS) with our 1.5B value model Deep Seek-VM-1.5B, focusing on guiding the Co T reasoning of Deep Seek models [19]. The best VGS setup for our value model is beam search with final WMV aggregation, beam width 2, block size 4096 and with DVTS (for larger inference budgets). We show this setup outperforms other test-time compute methods (e.g., MV, WMV, Bo N) and other scoring models (e.g., existing 7B PRMs and a 1.5B Bradley-Terry reward model trained on our dataset). Our experiments show that blockwise VGS significantly improves TTC compared to majority voting or weighted majority voting...
Researcher Affiliation Collaboration 1Cornell University 2Harvard University 3Netflix 4Databricks
Pseudocode Yes Algorithm 1 Beam Search with Width w 1: Input: prompt x. 2: Set num beams B N w . 3: Initialize beams y1, . . . , y B x. 4: while j s.t. yj is not finished do 5: For each j s.t. yj is not finished, sample w i.i.d. blocks {bi,j}i [w] from π( | yj). 6: Update unfinished beams to be the best continuations with the highest V (yj, bi,j). 7: end while 8: return Bo N or WMV on {y1, . . . , y B}. Algorithm 2 Best-of-N 1: Input: prompt x, responses {yi}i [N]. 2: return ybon = arg maxyi V (x, yi). Algorithm 3 (Weighted) Majority Vote 1: Input: prompt x, responses {yi}i [N], weights {wi}i [N], equivalence relation . 2: Partition {yi}i into equiv. classes {pk}k. 3: return A response from the highest weight partition arg maxpk P
Open Source Code Yes Our dataset, model and codebase are open-sourced at https://github.com/kaiwenw/value-guided-search.
Open Datasets Yes Our dataset, model and codebase are open-sourced at https://github.com/kaiwenw/value-guided-search. We collect a dataset of 2.5 million math reasoning traces (over 30 billion tokens) from a filtered subset of the Open R1-Math dataset [2].
Dataset Splits Yes We start from the Open R1-Math dataset (default split) [2] which contains 94k math problems with solutions that were already filtered for quality. We selected the block size and beam width based on the performance AIME-24 as the validation set (ablations in Section 4.1). Then, we evaluate on AIME-25 and HMMT-25 since they happened after the release of Deep Seek and Open R1.
Hardware Specification Yes Compute GPUs 16 nodes of 8 NVIDIA H100
Software Dependencies No Inference Engine SGLang [52]
Experiment Setup Yes Table 4: Value Model Training Parameters. Table 5: Decoding and Search Parameters. PPO Hyperparameter Setting Setting Parameters Generation (train) temperature: 1.0 top p: 1 PPO batch size: 256 mini batch size: 128 micro batch size: 1 policy learning rate: 1e-6 critic learning rate: 1e-5 train epochs: 25 γentropy: 1e-3 γKL: 1e-4 gae γ: 1 gae λ: 1 clip ratio: 0.2 Total number of steps: 2250