Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling

Authors: Hao Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, Hongxiang Fan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior.
Researcher Affiliation	Academia	1Imperial College London, UK 2Institute of Science Tokyo, Japan EMAIL EMAIL
Pseudocode	Yes	As illustrated in Figure 2, a single cycle of VG-Search advances the search from a candidate length t to t + g. The full algorithm proceeds as follows: 1. Start: Initialize with B1 B2 candidates from the initial prompt. 2. Verify & Select: Evaluate B1 B2 candidates using V, and retain the top B1 beams. 3. Extend: For each of the B1 selected beams, produce g 1 generation steps using G. 4. Branch: For each extended beam, produce B2 single-generation-step continuations using G. 5. Repeat: Go back to Step 2 and iterate until termination criteria are met.
Open Source Code	Yes	Our code is avaiblae at github.com/hmarkc/VG-Search.
Open Datasets	Yes	We evaluate on the MATH-500 [21] and AIME [2] datasets. MATH-500 samples 500 problems from the MATH [16] benchmark, with difficulty levels labelled. AIME comprises 90 advanced high school mathematics problems from the past three years (AIME24, AIME23, AIME22).
Dataset Splits	Yes	Additionally, we sample 250 problems from the MATH [16] benchmark as a validation set for Section 5, referred to as MATH-250. It consists of 50 problems per difficulty level.
Hardware Specification	Yes	We conduct our experiments on NVIDIA H100 [12] and A100 [11] GPUs, using CUDA 12.8 on Ubuntu 22.04.
Software Dependencies	Yes	We conduct our experiments on NVIDIA H100 [12] and A100 [11] GPUs, using CUDA 12.8 on Ubuntu 22.04. The v LLM library (v0.6.3) [18] is employed for model execution. The experiments for VG-Search on MATH-500 take from 1 to 5 hours, depending on the candidate number. To obtain accurate system-level runtime measurements, we use Nsight Systems [26] and employ the NVIDIA NVTX [13] extension to categorize execution time across different models.
Experiment Setup	Yes	The number of generations is set to n {4, 16, 64, 128, 256}, with a Branch-Out Factor B2 = 4, temperature 0.8, and Top-p 1.0. For each generation step, the maximum token number is set to 2, 048, employing Last scoring and majority voting. The number of iterations I corresponds to g: for g {1, 2, 3, 4}, we set I {12, 6, 4, 3} to ensure equal compute budget. The prompt template from Qwen [40] is adopted. DVTS, Best-of-N (Bo N), and Beam Search are used as baselines, configured with the same parameters as VG-Search.