Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Weaver: Shrinking the Generation-Verification Gap by Scaling Compute for Verification

Authors: Jon Saad-Falcon, Estefany Kelly Buchanan, Mayee Chen, Tzu-Heng (Brian) Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations demonstrate that WEAVER significantly improves the pass@1 performance across several reasoning and math tasks, achieving o3-minilevel accuracy with Llama 3.3 70B Instruct (a much cheaper non-reasoning model) as the generator, and an ensemble of smaller judge and reward models as the verifiers (86.2% average). Empirically, WEAVER improves over unweighted averaging of verifier scores by 17.1% and repeated sampling with majority voting by 13.5% (Table 1). WEAVER allows us to improve over Pass@1 by 17.9% for 8B models and 14.5% for 70B models across reasoning and mathematics tasks (Tables 1 and 18). This mirrors the performance jump from GPT-4o to o3-mini (73.9% vs. 88.2%) but without parameter tuning or post-training steps.
Researcher Affiliation	Collaboration	1Stanford University 2Together AI 3University of Wisconsin Madison Equal contribution
Pseudocode	No	The paper describes methods textually and mathematically, for example, in Section B.1 'Weak Supervision Model' and B.1.1 'Parameter Estimation'. It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	In our supplementary materials, we provide a code repository for our WEAVER implementation along with access to all datasets used in our experiments. Furthermore, our methodology is also described in sufficient detail for reproducibility in the paper in Sections 3 and 4. We release our WEAVER implementation code and the configuration files used in our supplementary materials. We provide documentation for these assets alongside our code repository.
Open Datasets	Yes	We evaluate on MATH500, GPQA Diamond, MMLU College, and MMLU Pro. See Sec. C.1 for more details. Table 10: Benchmark Overview: Evaluation configurations for Alpaca Eval 2.0, Arena-Hard-Auto, AIMO, MATH500, GPQA, MMLU, MMLU Pro, and Big-Bench Hard (BBH). Benchmark Dataset Size Scoring Type Metric License MATH500 500 Ground Truth Pass@1 Apache 2.0 GPQA 646 Ground Truth Pass@1 CC BY 4.0 MMLU College 719 Ground Truth Pass@1 MIT MMLU Pro 500 Ground Truth Pass@1 MIT
Dataset Splits	Yes	To obtain practical scaling trends, we fit parametric models in Eq. (14) to the empirical averages computed from 5 independent runs for each value of K, across each dataset and verification strategy. We train/evaluate on an 80:20 split. Our experimental setup uses training examples containing instruction-generation pairs. Since our datasets have multiple generations per instruction (up to 100), we group examples by instruction before splitting to prevent data leakage between train and validation sets. We hold out 50% of the dataset instructions (each paired with 100 candidate generations) for evaluation. For the data scaling experiment, we train on different percentages (1%, 2%, 4%, and 16%) of the dataset by calculating the number of instructions as num_problems_in_dataset (train_percentage/100) and selecting repeated samples for each instruction with samples=min(max(4,num_problems 2),20) to avoid overfitting. We use a consistent random seed to ensure identical dataset splits between optimization runs.
Hardware Specification	Yes	Our experiments were conducted using 4 compute nodes, each equipped with 8 NVIDIA H100 GPUs (80GB HBM3 memory per GPU), for a total of 32 H100 GPUs. Each node was configured with high-bandwidth NVLink connections between GPUs and inter-node communication was facilitated via NVIDIA NVLink Switch System to minimize communication overhead during distributed training and inference.
Software Dependencies	Yes	Our experiments were powered by: NVIDIA CUDA 12.2 Py Torch 2.1 with NVIDIA NCCL for distributed communication Deep Speed Ze RO Stage 3 for memory optimization Distributed data loading with webdataset format for efficient streaming
Experiment Setup	Yes	We optimize 5 using gradient descent to estimate the verifier accuracy parameters. We use the L-BFGS-B algorithm to optimize a smooth approximation following [29]. To ensure numerical stability and robustness to outliers or heavy-tailed noise in the observed selection accuracies, we minimize the Huber loss between the predicted values and the empirical means. The Huber loss behaves quadratically for small residuals and linearly for large ones, making it less sensitive to outliers than mean squared error (MSE) while maintaining smooth differentiability for gradient-based optimization. It is defined as, 2r2 if \|r\| δ δ \|r\| 1 2δ otherwise where δ>0 is a tunable threshold that controls the transition between the two regimes. We search over δ {0.01,0.05,0.1,0.25,0.5} to select the value that yields the best fit. For the loss function in WEAVER distillation, we utilized cross-entropy loss with Adam [35]. Our classification architecture comprises a single linear classification layer with 0.1 dropout applied to the input, which consists of the final hidden state from the [CLS] token. Regarding learning dynamics, we implemented linear warmup and linear decay via the Sentence-Transformers library [61], employing a learning rate of 5e-6 and training batch size of 64 across all experimental setups.