Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VPO: Reasoning Preferences Optimization Based on $\mathcal{V}$-Usable Information
Authors: Zecheng Wang, Chunshan Li, Yupeng Zhang, Han Liu, Bingning Wang, Dianhui Chu, Dianbo Sui
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that, on both the base and instruct models of the Qwen-2.5 [31] and LLa MA 3.1 [14] series, VPO consistently exhibits superior overall performance on reasoning tasks compared to DPO and its variants. For instance, on the Qwen-2.5-7B-Base model, VPO outperforms DPO by 7.80%, 4.02%, 2.57%, 3.41%, and 12.25% on MATH500 [17], GSM8k [8], Minerva MATH, [20] Olympiad MATH[16], and AMC23, respectively. We also conduct ablation experiments and in-depth analysis on VPO to explain its effectiveness and rationale. |
| Researcher Affiliation | Collaboration | Zecheng Wang1,2*, Chunshan Li1B, Yupeng Zhang2*, Han Liu3, Bingning Wang2B, Dianhui Chu1, and Dianbo Sui1B 1Harbin Institute of Technology 2We Chat, Tencent Inc 3Tsinghua University |
| Pseudocode | No | The paper describes the VPO methodology using mathematical equations and textual explanations, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code. |
| Open Source Code | No | Answer: [No] Justification: We will release our code in the near future. |
| Open Datasets | Yes | We use two types of models: Llama-3.1-8B [14] and Qwen-2.5-7B [31]... (1) GSM8K [8], a dataset containing real, high-quality elementary school math application problems; and (2) MATH [17], a dataset containing challenging mathematics competition problems. We evaluate the model s performance on standard mathematical reasoning benchmarks, including: (1) GSM8K [8]; (2) MATH500 [17]; (3) Olympiad Bench-Math [16]; (4) AMC23; (5) Minerva Math [20]... We use the ARC dataset [7], which covers multiple scientific domains. |
| Dataset Splits | No | For Llama 3.1-8B-Base, Llama-3.1-8B Instruct, and Qwen-2.5-7B-Base, we retain 5 preference pairs for each query. For Qwen-2.5-7B-Instruct, we retain 10 preference pairs for each query. In total, the training data constructed for each model contains 30k-40k sample pairs. We test on the ARC-Challenge test set, which contains 1,172 questions. We sample 3.37k training samples from both the easy and challenge sets, performing 30 samplings per query to construct a preference dataset of approximately 10k pairs. The paper mentions the total size of constructed training data and using a predefined test set for ARC, but does not explicitly state the train/validation/test splits for the *constructed* preference datasets themselves with percentages or counts. |
| Hardware Specification | Yes | All experimental results in this paper were conducted on GPUs with 8 H800 and 8 H20 configurations, with the same GPU model used across all experiments in each set. |
| Software Dependencies | No | The paper mentions the use of LLMs (Llama 3.1 and Qwen 2.5 series) and refers to various methods (DPO, TDPO, SimPO, IPO, RPO) but does not provide specific version numbers for software libraries, frameworks, or programming languages (e.g., Python, PyTorch, CUDA versions) that would be necessary to replicate the experiment environment. |
| Experiment Setup | Yes | For training, we perform full-parameter preference optimization training on the model, training all baseline methods for 2 epochs with a learning rate of 5 10 7. The coefficient β in the DPO loss is tuned in {0.05, 0.1, 0.5, 1.0}, and we end up using 0.05 in this experiment. For the parameters γ and γ/β in Sim PO, we tried the following combinations: {2.0, 0.5}, {2.5, 0.55}, {10, 0.3}, {10, 0.5}, and {10, 0.1}. Ultimately, we select {10, 0.5} to train all models. |