Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
What Are Step-Level Reward Models Rewarding? Counterintuitive Findings from MCTS-Boosted Mathematical Reasoning
Authors: Yiran Ma, Zui Chen, Tianqiao Liu, Mi Tian, Zhuo Liu, Zitao Liu, Weiqi Luo
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Main Results After collecting all the step-level preference pairs through MCTS, datasets are constructed for FC-SRM, MO-SRM, SSMO-SRM, and NT-SRM training by selecting the corresponding components in each piece of data. The training curves are shown in Figure 3. These SRMs are subsequently used as scoring functions in greedy search, the accuracy and absolute gains over baseline are reported in Table 1. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, Hangzhou, China 2Shanghai Tech University, Shanghai, China 3TAL Education Group, Beijing, China 4University of Rochester, New York, USA 5Jinan University, Guangzhou, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Beam Search Algorithm |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | Yes | To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021). |
| Dataset Splits | Yes | To construct step-level preference pairs through MCTS, we use the math problems and their corresponding final answers from the training data of GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021). The accuracies are evaluated on the test data. |
| Hardware Specification | Yes | Each SRM is trained on two instances, with each instance equipped with 8 A800 GPUs. |
| Software Dependencies | No | The paper mentions specific LLM models (Llama-3-8B-Instruct, Deep Seek-Math-7B-Base, Qwen2-7B) but does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The MCTS requires the agent sampling n = 6 candidate actions at each expansion phase and iterates 500 times on each problem to evaluate the quality of each node. Notably, to avoid the influence of the variation of answer format, we use a supervised fine-tuned (SFT) model based on Deep Seek-Math-7B-Base to assert the correctness of the solution after each rollout during the search. This model is also used in our evaluation pipeline. To strengthen the preferences, only the preference pairs whose difference of value is greater than 0.7 are assumed valid. For detailed hyperparameters, see Appendix. |