ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
Authors: Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first show that the tree-search policy in Re ST-MCTS achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as Re STEM and Self Rewarding LM. |
| Researcher Affiliation | Academia | Dan Zhang1 , Sining Zhoubian1 , Ziniu Hu2 , Yisong Yue2, Yuxiao Dong1, Jie Tang1 1The Knowledge Engineering Group (KEG), Tsinghua University; 2California Institute of Technology |
| Pseudocode | Yes | Algorithm 1: Mutual self-training Re ST-MCTS for value model and policy model. Algorithm 2: The proposed value guided search algorithm MCTS . |
| Open Source Code | Yes | We release all code at https://github.com/THUDM/Re ST-MCTS. |
| Open Datasets | Yes | Aiming to gather value train data for science, we integrate questions of a lean science dataset Dsci within Sci Instruct [10] into D0. This dataset consists of 11,554 questions, where each question is paired with a correct step-by-step solution. For math, we integrate the MATH [33] train set into D0. |
| Dataset Splits | No | The paper states 'We split DV0 and use the train set to finetune...' and 'We use the test set containing 14k data samples to evaluate the value model...' but does not provide specific percentages or counts for the training, validation, and test splits needed to reproduce the exact data partitioning. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running its experiments, only providing average running times. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer [62]' but does not provide specific version numbers for other key software components or libraries. |
| Experiment Setup | Yes | Note that the learning rate is set to 1e-6 in this process. For Re ST-MCTS , self-critic is used and the ending threshold is also set to 0.9. The rollout step limit m is set to 2, α is set to 0.5, and the number of iterations T is set to 50 by default. Moreover, both tree search algorithms use b = 3 by default, where b is the number of samples generated in the expansion process as mentioned in the former sections. |