Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Policy Guided Tree Search for Enhanced LLM Reasoning
Authors: Yang Li
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs. Extensive experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate the effectiveness of PGTS. |
| Researcher Affiliation | Industry | 1Independent Researcher. Correspondence to: Yang Li <EMAIL>. The author is listed as an 'Independent Researcher' with a generic email domain (gmail.com). This classification is based on the interpretation that 'Independent Researcher' is not a university or public research institution, and by elimination (since there is only one author, it cannot be a 'collaboration'), it falls under the 'Industry' category as a private-sector endeavor, though not a traditional corporation or lab. |
| Pseudocode | Yes | A. Training Algorithm. Algorithm 1 Policy Training in TS-MDP |
| Open Source Code | No | The paper does not contain any explicit statements about making the source code for their methodology publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | For mathematical reasoning, we evaluate our framework on the GSM8K (Cobbe et al., 2021), MATH500 (Hendrycks et al., 2021; Lightman et al., 2023), and AQUA (Ling et al., 2017) datasets... For commonsense reasoning, we evaluate Strategy QA (Geva et al., 2021)... For logical reasoning, we evaluate the Pr Onto QA (Saparov & He, 2022) dataset... Finally, for the planning task, we evaluate our framework on the Blocksworld benchmark (Valmeekam et al., 2022). |
| Dataset Splits | Yes | GSM8K ... We adhere to the original train-test split... MATH500 ... We use the original training split to train our PGTS policy... AQUA ... We adhere to the original train-test split... Strategy QA ... We randomly sample 1,000 examples for testing, with the remaining examples used to train the PGTS policy... Pr Onto QA ... We adopt the same dataset splits as used in RAP (Hao et al., 2023)... Blocksworld ... using the same dataset splits and prompt template as RAP (Hao et al., 2023). Specifically, split-v1 is used for training and split-v2 for testing, with a 4-shot setup. |
| Hardware Specification | No | The paper mentions using "LLa MA3.1-8B and LLa MA3.1-70B models" but does not provide any specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions several frameworks and models like GPS layers, Proximal Policy Optimization (PPO), and distilbert-base-uncased, but does not provide specific version numbers for any software, libraries, or programming languages used. |
| Experiment Setup | Yes | For mathematical reasoning, we evaluate our framework on the GSM8K (Cobbe et al., 2021), MATH500 (Hendrycks et al., 2021; Lightman et al., 2023), and AQUA (Ling et al., 2017) datasets, using 4-shot settings for GSM8K and MATH500, and a 10-shot setting for AQUA. The in-context learning (ICL) examples are adapted from Open Compass (Contributors, 2023) for GSM8K and MATH500, and from LLMReasoner (Hao et al., 2024) for AQUA. For commonsense reasoning, we evaluate Strategy QA (Geva et al., 2021) in a 5-shot setting... For logical reasoning, we evaluate the Pr Onto QA (Saparov & He, 2022) dataset for logical deduction in a 5-shot setting and the GPQA (Rein et al., 2023) dataset for graduate-level multiple-choice questions in a 0-shot setting... MCTS and PGTS utilize weighted voting based on the reward of each reasoning chain... For both MCTS and PGTS approaches, the tree breadth is limited to 4 child node per parent... For PGTS, we train the policy using up to 1,000 examples from the training split of each dataset... The policy architecture consists of two GPS layers followed by a single linear layer for action and value prediction. To simplify experiments, the action cost C(a) is fixed across datasets, with values of 0.1, 0.2, 0.5, and 0.0 for expand, branch, backtrack, and terminate actions, respectively. Key PPO hyperparameters include: PPO Clip Range: 0.2, Discount Factor (γ): 0.99, GAE Lambda (λ): 0.95. |