Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Non-myopic Generation of Language Models for Reasoning and Planning
Authors: Chang Ma, Haiteng Zhao, Junlei Zhang, Junxian He, Lingpeng Kong
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show significant improvements across a wide range of tasks in math, coding, and agent-based scenarios. Furthermore, Predictive-Decoding demonstrates computational efficiency, outperforming search baselines while utilizing inference compute more effectively. This study provides insights into optimizing LLM planning capabilities. |
| Researcher Affiliation | Academia | Chang Ma Haiteng Zhao Junlei Zhang Junxian He Lingpeng Kong The University of Hong Kong Peking University Zhejiang University Westlake University The Hong Kong University of Science and Technology EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Predictive-Decoding for Planning |
| Open Source Code | Yes | Code is available at this repo. |
| Open Datasets | Yes | Our evaluation covers three domains: math GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), coding Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), and agents two agent tasks Alf World (Shridhar et al., 2021) and PDDL (from Agentboard, Ma et al., 2024) to understand planning ability in closed-loop interactions. |
| Dataset Splits | Yes | MBPP uses the 500 samples test set from huggingface. |
| Hardware Specification | Yes | We use 1-4 A100s to launch the LLMs with vLLM. |
| Software Dependencies | No | The paper mentions software like vLLM, Python, and various LLMs (GPT-3.5-Turbo, Mistral-7b, Llama3-8b, Llama3.1-70b, Deepseek-Coder-6.7b), but it does not provide specific version numbers for these components to ensure reproducibility. For example, 'vLLM' is mentioned without a version number, and while 'Python' is implied for PAL format, no specific Python version or library versions are stated. |
| Experiment Setup | Yes | Table 10: Hyperparameters for Predictive-Decoding main experiments. Method Model Task Hyperparameters Predictive-Decoding MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Human Eval α = 0.3,τ = 0.05,K = 8,T0 = 6 MBPP α = 1.0,τ = 0.1,K = 8,T0 = 6 Mistral-v0.3 MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Deepseek-Coder Human Eval α = 0.4,τ = 1.0,K = 8,T0 = 6 MBPP α = 0.4,τ = 1.0,K = 8,T0 = 6 Llama3.1-70B Alfworld α = 1.0,τ = 0.01,K = 8,T0 = 5 PDDL α = 1.0,τ = 0.01,K = 8,T0 = 5 GPT-35-Turbo (azure API) Alfworld α = 0.6,τ = 0.01,K = 8,T0 = 5 PDDL α = 0.8,τ = 0.05,K = 8,T0 = 5 |