Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Non-myopic Generation of Language Models for Reasoning and Planning

Authors: Chang Ma, Haiteng Zhao, Junlei Zhang, Junxian He, Lingpeng Kong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show significant improvements across a wide range of tasks in math, coding, and agent-based scenarios. Furthermore, Predictive-Decoding demonstrates computational efficiency, outperforming search baselines while utilizing inference compute more effectively. This study provides insights into optimizing LLM planning capabilities.
Researcher Affiliation Academia Chang Ma Haiteng Zhao Junlei Zhang Junxian He Lingpeng Kong The University of Hong Kong Peking University Zhejiang University Westlake University The Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Predictive-Decoding for Planning
Open Source Code Yes Code is available at this repo.
Open Datasets Yes Our evaluation covers three domains: math GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), coding Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), and agents two agent tasks Alf World (Shridhar et al., 2021) and PDDL (from Agentboard, Ma et al., 2024) to understand planning ability in closed-loop interactions.
Dataset Splits Yes MBPP uses the 500 samples test set from huggingface.
Hardware Specification Yes We use 1-4 A100s to launch the LLMs with vLLM.
Software Dependencies No The paper mentions software like vLLM, Python, and various LLMs (GPT-3.5-Turbo, Mistral-7b, Llama3-8b, Llama3.1-70b, Deepseek-Coder-6.7b), but it does not provide specific version numbers for these components to ensure reproducibility. For example, 'vLLM' is mentioned without a version number, and while 'Python' is implied for PAL format, no specific Python version or library versions are stated.
Experiment Setup Yes Table 10: Hyperparameters for Predictive-Decoding main experiments. Method Model Task Hyperparameters Predictive-Decoding MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Human Eval α = 0.3,τ = 0.05,K = 8,T0 = 6 MBPP α = 1.0,τ = 0.1,K = 8,T0 = 6 Mistral-v0.3 MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Deepseek-Coder Human Eval α = 0.4,τ = 1.0,K = 8,T0 = 6 MBPP α = 0.4,τ = 1.0,K = 8,T0 = 6 Llama3.1-70B Alfworld α = 1.0,τ = 0.01,K = 8,T0 = 5 PDDL α = 1.0,τ = 0.01,K = 8,T0 = 5 GPT-35-Turbo (azure API) Alfworld α = 0.6,τ = 0.01,K = 8,T0 = 5 PDDL α = 0.8,τ = 0.05,K = 8,T0 = 5