Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Non-myopic Generation of Language Models for Reasoning and Planning

Authors: Chang Ma, Haiteng Zhao, Junlei Zhang, Junxian He, Lingpeng Kong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show significant improvements across a wide range of tasks in math, coding, and agent-based scenarios. Furthermore, Predictive-Decoding demonstrates computational efficiency, outperforming search baselines while utilizing inference compute more effectively. This study provides insights into optimizing LLM planning capabilities.
Researcher Affiliation	Academia	Chang Ma Haiteng Zhao Junlei Zhang Junxian He Lingpeng Kong The University of Hong Kong Peking University Zhejiang University Westlake University The Hong Kong University of Science and Technology EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Predictive-Decoding for Planning
Open Source Code	Yes	Code is available at this repo.
Open Datasets	Yes	Our evaluation covers three domains: math GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021), coding Human Eval (Chen et al., 2021) and MBPP (Austin et al., 2021), and agents two agent tasks Alf World (Shridhar et al., 2021) and PDDL (from Agentboard, Ma et al., 2024) to understand planning ability in closed-loop interactions.
Dataset Splits	Yes	MBPP uses the 500 samples test set from huggingface.
Hardware Specification	Yes	We use 1-4 A100s to launch the LLMs with vLLM.
Software Dependencies	No	The paper mentions software like vLLM, Python, and various LLMs (GPT-3.5-Turbo, Mistral-7b, Llama3-8b, Llama3.1-70b, Deepseek-Coder-6.7b), but it does not provide specific version numbers for these components to ensure reproducibility. For example, 'vLLM' is mentioned without a version number, and while 'Python' is implied for PAL format, no specific Python version or library versions are stated.
Experiment Setup	Yes	Table 10: Hyperparameters for Predictive-Decoding main experiments. Method Model Task Hyperparameters Predictive-Decoding MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Human Eval α = 0.3,τ = 0.05,K = 8,T0 = 6 MBPP α = 1.0,τ = 0.1,K = 8,T0 = 6 Mistral-v0.3 MATH α = 1.0,τ = 0.01,K = 8,T0 = 6 GSM8K α = 1.0,τ = 0.01,K = 8,T0 = 6 Deepseek-Coder Human Eval α = 0.4,τ = 1.0,K = 8,T0 = 6 MBPP α = 0.4,τ = 1.0,K = 8,T0 = 6 Llama3.1-70B Alfworld α = 1.0,τ = 0.01,K = 8,T0 = 5 PDDL α = 1.0,τ = 0.01,K = 8,T0 = 5 GPT-35-Turbo (azure API) Alfworld α = 0.6,τ = 0.01,K = 8,T0 = 5 PDDL α = 0.8,τ = 0.05,K = 8,T0 = 5