Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Empowering LLM Agents with Zero-Shot Optimal Decision-Making through Q-learning

Authors: Jiajun Chai, Sicheng Li, Yuqian Fu, Dongbin Zhao, Yuanheng Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate MLAQ in benchmarks that present significant challenges for existing LLM agents. Results show that MLAQ achieves a optimal rate of over 90% in tasks where other methods struggle to succeed. Additional experiments are conducted to reach the conclusion that introducing model-based RL into LLM agents shows significant potential to improve optimal decision-making ability. Empirically, we evaluate MLAQ on well-known benchmarks for LLM agents (Blocks World (Valmeekam et al., 2022) and Ro Co-benchmark (Mandi et al., 2023)), which require optimal decision-making for long horizons. There is no existing LLM agent has successfully obtained the optimal policy, while MLAQ achieves over 90% optimal / success rate across most difficulty levels. The comparison with methods including Ro Co (Mandi et al., 2023) and RAP (Hao et al., 2023) fully demonstrates MLAQ s superior performance in optimal decision-making. Through comparative and ablation experiments, we get a key conclusion: integrating the model-based RL framework with an LLM agent in the form of MLAQ can effectively achieve zero-shot optimal decision-making.
Researcher Affiliation	Academia	Jiajun Chai 1,2, Sicheng Li 1,2, Yuqian Fu 1,2, Dongbin Zhao 1,2, , Yuanheng Zhu 1,2, 1 Institution of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences EMAIL
Pseudocode	Yes	F PSEUDO-CODE FOR THE OVERALL ALGORITHM We present detailed pseudo-code for our MLAQ framework in this section, and these algorithms are used to fully describe the overall process presented in Figure 1 and Figure 2. We have simplified the entire algorithm process into the flowchart in Figure 9 to facilitate a better understanding of how MLAQ makes optimal decisions for new a task. Algorithm 1: Obtaining optimal decision sequence for MLAQ agent Algorithm 2: Imagination(s0, s; τ, M, D) guided by UCB values Algorithm 3: Expand-Buffer(s0, s; M, D) Algorithm 4: Q-Update(s0, s; X)
Open Source Code	No	Our interactive website is available at this link. (end of abstract) While an interactive website is mentioned, there is no explicit statement about making the source code available for the methodology described in the paper, nor a direct link to a code repository.
Open Datasets	Yes	Empirically, we evaluate MLAQ on well-known benchmarks for LLM agents (Blocks World (Valmeekam et al., 2022) and Ro Co-benchmark (Mandi et al., 2023)), which require optimal decision-making for long horizons. We conduct experiments on the Blocks World benchmark (Valmeekam et al., 2022) for the single-agent setting and the Ro Co-benchmark (Mandi et al., 2023) for the multi-agent setting.
Dataset Splits	No	We conduct experiments on the Blocks World benchmark (Valmeekam et al., 2022) for the single-agent setting and the Ro Co-benchmark (Mandi et al., 2023) for the multi-agent setting. Agents in these domains require multi-step decision-making to achieve the final goal, necessitating the ability to maximize expected future rewards. Additionally, the decision-making space for LLM agents in the Ro Co-benchmark is significantly larger than that in Blocks World due to the presence of multiple agents. The details of these benchmarks can be found in Appendix D.1 Blocks World: we choose the tasks with four blocks to evaluate our methods. Hao et al. (2023) has grouped all tasks according to the optimal step, and we randomly choose up to 30 tasks from each group for evaluation. The paper mentions how tasks are selected for evaluation (randomly choose up to 30 tasks from each group for evaluation) for Blocks World, but does not specify exact percentages, sample counts, or a random seed for reproducibility for standard train/test/validation splits relevant to model training.
Hardware Specification	No	In this paper, all experiments are conducted using the GPT API interface, without involving CPU or GPU usage. The total cost of the API resources used in this paper does not exceed 1500 US dollars, including preliminary tests, comparative experiments, and ablation experiments. The paper explicitly states that experiments were conducted using an API interface and did not involve the authors' local CPU or GPU usage for running experiments, therefore no specific hardware details are provided by the authors.
Software Dependencies	Yes	Table 10: Hyper-parameters presented in the MLAQ training process. LLM source gpt-4-0125-preview.
Experiment Setup	Yes	Table 10: Hyper-parameters presented in the MLAQ training process. Hyper Parameter Value: LLM source gpt-4-0125-preview, Learning rate α 1.0, Discount γ 0.995, UCB weight wg 2, v UCB weight wg 4, v UCB threshold ϵg 0, Maximum trial number for imagination Kc 2, Maximum trial number for prediction Ks 2, Maximum trial number for policy Ka 2, Q threshold Q 0.5, Q update loops 20, Environmental horizon T 20 / 8 / 16.