Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Decision-Making of Large Language Models via Actor-Critic

Authors: Heng Dong, Kefei Duan, Chongjie Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across diverse environments including high-level decision-making (ALFWorld), low-level action spaces (Baby AI-Text), and large action spaces (Web Shop) demonstrate the framework s generality and superiority over state-of-the-art methods.
Researcher Affiliation	Academia	*Equal contribution 1IIIS, Tsinghua University 2Washington University in St. Louis. Correspondence to: Heng Dong <EMAIL>.
Pseudocode	Yes	Algorithm 1 LAC: LLM-based Actor-Critic algorithm.
Open Source Code	Yes	The code of LAC is publicly available on Git Hub1 and website2.
Open Datasets	Yes	We benchmark our method LAC on three benchmarks that cover high-level action space (ALFWorld (Shridhar et al., 2021)), low-level action space (Baby AI-Text (Chevalier-Boisvert et al., 2018)) and potentially infinite action space (Web Shop (Yao et al., 2022)).
Dataset Splits	Yes	Following Re Act, we evaluate all 134 unseen evaluation games in a task-specific setup. ... We evaluate on the test environment in Baby AI-Text. ... we evaluate on 50 tasks for each task type, yielding 300 tasks total. ... We evaluate 100 tasks in Web Shop for our method and all the baselines.
Hardware Specification	Yes	We use A100 GPU with 80GB memory to fine-tune our model. ... We use A100 GPU with 80GB memory to evaluate our method.
Software Dependencies	No	The paper mentions using techniques like LoRA and external implementations (LATS with MCTS) but does not provide specific version numbers for programming languages, libraries, or frameworks used in their own implementation.
Experiment Setup	Yes	We fine-tune models for 1,000 steps with learning rate 2.5e-5 and batch size 2. ... The number of candidate actions n is a hyperparameter, set to 5 in our experiments... For ALFWorld, we set α as 1... For Baby AI-Text, we conduct a grid-search over {1/2,1,2,5,10} for α... For Web Shop, we also conduct a grid search over {1/10,1,10} to find the best α. ... We set the maximum prediction step as 4, ... We set the maximum horizon length to 40 for ALFWorld, 30 for Baby AI-Text, and 15 for Web Shop.