Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Decision-Making of Large Language Models via Actor-Critic

Authors: Heng Dong, Kefei Duan, Chongjie Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across diverse environments including high-level decision-making (ALFWorld), low-level action spaces (Baby AI-Text), and large action spaces (Web Shop) demonstrate the framework s generality and superiority over state-of-the-art methods.
Researcher Affiliation Academia *Equal contribution 1IIIS, Tsinghua University 2Washington University in St. Louis. Correspondence to: Heng Dong <EMAIL>.
Pseudocode Yes Algorithm 1 LAC: LLM-based Actor-Critic algorithm.
Open Source Code Yes The code of LAC is publicly available on Git Hub1 and website2.
Open Datasets Yes We benchmark our method LAC on three benchmarks that cover high-level action space (ALFWorld (Shridhar et al., 2021)), low-level action space (Baby AI-Text (Chevalier-Boisvert et al., 2018)) and potentially infinite action space (Web Shop (Yao et al., 2022)).
Dataset Splits Yes Following Re Act, we evaluate all 134 unseen evaluation games in a task-specific setup. ... We evaluate on the test environment in Baby AI-Text. ... we evaluate on 50 tasks for each task type, yielding 300 tasks total. ... We evaluate 100 tasks in Web Shop for our method and all the baselines.
Hardware Specification Yes We use A100 GPU with 80GB memory to fine-tune our model. ... We use A100 GPU with 80GB memory to evaluate our method.
Software Dependencies No The paper mentions using techniques like LoRA and external implementations (LATS with MCTS) but does not provide specific version numbers for programming languages, libraries, or frameworks used in their own implementation.
Experiment Setup Yes We fine-tune models for 1,000 steps with learning rate 2.5e-5 and batch size 2. ... The number of candidate actions n is a hyperparameter, set to 5 in our experiments... For ALFWorld, we set α as 1... For Baby AI-Text, we conduct a grid-search over {1/2,1,2,5,10} for α... For Web Shop, we also conduct a grid search over {1/10,1,10} to find the best α. ... We set the maximum prediction step as 4, ... We set the maximum horizon length to 40 for ALFWorld, 30 for Baby AI-Text, and 15 for Web Shop.