Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Enhancing Decision-Making of Large Language Models via Actor-Critic
Authors: Heng Dong, Kefei Duan, Chongjie Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across diverse environments including high-level decision-making (ALFWorld), low-level action spaces (Baby AI-Text), and large action spaces (Web Shop) demonstrate the framework s generality and superiority over state-of-the-art methods. |
| Researcher Affiliation | Academia | *Equal contribution 1IIIS, Tsinghua University 2Washington University in St. Louis. Correspondence to: Heng Dong <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 LAC: LLM-based Actor-Critic algorithm. |
| Open Source Code | Yes | The code of LAC is publicly available on Git Hub1 and website2. |
| Open Datasets | Yes | We benchmark our method LAC on three benchmarks that cover high-level action space (ALFWorld (Shridhar et al., 2021)), low-level action space (Baby AI-Text (Chevalier-Boisvert et al., 2018)) and potentially infinite action space (Web Shop (Yao et al., 2022)). |
| Dataset Splits | Yes | Following Re Act, we evaluate all 134 unseen evaluation games in a task-specific setup. ... We evaluate on the test environment in Baby AI-Text. ... we evaluate on 50 tasks for each task type, yielding 300 tasks total. ... We evaluate 100 tasks in Web Shop for our method and all the baselines. |
| Hardware Specification | Yes | We use A100 GPU with 80GB memory to fine-tune our model. ... We use A100 GPU with 80GB memory to evaluate our method. |
| Software Dependencies | No | The paper mentions using techniques like LoRA and external implementations (LATS with MCTS) but does not provide specific version numbers for programming languages, libraries, or frameworks used in their own implementation. |
| Experiment Setup | Yes | We fine-tune models for 1,000 steps with learning rate 2.5e-5 and batch size 2. ... The number of candidate actions n is a hyperparameter, set to 5 in our experiments... For ALFWorld, we set α as 1... For Baby AI-Text, we conduct a grid-search over {1/2,1,2,5,10} for α... For Web Shop, we also conduct a grid search over {1/10,1,10} to find the best α. ... We set the maximum prediction step as 4, ... We set the maximum horizon length to 40 for ALFWorld, 30 for Baby AI-Text, and 15 for Web Shop. |