Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Authors: Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In this section, we conduct a series of experiments to demonstrate the effectiveness of the proposed framework for agent tasks. First, we evaluate the framework s performance on standard agent benchmarks (Yao et al., 2023a; Wang et al., 2022a; Yao et al., 2023b), detailed in Section 4.2. Next, we show how customizing the reward target during inference allows us to generate more tailored action plans, as described in Section 4.3. Finally, we conduct ablation studies in Section 4.4. |
| Researcher Affiliation | Collaboration | Zhenfang Chen MIT-IBM Watson AI Lab Delin Chen UMass Amherst Rui Sun University of California, Los Angeles Wenjun Liu UMass Amherst Chuang Gan UMass Amherst and MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes algorithms like MCTS but does not provide an explicit pseudocode or algorithm block within the main text. It mentions 'We provide more details in the implementation details and the appendix section.' |
| Open Source Code | No | We will release all the code, model, and data for easy reproduction upon acceptance. |
| Open Datasets | Yes | Webshop is a well-known environment for online shopping (Yao et al., 2023a)... Science World (Wang et al., 2022a) is an interactive benchmark... Game of 24 is a mathematical game... ALFWorld (Shridhar et al., 2021), a classic environment for House Holding... Clinical Agent (Schmidgall et al., 2024), an environment designed for medical decision-making tasks. |
| Dataset Splits | Yes | Webshop ... test the model on the validation split... Science World ... evaluate the framework on both seen and unseen splits. Game of 24 ... selected 100 challenging puzzles, specifically those indexed from 901 to 1,000... ALFWorld ... test the model on the dev and std split. |
| Hardware Specification | Yes | All models can be easily set up using the v LLM library (Kwon et al., 2023b) and a single H100 GPU. |
| Software Dependencies | No | The paper mentions using the 'v LLM library (Kwon et al., 2023b)' but does not provide a specific version number for this or any other software dependency. |
| Experiment Setup | Yes | We limit the maximum number of trajectories our ARMAP can explore to 10 in the Science World and Webshop environments... For the Game of 24 environment... set the maximum number of explored trajectories to 100. For Sampling, we set the model temperature to 1 and sample action trajectories using chain-of-thought prompting (Wei et al., 2023). For Greedy, we set the temperature to 0... For Reflexion, we set the maximum trial number to be 10 for all tasks. |