Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

Authors: Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In this section, we conduct a series of experiments to demonstrate the effectiveness of the proposed framework for agent tasks. First, we evaluate the framework s performance on standard agent benchmarks (Yao et al., 2023a; Wang et al., 2022a; Yao et al., 2023b), detailed in Section 4.2. Next, we show how customizing the reward target during inference allows us to generate more tailored action plans, as described in Section 4.3. Finally, we conduct ablation studies in Section 4.4.
Researcher Affiliation Collaboration Zhenfang Chen MIT-IBM Watson AI Lab Delin Chen UMass Amherst Rui Sun University of California, Los Angeles Wenjun Liu UMass Amherst Chuang Gan UMass Amherst and MIT-IBM Watson AI Lab
Pseudocode No The paper describes algorithms like MCTS but does not provide an explicit pseudocode or algorithm block within the main text. It mentions 'We provide more details in the implementation details and the appendix section.'
Open Source Code No We will release all the code, model, and data for easy reproduction upon acceptance.
Open Datasets Yes Webshop is a well-known environment for online shopping (Yao et al., 2023a)... Science World (Wang et al., 2022a) is an interactive benchmark... Game of 24 is a mathematical game... ALFWorld (Shridhar et al., 2021), a classic environment for House Holding... Clinical Agent (Schmidgall et al., 2024), an environment designed for medical decision-making tasks.
Dataset Splits Yes Webshop ... test the model on the validation split... Science World ... evaluate the framework on both seen and unseen splits. Game of 24 ... selected 100 challenging puzzles, specifically those indexed from 901 to 1,000... ALFWorld ... test the model on the dev and std split.
Hardware Specification Yes All models can be easily set up using the v LLM library (Kwon et al., 2023b) and a single H100 GPU.
Software Dependencies No The paper mentions using the 'v LLM library (Kwon et al., 2023b)' but does not provide a specific version number for this or any other software dependency.
Experiment Setup Yes We limit the maximum number of trajectories our ARMAP can explore to 10 in the Science World and Webshop environments... For the Game of 24 environment... set the maximum number of explored trajectories to 100. For Sampling, we set the model temperature to 1 and sample action trajectories using chain-of-thought prompting (Wei et al., 2023). For Greedy, we set the temperature to 0... For Reflexion, we set the maximum trial number to be 10 for all tasks.