Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Authors: Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate PLAN-AND-ACT using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the Web Arena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on Web Voyager.
Researcher Affiliation Academia 1UC Berkeley 2University of Tokyo 3ICSI. Correspondence to: Amir Gholami <EMAIL>.
Pseudocode No The paper includes prompt templates in the appendix for the PLANNER, EXECUTOR, and data generation, but these are descriptions for an LLM's input/output format rather than structured, executable pseudocode or algorithm blocks for a computational procedure.
Open Source Code No The paper does not explicitly state that the code for PLAN-AND-ACT is open-source or provide a link to a code repository. It mentions using and finetuning existing models like LLaMA-3.3-70B-Instruct and Web RL-Llama-3.1-70B, but not the release of their specific implementation.
Open Datasets Yes We evaluate PLAN-AND-ACT using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the Web Arena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on Web Voyager. Environment: We run ablations on PLAN-AND-ACT using Web Arena-Lite (Koh et al., 2024), a benchmark containing 165 test cases across diverse websites including Open Street Map, Reddit, Git Lab, a content management system (CMS), and One Stop Shop (OSS). We also evaluate PLAN-AND-ACT on the full Web Arena dataset as well as the Web Voyager (He et al., 2024a) dataset, which is a dynamic, realworld web dataset.
Dataset Splits Yes We run ablations on PLAN-AND-ACT using Web Arena-Lite (Koh et al., 2024), a benchmark containing 165 test cases across diverse websites... and provides training data... We also evaluate PLAN-ANDACT on the full Web Arena dataset as well as the Web Voyager (He et al., 2024a) dataset... The second column has an EXECUTOR that was trained only on 1,113 Web Arena-lite training data points, and the third column being an EXECUTOR trained on both the Web Arena-lite training data as well as the 923 synthetically generated action trajectories from Section 4.1.
Hardware Specification Yes We acknowledge gracious support from Apple team, as well as Nvidia for providing GPU hardware. Machine 8 A100
Software Dependencies Yes For our primary PLAN-AND-ACT framework, we utilize LLa MA-3.3-70B-Instruct model by fine-tuning separate instances for both the PLANNER and EXECUTOR components. We use Web RL-Llama-3.1-70B (Qi et al., 2024) as the actor model and ORM-Llama-3.1-8B (Qi et al., 2024) as the filter model for filtering for successful trajectories. Framework torchtune
Experiment Setup Yes Table 5. (a) Training Hyperparameters Learning Rate 2e-5 Optimizer Adam W LR Scheduler Cosine Warmup Ratio 0.1 Batch Size 32 Epochs 1 FP16/BF16 Enabled