Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Authors: Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate PLAN-AND-ACT using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the Web Arena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on Web Voyager.
Researcher Affiliation	Academia	1UC Berkeley 2University of Tokyo 3ICSI. Correspondence to: Amir Gholami <EMAIL>.
Pseudocode	No	The paper includes prompt templates in the appendix for the PLANNER, EXECUTOR, and data generation, but these are descriptions for an LLM's input/output format rather than structured, executable pseudocode or algorithm blocks for a computational procedure.
Open Source Code	No	The paper does not explicitly state that the code for PLAN-AND-ACT is open-source or provide a link to a code repository. It mentions using and finetuning existing models like LLaMA-3.3-70B-Instruct and Web RL-Llama-3.1-70B, but not the release of their specific implementation.
Open Datasets	Yes	We evaluate PLAN-AND-ACT using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the Web Arena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on Web Voyager. Environment: We run ablations on PLAN-AND-ACT using Web Arena-Lite (Koh et al., 2024), a benchmark containing 165 test cases across diverse websites including Open Street Map, Reddit, Git Lab, a content management system (CMS), and One Stop Shop (OSS). We also evaluate PLAN-AND-ACT on the full Web Arena dataset as well as the Web Voyager (He et al., 2024a) dataset, which is a dynamic, realworld web dataset.
Dataset Splits	Yes	We run ablations on PLAN-AND-ACT using Web Arena-Lite (Koh et al., 2024), a benchmark containing 165 test cases across diverse websites... and provides training data... We also evaluate PLAN-ANDACT on the full Web Arena dataset as well as the Web Voyager (He et al., 2024a) dataset... The second column has an EXECUTOR that was trained only on 1,113 Web Arena-lite training data points, and the third column being an EXECUTOR trained on both the Web Arena-lite training data as well as the 923 synthetically generated action trajectories from Section 4.1.
Hardware Specification	Yes	We acknowledge gracious support from Apple team, as well as Nvidia for providing GPU hardware. Machine 8 A100
Software Dependencies	Yes	For our primary PLAN-AND-ACT framework, we utilize LLa MA-3.3-70B-Instruct model by fine-tuning separate instances for both the PLANNER and EXECUTOR components. We use Web RL-Llama-3.1-70B (Qi et al., 2024) as the actor model and ORM-Llama-3.1-8B (Qi et al., 2024) as the filter model for filtering for successful trajectories. Framework torchtune
Experiment Setup	Yes	Table 5. (a) Training Hyperparameters Learning Rate 2e-5 Optimizer Adam W LR Scheduler Cosine Warmup Ratio 0.1 Batch Size 32 Epochs 1 FP16/BF16 Enabled