Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

Authors: Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Li Erran Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate PAE on challenging vision-based web navigation, using both real-world and selfhosted websites from Web Voyager and Web Arena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (around 50% relative improvement) to both unseen tasks and websites.
Researcher Affiliation	Collaboration	1University of California, Berkeley 2University of Illinois, Urbana-Champaign 3Amazon.
Pseudocode	Yes	In Algorithm 1, we include a formal definitions of our practical algorithm of PAE as presented in Section 3. Algorithm 1 Proposer-Agent-Evaluator: Practical Algorithm
Open Source Code	No	The release of our models enables medium-size VLMs such as LLa Va-7B to beat the prior SOTA Qwen2VL-72B with 10 more parameters on Web Arena Easy. While this sentence implies that models are released, it does not explicitly state that the source code for the methodology described in the paper is publicly available, nor does it provide a direct link or specific instructions for access.
Open Datasets	Yes	We validate the effectiveness of PAE framework with realistic web-navigation benchmarks, including more than 100 domains both from online websites like Amazon from Web Voyager (He et al., 2024a) and self-hosted websites like Post Mill from Web Arena (Zhou et al., 2024a).
Dataset Splits	Yes	To understand the generalization of PAE to websites that it has never interacted with, we apply the workflow from He et al. (2024a) to generate 500 tasks using Claude 3 Sonnet on 85 unseen online websites and test the checkpoints from Web Voyager experiments. Results are presented in Table 3 and a list of the websites is included in Appendix F. We observe that PAE for both LLa Va-7B and LLava-34B enable the agents to learn general web-browsing skills that can be zero-shot transferred to unseen websites, with 7.2% and 5.3% improvement in absolute success rate respectively.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory amounts) used for running the experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions several models (e.g., LLa Va-1.6, Claude-3-Sonnet, Qwen2VL-7B) and tools (Gradio, Chrome Driver), but does not provide specific version numbers for software dependencies or libraries critical for replicating the experiments.
Experiment Setup	Yes	We include the hyperparameters that we have used in Table 5. As shown in the table, the only hyperparameters that PAE have on top of standard supervised fine-tuning are number of trajectories to collect in each global iteration in Algorithm 1, number of proposed tasks from the task proposer before RL training, and the number of seen screenshots for the evaluator.