Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Authors: Hyungjoo Chae, Seonghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we observe that our WEB-SHEPHERD achieves about 30 points better accuracy compared to using GPT-4o on WEBREWARDBENCH. Furthermore, when testing on Web Arena-lite by using GPT-4o-mini as the policy and WEB-SHEPHERD as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier.
Researcher Affiliation Academia 1Georgia Institute of Technology 2Department of Artificial Intelligence, Yonsei University 3Carnegie Mellon University
Pseudocode No The paper describes methods and steps in prose and provides prompts as structured text, but it does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks to present a formal algorithm.
Open Source Code Yes Our model, dataset, and code are publicly available at LINK.
Open Datasets Yes First, we release the WEBPRM COLLECTION, which contains human-crafted instructions that covers diverse tasks across multiple difficulty levels. Second, we release the WEBREWARDBENCH, the first meta-evaluation benchmark to assess PRMs in web navigation. Our model, dataset, and code are publicly available at LINK.
Dataset Splits Yes We conduct analysis on the effect of the (1) number of instructions, and (2) number of rejected actions in the dataset on the performance of the PRM. Specifically, we construct datasets using the subset of Web PRMCollection 0.25, 0.5, and 0.75 percent of instruction and its corresponding chosen-rejected pairs and 1,2, and 3 number of max rejected actions.
Hardware Specification Yes Training is conducted using Deep Speed Ze RO Stage 2 on an RTX A6000 (48GB) server with 8 GPUs, totaling approximately 16 GPU-hours.
Software Dependencies No The paper mentions leveraging the LLa MA-Factory [47] framework, applying the Liger kernel [48] optimization, and using v LLM [45] for inference, but specific version numbers for these software components are not provided.
Experiment Setup Yes We train the model for 3 epochs with a learning rate of 1e-4, using Lo RA with a rank of 16.