Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL

Authors: Joey Hong, Anca Dragan, Sergey Levine

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of PNLC, we evaluate our method on a variety of multi-turn LLM agent benchmark tasks: web shopping [42], social deduction games [16], and persuasion [37]. A table of all results can be found for all considered benchmarks in Table 1. We see that PNLC performs best across all tasks, the closest competitors being task-specific approaches such as Agent Q in Web Shop or Strategist in Avalon. Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We run extensive experiments showing our method improves upon existing state-of-the-art.
Researcher Affiliation	Academia	Joey Hong Anca Dragan Sergey Levine UC Berkeley EMAIL
Pseudocode	No	The paper describes the methodology using text and figures (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We release the code used to run the experiments found at this website.
Open Datasets	Yes	To demonstrate the effectiveness of PNLC, we evaluate our method on a variety of multi-turn LLM agent benchmark tasks: web shopping [42], social deduction games [16], and persuasion [37]. Web Shop [41] is an online shopping environment where an agent processes unstructured text data (in the form of descriptions crawled from Amazon) to purchase a product given some initial user specifications. Avalon Bench [16] is a testbed for LLM agents modeled after the team-based discussion game Resistance Avalon. We consider a goal-oriented dialogue task, inspired by Wang et al. [37].
Dataset Splits	Yes	Web Shopping. All relevant methods train on the same dataset consisting of 12k initial user instructions, of which we randomly held out 100 for evaluation.
Hardware Specification	No	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Our training procedure requires very little compute. We also report inference budget in terms of time taken. (The justification does not provide specific hardware models like GPU/CPU names, memory, or specific cluster configurations used for their experiments, only general statements about compute budget.)
Software Dependencies	No	The paper mentions using models like GPT-3 [21] and GPT-4 [22] and the IQL algorithm [12], but does not specify software dependencies such as programming language versions, library versions (e.g., PyTorch, TensorFlow), or specific versions of other software tools used for implementation.
Experiment Setup	Yes	A.2 Web Shop Training Details Hyperparameter Setting Hidden-layer size 64*64 IQL τ 0.8 Discount factor 0.99 Batch size 32 Target network update α 0.005 Number of updates per iteration 50 Number of iterations 100 Optimizer Adam W Learning rate 4e-4 (Similar tables are provided for Avalon and Persuasion in Appendices A.4 and A.6 respectively).