Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Benchmarks and Algorithms for Offline Preference-Based Reward Learning

Authors: Daniel Shin, Anca Dragan, Daniel S. Brown

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To test our approach, we ﬁrst evaluate existing oﬄine RL benchmarks for their suitability for oﬄine reward learning. ... When evaluated on this curated set of domains, our empirical results suggest that combining oﬄine RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the oﬄine data.
Researcher Affiliation	Academia	Daniel Shin EMAIL Computer Science Department Stanford University Anca D. Dragan EMAIL EECS Department University of California, Berkeley Daniel S. Brown EMAIL School of Computing University of Utah
Pseudocode	Yes	Algorithm 1 OPRL
Open Source Code	Yes	Videos of learned behavior and code is available in the Supplement.
Open Datasets	Yes	We ﬁrst evaluate a variety of popular oﬄine RL benchmarks from D4RL (Fu et al., 2020) to determine which domains are most suited for evaluating oﬄine reward learning.
Dataset Splits	No	The paper discusses evaluating performance on existing datasets like D4RL and on new datasets created by the authors. For instance, it mentions 'training with the ground-truth reward function on the full dataset of 1 million state transitions' and 'Our experimental setup is similar to Maze2D, except we start with 50 pairs of trajectories instead of 5 and we add 10 trajectories per round of active queries instead of 1 query per round.' However, it does not explicitly provide information about predefined training, validation, or test dataset splits in terms of percentages, absolute sample counts, or citations to standard splits for the experimental evaluation of policies or reward models.
Hardware Specification	Yes	All models are trained on an Azure Standard NC24 Promo instance, with 24 v CPUs, 224 Gi B of RAM and 4 x K80 GPU (2 Physical Cards).
Software Dependencies	No	The paper mentions using a 'neural network' and 'deep learning' which implies frameworks like PyTorch or TensorFlow, and refers to 'oﬄine RL algorithms' (e.g., AWR, CQL), but it does not specify any software components with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	For our experimental setup, we ﬁrst randomly select 5 pairs of trajectory snippets and train 5 epochs with our models. After this initial training process, for each round, one additional pair of trajectories is queried to be added to the training set and we train one more epoch on this augmented dataset. ... For policy learning with AWR, lower dimensional environments including Maze2D-Umaze, Maze2D-Medium, and Hopper are ran with 400 iterations. Higher dimensional environments including Halfcheetah, Flow-Merge Random, and Kitchen-Complete are ran with 1000 iterations. ... For CQL, policy learning rate is 1e-4, lagrange threshold is -1.0, min q weights is 5.0, min q version is 3, and policy eval start is 0.