Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Benchmarks and Algorithms for Offline Preference-Based Reward Learning
Authors: Daniel Shin, Anca Dragan, Daniel S. Brown
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test our approach, we ļ¬rst evaluate existing oļ¬ine RL benchmarks for their suitability for oļ¬ine reward learning. ... When evaluated on this curated set of domains, our empirical results suggest that combining oļ¬ine RL with learned human preferences can enable an agent to learn to perform novel tasks that were not explicitly shown in the oļ¬ine data. |
| Researcher Affiliation | Academia | Daniel Shin EMAIL Computer Science Department Stanford University Anca D. Dragan EMAIL EECS Department University of California, Berkeley Daniel S. Brown EMAIL School of Computing University of Utah |
| Pseudocode | Yes | Algorithm 1 OPRL |
| Open Source Code | Yes | Videos of learned behavior and code is available in the Supplement. |
| Open Datasets | Yes | We ļ¬rst evaluate a variety of popular oļ¬ine RL benchmarks from D4RL (Fu et al., 2020) to determine which domains are most suited for evaluating oļ¬ine reward learning. |
| Dataset Splits | No | The paper discusses evaluating performance on existing datasets like D4RL and on new datasets created by the authors. For instance, it mentions 'training with the ground-truth reward function on the full dataset of 1 million state transitions' and 'Our experimental setup is similar to Maze2D, except we start with 50 pairs of trajectories instead of 5 and we add 10 trajectories per round of active queries instead of 1 query per round.' However, it does not explicitly provide information about predefined training, validation, or test dataset splits in terms of percentages, absolute sample counts, or citations to standard splits for the experimental evaluation of policies or reward models. |
| Hardware Specification | Yes | All models are trained on an Azure Standard NC24 Promo instance, with 24 v CPUs, 224 Gi B of RAM and 4 x K80 GPU (2 Physical Cards). |
| Software Dependencies | No | The paper mentions using a 'neural network' and 'deep learning' which implies frameworks like PyTorch or TensorFlow, and refers to 'oļ¬ine RL algorithms' (e.g., AWR, CQL), but it does not specify any software components with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | For our experimental setup, we ļ¬rst randomly select 5 pairs of trajectory snippets and train 5 epochs with our models. After this initial training process, for each round, one additional pair of trajectories is queried to be added to the training set and we train one more epoch on this augmented dataset. ... For policy learning with AWR, lower dimensional environments including Maze2D-Umaze, Maze2D-Medium, and Hopper are ran with 400 iterations. Higher dimensional environments including Halfcheetah, Flow-Merge Random, and Kitchen-Complete are ran with 1000 iterations. ... For CQL, policy learning rate is 1e-4, lagrange threshold is -1.0, min q weights is 5.0, min q version is 3, and policy eval start is 0. |