Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies

Authors: Yinglun Xu, Tarun Suresh, Rohan Gumaste, David Zhu, Ruirui Li, Zhengyang Wang, Haoming Jiang, Xianfeng Tang, Qingyu Yin, Monica Xiao Cheng, Qi Zeng, Chao Zhang, Gagandeep Singh

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Dataset: Following previous studies (Hejna & Sadigh, 2023; Zhang et al., 2023), we construct our offline preference dataset from D4RL benchmark (Fu et al., 2020) labeled by synthetic preference following the preference model in Section 2. ... The results in Table 1 show that the PRC algorithm has high learning efficiency.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign, 2Amazon, 3Georgia institute of technology
Pseudocode Yes Algorithm 1: Two-step training with KL-regularization and PPO algorithm Algorithm 2: Preference Based Reinforcement Learning on a Constrained Action Space Algorithm 3: PRC practical implementation
Open Source Code No The paper does not contain an explicit statement from the authors that their code is released, nor does it provide a direct link to a code repository. The text mentions the training time for FTB and states that details can be found in the Appendix, but this refers to a third-party method used for comparison.
Open Datasets Yes Following previous studies (Hejna & Sadigh, 2023; Zhang et al., 2023), we construct our offline preference dataset from D4RL benchmark (Fu et al., 2020) labeled by synthetic preference following the preference model in Section 2.
Dataset Splits No The paper mentions using various datasets from the D4RL benchmark (e.g., Half Cheetah , Hopper , and Walker , with types Medium , Medium-Replay , and Medium-Expert ) and states the total number of trajectory pairs, but it does not specify explicit train/test/validation splits, percentages, or the methodology used for partitioning these datasets for experiments.
Hardware Specification Yes For the computational resource, we use a single NVIDIA A100-PCI GPU with 40 GB RAM and an Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz with 64 GB RAM.
Software Dependencies No The paper mentions using SAC and PPO algorithms for the reinforcement learning step, and describes general neural network structures. However, it does not provide specific version numbers for these algorithms, any libraries (e.g., PyTorch, TensorFlow), or the programming language (e.g., Python) used for implementation.
Experiment Setup Yes To learn a reward model, we follow a standard supervised learning framework. Specifically, we use a multilayer perceptron (MLP) structure for the neural network to approximate the utility model. The neural network consists of 3 hidden layers, and each has 64 neurons. We use a tanh function as the output activation so that the output is bound between [ -1, 1]. To train a deterministic behavior clone policy, the neural network we use to represent the clone policy has the same structure as that for the utility model.