Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Two-Step Offline Preference-Based Reinforcement Learning on Explicitly Constrained Policies
Authors: Yinglun Xu, Tarun Suresh, Rohan Gumaste, David Zhu, Ruirui Li, Zhengyang Wang, Haoming Jiang, Xianfeng Tang, Qingyu Yin, Monica Xiao Cheng, Qi Zeng, Chao Zhang, Gagandeep Singh
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments Dataset: Following previous studies (Hejna & Sadigh, 2023; Zhang et al., 2023), we construct our offline preference dataset from D4RL benchmark (Fu et al., 2020) labeled by synthetic preference following the preference model in Section 2. ... The results in Table 1 show that the PRC algorithm has high learning efficiency. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign, 2Amazon, 3Georgia institute of technology |
| Pseudocode | Yes | Algorithm 1: Two-step training with KL-regularization and PPO algorithm Algorithm 2: Preference Based Reinforcement Learning on a Constrained Action Space Algorithm 3: PRC practical implementation |
| Open Source Code | No | The paper does not contain an explicit statement from the authors that their code is released, nor does it provide a direct link to a code repository. The text mentions the training time for FTB and states that details can be found in the Appendix, but this refers to a third-party method used for comparison. |
| Open Datasets | Yes | Following previous studies (Hejna & Sadigh, 2023; Zhang et al., 2023), we construct our offline preference dataset from D4RL benchmark (Fu et al., 2020) labeled by synthetic preference following the preference model in Section 2. |
| Dataset Splits | No | The paper mentions using various datasets from the D4RL benchmark (e.g., Half Cheetah , Hopper , and Walker , with types Medium , Medium-Replay , and Medium-Expert ) and states the total number of trajectory pairs, but it does not specify explicit train/test/validation splits, percentages, or the methodology used for partitioning these datasets for experiments. |
| Hardware Specification | Yes | For the computational resource, we use a single NVIDIA A100-PCI GPU with 40 GB RAM and an Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz with 64 GB RAM. |
| Software Dependencies | No | The paper mentions using SAC and PPO algorithms for the reinforcement learning step, and describes general neural network structures. However, it does not provide specific version numbers for these algorithms, any libraries (e.g., PyTorch, TensorFlow), or the programming language (e.g., Python) used for implementation. |
| Experiment Setup | Yes | To learn a reward model, we follow a standard supervised learning framework. Specifically, we use a multilayer perceptron (MLP) structure for the neural network to approximate the utility model. The neural network consists of 3 hidden layers, and each has 64 neurons. We use a tanh function as the output activation so that the output is bound between [ -1, 1]. To train a deterministic behavior clone policy, the neural network we use to represent the clone policy has the same structure as that for the utility model. |