Beyond Reward: Offline Preference-guided Policy Optimization
Authors: Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, Donglin Wang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google. com/view/oppo-icml-2023. [...] Firstly, we propose OPPO, a concise, stable, and one-step offline Pb RL paradigm that avoids the need for separate reward function learning. Secondly, we present an instance of a preference-based hindsight information matching objective and a novel preference modeling objective over the context. Finally, extensive experiments are conducted to demonstrate the superiority of OPPO over previous competitive baselines and to analyze its performance. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China 2Machine Intelligence Lab (Mi LAB) of the School of Engineering, Westlake University, Hangzhou, Zhejiang, China. Correspondence to: Donglin Wang <wangdonglin@westlake.edu.cn>, Yachen Kang <kangyachen@westlake.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 OPPO: Offline Preference-guided Policy Optimization Require: Dataset D := {τ} and labeled dataset D := {(τ i, τ j, y)}, where τ i D and τ j D. Return: π(a|s, z) and z . |
| Open Source Code | Yes | Our code is available on the project website: https://sites.google. com/view/oppo-icml-2023. [...] Our code is based on Decision Transformer2, and our implementation of OPPO is available at: https://github.com/bkkgbkjb/OPPO |
| Open Datasets | Yes | To answer the above questions, we evaluate OPPO on the continuous control tasks from the D4RL benchmark (Fu et al., 2020). Specifically, we choose Hopper, Walker, and Halfcheetah as three base tasks, with medium, medium-replay, medium-expert as the datasets for each task. |
| Dataset Splits | No | The paper describes the general nature of the D4RL datasets (e.g., "Medium: 1 million timesteps", "Medium-Replay: the replay buffer of an agent"), but does not explicitly state the specific train/validation/test splits used for their experiments. While it refers to D4RL for more information, it does not detail *their* partitioning of the data. |
| Hardware Specification | Yes | The experiments were run on a computational cluster with 20x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for about 20 days. |
| Software Dependencies | No | The paper mentions "Adam W optimizer ... following Py Torch defaults" and that their "code is based on Decision Transformer2" with a GitHub link, but it does not specify explicit version numbers for PyTorch, Decision Transformer, or any other software libraries used, which is required for reproducibility. |
| Experiment Setup | Yes | For specific hyperparameter selection during the training process, please refer to the detailed description in Appendix A.1.3. [...] Table 7. Hyperparameters of coefficients of combined losses during Offline HIM. [...] Table 8. Hyperparameters of z searching for Open AI Gym experiments. [...] Table 9. Hyperparameters of Transformer for Open AI Gym experiments. |