reproducibilityindex.ai

Beyond Reward: Offline Preference-guided Policy Optimization

Authors: Yachen Kang, Diyuan Shi, Jinxin Liu, Li He, Donglin Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google. com/view/oppo-icml-2023. [...] Firstly, we propose OPPO, a concise, stable, and one-step offline Pb RL paradigm that avoids the need for separate reward function learning. Secondly, we present an instance of a preference-based hindsight information matching objective and a novel preference modeling objective over the context. Finally, extensive experiments are conducted to demonstrate the superiority of OPPO over previous competitive baselines and to analyze its performance.
Researcher Affiliation	Academia	1College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China 2Machine Intelligence Lab (Mi LAB) of the School of Engineering, Westlake University, Hangzhou, Zhejiang, China. Correspondence to: Donglin Wang <wangdonglin@westlake.edu.cn>, Yachen Kang <kangyachen@westlake.edu.cn>.
Pseudocode	Yes	Algorithm 1 OPPO: Offline Preference-guided Policy Optimization Require: Dataset D := {τ} and labeled dataset D := {(τ i, τ j, y)}, where τ i D and τ j D. Return: π(a\|s, z) and z .
Open Source Code	Yes	Our code is available on the project website: https://sites.google. com/view/oppo-icml-2023. [...] Our code is based on Decision Transformer2, and our implementation of OPPO is available at: https://github.com/bkkgbkjb/OPPO
Open Datasets	Yes	To answer the above questions, we evaluate OPPO on the continuous control tasks from the D4RL benchmark (Fu et al., 2020). Specifically, we choose Hopper, Walker, and Halfcheetah as three base tasks, with medium, medium-replay, medium-expert as the datasets for each task.
Dataset Splits	No	The paper describes the general nature of the D4RL datasets (e.g., "Medium: 1 million timesteps", "Medium-Replay: the replay buffer of an agent"), but does not explicitly state the specific train/validation/test splits used for their experiments. While it refers to D4RL for more information, it does not detail their partitioning of the data.
Hardware Specification	Yes	The experiments were run on a computational cluster with 20x Ge Force RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for about 20 days.
Software Dependencies	No	The paper mentions "Adam W optimizer ... following Py Torch defaults" and that their "code is based on Decision Transformer2" with a GitHub link, but it does not specify explicit version numbers for PyTorch, Decision Transformer, or any other software libraries used, which is required for reproducibility.
Experiment Setup	Yes	For specific hyperparameter selection during the training process, please refer to the detailed description in Appendix A.1.3. [...] Table 7. Hyperparameters of coefficients of combined losses during Offline HIM. [...] Table 8. Hyperparameters of z searching for Open AI Gym experiments. [...] Table 9. Hyperparameters of Transformer for Open AI Gym experiments.