Direct Preference-based Policy Optimization without Reward Modeling

Authors: Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, Hyun Oh Song

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results on offline RL settings with actual human preference labels show that the proposed algorithm outperforms or is on par with the baselines on all of the tasks considered. Notably, in high-dimensional control tasks, our algorithm outperforms offline RL methods that utilize ground-truth reward information.
Researcher Affiliation Collaboration Gaon An Seoul National University white0234@mllab.snu.ac.kr Junhyeok Lee Seoul National University riman314@mllab.snu.ac.kr Xingdong Zuo NAVER xingdong.zuo@navercorp.com Norio Kosaka NAVER Line Corporation kosaka.norio@linecorp.com Kyung-Min Kim NAVER kyungmin.kim.ml@navercorp.com Hyun Oh Song Seoul National University hyunoh@mllab.snu.ac.kr
Pseudocode Yes Algorithm 1 Direct Preference-based Policy Optimization
Open Source Code Yes Our official code is available at https://github.com/snu-mllab/DPPO.
Open Datasets Yes We evaluate our algorithm on D4RL, a standard benchmark for offline RL, with preference datasets generated by actual human teachers [16]. ... For the Gym hopper, Gym walker2d, and Adroit pen tasks, we utilize publicly available human preference datasets released by [27].
Dataset Splits Yes Following recent Pb RL works, we evaluate our algorithm on the offline setting which assumes a large unlabeled dataset D is given along with a much smaller preference-labeled dataset Dpref [49, 27].
Hardware Specification Yes All our offline RL experiments were run on a single RTX 3090 GPU with 10 CPU cores (Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz). For the RLHF experiments, we used an A100 GPU with 10 CPU cores (AMD EPYC 7402 24-Core Processor).
Software Dependencies No The paper mentions that the algorithm was implemented on Deep Speed-Chat but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes Table 4: Hyperparameter settings for IQL in PT+IQL. ... Table 5: Hyperparameter settings for CQL in PT+CQL. ... Table 6: Hyperparameter settings of the preference predictor training process in DPPO (Ours). ... Table 7: Hyperparameter settings of the policy optimization process in DPPO (Ours).