reproducibilityindex.ai

Direct Preference-based Policy Optimization without Reward Modeling

Authors: Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, Hyun Oh Song

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results on offline RL settings with actual human preference labels show that the proposed algorithm outperforms or is on par with the baselines on all of the tasks considered. Notably, in high-dimensional control tasks, our algorithm outperforms offline RL methods that utilize ground-truth reward information.
Researcher Affiliation	Collaboration	Gaon An Seoul National University white0234@mllab.snu.ac.kr Junhyeok Lee Seoul National University riman314@mllab.snu.ac.kr Xingdong Zuo NAVER xingdong.zuo@navercorp.com Norio Kosaka NAVER Line Corporation kosaka.norio@linecorp.com Kyung-Min Kim NAVER kyungmin.kim.ml@navercorp.com Hyun Oh Song Seoul National University hyunoh@mllab.snu.ac.kr
Pseudocode	Yes	Algorithm 1 Direct Preference-based Policy Optimization
Open Source Code	Yes	Our official code is available at https://github.com/snu-mllab/DPPO.
Open Datasets	Yes	We evaluate our algorithm on D4RL, a standard benchmark for offline RL, with preference datasets generated by actual human teachers [16]. ... For the Gym hopper, Gym walker2d, and Adroit pen tasks, we utilize publicly available human preference datasets released by [27].
Dataset Splits	Yes	Following recent Pb RL works, we evaluate our algorithm on the offline setting which assumes a large unlabeled dataset D is given along with a much smaller preference-labeled dataset Dpref [49, 27].
Hardware Specification	Yes	All our offline RL experiments were run on a single RTX 3090 GPU with 10 CPU cores (Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz). For the RLHF experiments, we used an A100 GPU with 10 CPU cores (AMD EPYC 7402 24-Core Processor).
Software Dependencies	No	The paper mentions that the algorithm was implemented on Deep Speed-Chat but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup	Yes	Table 4: Hyperparameter settings for IQL in PT+IQL. ... Table 5: Hyperparameter settings for CQL in PT+CQL. ... Table 6: Hyperparameter settings of the preference predictor training process in DPPO (Ours). ... Table 7: Hyperparameter settings of the policy optimization process in DPPO (Ours).