Direct Preference-based Policy Optimization without Reward Modeling
Authors: Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, Hyun Oh Song
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results on offline RL settings with actual human preference labels show that the proposed algorithm outperforms or is on par with the baselines on all of the tasks considered. Notably, in high-dimensional control tasks, our algorithm outperforms offline RL methods that utilize ground-truth reward information. |
| Researcher Affiliation | Collaboration | Gaon An Seoul National University white0234@mllab.snu.ac.kr Junhyeok Lee Seoul National University riman314@mllab.snu.ac.kr Xingdong Zuo NAVER xingdong.zuo@navercorp.com Norio Kosaka NAVER Line Corporation kosaka.norio@linecorp.com Kyung-Min Kim NAVER kyungmin.kim.ml@navercorp.com Hyun Oh Song Seoul National University hyunoh@mllab.snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 Direct Preference-based Policy Optimization |
| Open Source Code | Yes | Our official code is available at https://github.com/snu-mllab/DPPO. |
| Open Datasets | Yes | We evaluate our algorithm on D4RL, a standard benchmark for offline RL, with preference datasets generated by actual human teachers [16]. ... For the Gym hopper, Gym walker2d, and Adroit pen tasks, we utilize publicly available human preference datasets released by [27]. |
| Dataset Splits | Yes | Following recent Pb RL works, we evaluate our algorithm on the offline setting which assumes a large unlabeled dataset D is given along with a much smaller preference-labeled dataset Dpref [49, 27]. |
| Hardware Specification | Yes | All our offline RL experiments were run on a single RTX 3090 GPU with 10 CPU cores (Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz). For the RLHF experiments, we used an A100 GPU with 10 CPU cores (AMD EPYC 7402 24-Core Processor). |
| Software Dependencies | No | The paper mentions that the algorithm was implemented on Deep Speed-Chat but does not provide specific version numbers for any software libraries or dependencies. |
| Experiment Setup | Yes | Table 4: Hyperparameter settings for IQL in PT+IQL. ... Table 5: Hyperparameter settings for CQL in PT+CQL. ... Table 6: Hyperparameter settings of the preference predictor training process in DPPO (Ours). ... Table 7: Hyperparameter settings of the policy optimization process in DPPO (Ours). |