Is RLHF More Difficult than Standard RL? A Theoretical Perspective
Authors: Yuanhao Wang, Qinghua Liu, Chi Jin
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. |
| Researcher Affiliation | Academia | Yuanhao Wang, Qinghua Liu, Chi Jin Princeton University {yuanhao,qinghual,chij}@princeton.edu |
| Pseudocode | Yes | Algorithm 1 Preference-to-Reward (P2R) Interface; Algorithm 2 Preference-based OMLE (P-OMLE); Algorithm 3 Optimistic MLE with ϵ -Perturbed Reward Feedback; Algorithm 4 Learning von Neumann Winner via Adversarial MDP Algorithms; Algorithm 5 Learning von Neumann winner via Optimistic MLE |
| Open Source Code | No | The paper does not provide a statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments using datasets. It refers to various MDP types (e.g., tabular MDPs, linear MDPs) as problem settings for which its theoretical results apply, not as specific datasets used for training. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments that involve dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe any computational experiments that would require specific hardware specifications. |
| Software Dependencies | No | The paper is theoretical and does not describe empirical experiments or specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithms and proofs rather than empirical experimental setups with hyperparameters or training configurations. |