Is RLHF More Difficult than Standard RL? A Theoretical Perspective

Authors: Yuanhao Wang, Qinghua Liu, Chi Jin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs.
Researcher Affiliation Academia Yuanhao Wang, Qinghua Liu, Chi Jin Princeton University {yuanhao,qinghual,chij}@princeton.edu
Pseudocode Yes Algorithm 1 Preference-to-Reward (P2R) Interface; Algorithm 2 Preference-based OMLE (P-OMLE); Algorithm 3 Optimistic MLE with ϵ -Perturbed Reward Feedback; Algorithm 4 Learning von Neumann Winner via Adversarial MDP Algorithms; Algorithm 5 Learning von Neumann winner via Optimistic MLE
Open Source Code No The paper does not provide a statement or link indicating that source code for the described methodology is publicly available.
Open Datasets No The paper is theoretical and does not conduct experiments using datasets. It refers to various MDP types (e.g., tabular MDPs, linear MDPs) as problem settings for which its theoretical results apply, not as specific datasets used for training.
Dataset Splits No The paper is theoretical and does not conduct experiments that involve dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any computational experiments that would require specific hardware specifications.
Software Dependencies No The paper is theoretical and does not describe empirical experiments or specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on algorithms and proofs rather than empirical experimental setups with hyperparameters or training configurations.