reproducibilityindex.ai

Is RLHF More Difficult than Standard RL? A Theoretical Perspective

Authors: Yuanhao Wang, Qinghua Liu, Chi Jin

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs.
Researcher Affiliation	Academia	Yuanhao Wang, Qinghua Liu, Chi Jin Princeton University {yuanhao,qinghual,chij}@princeton.edu
Pseudocode	Yes	Algorithm 1 Preference-to-Reward (P2R) Interface; Algorithm 2 Preference-based OMLE (P-OMLE); Algorithm 3 Optimistic MLE with ϵ -Perturbed Reward Feedback; Algorithm 4 Learning von Neumann Winner via Adversarial MDP Algorithms; Algorithm 5 Learning von Neumann winner via Optimistic MLE
Open Source Code	No	The paper does not provide a statement or link indicating that source code for the described methodology is publicly available.
Open Datasets	No	The paper is theoretical and does not conduct experiments using datasets. It refers to various MDP types (e.g., tabular MDPs, linear MDPs) as problem settings for which its theoretical results apply, not as specific datasets used for training.
Dataset Splits	No	The paper is theoretical and does not conduct experiments that involve dataset splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not describe any computational experiments that would require specific hardware specifications.
Software Dependencies	No	The paper is theoretical and does not describe empirical experiments or specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and focuses on algorithms and proofs rather than empirical experimental setups with hyperparameters or training configurations.