MaxMin-RLHF: Alignment with Diverse Human Preferences

Authors: Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit Bedi, Mengdi Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Maryland, College Park, MD, USA. 2Department of Electrical and Computer Engineering, Princeton University, NJ, USA. 3JP Morgan Chase AI Research, New York, USA. 4Department of Computer Science,, University of Central Florida, FL, USA.
Pseudocode Yes Algorithm 1 Max Min RLHF. Algorithm 2 Learning Rewards with EM Algorithm.
Open Source Code No The paper does not include any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We use the IMDb dataset as a basis for our inputs (Maas et al., 2011)... We use the same dataset as Jang et al. (2023) and 10k data points from GPT4Alpaca (Peng et al., 2023) are used as the instruction dataset...
Dataset Splits No The paper mentions training and testing data splits but does not explicitly mention a "validation" split or provide details for one.
Hardware Specification No The paper mentions models like GPT-2 and Tulu2-7B and states experiments were run, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using GPT-2, Tulu2-7B, PPO, and EM algorithm, but does not specify version numbers for these or any other software dependencies like programming languages or libraries.
Experiment Setup Yes For SFT, we fine-tune GPT-2 until convergence on reviews from the train split of the IMDB dataset and use this GPT-2 backbone for both the reward model and PPO training. We use the same dataset as Jang et al. (2023) and 10k data points from GPT4Alpaca (Peng et al., 2023) are used as the instruction dataset to generate rollouts, collect pairwise feedback data, and PPO training. We utilize GPT-4 to simulate human annotators with preference prompts described in Table 4 in Appendix F. We divide the datasets into groups of human users. Each group has 40 users, which are split into 30 users in training data and 10 users in testing data. For the experiments in this subsection, we use Tulu2-7B (Ivison et al., 2023) as the base model. We have 60 users in training data which are mixed from two different groups with diverse preferences. The original distribution is that users are evenly distributed in two clusters. Then, we use the EM algorithm to train |U|= 2 reward models until we converge. Update ϕu, u = 1, , |U| by minimizing the negative log-likelihood loss (2). Following Jang et al. (2023), we use the same 50 instances from Koala evaluation(Geng et al., 2023) and test the model s ability to generate answers in different groups of users preferences. We run pairwise evaluations by GPT-4 using Alpaca Farm codebase(Dubois et al., 2023) and use the win rate to the base Tulu2-7B model as the metric.