Preference Learning Algorithms Do Not Learn Preference Rankings

Authors: Angelica Chen, Sadhika Malladi, Lily Zhang, Xinyi Chen, Qiuyi (Richard) Zhang, Rajesh Ranganath, Kyunghyun Cho

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We also derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly.
Researcher Affiliation Collaboration Angelica Chen New York University ac5968@nyu.edu Sadhika Malladi Princeton University smalladi@princeton.edu Lily H. Zhang New York University lily.h.zhang@nyu.edu Xinyi Chen Google Deep Mind; Princeton University xinyic@google.com Qiuyi Zhang Google Deep Mind qiuyiz@google.com Rajesh Ranganath New York University rajeshr@cims.nyu.edu Kyunghyun Cho New York University; Genentech; CIFAR LMB kyunghyun.cho@nyu.edu
Pseudocode No The paper contains mathematical definitions and theorems (e.g., Definition 2.2, Theorem 3.1), but no explicit pseudocode blocks or algorithm listings.
Open Source Code Yes We provide an anonymized version of our code.
Open Datasets Yes common preference datasets, such as Ultra Feedback [7], Anthropic helpfulness and harmlessness (HH-RLHF, [14]), and Stanford Human Preferences (SHP, [11]) (Figure 1). ... Alpaca Farm Validation [9] is sourced from the Alpaca Eval dataset, but with new splits repurposed for training preference-tuned models. ... This particular split can be found at https://huggingface.co/datasets/tatsu-lab/alpaca_eval/ blob/main/alpaca_farm_human_crossannotations.json.
Dataset Splits Yes We split the test dataset in half, using half for validation during hyperparameter tuning.
Hardware Specification Yes The GPT2 models were trained on a single Nvidia A100 GPU each, and the Pythia 2.8B and Llama 2 7B models were trained on two Nvidia A100 GPUs each.
Software Dependencies No We use Py Torch and the Hugging Face transformers and datasets libraries to compute all ranking accuracies.
Experiment Setup Yes We ran a separate hyperparameter search for each class of model and for each stage of training (i.e. SFT versus DPO). The hyperparameter ranges we searched were: SFT: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {64, 128, 256, 512} DPO: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {32, 64, 128}, β {0.01, 0.1, 1.0, 10.0}