Preference Learning Algorithms Do Not Learn Preference Rankings
Authors: Angelica Chen, Sadhika Malladi, Lily Zhang, Xinyi Chen, Qiuyi (Richard) Zhang, Rajesh Ranganath, Kyunghyun Cho
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We also derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. |
| Researcher Affiliation | Collaboration | Angelica Chen New York University ac5968@nyu.edu Sadhika Malladi Princeton University smalladi@princeton.edu Lily H. Zhang New York University lily.h.zhang@nyu.edu Xinyi Chen Google Deep Mind; Princeton University xinyic@google.com Qiuyi Zhang Google Deep Mind qiuyiz@google.com Rajesh Ranganath New York University rajeshr@cims.nyu.edu Kyunghyun Cho New York University; Genentech; CIFAR LMB kyunghyun.cho@nyu.edu |
| Pseudocode | No | The paper contains mathematical definitions and theorems (e.g., Definition 2.2, Theorem 3.1), but no explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | We provide an anonymized version of our code. |
| Open Datasets | Yes | common preference datasets, such as Ultra Feedback [7], Anthropic helpfulness and harmlessness (HH-RLHF, [14]), and Stanford Human Preferences (SHP, [11]) (Figure 1). ... Alpaca Farm Validation [9] is sourced from the Alpaca Eval dataset, but with new splits repurposed for training preference-tuned models. ... This particular split can be found at https://huggingface.co/datasets/tatsu-lab/alpaca_eval/ blob/main/alpaca_farm_human_crossannotations.json. |
| Dataset Splits | Yes | We split the test dataset in half, using half for validation during hyperparameter tuning. |
| Hardware Specification | Yes | The GPT2 models were trained on a single Nvidia A100 GPU each, and the Pythia 2.8B and Llama 2 7B models were trained on two Nvidia A100 GPUs each. |
| Software Dependencies | No | We use Py Torch and the Hugging Face transformers and datasets libraries to compute all ranking accuracies. |
| Experiment Setup | Yes | We ran a separate hyperparameter search for each class of model and for each stage of training (i.e. SFT versus DPO). The hyperparameter ranges we searched were: SFT: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {64, 128, 256, 512} DPO: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {32, 64, 128}, β {0.01, 0.1, 1.0, 10.0} |