Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Preference Learning Algorithms Do Not Learn Preference Rankings
Authors: Angelica Chen, Sadhika Malladi, Lily Zhang, Xinyi Chen, Qiuyi (Richard) Zhang, Rajesh Ranganath, Kyunghyun Cho
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We also derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. |
| Researcher Affiliation | Collaboration | Angelica Chen New York University EMAIL Sadhika Malladi Princeton University EMAIL Lily H. Zhang New York University EMAIL Xinyi Chen Google Deep Mind; Princeton University EMAIL Qiuyi Zhang Google Deep Mind EMAIL Rajesh Ranganath New York University EMAIL Kyunghyun Cho New York University; Genentech; CIFAR LMB EMAIL |
| Pseudocode | No | The paper contains mathematical definitions and theorems (e.g., Definition 2.2, Theorem 3.1), but no explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | We provide an anonymized version of our code. |
| Open Datasets | Yes | common preference datasets, such as Ultra Feedback [7], Anthropic helpfulness and harmlessness (HH-RLHF, [14]), and Stanford Human Preferences (SHP, [11]) (Figure 1). ... Alpaca Farm Validation [9] is sourced from the Alpaca Eval dataset, but with new splits repurposed for training preference-tuned models. ... This particular split can be found at https://huggingface.co/datasets/tatsu-lab/alpaca_eval/ blob/main/alpaca_farm_human_crossannotations.json. |
| Dataset Splits | Yes | We split the test dataset in half, using half for validation during hyperparameter tuning. |
| Hardware Specification | Yes | The GPT2 models were trained on a single Nvidia A100 GPU each, and the Pythia 2.8B and Llama 2 7B models were trained on two Nvidia A100 GPUs each. |
| Software Dependencies | No | We use Py Torch and the Hugging Face transformers and datasets libraries to compute all ranking accuracies. |
| Experiment Setup | Yes | We ran a separate hyperparameter search for each class of model and for each stage of training (i.e. SFT versus DPO). The hyperparameter ranges we searched were: SFT: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {64, 128, 256, 512} DPO: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {32, 64, 128}, β {0.01, 0.1, 1.0, 10.0} |