reproducibilityindex.ai

Preference Learning Algorithms Do Not Learn Preference Rankings

Authors: Angelica Chen, Sadhika Malladi, Lily Zhang, Xinyi Chen, Qiuyi (Richard) Zhang, Rajesh Ranganath, Kyunghyun Cho

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We also derive the idealized ranking accuracy that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly.
Researcher Affiliation	Collaboration	Angelica Chen New York University ac5968@nyu.edu Sadhika Malladi Princeton University smalladi@princeton.edu Lily H. Zhang New York University lily.h.zhang@nyu.edu Xinyi Chen Google Deep Mind; Princeton University xinyic@google.com Qiuyi Zhang Google Deep Mind qiuyiz@google.com Rajesh Ranganath New York University rajeshr@cims.nyu.edu Kyunghyun Cho New York University; Genentech; CIFAR LMB kyunghyun.cho@nyu.edu
Pseudocode	No	The paper contains mathematical definitions and theorems (e.g., Definition 2.2, Theorem 3.1), but no explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	We provide an anonymized version of our code.
Open Datasets	Yes	common preference datasets, such as Ultra Feedback [7], Anthropic helpfulness and harmlessness (HH-RLHF, [14]), and Stanford Human Preferences (SHP, [11]) (Figure 1). ... Alpaca Farm Validation [9] is sourced from the Alpaca Eval dataset, but with new splits repurposed for training preference-tuned models. ... This particular split can be found at https://huggingface.co/datasets/tatsu-lab/alpaca_eval/ blob/main/alpaca_farm_human_crossannotations.json.
Dataset Splits	Yes	We split the test dataset in half, using half for validation during hyperparameter tuning.
Hardware Specification	Yes	The GPT2 models were trained on a single Nvidia A100 GPU each, and the Pythia 2.8B and Llama 2 7B models were trained on two Nvidia A100 GPUs each.
Software Dependencies	No	We use Py Torch and the Hugging Face transformers and datasets libraries to compute all ranking accuracies.
Experiment Setup	Yes	We ran a separate hyperparameter search for each class of model and for each stage of training (i.e. SFT versus DPO). The hyperparameter ranges we searched were: SFT: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {64, 128, 256, 512} DPO: learning rate {5e-7, 1e-6, 5e-6, 1e-5}, batch size {32, 64, 128}, β {0.01, 0.1, 1.0, 10.0}