Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Authors: Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main finding is that approaches that use on-policy sampling and attempt to push down the likelihood on certain responses (i.e., employ a negative gradient ) outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Our goal is to provide clarity on these questions by performing a rigorous study to understand the behavior of existing methods. Concretely, we operate under typical assumptions in preference fine-tuning literature such as existence of a ground-truth reward function that explains the preference dataset and study surrogate objectives that optimize KL-penalized (with respect to a reference policy) expected reward. We develop an analysis framework consisting of didactic bandit problems, synthetic LLM problems, and full-scale LLM problems, constructed out of Alpaca Farm (Dubois et al., 2024) and Ultra Feedback (Cui et al., 2023). We then study behaviors of different methods given coverage conditions and geometric relationships in the problem.
Researcher Affiliation Collaboration 1CMU 2Stanford 3UW-Madison 4Google Deep Mind.
Pseudocode Yes Algorithm 1 A Unified Fine-Tuning Algorithm
Open Source Code Yes We have made the code for this project public in this repository. Please check this link for the project website, also this ar Xiv link for an extended version of this paper.
Open Datasets Yes We develop an analysis framework consisting of didactic bandit problems, synthetic LLM problems, and full-scale LLM problems, constructed out of Alpaca Farm (Dubois et al., 2024) and Ultra Feedback (Cui et al., 2023). The additional datasets used in our experiments are listed below: Mode Length, Skew Length, Relabelled Alpaca Farm. Please check this link for the project website, also this ar Xiv link for an extended version of this paper.
Dataset Splits No The paper mentions using datasets like Alpaca Farm and Ultra Feedback but does not specify explicit numerical training, validation, or test splits (e.g., percentages or counts) or reference predefined splits for reproducibility within the text.
Hardware Specification Yes Synthetic LLM experiments use a single A40 GPU. Bandit experiments use a Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPU, with 4 threads.
Software Dependencies No The paper mentions building implementations off 'Huggingface TRL implementation (von Werra et al., 2020)' and using 'Adam Gradient Optimizer', as well as 'min GPT (Karpathy)'. However, it does not provide specific version numbers for these software components or other key libraries like Python, PyTorch, or CUDA, which are necessary for reproducible software dependencies.
Experiment Setup Yes Table 4. Algorithm Agnostic Hyperparamters: B 64 Batch Size, Bmini 8 Mini-Batch Size, G 8 Gradient Accumulation Steps. Table 5. Sampling Hyperparamters: top k 0.0, top p 1.0, max new tokens 256, temperature 1.0. Table 6. DPO Hyperparameters: lr 1e-7, 5e-7, 1e-6, 5e-6, 1e5 learning rate; β 0.01, 0.05, 0.1, 0.5 KL weight. Table 7. Pref-FT/Binary Feed Me Hyperparameters: η 1e-7, 5e-7, 1e-6, 5e-6 learning rate. Table 8. PPO Hyperparameters: η 1e-7, 5e-7, 1e-6, 5e-6, 1e5 Learning rate; vf coef 0.1 Coefficient for the value function loss; init kl coef 0.2 Initial coefficient for KL penalty; target kl 0.1 Target KL divergence for policy updates. Table 9. RWR Hyperparameters: η 1e-7, 5e-7, 1e-6, 5e-6, 1e5 learning rate; β 0.1, 1, 10, 20 temperature. Table 10. Iterated Bof N Hyperparameters: η 1e-7, 5e-7, 1e-6, 5e-6, 1e5 learning rate; N 4, 10 actions per prompt.