The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Authors: Yuda Song, Gokul Swamy, Aarti Singh, J. Bagnell, Wen Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that Hy PO outperforms DPO, on the TL;DR summarization task [40] on all metrics including both the GPT4 win-rate and the reverse KL divergence to the reference policy, and on general chat benchmarks such as Alpaca Eval 2.0 [15], trained with the Ultra Feedback dataset [14]. Theoretically and empirically, we demonstrate that Hy PO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency.
Researcher Affiliation Collaboration Yuda Song Carnegie Mellon University yudas@cs.cmu.edu Gokul Swamy Carnegie Mellon University gswamy@cs.cmu.edu Aarti Singh Carnegie Mellon University aarti@cs.cmu.edu J. Andrew Bagnell Aurora Innovation, Carnegie Mellon University dbagnell@aurora.tech Wen Sun Cornell University ws455@cornell.edu
Pseudocode Yes Algorithm 1 Hybrid Preference Optimization (Hy PO) require Pretrained LLM πθ0, reference policy πref, offline data D, learning rate α, KL coefficient λ. 1: for t = 1, . . . , T do 2: Sample a minibatch of offline data Doff := {x, y+, y } D. 3: Compute DPO loss ℓdpo := P x,y+,y Doff log σ β log πθt 1(y+|x) β log πθt 1(y |x) 4: Sample (unlabeled) online data Don := {x, y} where x D, y πθt 1(x). 5: Compute ℓkl := P x,y Don log(πθt 1(y|x)) sg log (πθt 1(y|x)) (πref(y|x)) . 6: Update θt = θt 1 + α θt 1(ℓdpo λℓkl). return πT .
Open Source Code No No explicit statement about the release of source code for the described methodology or a direct link to a code repository is provided within the main body of the paper.
Open Datasets Yes Our first experiment is on the TL;DR dataset [40]... The TL;DR dataset is available at https://github.com/openai/summarize-from-feedback. ...finetune the Meta-Llama-3-8B-Instruct [27] model on the ultrafeedback dataset [14]... The dataset card of the Ultrafeedback dataset [14] is Hugging Face H4/ultrafeedback_binarized.
Dataset Splits Yes The human reference dataset contains 117k training, 6.45K validation and 6.55K testing data. The preference dataset contains 92.9K training and 83.8K validation data.
Hardware Specification Yes For our experiment, we run on a cluster of mixture of Nvidia A6000 and L40 GPUs with 48 GB VRAM. We use 4 GPUs in parallel for training... We run the general chat experiment on a node of 8 Nvidia A100 80GB GPUs.
Software Dependencies No The paper mentions software like Pythia, Llama-3, LoRA, and RLOO by name and citation, but it does not specify version numbers for these or for core programming languages or deep learning frameworks (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes We provide the hyperparameter for Hy PO and DPO. ... We summarize the hyperparameter of each baseline below. (referring to Tables 4-9 which contain specific learning rates, batch sizes, optimizers, and other training parameters for RM/SFT, DPO, and HyPO.)