The Importance of Online Data: Understanding Preference Fine-tuning via Coverage
Authors: Yuda Song, Gokul Swamy, Aarti Singh, J. Bagnell, Wen Sun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that Hy PO outperforms DPO, on the TL;DR summarization task [40] on all metrics including both the GPT4 win-rate and the reverse KL divergence to the reference policy, and on general chat benchmarks such as Alpaca Eval 2.0 [15], trained with the Ultra Feedback dataset [14]. Theoretically and empirically, we demonstrate that Hy PO is more performant than its pure offline counterpart DPO, while still preserving its computation and memory efficiency. |
| Researcher Affiliation | Collaboration | Yuda Song Carnegie Mellon University yudas@cs.cmu.edu Gokul Swamy Carnegie Mellon University gswamy@cs.cmu.edu Aarti Singh Carnegie Mellon University aarti@cs.cmu.edu J. Andrew Bagnell Aurora Innovation, Carnegie Mellon University dbagnell@aurora.tech Wen Sun Cornell University ws455@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Hybrid Preference Optimization (Hy PO) require Pretrained LLM πθ0, reference policy πref, offline data D, learning rate α, KL coefficient λ. 1: for t = 1, . . . , T do 2: Sample a minibatch of offline data Doff := {x, y+, y } D. 3: Compute DPO loss ℓdpo := P x,y+,y Doff log σ β log πθt 1(y+|x) β log πθt 1(y |x) 4: Sample (unlabeled) online data Don := {x, y} where x D, y πθt 1(x). 5: Compute ℓkl := P x,y Don log(πθt 1(y|x)) sg log (πθt 1(y|x)) (πref(y|x)) . 6: Update θt = θt 1 + α θt 1(ℓdpo λℓkl). return πT . |
| Open Source Code | No | No explicit statement about the release of source code for the described methodology or a direct link to a code repository is provided within the main body of the paper. |
| Open Datasets | Yes | Our first experiment is on the TL;DR dataset [40]... The TL;DR dataset is available at https://github.com/openai/summarize-from-feedback. ...finetune the Meta-Llama-3-8B-Instruct [27] model on the ultrafeedback dataset [14]... The dataset card of the Ultrafeedback dataset [14] is Hugging Face H4/ultrafeedback_binarized. |
| Dataset Splits | Yes | The human reference dataset contains 117k training, 6.45K validation and 6.55K testing data. The preference dataset contains 92.9K training and 83.8K validation data. |
| Hardware Specification | Yes | For our experiment, we run on a cluster of mixture of Nvidia A6000 and L40 GPUs with 48 GB VRAM. We use 4 GPUs in parallel for training... We run the general chat experiment on a node of 8 Nvidia A100 80GB GPUs. |
| Software Dependencies | No | The paper mentions software like Pythia, Llama-3, LoRA, and RLOO by name and citation, but it does not specify version numbers for these or for core programming languages or deep learning frameworks (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | We provide the hyperparameter for Hy PO and DPO. ... We summarize the hyperparameter of each baseline below. (referring to Tables 4-9 which contain specific learning rates, batch sizes, optimizers, and other training parameters for RM/SFT, DPO, and HyPO.) |