Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Authors: Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sushil Sikchi, Joey Hejna, Brad Knox, Chelsea Finn, Scott Niekum
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales. |
| Researcher Affiliation | Academia | Rafael Rafailov Stanford University rafailov@cs.stanford.edu Yaswanth Chittepu UMass Amherst ychittepu@umass.edu Ryan Park Stanford University rypark@stanford.edu Harshit Sikchi UT Austin hsikchi@utexas.edu Joey Hejna Stanford University jhejna@cs.stanford.edu W. Bradley Knox UT Austin bradknox@cs.utexas.edu Chelsea Finn Stanford University cbfinn@cs.stanford.edu Scott Niekum UMass Amherst sniekum@cs.umass.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We largely follow the DPO setup unless otherwise mentioned and build on their code (https://github.com/eric-mitchell/direct-preference-optimization) without changing any hyperparameters unless otherwise mentioned. [...] We have only used open-source models with open-source datasets for all aspects of the work. Please refer to section B for details on reproducing the results. |
| Open Datasets | Yes | All our experiments are carried out using the Reddit TL;DR summarization dataset [55] [...] For all DAA experiments, we used the curated Open AI TL;DR dataset with 92K preferred-dispreferred summary completions [55]. [...] The TL;DR comparison dataset used in this work uses a modified MIT License https://github.com/openai/summarize-from-feedback/blob/master/LICENSE. |
| Dataset Splits | Yes | 256 prompts sampled from the held-out set are used for all evaluations (e.g. loss, accuracy, KL, winrates, length), with temperature 1.0 and max length 512. |
| Hardware Specification | Yes | 1B models were trained on 2 NVIDIA A40 GPUs, 2.8B models were trained on 4 NVIDIA A40 GPUs, and 6.9B models were trained on 4 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For all DAA experiments, we used the curated Open AI TL;DR dataset with 92K preferred-dispreferred summary completions [55]. Each prompt is a Reddit post belonging to one of several topic forums, with title/post metadata included. 256 prompts sampled from the held-out set are used for all evaluations (e.g. loss, accuracy, KL, winrates, length), with temperature 1.0 and max length 512. Model sizes include 1B, 2.8B, and 6.9B and were initialized from the base Pythia pre-trained weights. All models underwent supervised fine-tuning on TL;DR prior to direct alignment. Across all SFT and DAA runs, we used a batch size of 128 (8 gradient accumulation steps), and RMSProp with a learning rate of 0.5 * 10^-6 (linear warmup for 150 steps) for 1 epoch. |