reproducibilityindex.ai

Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Authors: Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sushil Sikchi, Joey Hejna, Brad Knox, Chelsea Finn, Scott Niekum

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
Researcher Affiliation	Academia	Rafael Rafailov Stanford University rafailov@cs.stanford.edu Yaswanth Chittepu UMass Amherst ychittepu@umass.edu Ryan Park Stanford University rypark@stanford.edu Harshit Sikchi UT Austin hsikchi@utexas.edu Joey Hejna Stanford University jhejna@cs.stanford.edu W. Bradley Knox UT Austin bradknox@cs.utexas.edu Chelsea Finn Stanford University cbfinn@cs.stanford.edu Scott Niekum UMass Amherst sniekum@cs.umass.edu
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We largely follow the DPO setup unless otherwise mentioned and build on their code (https://github.com/eric-mitchell/direct-preference-optimization) without changing any hyperparameters unless otherwise mentioned. [...] We have only used open-source models with open-source datasets for all aspects of the work. Please refer to section B for details on reproducing the results.
Open Datasets	Yes	All our experiments are carried out using the Reddit TL;DR summarization dataset [55] [...] For all DAA experiments, we used the curated Open AI TL;DR dataset with 92K preferred-dispreferred summary completions [55]. [...] The TL;DR comparison dataset used in this work uses a modified MIT License https://github.com/openai/summarize-from-feedback/blob/master/LICENSE.
Dataset Splits	Yes	256 prompts sampled from the held-out set are used for all evaluations (e.g. loss, accuracy, KL, winrates, length), with temperature 1.0 and max length 512.
Hardware Specification	Yes	1B models were trained on 2 NVIDIA A40 GPUs, 2.8B models were trained on 4 NVIDIA A40 GPUs, and 6.9B models were trained on 4 NVIDIA A100 GPUs.
Software Dependencies	No	The paper does not specify version numbers for key software components or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For all DAA experiments, we used the curated Open AI TL;DR dataset with 92K preferred-dispreferred summary completions [55]. Each prompt is a Reddit post belonging to one of several topic forums, with title/post metadata included. 256 prompts sampled from the held-out set are used for all evaluations (e.g. loss, accuracy, KL, winrates, length), with temperature 1.0 and max length 512. Model sizes include 1B, 2.8B, and 6.9B and were initialized from the base Pythia pre-trained weights. All models underwent supervised fine-tuning on TL;DR prior to direct alignment. Across all SFT and DAA runs, we used a batch size of 128 (8 gradient accumulation steps), and RMSProp with a learning rate of 0.5 * 10^-6 (linear warmup for 150 steps) for 1 epoch.