Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Authors: Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on the Anthropic HH and TL;DR summarization datasets, we verify the effectiveness of ADVPO in mitigating the overoptimization problem, resulting in enhanced RLHF performance as evaluated through human-assisted evaluation.
Researcher Affiliation Collaboration Xiaoying Zhang Byte Dance Research zhangxiaoying.xy@bytedance.com Jean-François Ton Byte Dance Research jeanfrancois@bytedance.com Wei Shen Fudan University wshen21@m.fudan.edu.cn Hongning Wang Tsinghua University wang.hongn@gmail.com Yang Liu UC Santa Cruz yangliu@ucsc.edu
Pseudocode No The paper describes its proposed methods and procedures using mathematical formulations and textual descriptions, but it does not include a distinct 'Pseudocode' or 'Algorithm' block.
Open Source Code No Upon acceptance of the paper, we will release all the code necessary to reproduce the results of the paper.
Open Datasets Yes We used two widely adopted datasets, Anthropic HH [2] and TL;DR summarization [35] datasets, for empirical investigation.
Dataset Splits Yes For both datasets, the preference data is randomly divided into two halves: one for reward model training and the other for policy optimization. For the first half of the preference data dedicated to reward modeling in each dataset, we randomly allocate 90% for training and 10% for validation. In each single run, every 100 optimization steps, we use the current policy to generate responses for prompts in the validation dataset and record the average reward on the validation dataset.
Hardware Specification Yes All experiments were conducted on a single node equipped with 8 Nvidia A100-SXM-80GB GPUs using the Deep Speed library and Zero stage 2 [29], along with Hugging Face Accelerate [14]. All our experiments were run on a cluster of 8x A100 GPUs with 100 CPUs and 100 GB RAM.
Software Dependencies No We utilized the Deep Speed library and Zero stage 2 [29], along with Hugging Face Accelerate [14]. We employed the Adam W optimizer [23]. Specific version numbers for these software components are not provided.
Experiment Setup Yes We set the learning rate to 5e 6 for the Anthropic HH dataset and 3e 5 for the TL;DR dataset. In both cases, the batch size is 64, and the models are trained for 5 epochs. For both gold and proxy reward model training, we set the initial learning rate to 5e 6, a batch size of 64, and a context window length of 2048 tokens. For both algorithms, we train the model for 1500 steps, with an initial learning rate of 1e 6, a batch size of 64, and a context window length of 2048, a PPO value clip threshold of 0.2, consistent with previous procedures. For efficient online sampling, we set the maximum number of generated tokens to 512 and the KL coefficient β to 0 to encourage the most severe over-optimization scenario, following previous work [9].