Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Authors: Banghua Zhu, Michael Jordan, Jiantao Jiao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical findings highlight the superior performance of this approach over the traditional methods. Empirically, we present experimental evidence that the proposed method improves reward training in both bandit and neural network settings.
Researcher Affiliation Academia 1Department of EECS, University of California, Berkeley. Correspondence to: Banghua Zhu <banghua@berkeley.edu>.
Pseudocode Yes Algorithm 1 Iterative Data Smoothing (D, θ0, α, β) and Algorithm 2 Iterative Data Smoothing V2 (D, θ0, α, β) are provided.
Open Source Code No The paper does not provide an explicit statement about the availability of open-source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes We use the human-labeled Helpfulness and Harmlessnes (HH) dataset from Bai et al. (2022).3 ... and TLDR dataset2. ... 3https://huggingface.co/datasets/Dahoas/static-hh ... 2https://huggingface.co/datasets/Carper AI/openai_summarize_comparisons
Dataset Splits No The paper mentions using a 'large validation and test dataset' and selecting the best checkpoint based on the smallest loss in the validation set, but it does not specify the percentages or absolute counts for the training, validation, and test splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions general algorithms and models used (e.g., PPO algorithm, Dahoas/pythia-125M-static-sft), but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We select step sizes α = 10 5 and β = 0.7 for all experiments. The hyper-parameters for the neural network experiments are listed in table 1. (Table 1 then lists detailed parameters like learning rate α, label update parameter β, batch size, PPO epochs, fixed KL coefficient, etc.)