reproducibilityindex.ai

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

Authors: Banghua Zhu, Michael Jordan, Jiantao Jiao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical findings highlight the superior performance of this approach over the traditional methods. Empirically, we present experimental evidence that the proposed method improves reward training in both bandit and neural network settings.
Researcher Affiliation	Academia	1Department of EECS, University of California, Berkeley. Correspondence to: Banghua Zhu <banghua@berkeley.edu>.
Pseudocode	Yes	Algorithm 1 Iterative Data Smoothing (D, θ0, α, β) and Algorithm 2 Iterative Data Smoothing V2 (D, θ0, α, β) are provided.
Open Source Code	No	The paper does not provide an explicit statement about the availability of open-source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	We use the human-labeled Helpfulness and Harmlessnes (HH) dataset from Bai et al. (2022).3 ... and TLDR dataset2. ... 3https://huggingface.co/datasets/Dahoas/static-hh ... 2https://huggingface.co/datasets/Carper AI/openai_summarize_comparisons
Dataset Splits	No	The paper mentions using a 'large validation and test dataset' and selecting the best checkpoint based on the smallest loss in the validation set, but it does not specify the percentages or absolute counts for the training, validation, and test splits.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions general algorithms and models used (e.g., PPO algorithm, Dahoas/pythia-125M-static-sft), but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We select step sizes α = 10 5 and β = 0.7 for all experiments. The hyper-parameters for the neural network experiments are listed in table 1. (Table 1 then lists detailed parameters like learning rate α, label update parameter β, batch size, PPO epochs, fixed KL coefficient, etc.)