reproducibilityindex.ai

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Authors: Hritik Bansal, John Dang, Aditya Grover

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments, for a particular comparison instance. To our surprise, we observe that the choice of feedback protocol has a significant effect on the evaluation of aligned LLMs.
Researcher Affiliation	Academia	Hritik Bansal, John Dang, Aditya Grover Department of Computer Science, University of California Los Angeles {hbansal,john.dang,adityag}@cs.ucla.edu
Pseudocode	No	The paper describes mathematical formulations for reward models and policies (e.g., Eq. 1, 2, 3), but it does not include any clearly labeled pseudocode blocks or algorithms in a structured format.
Open Source Code	Yes	Our code and data are available at https://github.com/Hritikbansal/sparse_feedback.
Open Datasets	Yes	We collect 5.2K instructions from varied sources such as Dolly [10], User-orient [42], and Super NI [42].
Dataset Splits	Yes	Further, 70% of the feedback data is used for training and the rest is for validation.
Hardware Specification	Yes	We train the reward models on a single Nvidia RTX A6000 GPU with 48GB VRAM.
Software Dependencies	No	The paper mentions using specific software components like 'Adam W optimizer [21]' and 'Lo RA', but it does not provide specific version numbers for these or other key software dependencies required for replication.
Experiment Setup	Yes	The reward models optimized using equation 2 are trained with an effective batch size = 16 where the batch size = 4 and the number of gradient accumulation steps = 4. The reward models optimized using equation 1 are trained with an effective batch size = 64 where the batch size = 16 and the number of gradient accumulation steps = 4. Both the reward models use the Adam W optimizer [21] with a linear warmup of 100 steps to a maximum learning rate followed by a cosine decay. We perform a hyperparameter search over {1e 4, 1e 5, 1e 6} for maximum learning rate. We also apply a weight decay of 0.001 to all the reward models and train them at fp16 precision.