Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
Authors: Hritik Bansal, John Dang, Aditya Grover
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we analyze the effect of this design choice for the alignment and evaluation of LLMs. We uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and AI annotators. Our subsequent analysis identifies various facets of annotator biases that explain this phenomena such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments, for a particular comparison instance. To our surprise, we observe that the choice of feedback protocol has a significant effect on the evaluation of aligned LLMs. |
| Researcher Affiliation | Academia | Hritik Bansal, John Dang, Aditya Grover Department of Computer Science, University of California Los Angeles {hbansal,john.dang,adityag}@cs.ucla.edu |
| Pseudocode | No | The paper describes mathematical formulations for reward models and policies (e.g., Eq. 1, 2, 3), but it does not include any clearly labeled pseudocode blocks or algorithms in a structured format. |
| Open Source Code | Yes | Our code and data are available at https://github.com/Hritikbansal/sparse_feedback. |
| Open Datasets | Yes | We collect 5.2K instructions from varied sources such as Dolly [10], User-orient [42], and Super NI [42]. |
| Dataset Splits | Yes | Further, 70% of the feedback data is used for training and the rest is for validation. |
| Hardware Specification | Yes | We train the reward models on a single Nvidia RTX A6000 GPU with 48GB VRAM. |
| Software Dependencies | No | The paper mentions using specific software components like 'Adam W optimizer [21]' and 'Lo RA', but it does not provide specific version numbers for these or other key software dependencies required for replication. |
| Experiment Setup | Yes | The reward models optimized using equation 2 are trained with an effective batch size = 16 where the batch size = 4 and the number of gradient accumulation steps = 4. The reward models optimized using equation 1 are trained with an effective batch size = 64 where the batch size = 16 and the number of gradient accumulation steps = 4. Both the reward models use the Adam W optimizer [21] with a linear warmup of 100 steps to a maximum learning rate followed by a cosine decay. We perform a hyperparameter search over {1e 4, 1e 5, 1e 6} for maximum learning rate. We also apply a weight decay of 0.001 to all the reward models and train them at fp16 precision. |