Understanding the Learning Dynamics of Alignment with Human Feedback
Authors: Shawn Im, Yixuan Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. |
| Researcher Affiliation | Academia | 1Department of Computer Sciences, University of Wisconsin-Madison. Correspondence to: Shawn Im <shawnim@cs.wisc.edu>, Yixuan Li <sharonli@cs.wisc.edu>. |
| Pseudocode | No | The paper includes mathematical formulations and descriptions of algorithms (RLHF, DPO) but does not present any structured pseudocode or algorithm blocks labeled as such. |
| Open Source Code | No | Furthermore, we are committed to enhancing reproducibility and broader applicability by releasing our code publicly which is available here. |
| Open Datasets | Yes | For training, we leverage Anthropic s Persona dataset (Perez et al., 2022), which encompasses diverse types of personas1. 1https://github.com/anthropics/evals/ tree/main/persona We verify that this behavior occurs in practice by using the HH-RLHF dataset (Bai et al., 2022a) in Appendix C |
| Dataset Splits | No | The paper mentions training and testing, and loss curves for both, but it does not explicitly provide details about the dataset splits (e.g., percentages or counts for training, validation, and test sets). |
| Hardware Specification | No | The paper mentions the use of 'Llama-2-7B model' and 'Mistral-7B model' for experiments, but it does not specify any particular hardware components such as GPU models, CPU types, or memory used. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer' and 'LoRA' configuration parameters (r=8, α=32, 0.05 dropout), but it does not list any specific software libraries (e.g., PyTorch, TensorFlow) or their version numbers. |
| Experiment Setup | Yes | All of the following experiments are conducted with full fine-tuning on the Llama-2-7B model with the Adam W optimizer (Loshchilov & Hutter, 2018). The learning rate is 1e-5, and β = 0.01. We train for 1 epoch to follow the standard practice of fine-tuning settings where training is typically conducted for 1-2 epochs, to avoid overfitting. |