A Critical Evaluation of AI Feedback for Aligning Large Language Models
Authors: Archit Sharma, Sedrick Scott Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we find that two conditions are necessary for LAIF to significantly outperform SFT: (a) a sufficiently strong pre-trained base model and, (b) a capability mismatch between the teacher used for the SFT data collection and the critic used for collecting AI feedback. |
| Researcher Affiliation | Collaboration | Archit Sharma Sedrick Keh Eric Mitchell Chelsea Finn Kushal Arora Thomas Kollar Stanford University Toyota Research Institute |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/architsharma97/dpo-rlaif. |
| Open Datasets | Yes | To this end, we fix the dataset of prompts to be single-turn instructions derived from Share GPT [Chiang et al., 2023]. |
| Dataset Splits | Yes | Therefore, we use 10% of the available prompts for the SFT stage and the rest of them to generate the AIF dataset. |
| Hardware Specification | Yes | Training was done on A100 80GB instances and took around 1 hour per epoch for a 7B model when trained on 100% of the training examples. |
| Software Dependencies | No | The paper mentions software like "Adam optimizer" but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | For SFT runs, we train the models on 9 epochs and evaluate every 3 epochs. From here, we select the best checkpoint to report. We use a batch size of 8 and conduct a hyperparameter sweep for learning rate across {1e 7, 5e 7, 1e 6}. |