ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback
Authors: Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, Maosong Sun
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We comprehensively validate the advantage of AI feedback in boosting open-source models with ULTRAFEEDBACK. By fine-tuning a LLa MA2-13B model (Touvron et al., 2023b), we build a state-of-the-art reward model Ultra RM, which significantly outperforms existing open-source reward models. Based on Ultra RM, we enhance a powerful open-source model Ultra LM (Ding et al., 2023; Touvron et al., 2023a) with best-of-n sampling and PPO. Experiments show that both strategies improve the model dramatically. |
| Researcher Affiliation | Collaboration | 1NLP Group, DCST, IAI, BNRIST, Tsinghua University 2University of Illinois Urbana-Champaign 3Model Best.Inc 4Ping An Technology 5Tencent 6Renmin University of China 7Jiangsu Collaborative Innovation Center for Language Ability. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states "We release a suite of resources for feedback learning research, including a dataset, reward model, and critique model." However, it does not provide a specific link or explicit statement for the release of the code used for the methodology described. |
| Open Datasets | Yes | We finally present ULTRAFEEDBACK, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. [...] We release a suite of resources for feedback learning research, including a dataset, reward model, and critique model. |
| Dataset Splits | No | The paper discusses training on combined datasets (ULTRAFEEDBACK, Stanford SHP, Open AI Summarization, Anthropic Helpful) and mentions that Open AI Web GPT has no training and test splits, but it does not provide explicit details about specific validation splits used for their own models' training. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions training models but does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We train the 13B reward model for one epoch with the batch size being 512 pairs (i.e., 1024 completions) and the learning rate being 1e-5. We adopt the cosine learning rate decay strategy with a warm-up ratio of 3% and a final learning rate of 1e-6. [...] We train LLa MA2-13B for two epochs with a batch size of 256 and a learning rate of 2e-5. [...] We tune Ultra LM for 80 iterations on the ULTRAFEEDBACK prompts. In each iteration, we collect 512 samples and update the policy model with a mini-batch size of 64. The learning rate is fixed at 1e-6. |