ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback

Authors: Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, Maosong Sun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively validate the advantage of AI feedback in boosting open-source models with ULTRAFEEDBACK. By fine-tuning a LLa MA2-13B model (Touvron et al., 2023b), we build a state-of-the-art reward model Ultra RM, which significantly outperforms existing open-source reward models. Based on Ultra RM, we enhance a powerful open-source model Ultra LM (Ding et al., 2023; Touvron et al., 2023a) with best-of-n sampling and PPO. Experiments show that both strategies improve the model dramatically.
Researcher Affiliation Collaboration 1NLP Group, DCST, IAI, BNRIST, Tsinghua University 2University of Illinois Urbana-Champaign 3Model Best.Inc 4Ping An Technology 5Tencent 6Renmin University of China 7Jiangsu Collaborative Innovation Center for Language Ability.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states "We release a suite of resources for feedback learning research, including a dataset, reward model, and critique model." However, it does not provide a specific link or explicit statement for the release of the code used for the methodology described.
Open Datasets Yes We finally present ULTRAFEEDBACK, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. [...] We release a suite of resources for feedback learning research, including a dataset, reward model, and critique model.
Dataset Splits No The paper discusses training on combined datasets (ULTRAFEEDBACK, Stanford SHP, Open AI Summarization, Anthropic Helpful) and mentions that Open AI Web GPT has no training and test splits, but it does not provide explicit details about specific validation splits used for their own models' training.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions training models but does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes We train the 13B reward model for one epoch with the batch size being 512 pairs (i.e., 1024 completions) and the learning rate being 1e-5. We adopt the cosine learning rate decay strategy with a warm-up ratio of 3% and a final learning rate of 1e-6. [...] We train LLa MA2-13B for two epochs with a batch size of 256 and a learning rate of 2e-5. [...] We tune Ultra LM for 80 iterations on the ULTRAFEEDBACK prompts. In each iteration, we collect 512 samples and update the policy model with a mini-batch size of 64. The learning rate is fixed at 1e-6.