OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Authors: Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, Yang Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source language models. Moreover, we use AGIEval to validate the model generalization performance, in which only openchat-13b surpasses the base model. Finally, we conduct a series of analyses to shed light on the effectiveness and robustness of Open Chat.
Researcher Affiliation Collaboration Guan Wang1,2 , Sijie Cheng1,3,5 , Xianyuan Zhan3,4, Xiangang Li5, Sen Song2 B, Yang Liu1,3,4 B 1Department of Computer Science and Technology, Tsinghua University 2Laboratory of Brain and Intelligence, Tsinghua University 3Institute for AI Industry Research (AIR), Tsinghua University 4Shanghai Artificial Intelligence Laboratory 501.AI
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code, data, and models are publicly available at https://github.com/imoneoi/openchat and https://huggingface.co/openchat.
Open Datasets Yes Following Vicuna (Chiang et al., 2023), we adopt a widely-used SFT dataset, the Share GPT dataset. The Share GPT dataset consists of approximately 70k user-shared conversations, including around 6k expert conversations generated by GPT-4 and the remaining sub-optimal conversations from GPT-3.5. We perform experiments to assess their varying quality in Sec. 5.1. 1The Share GPT dataset is collected from https://sharegpt.com/.
Dataset Splits No The paper specifies training details such as 'fine-tune the model for 5 epochs on the Share GPT dataset' and 'an effective batch size of 200k tokens', but it does not specify a separate validation dataset split or how validation was performed during their fine-tuning process on the Share GPT data.
Hardware Specification No The paper mentions fine-tuning a model but does not specify any hardware details like GPU models, CPU types, or memory used for the experiments.
Software Dependencies No The paper mentions using the Adam W optimizer and a cosine learning rate schedule, but it does not specify software dependencies with version numbers (e.g., Python version, specific deep learning framework version like PyTorch or TensorFlow).
Experiment Setup Yes The openchat-13b is based on the llama-2-13b (Touvron et al., 2023b). We fine-tune the model for 5 epochs on the Share GPT dataset using the Adam W optimizer with a sequence length of 4,096 tokens and an effective batch size of 200k tokens. Given that the reward weight term in Eq. (6) (exp(rc/β)) remains constant within a class, we simplify the process by assigning a unit weight to Dexp and the weight of 0.1 to Dsub. The Adam W optimizer s hyperparameters are set as follows: β1 = 0.9, β2 = 0.95, ϵ = 10 5, and weight decay of 0.1. We employ a cosine learning rate schedule with a maximum learning rate of 6.7 10 5, which decays to 10% of the maximum value.