WizardArena: Post-training Large Language Models via Simulated Offline Chatbot Arena

Authors: HAIPENG LUO, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jian-Guang Lou, Shifeng Chen, Yansong Tang, Weizhu Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our Wizard Arena aligns closely with the online human arena rankings, and our models, trained on extensive offline battle data through Arena Learning, demonstrate marked improvements in performance across the SFT, DPO, and PPO stages.
Researcher Affiliation Collaboration Haipeng Luo1 Qingfeng Sun2 Can Xu2 Pu Zhao2 Qingwei Lin2 Jianguang Lou2 Shifeng Chen3 Yansong Tang1 Weizhu Chen2 1Shenzhen International Graduate School, Tsinghua University 2Microsoft Corporation 3Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Pseudocode No The paper describes its methods but does not include any structured pseudocode or algorithm blocks clearly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper's NeurIPS checklist states 'Yes, please refer to the appendix source code of this article for details' for open access to data and code, but the provided PDF does not include such an appendix or any explicit links/statements in the main body for the release of the authors' own implementation code for their methodology.
Open Datasets Yes We collected some instructions from open available datasets (i.e., Alpaca [11], FLAN [72], LMSYS-Chat-1M [87], Open Orca [88], Wizard LM [12]), and optimized them using the following steps:
Dataset Splits No The paper describes how the dataset D is split into nine slices for iterative training (D0, D1, D2, ..., DN), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts for model evaluation and hyperparameter tuning in a traditional sense.
Hardware Specification No The paper mentions applying the method to Mistral-7B and using Llama3-70B-Chat as judge models, but it does not specify any hardware details like GPU models, CPU types, or memory used for training or inference.
Software Dependencies No The paper mentions using 'Deep Speed [95] and TRL [96] for SFT and RL' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes In supervised fine-tuning, we trained three epochs with a learning rate of 5e-6, a batch size of 128, and a sequence length of 4096. For PPO reward model training, Mistral-7B was trained for one epoch at a learning rate of 1e-6. In PPO training, the learning rate was 1e-7 for one epoch with a KL coefficient of 0.4, and for DPO training, it was 5e-7 for two epochs with a beta of 0.3.