WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions
Authors: Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both automatic and human evaluations consistently indicate that Wizard LM outperforms baselines such as Alpaca (trained from Self Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Peking University |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | The paper references other open-source models and their repositories (e.g., Alpaca, Fast Chat for Vicuna), but does not provide a link or explicit statement that the code for Wizard LM or Evol-Instruct is open source or available. |
| Open Datasets | Yes | We evolve the instructions from Aplaca (Taori et al., 2023) data (created by machine)... Alpaca data has a total of 52k samples... To ensure a fair comparison with Vicuna s 70k real user data, we sampled 70k from the full 250k data and fine-tuned the LLa MA 13B model. |
| Dataset Splits | No | The paper describes training data, test data (Wizard Eval, LLM benchmarks), and fine-tuning procedures but does not explicitly mention or specify a separate validation dataset split or how it was used in the experimental setup for their models. |
| Hardware Specification | Yes | We train our model on 8 V100 GPUs with Deepspeed Zero-3 for 140 hours on 3 epochs. |
| Software Dependencies | No | The paper mentions using "LLa MA 13B (Touvron et al., 2023)", "Adam optimizer", "Deepspeed Zero-3", "Open AI Chat GPT API", and "Fast Chat 4" (for Vicuna 13B-v1.1 model). However, it does not provide consistent specific version numbers for all key software dependencies (e.g., Python, PyTorch/TensorFlow, specific API versions for OpenAI, or exact versions for other libraries). |
| Experiment Setup | Yes | We adopt Adam optimizer with an initial learning rate of 2 10 5, a maximum number of tokens 2048, and the batch size is 4 for each GPU. We train our model on 8 V100 GPUs with Deepspeed Zero-3 for 140 hours on 3 epochs. For inference, we use greedy search for Wizard LM and baseline models, and set the maximum generation length to 2048. |