reproducibilityindex.ai

WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions

Authors: Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, Daxin Jiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both automatic and human evaluations consistently indicate that Wizard LM outperforms baselines such as Alpaca (trained from Self Instruct) and Vicuna (trained from human-created instructions). The experimental results demonstrate that the quality of instruction-following dataset crafted by Evol-Instruct can significantly improve the performance of LLMs.
Researcher Affiliation	Collaboration	1Microsoft 2Peking University
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm" blocks.
Open Source Code	No	The paper references other open-source models and their repositories (e.g., Alpaca, Fast Chat for Vicuna), but does not provide a link or explicit statement that the code for Wizard LM or Evol-Instruct is open source or available.
Open Datasets	Yes	We evolve the instructions from Aplaca (Taori et al., 2023) data (created by machine)... Alpaca data has a total of 52k samples... To ensure a fair comparison with Vicuna s 70k real user data, we sampled 70k from the full 250k data and fine-tuned the LLa MA 13B model.
Dataset Splits	No	The paper describes training data, test data (Wizard Eval, LLM benchmarks), and fine-tuning procedures but does not explicitly mention or specify a separate validation dataset split or how it was used in the experimental setup for their models.
Hardware Specification	Yes	We train our model on 8 V100 GPUs with Deepspeed Zero-3 for 140 hours on 3 epochs.
Software Dependencies	No	The paper mentions using "LLa MA 13B (Touvron et al., 2023)", "Adam optimizer", "Deepspeed Zero-3", "Open AI Chat GPT API", and "Fast Chat 4" (for Vicuna 13B-v1.1 model). However, it does not provide consistent specific version numbers for all key software dependencies (e.g., Python, PyTorch/TensorFlow, specific API versions for OpenAI, or exact versions for other libraries).
Experiment Setup	Yes	We adopt Adam optimizer with an initial learning rate of 2 10 5, a maximum number of tokens 2048, and the batch size is 4 for each GPU. We train our model on 8 V100 GPUs with Deepspeed Zero-3 for 140 hours on 3 epochs. For inference, we use greedy search for Wizard LM and baseline models, and set the maximum generation length to 2048.