WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Authors: Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, Daxin Jiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on five prominent code generation benchmarks, namely Human Eval, Human Eval+, MBPP, DS-1000, and Multi PL-E, our models showcase outstanding performance.
Researcher Affiliation Collaboration Ziyang Luo2 Can Xu1 Pu Zhao1 Qingfeng Sun1 Xiubo Geng1 Wenxiang Hu1 Chongyang Tao2 Jing Ma2 Qingwei Lin1 Daxin Jiang1 1Microsoft 2Hong Kong Baptist University
Pseudocode No The paper includes examples of code and prompt templates, but no structured pseudocode or algorithm blocks labeled as such.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for their Wizard Coder models or Code Evol-Instruct method. It references third-party open-source projects or datasets but not their own implementation code.
Open Datasets Yes Human Eval (Chen et al., 2021b), Human Eval+ (Liu et al., 2023), MBPP (Austin et al., 2021), DS100 (Lai et al., 2022) and Multi PL-E (Cassano et al., 2022)
Dataset Splits Yes An external dev set serves as the controlled Evol Stop. If the performance drops, we halt the evolution. In Appendix C, we outline the approach employed to prevent data leakage. Additionally, Appendix D showcases some evolved examples for reference. [...] Due to the limited size of the dev set of MBPP, we merged the training set and dev set, forming the MBPP-400 dev set.
Hardware Specification No The paper mentions using 'Open AI s gpt3.5turbo' and 'GPT4' for evolution, but these are model names, not specific hardware specifications like GPU models or CPU types used for their own fine-tuning experiments.
Software Dependencies No The paper mentions 'Open AI s gpt3.5turbo', 'Python', 'Tensorflow', and 'Pytorch' but does not provide specific version numbers for these software components or libraries, which are required for reproducibility.
Experiment Setup Yes To fine-tune the basic models, we employ specific configurations, including a batch size of 512, a sequence length of 2048, 200 fine-tuning steps, 30 warmup steps, a learning rate of 2e-5, a Cosine learning rate scheduler, and fp16 mixed precision.