OctoPack: Instruction Tuning Code Large Language Models

Authors: Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, Shayne Longpre

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark COMMITPACK against other natural and synthetic code instructions (x P3x, Self-Instruct, OASST) on the 16B parameter Star Coder model, and achieve state-of-the-art performance among models not trained on Open AI outputs, on the Human Eval Python benchmark (46.2% pass@1). We further introduce HUMANEVALPACK, expanding the Human Eval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, Java Script, Java, Go, C++, Rust). Our models, OCTOCODER and OCTOGEEX, achieve the best performance across HUMANEVALPACK among all permissive models, demonstrating COMMITPACK s benefits in generalizing to a wider set of languages and natural coding tasks.
Researcher Affiliation Collaboration Niklas Muennighoff Qian Liu Armel Zebaze Qinkai Zheng Binyuan Hui Terry Yue Zhuo Swayam Singh Xiangru Tang Leandro von Werra Shayne Longpre n.muennighoff@gmail.com. Code, models and data are freely available at https://github.com/bigcode-project/octopack. We thank Hugging Face for providing compute instances.
Pseudocode No The paper does not contain any explicit 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code, models and data are freely available at https://github.com/bigcode-project/octopack.
Open Datasets Yes We compile COMMITPACK: 4 terabytes of Git commits across 350 programming languages. Code, models and data are freely available at https://github.com/bigcode-project/octopack. Instruction tuning Star Coder (Li et al., 2023b) on a filtered variant of COMMITPACK and OASST leads to our best model, OCTOCODER.
Dataset Splits No For finetuning on these datasets, we use small subsets with around 5,000 samples each. We generate n = 20 samples, which is enough to get reliable pass@1 estimates (Li et al., 2023b). This does not describe explicit training, validation, and test splits for their own training process of OCTOCODER and OCTOGEEX beyond the general use of "finetuning" data.
Hardware Specification No We thank Hugging Face for providing compute instances. This is not specific enough to identify GPU/CPU models or other hardware specifications.
Software Dependencies No The paper mentions models like 'Star Coder' and 'Code Gee X2' and programming languages like 'Python' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup No Training hyperparameters for both models are in Appendix P. The content of Appendix P is not included in the provided text.