reproducibilityindex.ai

OctoPack: Instruction Tuning Code Large Language Models

Authors: Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, Shayne Longpre

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark COMMITPACK against other natural and synthetic code instructions (x P3x, Self-Instruct, OASST) on the 16B parameter Star Coder model, and achieve state-of-the-art performance among models not trained on Open AI outputs, on the Human Eval Python benchmark (46.2% pass@1). We further introduce HUMANEVALPACK, expanding the Human Eval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, Java Script, Java, Go, C++, Rust). Our models, OCTOCODER and OCTOGEEX, achieve the best performance across HUMANEVALPACK among all permissive models, demonstrating COMMITPACK s beneﬁts in generalizing to a wider set of languages and natural coding tasks.
Researcher Affiliation	Collaboration	Niklas Muennighoff Qian Liu Armel Zebaze Qinkai Zheng Binyuan Hui Terry Yue Zhuo Swayam Singh Xiangru Tang Leandro von Werra Shayne Longpre n.muennighoff@gmail.com. Code, models and data are freely available at https://github.com/bigcode-project/octopack. We thank Hugging Face for providing compute instances.
Pseudocode	No	The paper does not contain any explicit 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code, models and data are freely available at https://github.com/bigcode-project/octopack.
Open Datasets	Yes	We compile COMMITPACK: 4 terabytes of Git commits across 350 programming languages. Code, models and data are freely available at https://github.com/bigcode-project/octopack. Instruction tuning Star Coder (Li et al., 2023b) on a ﬁltered variant of COMMITPACK and OASST leads to our best model, OCTOCODER.
Dataset Splits	No	For ﬁnetuning on these datasets, we use small subsets with around 5,000 samples each. We generate n = 20 samples, which is enough to get reliable pass@1 estimates (Li et al., 2023b). This does not describe explicit training, validation, and test splits for their own training process of OCTOCODER and OCTOGEEX beyond the general use of "finetuning" data.
Hardware Specification	No	We thank Hugging Face for providing compute instances. This is not specific enough to identify GPU/CPU models or other hardware specifications.
Software Dependencies	No	The paper mentions models like 'Star Coder' and 'Code Gee X2' and programming languages like 'Python' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	No	Training hyperparameters for both models are in Appendix P. The content of Appendix P is not included in the provided text.