SelfCodeAlign: Self-Alignment for Code Generation

Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, LINGMING ZHANG

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our primary experiments, we use Self Code Align with Code Qwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on Human Eval+, surpassing Code Llama-70B-Instruct despite being ten times smaller.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Northeastern University 3University of California, Berkeley 4Service Now Research 5Hugging Face 6Roblox 7Cursor AI
Pseudocode No The paper provides Python code snippets and examples in the appendices (e.g., Listing 1, Listing 2, Appendix D.1) but does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor structured steps in a pseudocode format.
Open Source Code Yes https://github.com/bigcode-project/selfcodealign and Our source code is licensed under Apache-2.0. (C.5 License) and (ii) We generate a series of datasets using Self Code Align and train multiple models on these datasets, which will all be released to the public.
Open Datasets Yes Self Code Align extracts diverse coding concepts from high-quality seed snippets in The Stack V1 [28], a large corpus of permissively licensed code.
Dataset Splits No The paper describes training on a generated dataset and evaluates on external benchmarks, but does not specify the train/validation/test splits of the generated dataset itself for model training.
Hardware Specification Yes We primarily conduct data generation, training, and evaluation on a node equipped with 4 NVIDIA A100 PCI-E GPUs, 128 cores, and 512 GB of memory. For experiments involving Deep Seek-Coder, we use a node with 8 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions software components such as PyTorch's Distributed Data Parallel (DDP), Pyright, Tree-sitter, Adafactor, and Deep Speed Ze RO-3, but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes We set the initial learning rate at 1e-5 for training on self-generated data and 2e-5 for training on data generated from other models. Empirically, we find this to be the optimal setting for both cases. We adopt a 0.05 warmup ratio and a linear scheduler. We use Adafactor [58] as our optimizer and choose a batch size of 64 with a sequence truncation length of 1280.