SelfCodeAlign: Self-Alignment for Code Generation
Authors: Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro Von Werra, Arjun Guha, LINGMING ZHANG
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our primary experiments, we use Self Code Align with Code Qwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on Human Eval+, surpassing Code Llama-70B-Instruct despite being ten times smaller. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Northeastern University 3University of California, Berkeley 4Service Now Research 5Hugging Face 6Roblox 7Cursor AI |
| Pseudocode | No | The paper provides Python code snippets and examples in the appendices (e.g., Listing 1, Listing 2, Appendix D.1) but does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor structured steps in a pseudocode format. |
| Open Source Code | Yes | https://github.com/bigcode-project/selfcodealign and Our source code is licensed under Apache-2.0. (C.5 License) and (ii) We generate a series of datasets using Self Code Align and train multiple models on these datasets, which will all be released to the public. |
| Open Datasets | Yes | Self Code Align extracts diverse coding concepts from high-quality seed snippets in The Stack V1 [28], a large corpus of permissively licensed code. |
| Dataset Splits | No | The paper describes training on a generated dataset and evaluates on external benchmarks, but does not specify the train/validation/test splits of the generated dataset itself for model training. |
| Hardware Specification | Yes | We primarily conduct data generation, training, and evaluation on a node equipped with 4 NVIDIA A100 PCI-E GPUs, 128 cores, and 512 GB of memory. For experiments involving Deep Seek-Coder, we use a node with 8 NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions software components such as PyTorch's Distributed Data Parallel (DDP), Pyright, Tree-sitter, Adafactor, and Deep Speed Ze RO-3, but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We set the initial learning rate at 1e-5 for training on self-generated data and 2e-5 for training on data generated from other models. Empirically, we find this to be the optimal setting for both cases. We adopt a 0.05 warmup ratio and a linear scheduler. We use Adafactor [58] as our optimizer and choose a batch size of 64 with a sequence truncation length of 1280. |