LeDex: Training LLMs to Better Self-Debug and Explain Code
Authors: Nan Jiang, Xiaopeng Li, Shiqi Wang, Qiang Zhou, Soneya Hossain, Baishakhi Ray, Varun Kumar, Xiaofei Ma, Anoop Deoras
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. |
| Researcher Affiliation | Collaboration | 1Purdue University 2AWS AI Labs 3University of Virginia |
| Pseudocode | No | The paper refers to the PPO algorithm in Appendix A.3 and provides mathematical formulations, but no explicit pseudocode or algorithm blocks are presented. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We use MBPP [3] (only use the 374 problems in the training set during training), APPS [4] (only use the 5,000 problems in the training set) and Code Contests [2] as our base training datasets, which contain programming problems and solutions collected from various platforms. |
| Dataset Splits | No | For supervised fine-tuning, we fine-tune three LLMs (Star Coder-15B, Code Llama-7B, and Code Llama-13B) using the correct initial solutions and correct refinements collected from the MBPP training set, APPS training set, and Code Contests. |
| Hardware Specification | Yes | Both the supervised fine-tuning and reinforcement learning are conducted on 8 NVIDIA A100 GPUs, each with 40GB of memory. |
| Software Dependencies | No | The paper mentions software components such as 'Adam W [24]', the 'TRL [25] library', and 'Ro BERTa(e)' for calculating sentiment similarity, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For supervised fine-tuning, we fine-tune three LLMs (Star Coder-15B, Code Llama-7B, and Code Llama-13B) using the correct initial solutions and correct refinements collected from the MBPP training set, APPS training set, and Code Contests. The model is fine-tuned for two epochs, using a batch size of 128. The optimizer is Adam W [24] with learning rate set to 2e 5. The learning rate is adjusted using a warmup of 500 steps and then decayed following a cosine scheduler. |