Fault-Aware Neural Code Rankers
Authors: Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, Jianfeng Gao
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that CODERANKER can significantly increase the pass@1 accuracy of various code generation models (including Codex [11], GPT-Neo, GPT-J) on APPS [25], Human Eval [11] and MBPP [3] datasets. |
| Researcher Affiliation | Industry | Microsoft Research {jinala,chenwang,meiyang,andres.codas,markenc, shuvendu,madanm,jfgao}@microsoft.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | The code and data is released on Git Hub https://github.com/microsoft/CodeRanker. |
| Open Datasets | Yes | We consider three existing code generation datasets for our evaluation: (1) APPS [25]: a collection of 5000 training and 5000 test tasks collected from coding competitions and interview problems, (2) Human Eval [11]: a set of 164 test tasks, and (3) MBPP [3]: a set of 974 mostly basic python programming tasks with 474 training problems and 500 test problems. |
| Dataset Splits | Yes | The APPS dataset does not come with a validation dataset, so we used a set of 600 tasks from the original training dataset for validation; these are, then, excluded from the training dataset. |
| Hardware Specification | Yes | All experiments are conducted on V100-32GB GPUs. |
| Software Dependencies | No | No specific software versions (e.g., Python 3.x, PyTorch 1.x) or library versions were explicitly mentioned. |
| Experiment Setup | Yes | We finetuned GPT-J and GPT-Neo code generation models on the APPS training dataset for 2 epochs with a batch size of 256 and a learning rate of 1e-5, and chose the checkpoint that has the lowest validation loss. For inference, we used temperature sampling with T = 0.8 for Codex model and T = 0.9 for the GPT-J and GPT-Neo models unless specified otherwise. We finetuned the CODERANKER models for 30 epochs with a batch size of 512 and a learning rate of 1e-4, and chose the checkpoint that results in the best ranked pass@1 metric on the validation dataset. |