Fault-Aware Neural Code Rankers

Authors: Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu Lahiri, Madanlal Musuvathi, Jianfeng Gao

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that CODERANKER can significantly increase the pass@1 accuracy of various code generation models (including Codex [11], GPT-Neo, GPT-J) on APPS [25], Human Eval [11] and MBPP [3] datasets.
Researcher Affiliation Industry Microsoft Research {jinala,chenwang,meiyang,andres.codas,markenc, shuvendu,madanm,jfgao}@microsoft.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes The code and data is released on Git Hub https://github.com/microsoft/CodeRanker.
Open Datasets Yes We consider three existing code generation datasets for our evaluation: (1) APPS [25]: a collection of 5000 training and 5000 test tasks collected from coding competitions and interview problems, (2) Human Eval [11]: a set of 164 test tasks, and (3) MBPP [3]: a set of 974 mostly basic python programming tasks with 474 training problems and 500 test problems.
Dataset Splits Yes The APPS dataset does not come with a validation dataset, so we used a set of 600 tasks from the original training dataset for validation; these are, then, excluded from the training dataset.
Hardware Specification Yes All experiments are conducted on V100-32GB GPUs.
Software Dependencies No No specific software versions (e.g., Python 3.x, PyTorch 1.x) or library versions were explicitly mentioned.
Experiment Setup Yes We finetuned GPT-J and GPT-Neo code generation models on the APPS training dataset for 2 epochs with a batch size of 256 and a learning rate of 1e-5, and chose the checkpoint that has the lowest validation loss. For inference, we used temperature sampling with T = 0.8 for Codex model and T = 0.9 for the GPT-J and GPT-Neo models unless specified otherwise. We finetuned the CODERANKER models for 30 epochs with a batch size of 512 and a learning rate of 1e-4, and chose the checkpoint that results in the best ranked pass@1 metric on the validation dataset.