Coder Reviewer Reranking for Code Generation
Authors: Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-Tau Yih, Daniel Fried, Sida Wang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only. |
| Researcher Affiliation | Collaboration | 1Stanford University 2The University of Hong Kong 3Meta AI FAIR 4Carnegie Mellon University. |
| Pseudocode | No | The paper describes methods and processes, but it does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code. |
| Open Source Code | No | The code will be made available after anonymous period. |
| Open Datasets | Yes | Human Eval (Chen et al., 2021) contains 164 hand-written Python programming questions Chen et al. (2021)... MBPP-Sanitized (Austin et al., 2021)... Plotting is a subset of DS-1000 (Lai et al., 2022)... Spider (Yu et al., 2018) is a benchmark of natural language to SQL query generation... NL2Bash (Lin et al., 2018) is a benchmark of translating natural language to bash commands. |
| Dataset Splits | No | Due to the lack of validation split on the benchmarks we experimented with, we restrain ourselves from hyperparameter search and rely on a single set of decoding hyperparameters. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments. It only refers to the size of the language models used. |
| Software Dependencies | No | The paper mentions using Python and a canonicalization software 'pyminifier' but does not provide specific version numbers for these or any other ancillary software dependencies. |
| Experiment Setup | Yes | On all datasets, we sample with temperature 0.4 and set the max tokens to be 300. For our main results in Table 1, we sample 125 different programs for each problem and then bootstrap 50 times to report the mean accuracy of reranking 25 samples. |