Coder Reviewer Reranking for Code Generation

Authors: Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-Tau Yih, Daniel Fried, Sida Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only.
Researcher Affiliation Collaboration 1Stanford University 2The University of Hong Kong 3Meta AI FAIR 4Carnegie Mellon University.
Pseudocode No The paper describes methods and processes, but it does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code.
Open Source Code No The code will be made available after anonymous period.
Open Datasets Yes Human Eval (Chen et al., 2021) contains 164 hand-written Python programming questions Chen et al. (2021)... MBPP-Sanitized (Austin et al., 2021)... Plotting is a subset of DS-1000 (Lai et al., 2022)... Spider (Yu et al., 2018) is a benchmark of natural language to SQL query generation... NL2Bash (Lin et al., 2018) is a benchmark of translating natural language to bash commands.
Dataset Splits No Due to the lack of validation split on the benchmarks we experimented with, we restrain ourselves from hyperparameter search and rely on a single set of decoding hyperparameters.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments. It only refers to the size of the language models used.
Software Dependencies No The paper mentions using Python and a canonicalization software 'pyminifier' but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes On all datasets, we sample with temperature 0.4 and set the max tokens to be 300. For our main results in Table 1, we sample 125 different programs for each problem and then bootstrap 50 times to report the mean accuracy of reranking 25 samples.