reproducibilityindex.ai

Coder Reviewer Reranking for Code Generation

Authors: Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-Tau Yih, Daniel Fried, Sida Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform an extensive study across six datasets with eight models from three model families. Experimental results show that Coder-Reviewer reranking leads to consistent and significant improvement (up to 17% absolute accuracy gain) over reranking with the Coder model only.
Researcher Affiliation	Collaboration	1Stanford University 2The University of Hong Kong 3Meta AI FAIR 4Carnegie Mellon University.
Pseudocode	No	The paper describes methods and processes, but it does not include any figures, blocks, or sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps formatted like code.
Open Source Code	No	The code will be made available after anonymous period.
Open Datasets	Yes	Human Eval (Chen et al., 2021) contains 164 hand-written Python programming questions Chen et al. (2021)... MBPP-Sanitized (Austin et al., 2021)... Plotting is a subset of DS-1000 (Lai et al., 2022)... Spider (Yu et al., 2018) is a benchmark of natural language to SQL query generation... NL2Bash (Lin et al., 2018) is a benchmark of translating natural language to bash commands.
Dataset Splits	No	Due to the lack of validation split on the benchmarks we experimented with, we restrain ourselves from hyperparameter search and rely on a single set of decoding hyperparameters.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments. It only refers to the size of the language models used.
Software Dependencies	No	The paper mentions using Python and a canonicalization software 'pyminifier' but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup	Yes	On all datasets, we sample with temperature 0.4 and set the max tokens to be 300. For our main results in Table 1, we sample 125 different programs for each problem and then bootstrap 50 times to report the mean accuracy of reranking 25 samples.