Is Self-Repair a Silver Bullet for Code Generation?

Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we analyze Code Llama, GPT-3.5 and GPT-4 s ability to perform self-repair on problems taken from Human Eval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all.
Researcher Affiliation Collaboration 1MIT CSAIL 2Microsoft Research
Pseudocode Yes Algorithm 1: Generating a repair tree T, computing T |= ψ and its token count with batched self-repair.
Open Source Code Yes Code and data available at github.com/theoxo/self-repair.
Open Datasets Yes We consider Python programming challenges from both APPS (Hendrycks et al., 2021) and Human Eval (Chen et al., 2021)
Dataset Splits No The paper uses the test sets of the APPS and Human Eval benchmarks for evaluation, but does not describe custom training, validation, or test dataset splits for model training.
Hardware Specification No The paper mentions Code Llama can be run 'locally on consumer-level hardware' but does not provide specific hardware details like CPU/GPU models or memory for the experiments performed.
Software Dependencies Yes We use the frozen endpoints gpt-3.5-turbo-0301 and gpt-4-0314.
Experiment Setup Yes Based on preliminary experiments, we set the decoding temperature to 0.8 for all models. We use Np = 50 for all experiments, and consider np 25 for the self-repair approaches and np 50 for the baseline, no-repair approach. Similarly, for the feedback strings, we use Nf = 25 and nf 10 (except for Section 4.2, in which we only consider nf = 1 and therefore settle for Nf = 10 instead). Finally, for the repair candidates we set Nr = nr = 1, since we do joint sampling of feedback and repair in most of our experiments.