Is Self-Repair a Silver Bullet for Code Generation?
Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we analyze Code Llama, GPT-3.5 and GPT-4 s ability to perform self-repair on problems taken from Human Eval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. |
| Researcher Affiliation | Collaboration | 1MIT CSAIL 2Microsoft Research |
| Pseudocode | Yes | Algorithm 1: Generating a repair tree T, computing T |= ψ and its token count with batched self-repair. |
| Open Source Code | Yes | Code and data available at github.com/theoxo/self-repair. |
| Open Datasets | Yes | We consider Python programming challenges from both APPS (Hendrycks et al., 2021) and Human Eval (Chen et al., 2021) |
| Dataset Splits | No | The paper uses the test sets of the APPS and Human Eval benchmarks for evaluation, but does not describe custom training, validation, or test dataset splits for model training. |
| Hardware Specification | No | The paper mentions Code Llama can be run 'locally on consumer-level hardware' but does not provide specific hardware details like CPU/GPU models or memory for the experiments performed. |
| Software Dependencies | Yes | We use the frozen endpoints gpt-3.5-turbo-0301 and gpt-4-0314. |
| Experiment Setup | Yes | Based on preliminary experiments, we set the decoding temperature to 0.8 for all models. We use Np = 50 for all experiments, and consider np 25 for the self-repair approaches and np 50 for the baseline, no-repair approach. Similarly, for the feedback strings, we use Nf = 25 and nf 10 (except for Section 4.2, in which we only consider nf = 1 and therefore settle for Nf = 10 instead). Finally, for the repair candidates we set Nr = nr = 1, since we do joint sampling of feedback and repair in most of our experiments. |