reproducibilityindex.ai

Is Self-Repair a Silver Bullet for Code Generation?

Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we analyze Code Llama, GPT-3.5 and GPT-4 s ability to perform self-repair on problems taken from Human Eval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all.
Researcher Affiliation	Collaboration	1MIT CSAIL 2Microsoft Research
Pseudocode	Yes	Algorithm 1: Generating a repair tree T, computing T \|= ψ and its token count with batched self-repair.
Open Source Code	Yes	Code and data available at github.com/theoxo/self-repair.
Open Datasets	Yes	We consider Python programming challenges from both APPS (Hendrycks et al., 2021) and Human Eval (Chen et al., 2021)
Dataset Splits	No	The paper uses the test sets of the APPS and Human Eval benchmarks for evaluation, but does not describe custom training, validation, or test dataset splits for model training.
Hardware Specification	No	The paper mentions Code Llama can be run 'locally on consumer-level hardware' but does not provide specific hardware details like CPU/GPU models or memory for the experiments performed.
Software Dependencies	Yes	We use the frozen endpoints gpt-3.5-turbo-0301 and gpt-4-0314.
Experiment Setup	Yes	Based on preliminary experiments, we set the decoding temperature to 0.8 for all models. We use Np = 50 for all experiments, and consider np 25 for the self-repair approaches and np 50 for the baseline, no-repair approach. Similarly, for the feedback strings, we use Nf = 25 and nf 10 (except for Section 4.2, in which we only consider nf = 1 and therefore settle for Nf = 10 instead). Finally, for the repair candidates we set Nr = nr = 1, since we do joint sampling of feedback and repair in most of our experiments.