Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Authors: Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls.
Researcher Affiliation Academia Hao Tang Cornell University haotang@cs.cornell.edu Keya Hu Shanghai Jiao Tong University hu_keya@sjtu.edu.cn Jin Peng Zhou Cornell University jpzhou@cs.cornell.edu Sicheng Zhong University of Toronto sicheng.zhong@mail.utoronto.ca Wei-Long Zheng Shanghai Jiao Tong University weilong@sjtu.edu.cn Xujie Si University of Toronto CIFAR AI Chair, Mila six@cs.toronto.edu Kevin Ellis Cornell University kellis@cornell.edu
Pseudocode Yes Algorithm 1 Bandit formulation of program synthesis
Open Source Code Yes We use public benchmarks and attach the code in the supplementary material for reproduction.
Open Datasets Yes APPS, one of the most challenging LLM programming problem benchmarks [13].", "visual reasoning puzzles from the Abstraction and Reasoning Corpus (ARC [7, 17]).", "We collect 38 non-linear loop invariant synthesis tasks [19] from [20, 21].
Dataset Splits No The paper describes using existing datasets like APPS, ARC, and Loop Invariants. For ARC, it mentions 'Each problem contains several training tasks as examples. We utilize GPT-4 Turbo to generate code that summarizes the transformation rules and refine code that fails to pass the training examples.' However, it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the evaluation of their method.
Hardware Specification No The paper states that experiments were conducted 'with GPT-4 (temp=1)' and also mentions 'GPT-3.5-turbo, Claude-3.5-Sonnet, and Llama-3.1-405B', which are language models. However, it does not specify any particular hardware like CPU or GPU models used for running their experiments or making API calls.
Software Dependencies No The paper mentions using Python for implementation, the `numpy` library (indicated by `np.beta` in pseudocode and `np.ndarray` in prompts), and the `Z3` solver. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We use GPT-4 (temp=1).", "For REx, large C values work well on all datasets (C = 20).", "We set the hyperparameters of each method accordingly as follows: 1. Greedy: empty value= 0, 0.5 2. BFS: branching factor= 2, 3, 4 3. Fixed-Width: width= 2, 4, 8 4. REx: C=5, 10, 15, 20, 25, 30