Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
Authors: Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls. |
| Researcher Affiliation | Academia | Hao Tang Cornell University haotang@cs.cornell.edu Keya Hu Shanghai Jiao Tong University hu_keya@sjtu.edu.cn Jin Peng Zhou Cornell University jpzhou@cs.cornell.edu Sicheng Zhong University of Toronto sicheng.zhong@mail.utoronto.ca Wei-Long Zheng Shanghai Jiao Tong University weilong@sjtu.edu.cn Xujie Si University of Toronto CIFAR AI Chair, Mila six@cs.toronto.edu Kevin Ellis Cornell University kellis@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Bandit formulation of program synthesis |
| Open Source Code | Yes | We use public benchmarks and attach the code in the supplementary material for reproduction. |
| Open Datasets | Yes | APPS, one of the most challenging LLM programming problem benchmarks [13].", "visual reasoning puzzles from the Abstraction and Reasoning Corpus (ARC [7, 17]).", "We collect 38 non-linear loop invariant synthesis tasks [19] from [20, 21]. |
| Dataset Splits | No | The paper describes using existing datasets like APPS, ARC, and Loop Invariants. For ARC, it mentions 'Each problem contains several training tasks as examples. We utilize GPT-4 Turbo to generate code that summarizes the transformation rules and refine code that fails to pass the training examples.' However, it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the evaluation of their method. |
| Hardware Specification | No | The paper states that experiments were conducted 'with GPT-4 (temp=1)' and also mentions 'GPT-3.5-turbo, Claude-3.5-Sonnet, and Llama-3.1-405B', which are language models. However, it does not specify any particular hardware like CPU or GPU models used for running their experiments or making API calls. |
| Software Dependencies | No | The paper mentions using Python for implementation, the `numpy` library (indicated by `np.beta` in pseudocode and `np.ndarray` in prompts), and the `Z3` solver. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use GPT-4 (temp=1).", "For REx, large C values work well on all datasets (C = 20).", "We set the hyperparameters of each method accordingly as follows: 1. Greedy: empty value= 0, 0.5 2. BFS: branching factor= 2, 3, 4 3. Fixed-Width: width= 2, 4, 8 4. REx: C=5, 10, 15, 20, 25, 30 |