Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
Authors: Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls. |
| Researcher Affiliation | Academia | Hao Tang Cornell University EMAIL Keya Hu Shanghai Jiao Tong University EMAIL Jin Peng Zhou Cornell University EMAIL Sicheng Zhong University of Toronto EMAIL Wei-Long Zheng Shanghai Jiao Tong University EMAIL Xujie Si University of Toronto CIFAR AI Chair, Mila EMAIL Kevin Ellis Cornell University EMAIL |
| Pseudocode | Yes | Algorithm 1 Bandit formulation of program synthesis |
| Open Source Code | Yes | We use public benchmarks and attach the code in the supplementary material for reproduction. |
| Open Datasets | Yes | APPS, one of the most challenging LLM programming problem benchmarks [13].", "visual reasoning puzzles from the Abstraction and Reasoning Corpus (ARC [7, 17]).", "We collect 38 non-linear loop invariant synthesis tasks [19] from [20, 21]. |
| Dataset Splits | No | The paper describes using existing datasets like APPS, ARC, and Loop Invariants. For ARC, it mentions 'Each problem contains several training tasks as examples. We utilize GPT-4 Turbo to generate code that summarizes the transformation rules and refine code that fails to pass the training examples.' However, it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the evaluation of their method. |
| Hardware Specification | No | The paper states that experiments were conducted 'with GPT-4 (temp=1)' and also mentions 'GPT-3.5-turbo, Claude-3.5-Sonnet, and Llama-3.1-405B', which are language models. However, it does not specify any particular hardware like CPU or GPU models used for running their experiments or making API calls. |
| Software Dependencies | No | The paper mentions using Python for implementation, the `numpy` library (indicated by `np.beta` in pseudocode and `np.ndarray` in prompts), and the `Z3` solver. However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We use GPT-4 (temp=1).", "For REx, large C values work well on all datasets (C = 20).", "We set the hyperparameters of each method accordingly as follows: 1. Greedy: empty value= 0, 0.5 2. BFS: branching factor= 2, 3, 4 3. Fixed-Width: width= 2, 4, 8 4. REx: C=5, 10, 15, 20, 25, 30 |