Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Authors: Hao Tang, Keya Hu, Jin Zhou, Si Cheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across loop invariant synthesis, visual reasoning puzzles, and competition programming problems, we find that our new method can solve more problems using fewer language model calls.
Researcher Affiliation Academia Hao Tang Cornell University EMAIL Keya Hu Shanghai Jiao Tong University EMAIL Jin Peng Zhou Cornell University EMAIL Sicheng Zhong University of Toronto EMAIL Wei-Long Zheng Shanghai Jiao Tong University EMAIL Xujie Si University of Toronto CIFAR AI Chair, Mila EMAIL Kevin Ellis Cornell University EMAIL
Pseudocode Yes Algorithm 1 Bandit formulation of program synthesis
Open Source Code Yes We use public benchmarks and attach the code in the supplementary material for reproduction.
Open Datasets Yes APPS, one of the most challenging LLM programming problem benchmarks [13].", "visual reasoning puzzles from the Abstraction and Reasoning Corpus (ARC [7, 17]).", "We collect 38 non-linear loop invariant synthesis tasks [19] from [20, 21].
Dataset Splits No The paper describes using existing datasets like APPS, ARC, and Loop Invariants. For ARC, it mentions 'Each problem contains several training tasks as examples. We utilize GPT-4 Turbo to generate code that summarizes the transformation rules and refine code that fails to pass the training examples.' However, it does not provide explicit train/validation/test dataset splits with percentages or sample counts for the evaluation of their method.
Hardware Specification No The paper states that experiments were conducted 'with GPT-4 (temp=1)' and also mentions 'GPT-3.5-turbo, Claude-3.5-Sonnet, and Llama-3.1-405B', which are language models. However, it does not specify any particular hardware like CPU or GPU models used for running their experiments or making API calls.
Software Dependencies No The paper mentions using Python for implementation, the `numpy` library (indicated by `np.beta` in pseudocode and `np.ndarray` in prompts), and the `Z3` solver. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We use GPT-4 (temp=1).", "For REx, large C values work well on all datasets (C = 20).", "We set the hyperparameters of each method accordingly as follows: 1. Greedy: empty value= 0, 0.5 2. BFS: branching factor= 2, 3, 4 3. Fixed-Width: width= 2, 4, 8 4. REx: C=5, 10, 15, 20, 25, 30