NExT: Teaching Large Language Models to Reason about Code Execution
Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on program repair tasks based on MBPP and HUMANEVAL demonstrate that NEXT improves the fix rate of a Pa LM 2 model, by 26.1% and 10.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. |
| Researcher Affiliation | Collaboration | 1Yale University 2Google DeepMind 3University of Illinois at Urbana-Champaign. |
| Pseudocode | Yes | Algorithm 1 Naturalized Execution Tuning (NEXT) |
| Open Source Code | No | The paper does not include an explicit statement about releasing its source code or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | We use two Python program repair benchmarks, MBPP-R and HUMANEVALFIX-PLUS (HEFIX+ hereafter). MBPP-R is a new repair benchmark that we create from MBPP (Austin et al., 2021)... We further augment HUMANEVALFIX with the more rigorous test suites from Eval Plus (Liu et al., 2023) to obtain HEFIX+. |
| Dataset Splits | Yes | This yields the MBPP-R dataset, with 10, 047 repair tasks in the training set and 1, 468 examples in the dev set. |
| Hardware Specification | No | The paper mentions using "Pa LM 2-L (Unicorn)" and that its "finetuning API is publicly accessible on Google Cloud Vertex AI platform." However, it does not specify any particular GPU models, CPU models, or detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions "Python" programs and uses "sys.settrace() hook in Python" but does not specify version numbers for Python or any other key software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We perform temperature sampling (T = 0.8) with a sample size of 32 for training (|Sj| = 32 in Algo. 1) and PASS@k evaluation. In the first iteration in Algo. 1, we use PASS@1 estimated with these 32 samples as the filtering metric M( ) to find challenging problems whose M( ) 10% for training. We perform 10 iterations of NEXT training and pick the best model using PASS@1 on the development set. |