NExT: Teaching Large Language Models to Reason about Code Execution

Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on program repair tasks based on MBPP and HUMANEVAL demonstrate that NEXT improves the fix rate of a Pa LM 2 model, by 26.1% and 10.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters.
Researcher Affiliation Collaboration 1Yale University 2Google DeepMind 3University of Illinois at Urbana-Champaign.
Pseudocode Yes Algorithm 1 Naturalized Execution Tuning (NEXT)
Open Source Code No The paper does not include an explicit statement about releasing its source code or provide a link to a code repository for the described methodology.
Open Datasets Yes We use two Python program repair benchmarks, MBPP-R and HUMANEVALFIX-PLUS (HEFIX+ hereafter). MBPP-R is a new repair benchmark that we create from MBPP (Austin et al., 2021)... We further augment HUMANEVALFIX with the more rigorous test suites from Eval Plus (Liu et al., 2023) to obtain HEFIX+.
Dataset Splits Yes This yields the MBPP-R dataset, with 10, 047 repair tasks in the training set and 1, 468 examples in the dev set.
Hardware Specification No The paper mentions using "Pa LM 2-L (Unicorn)" and that its "finetuning API is publicly accessible on Google Cloud Vertex AI platform." However, it does not specify any particular GPU models, CPU models, or detailed hardware specifications used for running the experiments.
Software Dependencies No The paper mentions "Python" programs and uses "sys.settrace() hook in Python" but does not specify version numbers for Python or any other key software libraries or dependencies used in the experiments.
Experiment Setup Yes We perform temperature sampling (T = 0.8) with a sample size of 32 for training (|Sj| = 32 in Algo. 1) and PASS@k evaluation. In the first iteration in Algo. 1, we use PASS@1 estimated with these 32 samples as the filtering metric M( ) to find challenging problems whose M( ) 10% for training. We perform 10 iterations of NEXT training and pick the best model using PASS@1 on the development set.