reproducibilityindex.ai

NExT: Teaching Large Language Models to Reason about Code Execution

Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on program repair tasks based on MBPP and HUMANEVAL demonstrate that NEXT improves the ﬁx rate of a Pa LM 2 model, by 26.1% and 10.3% absolute, respectively, with signiﬁcantly improved rationale quality as veriﬁed by automated metrics and human raters.
Researcher Affiliation	Collaboration	1Yale University 2Google DeepMind 3University of Illinois at Urbana-Champaign.
Pseudocode	Yes	Algorithm 1 Naturalized Execution Tuning (NEXT)
Open Source Code	No	The paper does not include an explicit statement about releasing its source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	We use two Python program repair benchmarks, MBPP-R and HUMANEVALFIX-PLUS (HEFIX+ hereafter). MBPP-R is a new repair benchmark that we create from MBPP (Austin et al., 2021)... We further augment HUMANEVALFIX with the more rigorous test suites from Eval Plus (Liu et al., 2023) to obtain HEFIX+.
Dataset Splits	Yes	This yields the MBPP-R dataset, with 10, 047 repair tasks in the training set and 1, 468 examples in the dev set.
Hardware Specification	No	The paper mentions using "Pa LM 2-L (Unicorn)" and that its "ﬁnetuning API is publicly accessible on Google Cloud Vertex AI platform." However, it does not specify any particular GPU models, CPU models, or detailed hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions "Python" programs and uses "sys.settrace() hook in Python" but does not specify version numbers for Python or any other key software libraries or dependencies used in the experiments.
Experiment Setup	Yes	We perform temperature sampling (T = 0.8) with a sample size of 32 for training (\|Sj\| = 32 in Algo. 1) and PASS@k evaluation. In the ﬁrst iteration in Algo. 1, we use PASS@1 estimated with these 32 samples as the ﬁltering metric M( ) to ﬁnd challenging problems whose M( ) 10% for training. We perform 10 iterations of NEXT training and pick the best model using PASS@1 on the development set.