reproducibilityindex.ai

ANPL: Towards Natural Programming with Interactive Decomposition

Authors: Di Huang, Ziyuan Nan, Xing Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, Yunji Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated ANPL by conducting a large-scale user study using the Abstraction and Reasoning Corpus (ARC), a well-known corpus that consists of 400 unique tasks (in the training set) without predefined programmatic solutions. We recruited 19 Python programmers who interacted with our system for a total of 440 man-hours. Compared to prior works evaluating LLMs code generation capabilities in interaction with real users Xu et al. [69](166 man-hours), Vaithilingam et al. [61](7.87 man-hours), ours is the most comprehensive evaluation up-to-date to the best of our knowledge. We find that programmers interacting with ANPL perform significantly better than interaction with the vanilla LLM (75.0% tasks solved vs. 58.4%).
Researcher Affiliation	Collaboration	1SKL of Processors, Institute of Computing Technology, CAS 2Intelligent Software Research Center, Institute of Software, CAS 3University of Chinese Academy of Sciences 4Autodesk Research
Pseudocode	Yes	Figure 2: The pseudo-code of ANPL compiler, consisting of the direct compiling process and the differential compiling process.
Open Source Code	No	The paper explicitly states the release of a dataset ('We will release the collected programmatic decomposition dataset, DARC'), but not the open-source code for the ANPL system itself. The provided URL points to a project page, not explicitly a code repository for the system.
Open Datasets	Yes	We evaluated ANPL by conducting a large-scale user study using the Abstraction and Reasoning Corpus (ARC) [14], a well-known corpus that consists of 400 unique tasks (in the training set) without predefined programmatic solutions.
Dataset Splits	Yes	We conducted a user study on 400 ARC training tasks to evaluate the effectiveness of ANPL compared to the original Chat GPT (GPT-3.5-turbo). [...] The held-out test input-output examples provided in ARC are used to check the correctness of the generated Python program
Hardware Specification	No	The paper mentions interacting with LLMs (GPT-3.5-turbo, GPT-4) via API calls and discusses waiting times for their responses, but it does not specify any particular hardware (e.g., GPU models, CPU types) used to run their ANPL system or experiments locally.
Software Dependencies	No	The paper states that ANPL is 'Python-like' and compatible with 'original Python language', and mentions using 'Num Py' and 'Py Torch' in a questionnaire, but it does not provide specific version numbers for Python or any of its libraries.
Experiment Setup	Yes	During the initial user input and function editing, it goes through a sequence of five attempts, starting with a temperature parameter of 0 and incrementing it by 0.1 with each try until it succeeds. In the resynthesis stage, ANPL requests the underlying LLM to produce 10 potential completions for each prompt. The text that Chat GPT generates will be subject to a maximum token constraint of 1024.