ANPL: Towards Natural Programming with Interactive Decomposition
Authors: Di Huang, Ziyuan Nan, Xing Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, Yunji Chen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated ANPL by conducting a large-scale user study using the Abstraction and Reasoning Corpus (ARC), a well-known corpus that consists of 400 unique tasks (in the training set) without predefined programmatic solutions. We recruited 19 Python programmers who interacted with our system for a total of 440 man-hours. Compared to prior works evaluating LLMs code generation capabilities in interaction with real users Xu et al. [69](166 man-hours), Vaithilingam et al. [61](7.87 man-hours), ours is the most comprehensive evaluation up-to-date to the best of our knowledge. We find that programmers interacting with ANPL perform significantly better than interaction with the vanilla LLM (75.0% tasks solved vs. 58.4%). |
| Researcher Affiliation | Collaboration | 1SKL of Processors, Institute of Computing Technology, CAS 2Intelligent Software Research Center, Institute of Software, CAS 3University of Chinese Academy of Sciences 4Autodesk Research |
| Pseudocode | Yes | Figure 2: The pseudo-code of ANPL compiler, consisting of the direct compiling process and the differential compiling process. |
| Open Source Code | No | The paper explicitly states the release of a dataset ('We will release the collected programmatic decomposition dataset, DARC'), but not the open-source code for the ANPL system itself. The provided URL points to a project page, not explicitly a code repository for the system. |
| Open Datasets | Yes | We evaluated ANPL by conducting a large-scale user study using the Abstraction and Reasoning Corpus (ARC) [14], a well-known corpus that consists of 400 unique tasks (in the training set) without predefined programmatic solutions. |
| Dataset Splits | Yes | We conducted a user study on 400 ARC training tasks to evaluate the effectiveness of ANPL compared to the original Chat GPT (GPT-3.5-turbo). [...] The held-out test input-output examples provided in ARC are used to check the correctness of the generated Python program |
| Hardware Specification | No | The paper mentions interacting with LLMs (GPT-3.5-turbo, GPT-4) via API calls and discusses waiting times for their responses, but it does not specify any particular hardware (e.g., GPU models, CPU types) used to run their ANPL system or experiments locally. |
| Software Dependencies | No | The paper states that ANPL is 'Python-like' and compatible with 'original Python language', and mentions using 'Num Py' and 'Py Torch' in a questionnaire, but it does not provide specific version numbers for Python or any of its libraries. |
| Experiment Setup | Yes | During the initial user input and function editing, it goes through a sequence of five attempts, starting with a temperature parameter of 0 and incrementing it by 0.1 with each try until it succeeds. In the resynthesis stage, ANPL requests the underlying LLM to produce 10 potential completions for each prompt. The text that Chat GPT generates will be subject to a maximum token constraint of 1024. |