Planning with Large Language Models for Code Generation
Authors: Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods |
| Researcher Affiliation | Collaboration | Shun Zhang, Zhenfang Chen, Yikang Shen MIT-IBM Watson AI Lab Mingyu Ding The University of Hong Kong Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Chuang Gan UMass Amherst, MIT-IBM Watson AI Lab |
| Pseudocode | Yes | We provide the pseudocode of our Planning-Guided Transformer Decoding algorithm (PG-TD) in Algorithm 1 and illustrate the whole process in Figure 2. |
| Open Source Code | No | Project page: https://codeaimcts.github.io. The paper provides a project page link, but does not explicitly state that the source code for the methodology is available there, nor is it a direct link to a code repository. |
| Open Datasets | Yes | The APPS dataset (Hendrycks et al., 2021) is released under MIT License 2. The Code Contests dataset (Li et al., 2022) is released under Apache License 2.0 3. |
| Dataset Splits | Yes | For the APPS dataset, we split all the test cases of a program evenly into two sets, where the first set is used as the public test cases for the algorithms to optimize the pass rate, and the second set is used as the private test cases for evaluating the generated programs. ... We use the first 500 problems in the interview-level problems in APPS test set for validation and the introductory-level problems in APPS test set for testing. |
| Hardware Specification | Yes | Our experiments are run on machines with two Intel(R) Xeon(R) Gold 6258R CPUs (@ 2.70GHz), and one V100-SXM2 GPU. |
| Software Dependencies | No | The paper mentions 'Huggingface Transformer (Wolf et al., 2019)' and 'GPT-2 and GPT-Neo' but does not provide specific version numbers for software dependencies used in the experiments (e.g., PyTorch, TensorFlow, specific Python libraries). |
| Experiment Setup | Yes | For PG-TD, we set the maximum number of children of any node (k) to be 3, and the beam size (b) to be 1 by default. For the baseline methods, we sample at most 512 programs in Sampling + Filtering, and maintain a set of 200 partial programs in each iteration for SMCG-TD. |