reproducibilityindex.ai

Planning with Large Language Models for Code Generation

Authors: Shun Zhang, Zhenfang Chen, Yikang Shen, Mingyu Ding, Joshua B. Tenenbaum, Chuang Gan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods
Researcher Affiliation	Collaboration	Shun Zhang, Zhenfang Chen, Yikang Shen MIT-IBM Watson AI Lab Mingyu Ding The University of Hong Kong Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Chuang Gan UMass Amherst, MIT-IBM Watson AI Lab
Pseudocode	Yes	We provide the pseudocode of our Planning-Guided Transformer Decoding algorithm (PG-TD) in Algorithm 1 and illustrate the whole process in Figure 2.
Open Source Code	No	Project page: https://codeaimcts.github.io. The paper provides a project page link, but does not explicitly state that the source code for the methodology is available there, nor is it a direct link to a code repository.
Open Datasets	Yes	The APPS dataset (Hendrycks et al., 2021) is released under MIT License 2. The Code Contests dataset (Li et al., 2022) is released under Apache License 2.0 3.
Dataset Splits	Yes	For the APPS dataset, we split all the test cases of a program evenly into two sets, where the first set is used as the public test cases for the algorithms to optimize the pass rate, and the second set is used as the private test cases for evaluating the generated programs. ... We use the first 500 problems in the interview-level problems in APPS test set for validation and the introductory-level problems in APPS test set for testing.
Hardware Specification	Yes	Our experiments are run on machines with two Intel(R) Xeon(R) Gold 6258R CPUs (@ 2.70GHz), and one V100-SXM2 GPU.
Software Dependencies	No	The paper mentions 'Huggingface Transformer (Wolf et al., 2019)' and 'GPT-2 and GPT-Neo' but does not provide specific version numbers for software dependencies used in the experiments (e.g., PyTorch, TensorFlow, specific Python libraries).
Experiment Setup	Yes	For PG-TD, we set the maximum number of children of any node (k) to be 3, and the beam size (b) to be 1 by default. For the baseline methods, we sample at most 512 programs in Sampling + Filtering, and maintain a set of 200 partial programs in each iteration for SMCG-TD.