CodeT: Code Generation with Generated Tests

Authors: Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on four benchmarks, Human Eval, MBPP, APPS, and Code Contests, using five different pre-trained language models with varying sizes and capabilities. Our results show that CODET can significantly improve the performance of code solution selection over previous methods, achieving remarkable and consistent gains across different models and benchmarks.
Researcher Affiliation Industry Bei Chen , Fengji Zhang , Anh Nguyen , Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen Microsoft Corporation {beichen, v-fengjzhang, anhnguyen, v-dazan, zeqi.lin, jlou, wzchen}@microsoft.com
Pseudocode No The paper describes the CODET method iteratively and references the RANSAC algorithm but does not include any formal pseudocode or algorithm block.
Open Source Code Yes Our work is publicly available at https://github.com/microsoft/Code T.
Open Datasets Yes We conduct experiments on four public code generation benchmarks in the zero-shot setting. The statistics of benchmarks are shown in Table 1.
Dataset Splits No The paper focuses on evaluating pre-trained language models in a zero-shot setting and uses a 'pass@k' metric, which involves sampling and selection from generated solutions. It does not describe explicit training, validation, or test dataset splits in the traditional sense for training its own model, nor does it specify any cross-validation setup for data partitioning.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, memory specifications, or cloud computing instances. It only mentions running models 'with half precision'.
Software Dependencies No The paper mentions using 'the Hugging Face transformers library (Wolf et al., 2019)' for INCODER and CODEGEN implementation, but it does not specify the version number of the library.
Experiment Setup Yes We set the temperature to 0.8, the top p to 0.95, the max generation length to 300, and the timeout of executing a test case to 0.1 seconds. Specially, for baseline pass@1, we use the greedy search setting with temperature 0. The number of sampling test cases for each problem is set to 100 for the Human Eval and MBPP benchmarks, and 50 for the APPS and Code Contests benchmarks. When scoring consensus sets in CODET, we use the square root of |Sx| to reduce the impact caused by code solutions.