reproducibilityindex.ai

CodeT: Code Generation with Generated Tests

Authors: Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments on four benchmarks, Human Eval, MBPP, APPS, and Code Contests, using five different pre-trained language models with varying sizes and capabilities. Our results show that CODET can significantly improve the performance of code solution selection over previous methods, achieving remarkable and consistent gains across different models and benchmarks.
Researcher Affiliation	Industry	Bei Chen , Fengji Zhang , Anh Nguyen , Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen Microsoft Corporation {beichen, v-fengjzhang, anhnguyen, v-dazan, zeqi.lin, jlou, wzchen}@microsoft.com
Pseudocode	No	The paper describes the CODET method iteratively and references the RANSAC algorithm but does not include any formal pseudocode or algorithm block.
Open Source Code	Yes	Our work is publicly available at https://github.com/microsoft/Code T.
Open Datasets	Yes	We conduct experiments on four public code generation benchmarks in the zero-shot setting. The statistics of benchmarks are shown in Table 1.
Dataset Splits	No	The paper focuses on evaluating pre-trained language models in a zero-shot setting and uses a 'pass@k' metric, which involves sampling and selection from generated solutions. It does not describe explicit training, validation, or test dataset splits in the traditional sense for training its own model, nor does it specify any cross-validation setup for data partitioning.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as CPU or GPU models, memory specifications, or cloud computing instances. It only mentions running models 'with half precision'.
Software Dependencies	No	The paper mentions using 'the Hugging Face transformers library (Wolf et al., 2019)' for INCODER and CODEGEN implementation, but it does not specify the version number of the library.
Experiment Setup	Yes	We set the temperature to 0.8, the top p to 0.95, the max generation length to 300, and the timeout of executing a test case to 0.1 seconds. Specially, for baseline pass@1, we use the greedy search setting with temperature 0. The number of sampling test cases for each problem is set to 100 for the Human Eval and MBPP benchmarks, and 50 for the APPS and Code Contests benchmarks. When scoring consensus sets in CODET, we use the square root of \|Sx\| to reduce the impact caused by code solutions.