reproducibilityindex.ai

Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation

Authors: Wenqing Zheng, S P Sharan, Ajay Kumar Jaiswal, Kevin Wang, Yihan Xi, Dejia Xu, Zhangyang Wang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations show that Chain Coder outperforms state-of-the-arts, demonstrating that our progressive generation eases the reasoning procedure and guides the language model to generate higher-quality solutions. Evaluations on the competition-level datasets show that Chain Coder performs better than other state-of-the-art models even with smaller model sizes. Ablation studies verified the effectiveness of the coarse-to-fine guidance and other design choices.
Researcher Affiliation	Academia	1VITA Group, The University of Texas At Austin, Austin, TX, US. Correspondence to: Wenqing Zheng, Zhangyang Wang <w.zheng@utexas.edu, atlaswang@utexas.edu>.
Pseudocode	Yes	Algorithm 1 The Encoding Step Of Chain Coder
Open Source Code	Yes	Our codes are available at: https://github. com/VITA-Group/Chain Coder.
Open Datasets	Yes	we leverage the Code Parrot Git Hub-Code (Code Parrot, 2022) dataset for model-pretraining on general purpose source code. For the fine-tuning and evaluation, we choose two carefully curated datasets, Code Contests (Li et al., 2022) and APPS (Hendrycks et al., 2021)
Dataset Splits	No	After pre-trained on the Git Hub codes, we use two copies of the model to fine-tune on the training sets of Code Contests and APPS, and evaluate on their test sets respectively. The paper explicitly mentions training and test sets but does not specify a separate validation split or its size/percentage.
Hardware Specification	No	The paper mentions architectural details like 'Chain Coder contains two sample-embedder transformer blocks, two transformer encoder blocks, 224 transformer decoder blocks, with the 512 hidden dimensions for these submodules, yielding 1.09 billion parameters in total.' but does not specify any particular hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	We utilize the BERT model (Devlin et al., 2018) to perform natural language understanding, and distill its outputs into four tokens: the first token, the last token, the minimum pooling token, and the maximum pooling token. The output dimensions for one instance is 4 E. we employ a pre-trained Code T5-based (Wang et al., 2021) Python code explanation model to generate natural language descriptions for each code sample. The paper names software models (BERT, CodeT5) but does not provide specific version numbers for these or any other software dependencies crucial for reproducibility (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We diversify the learning difficulty during the fine-tuning phase by periodically changing the number of I/O data pairs (sweeping between 1 and 32, with a 15 epoch period) and the number of programs that the model predicts (sweeping between 1 and 8, with a 37 epoch period). At inference time, we set our beam-search width to 5.