reproducibilityindex.ai

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

Authors: Binbin Xie, Jinsong Su, Yubin Ge, Xiang Li, Jianwei Cui, Junfeng Yao, Bin Wang14121-14128

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results and in-depth analysis on several benchmark datasets demonstrate the effectiveness of our approach.
Researcher Affiliation	Collaboration	Binbin Xie1,2, Jinsong Su1,2 *, Yubin Ge3, Xiang Li4, Jianwei Cui4, Junfeng Yao1 and Bin Wang4 1. Xiamen University, Xiamen, China 2. Peng Cheng Laboratory, Shenzhen, China 3. University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 4. Xiaomi AI Lab, Beijing, China
Pseudocode	Yes	Algorithm 1 The training procedure of our framework.
Open Source Code	Yes	We release our code at https://github.com/Deep Learn XMU/CGML.
Open Datasets	Yes	DJANGO (Oda et al. 2015). This dataset consists of 18,805 lines of Python source code, with each line paired with an NL utterance. We split the dataset into training/validation/test sets containing 16,000/1,000/1,805 instances, respectively. 2) ATIS. This dataset includes NL questions of a ﬂight database, with each question is annotated with a lambda calculus query. Following previous studies (Yin and Neubig 2018, 2019; Xu et al. 2020), we use the standard splits of training/validation/test sets, which contains 4,473/491/448 instances, respectively. 3) GEO. It contains NL questions about US geography paired with corresponding Prolog database queries. we use the standard splits of 600/280 training/test instances. 4) IFTTT (Quirk, Mooney, and Galley 2015). It consists of if-this-then-that programs, paired with NL utterances of their purpose. The dataset is split into 68,083 training, 5,171 validation and 3,868 test instances.
Dataset Splits	Yes	We split the dataset into training/validation/test sets containing 16,000/1,000/1,805 instances, respectively. Following previous studies (Yin and Neubig 2018, 2019; Xu et al. 2020), we use the standard splits of training/validation/test sets, which contains 4,473/491/448 instances, respectively. The dataset is split into 68,083 training, 5,171 validation and 3,868 test instances. Since there exists no validation set in GEO, we temporarily split the training data into two parts: 480 instances for training and 120 instances for validation, and use them to determine the optimal λ.
Hardware Specification	No	We use the same experimental setup as (Yin and Neubig 2017). Speciﬁcally, we use 256 hidden units and 128-dimensional word vectors for NL utterance encoding, and tune the dimension of various embeddings on validation datasets for each corpus. Iinitialize all parameters by uniformly sampling within the interval [-0.1, 0.1]. Besides, we set the batch size as 10 and employ dropout after each layer, where the drop rate is sequentially set to 0.5, 0.3, 0.4, and 0.3 for our four datasets, respectively. The paper does not explicitly describe the specific hardware (GPU/CPU models, memory, etc.) used to run the experiments.
Software Dependencies	No	TRANX is based on an attentional encoder-decoder framework, where a Bi LSTM encoder is used to learn word-level semantic representations. The paper mentions software components and frameworks (TRANX, LSTM) but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We use the same experimental setup as (Yin and Neubig 2017). Speciﬁcally, we use 256 hidden units and 128-dimensional word vectors for NL utterance encoding, and tune the dimension of various embeddings on validation datasets for each corpus. We initialize all parameters by uniformly sampling within the interval [-0.1, 0.1]. Besides, we set the batch size as 10 and employ dropout after each layer, where the drop rate is sequentially set to 0.5, 0.3, 0.4, and 0.3 for our four datasets, respectively. According to the average performance of our models on four validation sets, we set λ as 0.75, 0.5, 0.25 and 0.25 for our four datasets in all experiments thereafter.