Compositional Generalization by Learning Analytical Expressions

Authors: Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, Dongmei Zhang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the well-known benchmark SCAN demonstrate that our model seizes a great ability of compositional generalization, solving all challenges addressed by previous works with 100% accuracies.
Researcher Affiliation Collaboration Beihang University, Beijing, China; Xi an Jiaotong University, Xi an, China; Microsoft Research, Beijing, China
Pseudocode No The paper describes the model’s processes (Composer and Solver) with textual explanations and figures, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We open-source our code at https://github.com/microsoft/Contextual SP.
Open Datasets Yes As one of the most important benchmarks, the SCAN dataset is proposed to evaluate the compositional generalization ability of translation models [19]. Systematicity is evaluated on Add Jump, Around Right and Length of SCAN [19], while distribution-based systematicity is assessed on MCD splits of SCAN [17]. Productivity is evaluated on the SCAN-ext dataset.
Dataset Splits No The paper states, “We follow previous works to split datasets for all tasks,” implying standard splits are used, but it does not explicitly provide percentages or sample counts for a validation set in the main text. It only details train/test splits for specific tasks like Add Jump.
Hardware Specification Yes Our model is trained on a single Tesla-P100 (16GB) and the training time for a single run is about 20 25 hours.
Software Dependencies No The paper mentions “Our model is implemented in Py Torch [28]” and “updated via the Ada Delta [40] optimizer,” but it does not provide specific version numbers for PyTorch or Ada Delta.
Experiment Setup Yes Dimensions of word embeddings, hidden states, key vectors and value vectors are set as 128. Hyperparameters γ and N are set as 0.5 and 10 respectively. All parameters are randomly initialized and updated via the Ada Delta [40] optimizer, with a learning rate of 0.1 for Composer and 1.0 for Solver. Meanwhile, as done in previous works [14], we introduce a regularization term to prevent our model from overfitting in the early stage of training. Its weight is set to 0.1 at the beginning, and exponentially anneals with a rate 0.5 as the lesson increases.