Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Compositional Generalization by Learning Analytical Expressions
Authors: Qian Liu, Shengnan An, Jian-Guang Lou, Bei Chen, Zeqi Lin, Yan Gao, Bin Zhou, Nanning Zheng, Dongmei Zhang
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the well-known benchmark SCAN demonstrate that our model seizes a great ability of compositional generalization, solving all challenges addressed by previous works with 100% accuracies. |
| Researcher Affiliation | Collaboration | Beihang University, Beijing, China; Xi an Jiaotong University, Xi an, China; Microsoft Research, Beijing, China |
| Pseudocode | No | The paper describes the model’s processes (Composer and Solver) with textual explanations and figures, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-source our code at https://github.com/microsoft/Contextual SP. |
| Open Datasets | Yes | As one of the most important benchmarks, the SCAN dataset is proposed to evaluate the compositional generalization ability of translation models [19]. Systematicity is evaluated on Add Jump, Around Right and Length of SCAN [19], while distribution-based systematicity is assessed on MCD splits of SCAN [17]. Productivity is evaluated on the SCAN-ext dataset. |
| Dataset Splits | No | The paper states, “We follow previous works to split datasets for all tasks,” implying standard splits are used, but it does not explicitly provide percentages or sample counts for a validation set in the main text. It only details train/test splits for specific tasks like Add Jump. |
| Hardware Specification | Yes | Our model is trained on a single Tesla-P100 (16GB) and the training time for a single run is about 20 25 hours. |
| Software Dependencies | No | The paper mentions “Our model is implemented in Py Torch [28]” and “updated via the Ada Delta [40] optimizer,” but it does not provide specific version numbers for PyTorch or Ada Delta. |
| Experiment Setup | Yes | Dimensions of word embeddings, hidden states, key vectors and value vectors are set as 128. Hyperparameters γ and N are set as 0.5 and 10 respectively. All parameters are randomly initialized and updated via the Ada Delta [40] optimizer, with a learning rate of 0.1 for Composer and 1.0 for Solver. Meanwhile, as done in previous works [14], we introduce a regularization term to prevent our model from overfitting in the early stage of training. Its weight is set to 0.1 at the beginning, and exponentially anneals with a rate 0.5 as the lesson increases. |