A Benchmark for Systematic Generalization in Grounded Language Understanding
Authors: Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules. Table 1: Results for each split, showing exact match accuracy (average of 3 runs std. dev.). |
| Researcher Affiliation | Collaboration | Laura Ruis University of Amsterdam laura.ruis@student.uva.nl Jacob Andreas Massachusetts Institute of Technology jda@mit.edu Marco Baroni ICREA Facebook AI Research mbaroni@fb.com Diane Bouchacourt Facebook AI Research dianeb@fb.com Brenden M. Lake New York University Facebook AI Research brenden@nyu.edu |
| Pseudocode | No | The paper describes the model architecture and training process in text and a diagram (Figure 3), but it does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN) All experiment and model code is available so our results can be reproduced and built upon.4 (https://github.com/Laura Ruis/multimodal_seq2seq_g SCAN) |
| Open Datasets | Yes | We introduce grounded SCAN (g SCAN), a new benchmark that, like the original SCAN, focuses on rule-based generalization, but where meaning is grounded in states of a grid world accessible to the agent. The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN) |
| Dataset Splits | Yes | The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E). The shared training set across splits has more than 300k demonstrations of instructions and their action sequences, and each test instruction evaluates just one systematic difference. For more details on the number of examples in the training and test sets of the experiments, refer to Appendix C. |
| Hardware Specification | No | The paper describes model architectures and training procedures but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Adam with default parameters" for optimization but does not provide specific version numbers for any software libraries, frameworks (like TensorFlow or PyTorch), or programming languages used. |
| Experiment Setup | Yes | Training optimizes cross-entropy using Adam with default parameters [23]. ... The learning rate starts at 0.001 and decays by 0.9 every 20,000 steps. We train for 200,000 steps with batch size 200. The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E). |