A Benchmark for Systematic Generalization in Grounded Language Understanding

Authors: Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules. Table 1: Results for each split, showing exact match accuracy (average of 3 runs std. dev.).
Researcher Affiliation Collaboration Laura Ruis University of Amsterdam laura.ruis@student.uva.nl Jacob Andreas Massachusetts Institute of Technology jda@mit.edu Marco Baroni ICREA Facebook AI Research mbaroni@fb.com Diane Bouchacourt Facebook AI Research dianeb@fb.com Brenden M. Lake New York University Facebook AI Research brenden@nyu.edu
Pseudocode No The paper describes the model architecture and training process in text and a diagram (Figure 3), but it does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN) All experiment and model code is available so our results can be reproduced and built upon.4 (https://github.com/Laura Ruis/multimodal_seq2seq_g SCAN)
Open Datasets Yes We introduce grounded SCAN (g SCAN), a new benchmark that, like the original SCAN, focuses on rule-based generalization, but where meaning is grounded in states of a grid world accessible to the agent. The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN)
Dataset Splits Yes The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E). The shared training set across splits has more than 300k demonstrations of instructions and their action sequences, and each test instruction evaluates just one systematic difference. For more details on the number of examples in the training and test sets of the experiments, refer to Appendix C.
Hardware Specification No The paper describes model architectures and training procedures but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using "Adam with default parameters" for optimization but does not provide specific version numbers for any software libraries, frameworks (like TensorFlow or PyTorch), or programming languages used.
Experiment Setup Yes Training optimizes cross-entropy using Adam with default parameters [23]. ... The learning rate starts at 0.001 and decays by 0.9 every 20,000 steps. We train for 200,000 steps with batch size 200. The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E).