reproducibilityindex.ai

A Benchmark for Systematic Generalization in Grounded Language Understanding

Authors: Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test a strong multi-modal baseline model and a state-of-the-art compositional method ﬁnding that, in most cases, they fail dramatically when generalization requires systematic compositional rules. Table 1: Results for each split, showing exact match accuracy (average of 3 runs std. dev.).
Researcher Affiliation	Collaboration	Laura Ruis University of Amsterdam laura.ruis@student.uva.nl Jacob Andreas Massachusetts Institute of Technology jda@mit.edu Marco Baroni ICREA Facebook AI Research mbaroni@fb.com Diane Bouchacourt Facebook AI Research dianeb@fb.com Brenden M. Lake New York University Facebook AI Research brenden@nyu.edu
Pseudocode	No	The paper describes the model architecture and training process in text and a diagram (Figure 3), but it does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN) All experiment and model code is available so our results can be reproduced and built upon.4 (https://github.com/Laura Ruis/multimodal_seq2seq_g SCAN)
Open Datasets	Yes	We introduce grounded SCAN (g SCAN), a new benchmark that, like the original SCAN, focuses on rule-based generalization, but where meaning is grounded in states of a grid world accessible to the agent. The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN)
Dataset Splits	Yes	The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E). The shared training set across splits has more than 300k demonstrations of instructions and their action sequences, and each test instruction evaluates just one systematic difference. For more details on the number of examples in the training and test sets of the experiments, refer to Appendix C.
Hardware Specification	No	The paper describes model architectures and training procedures but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using "Adam with default parameters" for optimization but does not provide specific version numbers for any software libraries, frameworks (like TensorFlow or PyTorch), or programming languages used.
Experiment Setup	Yes	Training optimizes cross-entropy using Adam with default parameters [23]. ... The learning rate starts at 0.001 and decays by 0.9 every 20,000 steps. We train for 200,000 steps with batch size 200. The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E).