Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Benchmark for Systematic Generalization in Grounded Language Understanding

Authors: Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, Brenden M. Lake

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules. Table 1: Results for each split, showing exact match accuracy (average of 3 runs std. dev.).
Researcher Affiliation Collaboration Laura Ruis University of Amsterdam EMAIL Jacob Andreas Massachusetts Institute of Technology EMAIL Marco Baroni ICREA Facebook AI Research EMAIL Diane Bouchacourt Facebook AI Research EMAIL Brenden M. Lake New York University Facebook AI Research EMAIL
Pseudocode No The paper describes the model architecture and training process in text and a diagram (Figure 3), but it does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN) All experiment and model code is available so our results can be reproduced and built upon.4 (https://github.com/Laura Ruis/multimodal_seq2seq_g SCAN)
Open Datasets Yes We introduce grounded SCAN (g SCAN), a new benchmark that, like the original SCAN, focuses on rule-based generalization, but where meaning is grounded in states of a grid world accessible to the agent. The code to generate the benchmark and the data used in the experiments are both publicly available.2 (https://github.com/Laura Ruis/grounded SCAN)
Dataset Splits Yes The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E). The shared training set across splits has more than 300k demonstrations of instructions and their action sequences, and each test instruction evaluates just one systematic difference. For more details on the number of examples in the training and test sets of the experiments, refer to Appendix C.
Hardware Specification No The paper describes model architectures and training procedures but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using "Adam with default parameters" for optimization but does not provide specific version numbers for any software libraries, frameworks (like TensorFlow or PyTorch), or programming languages used.
Experiment Setup Yes Training optimizes cross-entropy using Adam with default parameters [23]. ... The learning rate starts at 0.001 and decays by 0.9 every 20,000 steps. We train for 200,000 steps with batch size 200. The best model was chosen based on a small development set of 2,000 examples (full details in Appendix E).