SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark

Authors: Victor Zhong, Austin W. Hanjie, Sida Wang, Karthik Narasimhan, Luke Zettlemoyer

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG)... In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures.
Researcher Affiliation Collaboration Victor Zhong1,3, Austin W. Hanjie2, Sida I. Wang3, Karthik Narasimhan2 and Luke Zettlemoyer1,3 1Department of Computer Science, University of Washington 2Department of Computer Science, Princeton University 3Facebook AI Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 3 is a diagram of the model architecture, and equations describe computations but are not pseudocode.
Open Source Code Yes The code for SILG is available at https://github.com/vzhong/silg.
Open Datasets Yes SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, Net Hack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown).
Dataset Splits Yes For each environment (separately), we train on training, do early stop on validation, and evaluate on test. Net Hack does not distinguish between train and evaluation, hence we create our own splits by dividing the seed range (first 1 million seeds for training, second for validation, and third for test).
Hardware Specification Yes All experiments were run on an internal cluster with 80 NVidia V100 GPUs and 20 Intel Xeon E5-2630 v4 CPUs for about 3 weeks. (Appendix I)
Software Dependencies No The paper mentions “Torchbeast [33], a distributed RL framework with importance weighted actor-learners based on IMPALA [18]” but does not provide specific version numbers for these or other software components.
Experiment Setup Yes The hyperparameter and compute resources are respectively shown in Appendix H and I. (Section 4 Setup)