SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark
Authors: Victor Zhong, Austin W. Hanjie, Sida Wang, Karthik Narasimhan, Luke Zettlemoyer
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG)... In addition, we propose the first shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-specific architectures. |
| Researcher Affiliation | Collaboration | Victor Zhong1,3, Austin W. Hanjie2, Sida I. Wang3, Karthik Narasimhan2 and Luke Zettlemoyer1,3 1Department of Computer Science, University of Washington 2Department of Computer Science, Princeton University 3Facebook AI Research |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 3 is a diagram of the model architecture, and equations describe computations but are not pseudocode. |
| Open Source Code | Yes | The code for SILG is available at https://github.com/vzhong/silg. |
| Open Datasets | Yes | SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, Net Hack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown). |
| Dataset Splits | Yes | For each environment (separately), we train on training, do early stop on validation, and evaluate on test. Net Hack does not distinguish between train and evaluation, hence we create our own splits by dividing the seed range (first 1 million seeds for training, second for validation, and third for test). |
| Hardware Specification | Yes | All experiments were run on an internal cluster with 80 NVidia V100 GPUs and 20 Intel Xeon E5-2630 v4 CPUs for about 3 weeks. (Appendix I) |
| Software Dependencies | No | The paper mentions “Torchbeast [33], a distributed RL framework with importance weighted actor-learners based on IMPALA [18]” but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | The hyperparameter and compute resources are respectively shown in Appendix H and I. (Section 4 Setup) |