reproducibilityindex.ai

SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark

Authors: Victor Zhong, Austin W. Hanjie, Sida Wang, Karthik Narasimhan, Luke Zettlemoyer

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose the multi-environment Symbolic Interactive Language Grounding benchmark (SILG)... In addition, we propose the ﬁrst shared model architecture for RL on these environments, and evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG. Our shared architecture achieves comparable performance to environment-speciﬁc architectures.
Researcher Affiliation	Collaboration	Victor Zhong1,3, Austin W. Hanjie2, Sida I. Wang3, Karthik Narasimhan2 and Luke Zettlemoyer1,3 1Department of Computer Science, University of Washington 2Department of Computer Science, Princeton University 3Facebook AI Research
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. Figure 3 is a diagram of the model architecture, and equations describe computations but are not pseudocode.
Open Source Code	Yes	The code for SILG is available at https://github.com/vzhong/silg.
Open Datasets	Yes	SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, Net Hack), as well as symbolic counterparts of visual worlds that require interpreting rich natural language with respect to complex scenes (ALFWorld, Touchdown).
Dataset Splits	Yes	For each environment (separately), we train on training, do early stop on validation, and evaluate on test. Net Hack does not distinguish between train and evaluation, hence we create our own splits by dividing the seed range (ﬁrst 1 million seeds for training, second for validation, and third for test).
Hardware Specification	Yes	All experiments were run on an internal cluster with 80 NVidia V100 GPUs and 20 Intel Xeon E5-2630 v4 CPUs for about 3 weeks. (Appendix I)
Software Dependencies	No	The paper mentions “Torchbeast [33], a distributed RL framework with importance weighted actor-learners based on IMPALA [18]” but does not provide specific version numbers for these or other software components.
Experiment Setup	Yes	The hyperparameter and compute resources are respectively shown in Appendix H and I. (Section 4 Setup)