Learning Interpretable Spatial Operations in a Rich 3D Blocks World

Authors: Yonatan Bisk, Kevin Shih, Yejin Choi, Daniel Marcu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as mirroring , twisting , and balancing . This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 5).
Researcher Affiliation Collaboration Yonatan Bisk,1 Kevin J. Shih,2 Yejin Choi,1 Daniel Marcu3 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2University of Illinois at Urbana-Champaign 3Amazon Inc. {ybisk,yejin}@cs.washington.edu, kjshih2@illinois.edu, marcud@amazon.com
Pseudocode No The paper does not contain any explicit pseudocode blocks or algorithms.
Open Source Code No The paper provides a link (https://groundedlanguage.github.io/) for their released data, but does not explicitly state that the source code for their methodology is provided or available at this link. The text only says 'In our released data,1 we captured block orientations as quaternions.'
Open Datasets Yes Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development. Each configuration has between five and twenty steps (and blocks). We present type and token statistics in Table 1, where we use NLTK s (Bird, Klein, and Loper 2009) treebank tokenizer. In our released data,1 we captured block orientations as quaternions. This allows for a complete and accurate re-rendering of the exact block orientations produced by our annotators. 1https://groundedlanguage.github.io/
Dataset Splits Yes Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions the model has convolutional layers.
Software Dependencies No The paper mentions using Adam optimizer and NLTK, but does not provide specific version numbers for any software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or other libraries.
Experiment Setup Yes Our model is trained end-to-end using Adam (Kingma and Ba 2014) with a batch size of 32.The convolutional aspect of the model has 3 layers and operates on a world representation of dimensions 32 4 64 64 32 (batch, depth, height, width, channels). The first convolutional layer uses a filter of size 4 5 5 and the second of size 4 3 3, each followed by a tanh nonlinearity for the 3D model3. Both layers output a tensor with the same dimensions as the input world. The final predicton layer is a 1 1 1 filter that projects the 32 dimensional vector at each location down to 8 values as detailed in the previous section. We further include an entropy term to encourage peakier distributions in the argument and operation softmaxes.