Learning Interpretable Spatial Operations in a Rich 3D Blocks World
Authors: Yonatan Bisk, Kevin Shih, Yejin Choi, Daniel Marcu
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as mirroring , twisting , and balancing . This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 5). |
| Researcher Affiliation | Collaboration | Yonatan Bisk,1 Kevin J. Shih,2 Yejin Choi,1 Daniel Marcu3 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2University of Illinois at Urbana-Champaign 3Amazon Inc. {ybisk,yejin}@cs.washington.edu, kjshih2@illinois.edu, marcud@amazon.com |
| Pseudocode | No | The paper does not contain any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | The paper provides a link (https://groundedlanguage.github.io/) for their released data, but does not explicitly state that the source code for their methodology is provided or available at this link. The text only says 'In our released data,1 we captured block orientations as quaternions.' |
| Open Datasets | Yes | Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development. Each configuration has between five and twenty steps (and blocks). We present type and token statistics in Table 1, where we use NLTK s (Bird, Klein, and Loper 2009) treebank tokenizer. In our released data,1 we captured block orientations as quaternions. This allows for a complete and accurate re-rendering of the exact block orientations produced by our annotators. 1https://groundedlanguage.github.io/ |
| Dataset Splits | Yes | Our new dataset comprises 100 configurations split 70-20-10 between training, testing, and development. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions the model has convolutional layers. |
| Software Dependencies | No | The paper mentions using Adam optimizer and NLTK, but does not provide specific version numbers for any software dependencies like deep learning frameworks (e.g., PyTorch, TensorFlow) or other libraries. |
| Experiment Setup | Yes | Our model is trained end-to-end using Adam (Kingma and Ba 2014) with a batch size of 32.The convolutional aspect of the model has 3 layers and operates on a world representation of dimensions 32 4 64 64 32 (batch, depth, height, width, channels). The first convolutional layer uses a filter of size 4 5 5 and the second of size 4 3 3, each followed by a tanh nonlinearity for the 3D model3. Both layers output a tensor with the same dimensions as the input world. The final predicton layer is a 1 1 1 filter that projects the 32 dimensional vector at each location down to 8 values as detailed in the previous section. We further include an entropy term to encourage peakier distributions in the argument and operation softmaxes. |